Uploaded by devinlaithefunguy

Probability and Statistics Arhanasios

advertisement
Probability & Statistics
About the Author
Atbanasios Papoulis was educated at the Polytechnic University of Athens
and at the University of Pennsylvania. He started teaching in 1948 at the
University of Pennsylvania, and in 1952 he joined the faculty of the then
Polytechnic Institute of Brooklyn. He has also taught at Union College,
U.C.L.A .• Stanford, and at the TH Darmstadt in Germany.
A m~or component of his work is academic research. He has consulted with many companies including Burrough's, United Technologies,
and IBM, and published extensively in engineering and mathematics. concentrating on fundamental concepts of general interest. In recognition of his
contributions, he received the distinguished alumnus award from the University of Pennsylvania in 1973, and, recently, the Humboldt award given to
American scientists for internationally recognized achievements.
Professor Papoulis is primarily an educator. He has taught thousands
of students and lectured in hundreds of schools. In his teaching, he stresses
clarity. simplicity. and economy. His approach, reflected in his articles and
books. has been received favorably throughout the world. All of his books
.Jtave international editions and translations. In Japan alone six of his major
texts.have been translated. His book Probability, Random Variables, and
Stochastic Processes has been the standard text for a quarter of a century. In
1980: it was chosen by the Institute of Scientific Information as a citation
. classic.
~
: Every year, the IEEE, an international organization of electrical engineers, selects one of its members as the outstanding educator. In 1984, this
prestigious award was given to Athanasios Papoulis with the following citation:
For inspirational leadership in teaching through thought-provoking
lectures, research, and creative textbooks.
PROBABILITY
& STATISTICS
Athanasios Papoulis
Polytechnic University
j.i Prentice-Hall lntemational, Inc.
This edition may be sold only in those countries to which
it is consigned by Prentice-Hall International. It is not to
be re-exported and it is not for sale in the U.S.A., Mexico,
or Canada.
0 1990 by Prentice-Hall, Inc.
A Division of Simon 8t Schuster
Englewood Cliffs, NJ 07632
All rights reserved. No part of this book may be
reproduced, in any form or by any means,
without permiaaion in writing from the publisher.
Printed in the United States of America
10 9 8 7 8 5 4 3 2 1
ISBN D-13-711730-2
Prentice-Hall International (UK) Limited, London
Prentice-Hall of Australia Pty. Umited, Sydney
Prentice-Hall Canada Inc., Toronto
Prentice-Hall Hiapanoamericana, S.A., Mexico
Prentice-Hall of India Private Limited, NtiW Delhi
Prentice-Hall of Japan, Inc., Tokyo
Simon 8t Schuster Asia Pte. Ltd., Singapore
Editora Prentice-Hall do Brasil, ltda., Rio de Janeiro
Prentice-Hall, Inc., Englewood C/m., New Janey
Contents
Preface
ix
PART ONE
PROBABILITY
1
1
The Meaning of Probability
3
1-1 Introduction 3
1-2 The Four Interpretations of Probability 9
2
Fundamental Concepts
2-1
2-2
19
Set Theory 19
Probability Space 29
Vi
CONTENTS
2-3 Conditional Probability and Independence 45
Problems 56
3
Repeated Trials
59
3-1 Dual Meaning of Repeated Trials 59
3-2 Bernoulli Trials 64
3-3 Asymptotic Theorems 70
3-4 Rare Events and Poisson Points 77
Appendix: Area Under the Normal Curve 81
Problems 81
4
The Random Variable
84
4-1 Introduction 84
4-2 The Distribution Function 88
4-3 Illustrations 101
4-4 Functions of One Random Variable
4-S Mean and Variance 121
Problems 131
112
5
Two Random Variables
135
S-1 The Joint Distribution Function 135
S-2 Mean, Correlation, Moments 144
5-3 Functions of Two Random Variables 155
Problems 165
6
Conditional Distributions, Regression, Reliability
6-1 Conditional Distributions 168
6-2 Bayes' Formulas 174
6-3 Nonlinear Regression and Prediction
6-4 System Reliability 186
Problems 195
181
168
CONTENTS
Vii
7, __ _
Sequences of Random Variables
197
7-1 General Concepts 197
7-2 Applications 204
7-3 Centr.tl Limit Theorem 214
7-4 Special Distributions of Statistics 219
Appendix: Chi-Square Quadratic Forms 226
Problems 229
PART TWO
STATISTICS
233
a____ _
The Meaning of Statistics
235
K-1 Introduction 235
K-2 The Major Areas of Statistics 238
8-3 Random Numbers and Computer Simulation 251
9
Estimation
273
9-1
9-2
9-3
9-4
9-5
9-6
General Concepts 273
Expected Values 275
Variance and Correlation 293
Percentiles and Distributions 297
Moments and Maximum Likelihood 301
Best Estimators and the Rao-Cramer Bound 307
Problems 316
10
Hypothesis Testing
10-1
10-2
10-3
10-4
321
General Concepts 321
Basic Applications 324
Quality Control 342
Goodness-of-Fit Testing 348
Viii
CONTENTS
10-5 Analysis of Variance 360
10-6 Neyman-Pearson, Sequential, and Likelihood Ratio Tests
Problems 382
369
11
The Method of Least Squares 388
11·1 Introduction 388
11·2 Deterministic Interpretation 391
11-3 Statistical Interpretation 402
11-4 Pr~diction 407
Problems 411
12
Entropy 414
12-1 Entropy of Partitions and Random Variables 414
12-2 Maximum Entropy and Statistics 422
12-3 Typical Sequences and Relative Frequency 430
Problems 435
Tables
437
Answers and Hints for Selected Problems
Index
448
443
Preface
Probability is a difficult subject. A major reason is uncenainty about its
meaning and skepticism about its value in the solution of real problems.
Unlike other scientific disciplines, probability is associated with randomness. chance, even ignorance, and its results are interpreted not as objective
scientific facts. but as subjective expressions of our state of knowledge. In
this book. I attempt to convince the skeptical reader that probability is no
different from any other scientific theory: All concepts are precisely defined
within an abstrclct model. and all results follow logically from the axioms. It
is true that the practical consequences of the theory are only inductive
inferences that cannot be accepted as logical certainties; however. this is
characteristic not only of statistical statements. but of all scientific conclusions.
The subject is developed as a mathematical discipline; however, mathematical subtleties are avoided and proofs of difficult theorems are merely
sketched or, in some cases, omitted. The applications are selected not only
because of their practical value. but also because they contribute to the
mastery of the theory. The book concentrates on basic topics. It also includes a simplified treatment of a number of advanced ideas.
In the preparation of the manuscript. I made a special effon to clarify
the meaning of all concepts, to simplify the derivations of most results, and
ix
X
PREFACE
to unify apparently unrelated concepts. For this purpose, I reexamined the
conventional approach to each topic, departing in many cases from traditional methods and interpretations. A few illustrations follow:
In the first chapter, the various definitions of probability are analyzed
and the need for a clear distinction between concepts and reality is stressed.
These ideas are used in Chapter 8 to explain the difference between probability and statistics, to clarify the controversy surrounding Bayesian statistics,
and to develop the dual meaning of random numbers. In Chapter II, a
comprehensive treatment of the method of least square is presented, showing the connection between deterministic curve fitting, parameter estimation, and prediction. The last chapter is devoted to entropy, a topic rarely
discussed in books on statistics. This important concept is defined as a
number associated to a partition of a probability space and is used to solve a
number of ill-posed problems in statistical estimation. The empirical interpretation of entropy and the rationale for the method of maximum entropy
are related to repeated trials and typical sequences.
The book is written primarily for upper division students of science and
engineering. The first part is suitable for a one-semester junior course in
probability. No prior knowledge of probability is required. All concepts arc
developed slowly from first principles, and they are i11ustrated with many
examples. The first three chapters involve mostly only high school mathematics; however. a certain mathematical maturity is assumed. The level of
sophistication increases in subsequent chapters. Parts I and II can be covered in a two-semester senior/graduate course in probability and statistics.
This work is based on notes written during my stay in Germany as a
recipient of the Humboldt award. I wish to express my appreciation to the
Alexander von Humboldt Foundation and to my hosts Dr. Eberhard Hansler
and Dr. Peter Hagedorn of the TH Damstadt for giving me the opportunity
to develop these notes in an ideal environment.
Athanasios Papoulis
PART ONE
PROBABILITY
lThe Meaning of Probability
Most scientific concepts have a precise meaning corresponding, more or less
exactly, to physical quantities. In contrast, probability is often viewed as a
vague concept associated with randomness, uncertainty, or even ignorance.
This is a misconception that must be overcome in any serious study of the
subject. In this chapter, we argue that the theory of probability, like any
other scientific discipline, is an exact science, and all its conclusions follow
logically from basic principles. The theoretical results must, of course, correspond in a reasonable sense to the real world; however, a clear distinction
must always be made between theoretical results and empirical statements.
1-1
Introduction
The theory of probability deals mainly with averages of mass phenomena
occurring sequentially or simultaneously: games of chance, polling, insurance, heredity, quality control, statistical mechanics, queuing theory, noise.
It has been observed that in these and other fields, certain averages approach
a constant value as the number of observations increases, and this value
remains the same if the averages are evaluated over any subsequence se-
3
4
CHAP.
1
THE MEANING OF PROBABILITY
lected prior to the observations. In a coin experiment, for example. the ratio
of heads to tosses approaches 0.5 or some other constant, and the same ratio
is obtained if one considers. say. every fourth toss. The purpose of the
theory is to describe and predict such averages in terms of probabilities of
events. The probability of an event :A is a number PC :A> assigned to sll. This
number is central in the theory and applications of probability; its significance is the main topic of this chapter. As a measure of averages. P(.'il) is
interpreted as follows:
If an experiment is performed n times and the event :A occurs n.-ll times,
then almost certainly the relative frequency n.-lll n of the occurrence of .rfl is
close to P(s4)
P(9l)..,. n. 11
n
(1-1)
provided that n is sufficiently large. This will be called the empirical or
relative frequency interpretation of probability.
Equation (1-1) is only a heuristic relationship because the terms almost
certainly, close, and sufficiently large have no precise meaning. The relative
frequency interpretation cannot therefore be used to define P(.'tl) as a theoretical concept. It can, however, be used to estimate P(s4) in terms of the
observed n.<rl and to predict n.'ll if P( .rfl) is known. For example, if 1,000 voters
arc polled and 451 respond Republican, then the probability P(s4) that a
voter is Republican is about .45. With P(s4) so estimated, we predict that in
the next election, 45% of the people will vote Republican.
The relative frequency interpretation of probability is objective in the
sense that it can be tested experimentally. Suppose, for example. that we
wish to test whether a coin is fair, that is, whether the probability of heads
equals .5. To do so, we toss it 1.000 times. If the number of heads is "about"
500, we conclude that the coin is indeed fair (the precise meaning of this
conclusion will be clarified later).
Probability also has another interpretation. It is used as a measure of
our state of knowledge or belief that something is or is not true. For example, based on evidence presented, we conclude with probability .6 that a
defendant is guilty. This interpretation is subjective. Another juror, having
access to the same evidence, might conclude with probability .95 (beyond
any reasonable doubt) that the defendant is guilty.
We note, finally, that in applications involving predictions. both interpretations might be relevant. Consider the following weather forecast: "The
probability that it will rain tomorrow in New York is .6." In this forecast. the
number .6 is derived from past records, and it expresses the relative frequency of rain in New York under similar conditions. This number, however. has no relevance to tomorrow's weather. Tomorrow it will either rain
SEC.
J- I
INTRODUCTION
5
or not rain. The forecast expresses merely the state of the forecaster's
knowledge. and it helps us decide whether we should carry an umbrella.
Concepts am/ Reality
Students are often skeptical about the validity of probabilistic statements.
They have been taught that the universe evolves according to physical laws
that specify exactly its future (determinism) and that probabilistic descriptions are used only for "random" or "chance" phenomena. the initial conditions of which are unknown. This deep-rooted skepticism about the "truth"
of probabilistic results can be overcome only by a proper interpretation of
the meaning of probability. We shall attempt to show that. like any other
scientific discipline. probability is an exact science and that all conclusions
follow logically from the axioms. It is, of course, true that the correspondence between theoretical results and the real world is imprecise; however,
this is characteristic not only of probabilistic conclusions but of all scientific
statements.
In a probabilistic investigation the following steps must be clearly distinguished (Fig. I. I).
Step I (Physical). We determine, by a process that is not and cannot
be made exact. the probabilities P(stl;) of various physical
events 91;.
Step 2 (Conceptual). We assume that the numbers P(.s4;) satisfy certain axioms, and by deductive logic we determine the probabilities P(33;) of other events ;?/l;.
Step 3 (Physical}. We make physical predictions concerning the
events ?.13; based on the numbers P(tf.i;) so obtained.
In steps I and 3 we deal with the real world. and all statements are
inexact. In step 2 we replace the real world with an abstract model, that is,
with a mental construct in which all concepts arc precise and all conclusions
follow from the axioms by deductive logic. In the context of the resulting
theory, the probability ofan event sf is a number P(stl) satisfying the axioms;
its "physical meaning" is not an issue.
Figure 1.1
Step I
CPhysical)
Observation
Step 2
(Conccptnall
Deduction
~ode I
Step 3
IPhyskall
!--::----:-~-
Prediction
6
CHAP.
J
THE MEANING OF PROBABILITY
We should stress that model formation is used not only in the study of
"random" phenomena but in all scientific investigations. The resulting theories are, of course, of no value unless they help us solve real problems. We
must assign specific, if only approximate, numerical values to physical quantities, and we must give physical meaning to all theoretical conclusions. The
link, however, between theory (model) and applications (reality) is always
inexact and must be separated from the purely deductive development of the
theory. Let us examine two illustrations of model formulation from other
fields.
Geometry Points and lines, as interpreted in theoretical geometry, are
not real objects. They are mental constructs havins. by assumption, certain
properties specified in terms of the axioms. The axioms are chosen to correspond in some sense to the properties of real points and lines. For example,
the axiom "one and only one line passes through two points" is in reasonable agreement with our perception of the corresponding property of a real
line.
Starting from the axioms, we derive, by pure reasoning, other properties that we call theorems. The theorems are then used to draw various
useful conclusions about real geometric objects. For example, we prove that
the sum of the angles of a conceptual triangle equals 180°, and we use this
theorem to conclude that the sum of the angles of a real triangle equals
approximately 180°.
Circuit Theory In circuit theory, a resistor is by definition a twoterminal device with the property that its voltage V is proportional to the
current /. The proportionality constant
v
R=1
(1-2)
is the resistance of the device.
This is, of course, only an idealized model of a real resistor and ( 1-2) is
an axiom (Kirchoff's law). A real resistor is a complicated device without
clearly defined terminals, and a relationship of the form in ( 1-2) can be
claimed only as a convenient idealization valid with a variety of qualifications and subject to unknown "errors." Nevertheless, in the deyelopment of
the theory, all these uncertainties are ignored. A real resistor is replaced by a
mental concept, and a theory is developed based on (1-2). It would not be
useful, we must agree, if at each stage of the theoretical development we
were concerned with the "true" meaning of R.
Returning to statistics, we note that, in the context of an abstract
model (step 2), the probability P(-54) is interpreted as a number P(-54) that
satisfies various axioms but is otherwise arbitrary. In the applications of the
theory to real problems, however (steps I and 3), the number P(-54) must be
given a physical interpretation. We shall establish the link between model
and reality using three interpretations of probability: relative frequency,
SF.C.
J-J
INTRODUCTION
7
classical, and subjective. We introduce here the first two in the context of
the die experiment.
Example 1.1
We wish to find the probability of the event s4 = {even} in the single-die experiment.
In the relative frequency interpretation. we rely on {1-1): We roll the die n
times. and we set P{.sA) = 11::~1 n where n:~~ is the number of times the event sA occurs.
This interpretation can be used for any die. fair or not. In the classical interpretation,
we assume that the six faces of the die are equally likely: that is, they have the same
probability of showing (this is the "fair die" assumption). Since the event {even}
occurs if one of the three outcomes h. or f.. or if. shows. we conclude that P{even} =
3/6. This conclusion seems logical and is generally used in most games of chance. As
we shall see, however, the equally likely condition on which it is based is not a simple
consequence of the fact that the die is symmetrical. It is accepted because, in the
long history of rolling dice, it was observed that the relative frequency of each face
equals l/6. •
Illustrations
We give next several examples of simple experiments, starting with a brief
explanation of the empirical meaning of the terms trials. outcomes, and
events.
A trial is the single performance of an experiment. Experimental outcomes are various observable characteristics that arc of interest in the performance of the experiment. An event is a set (collection) of outcomes. The
certain event is the set ~ consisting of all outcomes. An elementary event is
an event consisting of a single outcome. At a single trial. we observe one and
only one outcome. An event occurs at a trial if it contains the observed
outcome. The certain event occurs at every trial because it contains every
outcome.
Consider, for example, the die experiment. The outcomes of this experiment are the six facesjj, . . . ,J,.: the event {even} consists of the three
outcomes fi. f4, and ft.: the certain event ~ consists of all six outcomes. A
trial is the roll of the die once. Suppose that at a particular trial, fi shows. In
this case we observe the outcome/2 : however. many events occur, namely,
the certain event, the event {even}. the elementary event {!2 }, and 29 other
events!
Example 1.2
The coin experiment has two outcomes: heads {h) and tails {t). The event heads = {h}
consists of the single outcome II. To find the probability P{h} of heads, we toss the
coin n :.: 1.000 times. and we observe that heads shows nh = 508 times. From this we
conclude that P{h} ~ .51 (step I J. This leads to the expectation that in future trials,
about 51% of the tosses will show heads (step 3).
One might argue that this is only an approximation. Since a coin is symmetrical
(the "equally likely" assumption), the probability of heads is .5. Had we therefore
kept tossing, the nuio n1,1n would have approached .5. This is generally true; how-
8
CHAP.
I
THE MEANING OF PROBABILITY
ever, it is based on our long experience with coins, and it holds only for the limited
class of fair coins. •
Example 1.3
Example 1.4
Example 1.5
We wish to find the probability that a newborn child is a girl. In this experiment. we
have again two outcomes: boy (b) and girl (g). We observe that among the 1,000
recently born children, 489 are girls. From this we conclude that P{g} ~ .49. This
leads to the expectation that about 49% of the children born under similar circumstances will be girls.
Here again, there are only two outcomes. and we have no reason to believe
that they are not equally likely. We should expect, therefore. that the correct value
for P{g} is .5. However, as extensive records show. this expectation is not necessarily correct. •
A poll taken for the purpose of determining Republican (r) or Democratic (d) party
affiliation specifies an experiment consisting of the two outcomes r and d. A trial is
the polling of one person. It was found that among 1.000 voters questioned. 382 were
Republican. From this it follows that the probability P{r} that a voter is Republican is
about .38, and it leads to the expectation that in the next election. about 389C of the
people will vote Republican.
In this case, it is obvious that the equally likely condition cannot be used to
determine P{r}. •
Daily highway accidents specify an experiment. A trial of this experiment is the
determination of the accidents in a day. An outcome is the number k of accidents. In
principle, k can take any value; hence, the experiment has infinitely many outcomes.
namely all integers from 0 to x. The event .s4 = {k = 3} consists of the single outcome
k = 3. The event ~ = {k s 3} consists of the four outcomes k = 0. I, 2. and 3.
We kept a record of the number of accidents in 1,000 days. Here are the
number nk of days on which k accidents occurred:
k
0
1
2
3
4
5
6
7
8
9
10
>10
nl
13
80
144
200
194
155
120
15
14
4
I
0
From the table and (I-I) it follows with n31 = n3 = 200. nA = n0 + n 1 .,.
437, and n == 1,000; hence
P(.si) == P{k = 3} = .2
P(OO) = P{k s 3} ~ .437
Example 1.6
n~
+ n3
=
•
We monitor all telephone calls originating from a station between 9:00 and 10:00 A.M.
We thus have an experiment, the outcomes of which are all time instances between
9:00 and 10:00. A single trial is a particular call, and an outcome is the time of the
call. The experiment therefore has infinitely many outcomes. We observe that among
the last 1,000 calls, 248 occurred between 9:00 and 9: 15. From this we conclude that
SEC.
1-2
THE FOUR INTERPRETATIONS OF PROBABILITY
9
the probability of the event :;l = {the call occurs between 9:00 and 9: 15} equals
P<:4),.... .25. We expect. therefore. that among all future calls occurring between 9:00
and 10:00 A.M •• 25% will occur between 9:00 and 9:15.
Example 1.7
The age tat death of a person specifies an experiment with infinitely many outcomes.
We wish to find the probability of the event .-.1 = {death occurs before 60}. To do so,
we record the ages at death of 1.000 persons. and w~ observe that 682 of them are
less than 60. From this we conclude that J'(sll == 682/1000 == .68. We should expect.
therefore. that 689f of future deaths will occur before the age of 60.
Regularity and Randomness We note that to make predictions about future
averages based on past averages. we must impose the following restrictions
on the underlying experiment.
A.
8.
Its trials must be pertormed under .. essentially similar.. conditions (regularity).
The ratio n.-:~ln must be "essentially .. the same for any subsequence of trials selected prior to the observation (randomness).
These conditions arc heuristic and cannot easily be tested. As we
illustrate next. the difficulties vary from experiment to experiment. In the
coin experiment, both conditions can be readily accepted. In the birth experiment. A is acceptable but B might be questioned: If we consider only the
subsequence of births of twins where the firstborn is a boy, we might find a
different average. In the polling experiment, both conditions might be challenged: If the voting docs not take place soon after the polling, the voters
might change their preference. If the polled voters arc not "typical," for
example. if they arc taken from an afftucnt community. the averages might
change.
1-2
The Four Interpretations of Probability
The term probability has four interpretations:
I. Axiomatic definition (model concept)
2. Relative frequency (empirical)
3. Classical (equally likely)
4. Subjective (measure of belief)
In this book. we shall use only the axiomatic definition as the basis of the
theory (step 2). The other three interpretations will be used in the determination of probabilistic data of real experiments (step 1) and in the applications
J0
CHAP.
1 THE MEANING OF PROBABILITY
of the theoretical results to real experiments (step 3). We should note that
the last three interpretations have also been used as definitions in the theoretical development of probability; as we shall see. such definitions can be
challenged.
Axiomatic:
In the axiomatic development of the theory of probability, we start with a
probability space. This is a set '::/of abstract objects (elements) called outcomes. The set '::/ and its subsets are called events. The probability of an
event s4. is by definition a number P(s4.) assigned to s4.. This number satisfies
the following three axioms but it is otherwise arbitrary.
I.
P(s4.) is a nonnegative number:
P(s4) ~ 0
II. The probability of the event f:l (certain event) equals 1:
Ill.
(1-3)
P('::/) = 1
(1-4)
If two events s4. and~ have no common elements, the probability
of the event .514 U ~ consisting of the outcomes that are in s4. or li
equals the sum of their probabilities:
P(s4 U ~) = P{s4) + P(~)
(1-S)
The resulting theory is useful in the determination of averages of mass
phenomena only if the axioms are consistent with the relative frequency of
probability, equation (1-1). This means that if in (1-3), (1-4), and (1-S) we
replace all probabilities by the corresponding ratios nsA/n, the resulting equations remain approximately true. We maintain that they do.
Clearly, nsA ~ 0; furthermore, n~ = n because the certain event occurs
at every trial. Hence,
P(f/) = n:t = I
n
n
in agreement with axioms I and II. To show the consistency of axiom III
with (1-1), we observe that if the events s4. and li have no common elements
and at a specific trial the event s4. occurs, the event~ does not occur. And
since the event s4. U li occurs when either s4. or li occurs, we conclude that
n...,u~ = nsA + n91. Hence,
P(s4.)
= n.-A ~ 0
P(s4 U li) == n...,u~ = nsA + na == P(s4) + P(\i)
n
n
n
in agreement with (1-S).
Model Formation We comment next on the connection between an abstract space f:l (model) and the underlying real experiment. The first step in
model formation is the correspondence between elements of f:l and experimental outcomes. In Section 1-'l we assumed routinely that the outcomes of
an experiment are readily identified. This, however, is not always the case.
SEC.
1-2
THE I'OliR INTERPRETATIONS OF PROBABILITY
JJ
The actual outcomes of a real experiment can involve a large number of
observable characteristics. In the formation of the model. we select from all
these characteristics the ones that arc of interest in our investigation. We
demonstrate with two examples.
Example 1.8
Example 1.9
Consider the possible! models of the die experiment as interpreted by the three
players X. Y. and Z.
X says that the outcomes of this experiment are the six faces of the die.
forming the space :J = {/1 • • • • • J~}. In this space. the event {even} con~ists of the
three outcomes h . .1~. and./~.
Y wants to bet on even or odd only. He argues. therefore. that the experiment
has only the two outcomes t•ut•n and mid. forming the space :1 ·· {even. odd}. In this
space. {even} is an elementary event consisting of the single outcome t'Vt'll.
Z bets that the die will rest on the left side of the table and / 1 will show. He
maintains. therefore. that the experiment has infinitely many outcome~ consisting of
the six faces of the die and the coordinates of its center. The event {even} consists not
of one or of three outcomes but of infinitely many. •
In a polling experiment. a trial is the selection of a person. The person might be
Republican or Democrat. male or female. black or white. smoker or nonsmoker. and
so on. Thu!> the observable outcomes are a myriad of characteristics. In Example 1.4
we considered as outcomes the characteristics .. Republican .. and .. Democrat .. because we were interested only in party affiliation. We would have four outcomes if
we considered also the sex of the selected per11ons. eight outcomes if we included
their color. and so on. •
Thus the outcomes of a probabilistic model arc precisely defined objects corresponding not to the myriad of observable characteristics of the
underlying real experiment but only to those characteristics that arc of interest in the investigation .
.-..·,,,.The axiomatic approach to probability is relatively recent* (Kolmogoroff, 1933): however. the axioms and the formal results had been used earlier. Kolmogoroff's contribution is the interpretation of probability as an abstr.tct concept and the development of the theory as a precise mathematical
discipline based on measure theory.
Relative Frequency
The relative frequency interpretation (1-1) of probability states that if in 11
trials an event~ occurs 11:11 times, its probability P(.~) is approximately ll_.,j/11:
PC.i4.) =
II·
......:!
ll
(1-6)
• A. KolmogorofT, .. Grundbegriffc der Wahrscheinlichkcits Rechnung. ·· f:rgeb Math tmd ilrrer
Gren:.g. Vol. 2. 1933.
12
CHAP.
1
THE MEANING OF PROBABILITY
provided that n is sufficiently large and the ratio n31 1n is nearly constant as n
increases.
This interpretation is fundamental in the study of averages, establishing the link between the model parameter P(~). however it is defined. and
the empirical ratio n,11 /n. In our investigation. we shall use <J-6) to assign
probabilities to the events of real experiments. As a reminder of the connection between concepts and reality. we shall give a relative frequency interpretatitm of various axioms. definitions. and theorems based on ( 1-6).
The relative frequency cannot be used as the definition of P(~) because
(1-6) is an approximation. The approximation improves. however, as n increases. One might wonder. therefore, whether we can define P(:A) as a
limit:
P(srl)
= .,....
lim n:,J
,
(1-7)
We cannot, of course. do so if n and n::~ arc experimentally determined
numbers because in any real experiment. the number n of trials. although it
might be large, it is always finite. To give meaning to the limit, we must
interpret ( 1-7) as an assumption used to define P(~4.) as a theoretical concept.
This approach was introduced by Von Mises* early in the century as the
foundation of a new theory based on () -7). At that time the prevailing point
of view was still the classical. and his work offered a welcome alternative to
the concept of probability defined independently of any observation. It removed from this concept its metaphysical implications, and it demonstrated
that the classical definition works in real problems only because it makes
implicit use of relative frequencies based on our long experience. However,
the usc of ( 1-7) as the basis for a deductive theory has not enjoyed wide
acceptance. It has generally been recognized that Kolmogoroff's approach
is superior.
Cla.uic:al
Until recently. probability was defined in terms of the classical interpretation. As we shall see, this definition is restrictive and cannot form the basis
of a deductive theory. It is, however. important in assigning probabilities to
the events of experiments that exhibit geometric or other symmetries.
The classical definition states that if an experiment consists of N outcomes and N.?i of these outcomes are "favorable" to an event~ (i.e., they
arc elements of .llf), then
P(~) = N:A
N
(1-8)
• Richard Von Mises. Probability, Statistics, and Truth, English edition. H. Geiringer. ed.
(London: G. Allen and Unwin Ltd., 1957).
SEC.
J-2
THE FOUR INTERPRETATIOSS OF PROBABILITY
13
In words, the probability of an event .<A equals the ratio of the number of
outcomes N_._. favorable to .9l to the total number N of outcomes.
This definition, as it stands. is ambiguous because. as we have noted.
the outcomes of an experiment can be interpreted in several ways. We shall
demonstrate the ambiguity and the need for improving the definition with an
example.
Example 1.10
We roll two dice and wish to find the probability p that the sum ofthe faces that show
equals 7. We shall analyze this problem in terms of the following models.
(a) We consider as experimental outcomes the N .., II possible sums 2.
3•.... 12. Of these. only the outcome 7 is favorable to the event lll = {7},
hence N ..~ = I. If we use ( 1-8) to determine p. we must conclude that p =
1111.
(b) We count as outcomes the N = 21 pairs 1-1. 1-2. 1-3 . . . . , 6-6, not
distinguishing between the first and the second die. The favorable outcomes arc now N.1 - 3. namely the pairs 1-6. 2-5. and 3-4. Again using
0-8). we must conclude that p -' 3121.
(~) We count as outcomes the N - 36 pairs distinguishing between the first
and the second die. The favorable outcomes are now theN...; = 6 pairs 1-6.
6-1. 2-5, 5-2, 3-4. 4-3. and (1-81 yields p"" 6/36.
We thus have three different solutions for the same problem. Which is correct?
One might argue that the third is correct because the .. true .. number of outcomes is
36. Actually all three models can he used to describt• the die experiment. The third
leads to the correct solution because its outcomes are .. equally likely." For the other
two models, we cannot determine p from ( 1-81. •
Example 1.10 leads to the following refinement of ( 1-8): The probability
of an event s4. equals the ratio of the number of outcomes N.-J favorable to s4.
to the total number N of outcomes, provided that all outcomes are equally
likely.
As we shall see, this refinement does not eliminate the problems inherent in the classical definition. We comment next on the various objections to
the classical definition as the foundation of a precise theory and on its value
in the determination of probabilistic data and as a working hypothesis.
Note• Be aware ofthe difference between the numbers 11 and n.-.~ in (1-1) and the
numbers N and N:A in (1-8). In the former, n is the total number of trials
(repetitions) of the experiment and n. ~ is the number of successes of the event
.s4. In the latter, N is the total number of outcome.s of the experiment and N ,_. is
the number of outcomes that are favorable to .!4 (are elements of sf).
14
CHAP.
I
THE MEANING OF PROBABILITY
CRmCISMS
The term equally likely used in the refined version of (1-8) can
mean only that the outcomes are equally probable. No other interpretation consistent with the equation is possible. Thus the definition is circular: the concept to be defined is used in the definition.
This often leads to ambiguities about the correct choice of N and,
in fact. about the validity of 0-8).
2. It appears that 0-8) is a logical necessity that does not depend on
experience: "A die is symmetrical; therefore. the probability that 5
will show equals 1/6." However, this is not so. We accept certain
alternatives as equally likely because of our collective experience.
The probability that 5 will show equals 1/6 not only because the die
exhibits geometric symmetries but also because it was observed in
the long history of rolling dice that 5 showed in about 116 of the
trials.
I.
In the next example, the equally likely condition appears logical but is
not in agreement with observation.
Example 1.11
We wish to find the probability that a newborn baby is a girl. It is generally assumed
that p = 112 because the outcomes boy and girl are ··obviously .. equally likely.
However. this conclusion cannot be reached as a logical necessity unrelated to
observation. In the first place, it is only an approximation. Furthermore, without
access to long records, we would not know that the boy-girl alternatives arc equally
likely regardless of the sex history of the baby's family, the season or place of its
birth. or other factors. It is only after long accumulation of records that such factors
become irrelevant and the two alternatives are accepted as approximately equally
likely. •
3. The classical definition can be used only in a limited class of problems. In the die experiment, for example. ( 1-8) holds only if the die
is fair, that is, if its six outcomes have the same probability. If it is
loaded and the probability of 5 equals. say •. 193, there is no direct
way of deriving this probability from the equation.
The problem is more difficult in applications involving infinitely many outcomes. In such cases, we introduce as a measure of
the number of outcomes length. area. or volume. This makes reliance on (1-8) questionable, and, as the following classic example
suggests, it leads to ambiguous solutions.
Example 1.12
Given a circle C of radius r, we select .. at random .. a chord AB. Find the probability
.stt ~ {the length I of the chord is larger than the length r v'J of the side
of the inscribed equilateral triangle}.
p of the event
SEC.
J-2
THE FOUR I~TERPRETATJO!'ojS OF PROBABILITY
15
We shall show that this problem can he given at least three reasonable solutions.
Fir!it Solution The center M of the chord can be any puint in the interior of the circle
C. The point is favorable to the event .::1 if it is in the interior of the circle C 1 of radius
r/2 (Fig. 1.2a). Thus in this interpretation of .. randomness.·· the experiment consists
of all points on the circle C. Vsing the area of a region as a measure of the points in
that region. we conclude that the measure of the total number of outcomes is the area
TTr! of the circle C and the measure of the outcomes favurable to the event .:Ji equals
the area TTr~/4 of the circle C 1 • This yields
TTr=i4
I
p :.
7Tf!
-
4
Second Solution We now assume that the end A of the chord AB is fixed. This
reduces the number of pos!iibilities but has no effect on the value of p because the
number of favorable outcomes is reduced proportionately. We can thus consider as
experimental outcomes all points on the circumference of the circle. Since I> rvJ if
B is on the 120° arc DB/~ of Fig. 1.2b. the number of outcomes favorable to A are all
points on that arc. t:sing the length of the arcs as mea'iure of the outcomes. we
conclude that the measure of the total number of outcomes is the length 27Tr of the
Figure 1.2
:I
(a)
I
1.:)
hi
16
CHAP.
1
THE MEANING OF PROBABILITY
circle and the measure of the favorable outcomes is the length 2TTrl3 of the arc DBE.
Hence.
2TTr13 1
p = 2TTr = 3
Third Solution We assume that the direction of AB is perpendicular to the line FK of
Fig. 1.2c. As before. this assumption has no effect on the value of p. Clearly .I> r\13
if the center M of the chord AB is between the points G and H. Thus the outcomes of
the experiment are all points on the line FK and the favorable outcomes are the
points on the segment GH. Using the lengths rand r/2 of these segments as measures
of the outcomes. we conclude that
r/2 I
p=-;- =2
This example, known as Bertrand paradox, demonstrates the possible ambiguities associated with the classical definition, the meaning of the terms po.uible and
favorable, and the need for a clear specification of all experimental outcomes. •
USES OF THE CLASSICAL DEFINITION
In many experiments, the assumption that there are N equally
likely alternatives is well established from long experience. In such
cases, (1-8) is accepted as self-evident; for example, "If a ball is
selected at random from a box containing m black and n white
balls, the probability p that the ball is white equals ml(m + n),"
and "if a telephone call occurs at random in the time interval (0,
T), the probability p that it occurs in the interval (0, t) equals tiT."
Such conclusions are valid; however, their validity depends on the
meaning of the word random. The conclusion of the call example
that p = tiT is not a consequence of the randomness of the call.
Randomness in this case is equivalent to the assumption that p =
tiT, and it follows from past records of telephone calls.
2. The specification of a probabilistic model is based on the probabilities of various events of the underlying physical experiment. In a
number of applications, it is not possible to determine these probabilities empirically by repeating the experiment. In such cases, we
use the classical definition as a working hypothesis; we assume
that certain alternatives are equally likely, and we determine the
unknown probabilities from (1-8). The hypothesis is accepted if its
theoretical consequences agree with experience; otherwise, it is
rejected. This approach has important applications in statistical
mechanics (see Example 2.25).
3. The classical definition can be used as the basis of a deductive
theory if (1-8) is accepted not as a method of determining the
probabiJity of real events but as an assumption. As we show in the
next chapter, a deductive theory based on (1-8) is only a special
case of the axiomatic approach to probability, involving only experiments in which all elementary events have the same probabil•
1.
SEC.
1-2
THE FOUR JSTERPRETATIONS OF PROBABILITY
17
ity. We should note, however. that whereas the axiomatic development is based on the three axioms of probability. a theory based on
(1-8) requires no axioms. The reason is that if we assume that all
probabilities satisfy the equally likely condition. all axioms become
simple theorems. Indeed. axioms I and II are obvious. To prove
(1-5), we observe that if the events :.-l and ~ consist of N.-.~ and N .11
outcomes. respectively. and they arc mutually exclusive, their
union .<A. u :-A consists of N.., + N.11 outcomes. And since all probabilities satisfy ( 1-8), we conclude that
P(.ii U 2/l) :-:- Nt:/ 11
,...,. : · ....
~~
= P(:il) - P(l'A)
(1-9)
Suhjectivl'
In the subjective interpretation of probability. the number P(:/J.) assigned to a
statement .~ is a measure of our state of knowledge or belief concerning the
truth of ~. The underlying theory can he generally accepted as a form of
"inductive" reasoning developed ''deductively:· We shall not discuss this
interpretation or its effectiveness in decisions based on inductive inference.
We note only that the three axioms on which the theory is based are in
reasonable agreement with our understanding of the properties of inductive
inference. In our development. the subjective interpretation of probability
will be considered only in the context of Bayesian estimation (Section 8-2).
There we discuss the use of the subjective interpretation in problems involving averages, and we comment on the resulting controversies between "subjectivists" and "objectivists."
Here we discuss a special case of the subjective interpretation of P(~)
involving total prior ignorance. and we show that in this case it is formally
equivalent to the classical definition.
PRI~CIPU: OF I~St;FFICU:~T RI-:ASO:\ This principle states that if an
experiment has N alternatives (outcomes) '; und we have no knowledge
about their occurrence. we mu.'it assign to all alternatives the same probability. This yields
{1-10)
In the die experiment. we must assume that P{.f;} = l/6. In the polling
experiment. we must assume that P{r} = P{d} = li2.
Note that 0-10) is equivalent to 0-H). However. the classical definition, on which Equation (1-8) is based. is conceptually different from the
principle of insufficient reason. In the classical definition, we know. from
symmetry considerations or from past experience. that theN outcomes are
equally likely. Furthermore, this conclusion is objective and is not subject to
change. The principle of insufficient reason, by contrast. is a consequence of
our total ignorance about the experiment. Furthermore. it leads to conclusions that are subjective and subject to change in the face of any evidence.
18
CHAP.
I
THE MEANING OF PROBABILITY
Concluding Remarks
In this book we present a theory based on the axiomatic definition of probability. The theory is developed deductively. and all conclusions follow logically from the axioms. In the context of the theory. the question "what is
probability'?" is not relevant. The relevant question is the correspondence
between probabilities and observations. This question is answered in terms
of the other three interpretations of probability.
As a motivation. and as a reminder of the connection between concepts
and reality, we shall often give an empirical interpretation (relative frequency) of the various axioms. definitions. and theorems. This portion of the
book is heuristic and docs not obey the rules of deductive reasoning on
which the theory is based.
We conclude with the observation that all statistical statements concerning future events are inductive and must be interpreted as reasonable
approximations. We stress. however, that our inability to make exact predictions is not limited to statistics. It is characteristic of all scientific investigations involving real phenomena, deterministic or random. This suggests that
physical theories arc not laws of nature, whatever that may mean. They are
human inventions (mental constructs) used to describe with economy patterns of real events and to predict, but only approximately, their future
behavior. To "prove" that the future will evolve exactly as predicted, we
must invoke metaphysical causes.
2 _ _ __
Fundamental Concepts
The material in Chapters 2 and 3 is based on the notion of outcomes, events,
and probabilities and requires. for the most part. only high school mathematics. It is self-contained and richly illustrated. and it can be used to solve a
large variety of problems. In Chapter 2, we develop the theory of probability
as an abstract construct based on axioms. For motivation. we make also
frequent reference to the physical interpretation of all theoretical results.
This chapter is the foundation of the entire theory.
2-1
Set Theory
Sets are collections of objects. The objects of a set are called elements. Thus
the set
{apple. boy. pencil}
consists of the three elements apple, pencil. and boy. The elements of a set
are usually placed in braces; the order in which they are written is immaterial. They can be identified by words or by suitable abbreviations. For example, {h, t} is a set consisting of the elements h for heads and t for tails.
19
20
CHAP.
2
FUNDAMENTAL CONCEPTS
Similarly, the six faces of a die form the set
{Ji. Ji, 13. J., A . .16}
In this chapter all sets will be identified by script letters* sf., ~.
'€, . . . ; their elements will, in general, be identified by the Greek letter
Thus the expression
c.
st ={Ct. C2 •... , Cl\·}
will mean that .s4 is a set consisting of the N elements Ct, C2.
The notation
C; E .s4
will mean that C; is an element of the set .s4 (belongs to the set sf.); the
notation
C; ~ st
will mean that C; is not an element of st. Here is a simple illustrdtion. The set
st
= {2, 4, 6} consists of the three clements 2. 4, and 6. Thus
2E.srl
3~.54
In this illustration, the elements of the set sf. are numbers. Note, however,
that numbers are used merely for the purpose of identifying the elements; for
the specification of the set, their numerical properties are irrelevant.
The elements of a set might be simple objects, as in the preceding
examples, or each might consist of several objects. For example,
.s4 = {hh, ht, th}
is a set consisting of the three elements hh. ht, and th. Sets of this form will
be used in experiments involving repeated trials. In the set stl, the elements
ht and th are different; however, the set {th. hh, ht} equals :A..
In the preceding examples, we identified all sets explicitly in terms of
their elements. We shall also identify sets in terms of the properties of their
elements. For example,
sf. = {all integers from I to 6}
(2-2)
is the set {1, 2, 3, 4. 5, 6}; similarly.
~ = {all even integers from I to 6}
is the set {2, 4, 6}.
Venn Diagrams Wc shall assume that all elements under consideration
belong to a set ~called space (or universe). For example, if we consider
children in a certain school, ~is the set of all children in that school.
The set ~ is often represented by a rectangle, and its elements by the
points in the rectangle. All other sets under consideration are thus represented by various regions in this rectangle. Such a representation is called a
Venn diagram. A Venn diagram consists of infinitely many points; however,
the set ~ that it represents need not have infinitely many elements. The
diagram is used merely to represent graphically various set operations.
• In subsequent chapters, we shall use script letters to identify only sets representing events of
a probability space.
SEC.
2-1
SF.T THEORY
21
.1\c::l
J<igure 2.1
Subsets We shall say that a set 2A is a subset of a set stl if every
of~
is also an clement of :A. (Fig. 2.1 ). The notations
tA c .'11.
stl :J :!A
(2-3)
will mean that the set 00 is a subset of the set .<A.. For example, if
.'11. = {fi. Ji .};}
213 ;,
.f,}
then :11 C .'11.. In the Venn diagram representation of (2-3), the set :!A is
included in the set stl.
element
u;
Equality The notation
.'11. = /A
will mean that the sets stl and :1l consist of the same elements. To establish
the quality of the sets .<A. and ~. we must show that every element of ?A is an
clement of stl and every element of stl is an element of til. In other words.
stl = ~
iff*
~ c stl
and
.>4 c 1A
We shall say that ~ is a proper subset of .-;1 if :A is a subset of st. but
does not equal .<A.. The distinction between a subset and a proper subs~t will
not be made always.
Unions and Intersections Given two sets .<A. and 33, we form a set
consisting of all elements that are either in stl or in ~ or in both. This set is
written in the form
stl U ?A
or
stl + .?A
and it is called the union of the sets .<A. and ~1l (shaded in Fig. 2.2).
Given two sets stl and ~. we form a set consisting of all elements that
are in :A. and in ?A. This set is written in the form
,r;zt. n ~
or
.-4~
and it is called the intersection of the sets :A. and 3l (shaded in Fig. 2.3).
Complement Given a set st.. we form a set consisting of all elements
of ':t that arc not in :A. (shaded in Fig. 2.4). This set is denoted by stl and it is
called the complement of stl.
• Iff is an abbreviation for "if and only if."
22
CHAP.
2
FUNDAMENTAL CONCEPTS
:.·J u :II
Figure 2.2
Figure 2.3
8
Figure 2.4
Example 2.1
Suppose that ~ is the set of all children in a community, at is the set of children in
fifth grade, and 00 is the set of all boys. In this case, ii is the set of children that are
not in fifth grade, and ?ii is the set of all girls. The set at u ~ consists of all girls in fifth
grade and all the boys in the community. The set at n ~ consists of all boys in fifth
grade. •
Disjoint S~ts We shall say that the sets .s4. and ~ are disjoint if they
have no common elements. If two sets are disjoint, their intersection has a
meaning only if we agree to define a set without elements. Such a set is
denoted by 0 and it is called the empty (or null) set. Thus
.stn!i=0
itT the sets .s4. and~ are disjoint.
We note that a set may have finitely many or infinitely many elements.
There are two kinds of infinities: countable and noncountable. A set is called
SEC.
2-J
SET THEORY
23
l'igure 2.5
countable if its elements can be brought into one-to-one correspondence
with the positive integers. For example. the set of even numbers is countable; this is easy to show. The set of rational numbers is countable; the proof
is more difficult. The set of all numbers in an interval is noncountable: this is
difficult to show.
PROPERTIES The following set properties follow readily from the defini-
tions.
s.4.u:J=~
If ~
stn:·t=sl.
s.4.U0=.?t
stn0-==0
C s.4., then s.4. U ~ = .s.'J and s.4. r. :A = tA.
Transitive Property
If :A C :Jl and :1l C '{, then
~lt
c
'-€ (Fig. 2.5).
Commutative Property
:11 u :13 = 213 u .?t
Associative Property
(.Ill u ~) u <t = ,>{ u (~ u ~)
From this it follows that we can omit parentheses in the last two operations.
Distributive Law (See Fig. 2.6.)
(dl u 00) n <t = <.sJf n <t) u (;:A n '£)
Thus set operations are the same as the corresponding arithmetic operations if we replace .sJf U :il and .'i4. n tJl by ~<4 + !A and .>4:13, respectively. With
Figure 2.6
24
CHAP.
2
FUNDAMENTAL CONCEPTS
Partition A
Figure 2.7
the latter notation, the distributive law for sets yields
(sf + 00)(<€ + ~) = sf(<€ + ~) + 00(<@ + S,) = .~<(5 +
sf(jt
+
til'~
+
~Jjl(jj
Partitions A partition of Y is a collection
A = [sf,, . . . , sfmJ
of subsets sf,, . . . , sfm of fl with the following property <Fig. 2.7): They
are disjoint and their union equals f/:
sf; n sfj = 0, ; j
sf, u . . . u sfm = y
(2-4)
The set fl has, of course, many partitions. If sf is a set with complement Si, then
sfnsf=0
hence, [sf, Si] is a partition of fl.
*
Example l.l
Suppose that ~ is the set of all children in a school. If .7l; is the set of all children in
the ith grade, then (sf., . . . , sf 12 ) is a partition of~. lf:i is the set of all boys and
<§ = i the set of all girls, then [9\, <§) is also a partition. •
Cartesian Product Given two sets sf and 00 with elements a; and ~i, respectively, we form a new set ~, the elements of which are all possible pairs a;~i.
The set so formed is denoted by
~=sfx~
a~
and it is called the Cartesian product of the sets sf and~. Clearly, if sf has m
elements and ~ has n elements, the set ~ so constructed has mn elements.
Example 2.3
If sf
= {car, apple, bird} and~ ={heads, tails}, then
C€ = sf x ~ = {ch, ct, ah, at, bh, bt}
The Cartesian product can be defined even if the sets sf and
identical.
•
~
are
SEC.
Example 2.4
If s4
2-1
SET THEORY
25
= {heads, tails} = ~. then
<€ = J4
X ~
•
= {hh. ht, th, tt}
Subsets and Combinatoric.\·
Probabilistic statements involving finitely many elements and repeated trials
are based on the determination of the number of subsets off! having specific
properties. Let us review the underlying mathematics.
PERMUTATIONS AND COMBINATIO~S Given N distinct objects and a
number m s N, we select m of these objects and place them in line. The
number of configurations so formed is denoted by P:~ and is called "permutations of N objects taken m at a time." In this definition. two permutations
are different if they differ by at least one object or by the order of placement.
To clarify this concept, here is a list of all permutations of the N = 4
objects a, b, c, d form = 1 and m = 2:
m=1
(2-6)
a
b
c:
d
Pt = 4
m =2
ab
ac:
ad
ba
be
bd
ca
cb
cd
da
db
de
p~
= 12
(2-7)
Note that P~ = 3P1. This is so because each term in (2-6) generates 4 - 1 = 3
terms in (2-7). Clearly. 3 is the number of letters remaining after one is
selected. This leads to the following generalization.
• Theonm
P':, = N(N- I)· · · (N- m + I)
(2-8)
• Proof. Clearly, Pf = N; reasoning as in (2-7) we obtain P~ = N(N- 1).
We thus have N(N - I) permutations of the N objects taken 2 at a time. At
the end of each permutation so formed, we attach one of the remaining
N - 2 objects. This yields (N - 2)Pf! permutations of N objects taken 3 at a
time. By simple induction. we obtain
P':, = (N - m + l)P~ 1
and (2-8) results.
We emphasize that in each permutation, a specific object appears only
once, and two permutations arc different even if they consist of the same
objects in a different order. Thus in (2-7), ab is distinct from ba; the configuration aa does not appear.
Example 2.5
As we see from (2-8).
P!0 =
10
X
9
= 90
If the ten objects are the numbers 0, I, . . . • 9. then P~0 is the total number of
two-digit numbers excluding 00, II, 22. . . .• 99. •
-
26
CHAP.
2
FUNDAMENTAL CONCEPTS
= N in (2-8), we obtain
P~ = N(N - I) · · · 1 = N!
• Corollary. Setting m
This is the number of permutations of N objects (the phrase "taken N at a
time" is omitted).
Example 2.6
Here are the 3! = 3 x 2 permutations of the objects a. b, and c:
abc
acb
bac
bca
(·ab
•
cba
Combinations Given N distinct objects and a number m s N, we select m
of these objects in all possible ways. The number of groups so formed is
denoted by C~ and is called "combinations of N objects taken mat a time."
Two combinations are different if they differ by at least one object; the order
of selection is immaterial.
If we change the order of the objects of a particular combination in all
possible ways, we obtain m! permutations of these objects. Since there are
C~ combinations, we conclude that
P~ = m !C~
(2-9)
From this and (2-8) it follows that
CN = N(N- I)· · · (N- m + 1)
m
m!
N). Multiplying numerator and denomination
This fraction is denoted by (
m
by (N - m)!, we obtain
CN-
m-
(N)N!
m. - m!(N- m)!
(2-10)
Note, finally, that
c~ = c~-m
This can be established also directly: Each time we take m out of N objects,
we leaveN- m objects.
Example 2.7
With N = 4 and m
= 2, (2-10) yields
4 _ (4) _ 4 X 3 _
C2
- 2 -IX2- 6
Here are the six combinations of the four objects a, b, c, d taken 2 at a time:
d
M
~
k
M
~
•
AppUcations We have N objects forming two groups. Group 1 consists of
m identical objects identified by h; group 2 consists of N - m identical
objects identified by t. We place these objects inN boxes, one in each box.
St::C. 2-1
SET THEORY
27
We maintain that the number x of ways we can do so equals
X=(~}
(2-11)
• Proof. The m objects of group I are placed in m out of the N available
boxes and the N - m objects of group 2 in the remaining N - m boxes. This
yields (2-11) because there are C~ ways of selecting m out of N objects.
Example 2.8
Here are the q = 6 ways of placing the four objects h, h. t. t in the four boxes 8 1 , 8 2 ,
83,84:
q
= 6
•
Example 2.9
This example has applications in statistical mechanics <see Example 2.25). We place
m identical balls inn> m boxes. We maintain that the number y of ways that we can
do so equals
y
= (~)
(2-12)
where N "" n - m - I
• Proof. The solution of this problem is rather tricky. We consider them balls as
group I of objects and then - I interior walls separating then boxes as group 2. We
thus haveN = n + m - I objects, and (2-12) follows from (2-11). In Fig. 2.8, we
demonstrate one of the placements of m = 4 balls in n = 7 boxes. In this case. we
have n - I = 6 interior walls. The resulting sequence of N = n - m - I = 10 balls
and walls is bwwbbwwwbw, as shown. •
Binary Numbers We wish to find the total number z of N-digit binary
numbers consisting of m Is and N - mOs. Identifying the Is and Os as the
objects of group 1 and group 2. respectively. we conclude from (2-12) that
z = (~).
Joigure 2.8
b
bb
bwwbbwwwbw
b
m =4
n
=7
28
CHAP.
2
FUNDAMENTAL CONCEPTS
Example 2.10
Here are the
(~)
= 6 four-digit binary numbers consisting of two Is and two Os:
1100
1010
1001
OliO
0101
0011
•
Note From the identity (binomial expansion)
,...
(a
+b)".=
L (~) a'"lr"-"'
...
~o
it follows with a = b = I that
(2-13)
Hence, the total number of N-digit binary numbers equals 2·'·.
c....
Subsets Consider a set Y consisting of the N elements C1 , • • • ,
We maintain that the total number of its subsets, including f/ itself and the
empty set 0, equals 2N.
• Proof. It suffices to show that we can associate to each subset of f/ one
and only one N-digit binary number [see (2-13)]. Suppose that sA is a subset
of Y. If J4 contains the element Ct. we write I as the ith binary digit; otherwise, we write 0. We have thus established a one-to-one correspondence
between all theN-digit binary numbers and the subsets ofY. Note, in particular, that Y corresponds to II . . . I and 0 to 00 . . . 0.
Example 2.11
The set ~ = {a, b, c, d} has four elements and 24 = 16 subsets:
0
{a}
{b}
{c}
{d}
{a, b}
{a, c}
{a, d}
{b, d}
{c, d}
{a, b, c}
{a, b, d}
{a, c, d}
{b, c. d}
The corresponding four-digit numbers are as follows:
()()()()
OliO
0001
1010
0010
1100
0100
0111
1000
1011
0011
1101
0101
1110
{b, c}
{a, b, c, d}
1001
1111
•
Genenllzed Combinations We are given N objects and wish to group them
into r classes A 1 ,
, A, such that the ith class A; consists of k1 objects
where
k 1 + · · · + k,
=N
The total number of such groupings equals
eN
"·· ...• k.
=
N!
k, !k2! • . k,!
(2-14)
• Proof. We select first k 1 out of theN object to form the first class A 1 • As
we know, there are
(Z) ways to do so. We next select k out of the remain2
ing N - k 2 objects to form the second class A 2 • There are ( N
~ k,) ways to
SEC.
2-2
PROBABII.ITY SPACE
29
do so. We then select k3 out of the remaining N- k1 - k2 objects to form the
third class Ah and so we continue. After the formation of the A,_, class.
there remain N- k 1
-
• • • -
k,- 1 = k, objects. There
is(~:) =
I way of
selecting k, out of the remaining k, object: hence. there is only one way of
forming the last class A,.
From the foregoing it follows that
c·v
==
N!
<N - k,>!
<k, , + k,)!
k,. · ..• A.
k 1!(N- k,)! k2!(N- k,- k~)!
k,.,!k,!
and (2-14) results.
Note that <2-10) is a special case of{2-14) obtained with r = 2, k 1 = k,
h- :.:. N - k • and eN
4,.Az = cv
A
Example 2.12
We wish to determine the number M of bridge hands that we can deal using a deck
with N ,_ 52 cards.
In a bridge game. there are r -= 4 hands: each hand consists of 13 cards. With
k, = /.:2 = k~ = k• ... 13. (2-14) yields
-c'2n. · · · · u -..
M -
52!
..,5,6
02K
13! 13! 13! 13! . ··' x I
•
Let us look at two other interpretations of (2-14).
We haver groups of objects. Group I consists of k 1 repetitions of a
particular object h1 ; group 2 consists of k~ repetitions of another
object h 2 • and so on. These objects are placed into N = k 1 + · · ·
+ k, boxes, one in each box. The total number of ways that we can
dO SO equa)S
.A,·
2. Suppose that ~ is a set consisting of N clements and that
ld, ..... sf, I
is a partition of~ formed with the r sets :A; as in (2-4). If the ith set
~; consists of k; elements where k; are given numbers, the total
A.·
number of such partitions equals c~:
I.
Ct ..
.. ..
2-2
Probability Space
In the theory of probability. as in all scientific investigations, it is essential
that we describe a physical experiment in terms of a clearly defined model.
In this section, we develop the underlying concepts.
• Definition. An experimental model is a set ~. The elements C1 of ~ are
called outcomes. The subsets of~ are called events. The empty set 0 is
called the impossible event and the set~ the certain event. Two events .54.
30
CHAP.
2 FUNDAMENTAL CONCEPTS
and ~ are called mutually exdusive if they have no common elements, that
is, if .s4. n ~ = 0. An event {C;} consisting of the single outcome C; is called an
elemetltary event. Note the important distinction between the element C; and
the event {C;}.
If ff has N elements, the number of its subsets equals 2.-v. Hence, an
experiment with N outcomes has 2·'· events (we include the certain event and
the impossible event).
Example 2.13
In the single-die experiment, the space ~ consists of six outcomes:
9' = {Jj . . . . ,.f6}
It therefore has 26 = 64 events. We list them according to the number of elements in
each.
(~)
= I event without elements. namely, the event 0.
( ~) = 6 events with one element each, namely. the elementary events
{Jj }. {ii}•...• {J6}.
(~)
= 15 events with two elements each, namely,
{Ji • ./i}. {Jj.jj}. . . . . {f•. .f6}.
( ~) = 20 events with three elements each. namely.
{Jj.Ji./J}. {Jj . ./i . ./4}....• {./4.};./6}.
(:) = 15 events with four elements each. namely,
{Ji . .fi. JJ •./4}. fli . .h .f,. J.}..... u; ..14. h . .f6}.
( ~) = 6 events with five elements each. namely.
{Jj ,Ji.f,,J4,J~} .. ... {Ji.f,.J4.J;.J6}.
(:) = I event with six elements, namely. the event!/.
In this listing, the various events are specified explicitly in terms of their
elements as in (2-1). They can, however. be described in terms of the properties of
the elements as in (2-2). We now cite various events using both descriptions:
{odd}= Ui .};,/.}
{even}= {./i • ./4 •.16}
{less than 4} = {jj • .h. h}
{even, less than 4} = {.h}
Note, finally, that each element of9' belongs to 2·"· 1 = 32 events. For example.
jj belongs to the event {odd}, the event {less than 4}. the event {jj • .h }, and 29 other
events. •
Example 2.14
(a) In the single toss of a coin, the space ~ consists of two outcomes !I =
{h, t}. It therefore has 22 = 4 events, namely 0. {h}, {t}, and fl.
(b) If the coin is tossed twice, 9' consists of four outcomes:
f:i = {hh, ht, til, II}
SEC.
2-2
PROBABII.JTY SPACE
31
It therefore has 2-' = 16 events. These include the four elementary events
{hh}. {ht}. {th}. and {11}. The event-~~ "" {heads at the first toss} is not an
elementary event. It is an event consisting of the two outcomes hlr and lrt.
Thus ~ 1 = {hit. lzt}. Similarly.
-~~ =- {heads at the second toss} = {lrlt. tlr}
:J 1 "" {tails at the first toss} = {th. 11}
:l ~ =- {tails at the second toss} '"' {111. 11}
The intersection of the events "Jt 1 and ·It~ is the elementary event {hlr}.
Thus
~ 1 n ·}{~ = {heads at both tosses} = {lrh}
Similarly.
'1t 1 n ?j 2 - {heads first. tails second} "" {Itt}
:? 1 n ;j ~ = {tails at both tos!>es} = {II}
•
l·:mpiricallntc•rprc•wtion o/Omnmln and J·:n·llf\. In the applications of probability to real problems. the underlying experiment is repeated a large number
of times. and probabilities arc introduced to describe various averages. A
!tingle performance of the experiment will be called a trial. The repetitions of
the experiment form rc•peated trials. In the single-die experiment, a trial is the
roll of the die once. Repeated trials are repeated rolls. In the polling experiment. a trial is the selection of a voter from a given population.
At each trial. we observe one and only one outcome. whatever we have
agreed to consider as the outcome for that experiment. The set of all outcomes
is modeled by the certain event:;. If the observed outcome~ is an element of
the event ."Jl. we say that the event .?'i occurred at that particular trial. Thus at a
single trial only one outcome is observed: however. many events occur.
namely. all the 2·'· 1 subsets of ;f that contain the particular outcome ~. The
remaining 2-"· 1 events do not occur.
Example 2.15
We conduct a poll to determine whether a voter is Republican (r) or Democrat (d). In this case. the experiment consists of two outcomes !f = {r. d}
and four events:
0. {r}. {d}. :J
(b) We wish to determine also whether the voter is male or female. We now
have four outcomes:
!J .:.: {rm. rf. dm. c{/1
and 16 events. The elementary events are {rm}. {1:/1. {dm}. and {df}. Thus
:1t .:. : {Republican} = {rm. rj)
~~ =- {Democmt} -= {dm. ~11
.t4 =-- {male} - {rm. elm}
.1-- = {female} = {rj: df}
(a)
:11.
~1.
n .«
n .M.
. :.: {Republican. male} =- {rm}
:It n
= {Democr.u. male} - {dm}
'J
.+ . : {Republican. female} = {t:/1
n :;. - {Democrat. female}
= {df} •
Note that the impossible event docs not occur in any trial. The certain
event occurs in every trial. If the events .!A and :11 are mutually exclusive and
32
CHAP.
2
FUNDAMENTAL CONCEPTS
.54 occurs at a particular trial, 00 does not occur at that trial. If 3l c .54 and ?1\
occurs, .54 also occurs. At a particular trial, either the event sf or its complement .54 occurs. More generally, suppose that [.s.i 1 • • • • , .54"'] is a partition
of!:!. It follows from Equation (2-4) that at a particular trial, one and only
one event of this partition will occur.
Example 2.16
Consider the single-die experiment. The die is rolled and 2 shows. In our terminology, the outcomeJi is observed. In this case, the event {even}, the event {less than 4}.
and 30 other events occur, namely, all subsets of~ that contain the element/2 • •
Returning to an arbitrary !:! , we denote by na~ the number of times the
event .s4. occurs inn trials. Clearly, na~ s n, n!l = n, and n0 = 0. Furthermore,
if the events .s4. and 00 are mutually exclusive, then
n:.eu~
If the events .s4. and
Example 2.17
~
= n81
+
(2-15)
n~
have common elements, then
n~~~ = n:.e + n~ - na~1~
(2-16)
We roll a die 10 times and we observe the sequence
J.
h
Ji
t.
h
h
h
t.
t.
h
In this sequence, the elementary event {12} occurs 3 times, the event {even} 7 times,
and the event {odd} 3 times. Furthermore, if sf ={even} and li ={less than S}, then
sf u !i = {Ji,Ji,fl,J.,/6}
sf n ~ = {12./.}
n, = 7
na = 8
nsiriA = 6
na~u~ = 9
in agreement with (2-16). •
Note This interpretation of repeated trials is used to relate probabilities to real
experiments as in (2-1). If the experiment is performed n times and the event sf
occurs n11 times, then P(sf) =- naln, provided that n is sufficiently large. We
should point out, however, that repeated trials is also a model concept used to
create other models from a given experimental model. Consider the coin experiment. The model of a single toss has only two outcomes. However. if we
are interested in establishing averages involving two tosses, our model has four
outcomes: hh, ht, th, and tt. The model interpretation is thus fundamentally
different from the empirical interpretation of repeated tosses. This distinction
is subtle but fundamental. It will be discussed further in chapter 3 and will be
used throughout the book.
The Axioms
A probabilistic model is a set fJ the elements of which arc experimental
outcomes. The subsets off! are events. To complete the specification of the
model, we shall assign probabilities to all events.
• Definition. The probability of an event .54 is a number P(-54) assigned to .54.
This number satisfies the following axioms.
SEC.
2-2
PROBABILITY SPACE
33
I. It is nonnegative:
P(sll.)
~
(2-17)
0
II. The probability of the certain event equals I:
P(~)
Ill.
If the events .<A. and
P(.<A. U
~
=I
(2-18)
are mutually exclusive. then
~) =
P(al.)
+ P(:'A)
(2-19)
This axiom can be readily generalized. Suppose that the
events sf,~. and~ are mutually exclusive. that is, that no two of
them have common elements. Repeated application of (2-19)
yields
P(.<A. U ?A U '€) = P(:il) .,. P(~-Ji) + P(<(!,)
This can be extended to any finite number of terms; we shall
assume that it holds also for infinite but countably many terms.
Thus if the events .s4 1 ••'11. 2 • • • • arc mutually exclusive, then
P(sf, U sl2 U • • ·) = P(.<A.d + P(.<A.2) .,. • • •
(2-20)
This docs not follow from (2-19): it is an additional requirement
known as the axiom of infinite additivity.
Axiomatic Definition or an Experiment In summary. a model of an experiment is specified in terms of the following concepts:
I. A set ~ consisting of the outcomes ';
2. Subsets of ~~ called events
3. A number P(.<A..) assigned to every event <This number satisfies the
listed axioms but is otherwise arbitrary.)
The letter~~ will be used to identify not only the certain event but also
the entire experiment.
Probability Masses We shall find it convenient to interpret the probability P(Sif) of an event sf as its probability mass. In Venn diagrams, !:1 is the
entire rectangle, and its mass equals I. The mass of the region of the diagram
representing an event .<A. equals P(Sif). This interpretation of P(Sif) is consistent with the axioms and can be used to facilitate the interpretation of various results.
34
CHAP.
2
FUNDAMENTAL CONCEPTS
PROPERTIES In the development of the theory, all results must be derived
from the axioms. We must not accept any statement merely because it
appears intuitively reasonable. Let us look at a few illustrations.
I.
The probability of the impossible event is 0:
P(0) = 0
(2-21)
• Proof. For any szl, the events szl and 0 are mutually exclusive,
hence (axiom Ill) P(szl U 0) = P(szl) + P(0). Furthermore, P(szl U
0) = P(szl) because szl U 0 = szl, and (2-21) results.
2. The probability of :si equals
(2-22)
P(szl) = I - P(st)
• Proof. The events 91. and ~ are mutually exclusive, and their
union equals ~, hence,
P(~) = P(szl U ~) = P(s4) + P(:si)
and (2-22) follows from (2-18).
3. Since P(szl) ~ 0 (axiom 1), it follows from (2-22) that P(szl) s 1.
Combining with (2-17), we obtain
0 s P(szl) s 1
(2-23)
4. If :1 C szl, then
P(li) s P(~)
(2-24)
• Proof. Clearly (see Fig. 2.9),
sz1 = sz1
u
:~ = :~
Furthermore, the events
hence (axiom III)
~
and szl n
u
(.si
n
~>
mare mutually exclusive;
P(szl) = P(9J) + P(:A n 9J)
And since P(szl n ~) ~ 0 (axiom I), (2-24) follows.
5. For any szl and 9J,
P(szl U :1) = P(szl) + P(li) - P(szl n ~)
(2-25)
(2-26)
• Proof. This is an extension of axiom Ill. To prove it, we shall
express the event szl U 00 as the union of two mutually exclusive
Figure 2.9
SEC.
2-2
PROBABILITY SPACE
35
Figure 2.10
events. As we see from Fig. 2.10. stt U ~ =
as in (2-25),
P(stl U ~) = P(stt) + P(~
Furthermore,
stt U (stl. n ~).Hence,
n
8)
(2-27)
~=~n~=~u~n~=~n~u~n~
(distributive law); and since the events .~ n ?.13 and ~ n ?13 are
mutually exclusive, we conclude from axiom III that
P(~)
Eliminating P(stt
= P(stt n B) +
P(~
n ?A)
(2-28)
n ?.e) from (2-27) and (2-28), we obtain (2-26).
We note that the properties just discussed can be derived simply in
terms of probability masses.
J::mpiricallmerpretation. The theory of probability is based on the theorems
and on deduction. However, for the results to be useful in the applications, all
concepts must be in reasonable agreement with the empirical interpretation
P(stt) = n:Ain of P(.si). In Section 1-2. we showed that this agreement holds for
the three axioms. Let us look at the empirical interpretation of the properties.
I. The impossible event never occurs; hence,
P(0)
= ne
=
n
0
2. As we know, n.., + n:;; = n; hence.
n.., = 1 - n.,. = 1 - P(stt)
n
n
n
3. Clearly, 0 s n.., s n; hence, 0 s n"ln s 1 in agreement with (2-22).
4. If~ C stt, then na s n"; hence.
P(~)
= n;t = n -
P(li)
5.
In general (see (2-16)],
P(stt U li)
n.o~ua
= n:Aua = n,,.
n
n
= na s
n
n:A
n
= P(stt)
= n.,. ... na - n:Af':A; hence.
-r na - n:Ara
n
n
= P(stt) -
PC~) - PCstt
n ~)
36
CHAP.
2
FUNDAMENTAL CONCEPTS
Model Specification
An experimental model is specified in terms of the probabilities of all its
events. However, because of the axioms, we need not assign probabilities to
every event. For example, if we know P(s4), we can find P(Sl) from (2-22).
We will show that the probabilities of the events of any experiment can be
determined in terms of the probabilities of a minimum number of events.
COUNTABLE OUTCOMES
C1 ,
• • • ,
Suppose, first, that fl consists of N outcomes
CN. In this case, the experiment is specified in terms of the proba-
bilities Pi = P{C;} of the elementary events {C;}. Indeed, if sA. is an event
consisting of the r outcomes '•· , . . . , '•·. it can be written as the union of
the corresponding elementary events:
sA.= {C.,} u · · · u {Cd
<2-29)
hence [see (2-20)],
P(sA.) = P{C.,} + · · · + P{Cd = P4, + · · · + Pk.
(2-30)
Thus the probability of any event s4 of Y equals the sum of the probabilities
Pi of the elementary events formed with all the elements of s4. The numbers
Pt are such that
PI + · · · + PN = I
Pi~ 0
(2-31)
but otherwise arbitrary.
The foregoing holds also if~ consists ofcountably many outcomes (see
axiom III). It doe~ not hold if the elements of~ are noncountablc (points in
an interval, for example). In fact, it is not uncommon that the probabilities of
all elementary events of a noncountable space equal zero even though
P(Y) = I.
Equally Likely Outcomes
We shall say that the outcomes of an experiment
are equally likely if
I
PI= • .• = PN = -
(2-32)
N
In the context of an abstract model, (2-32) is only a special assumption.
However, for real experiments, the equally likely assumption covers a large
number of applications. In many problems, this assumption is established
empirically in terms of observed frequencies or by "reasoning" based on
physical "symmetries." This includes games of chance, statistical mechanics, coding, and many other applications.
From (2-30) and (2-32) it follows that if an event sA. consists of NsA
outcomes, then
P(s4) = NsA
(2-33)
N
This relationship can be phrased as follows: The probability P(sA.) of an
event sA. equals the number N sA of elements of sA. divided by the total number
N of elements. It appears, therefore, that (2-32) is equivalent to the classical
definition of probability [see (1-8)). There is, however, a fundamental differ-
SEC.
2-2
37
PROBABILITY SPACE
ence. In the axiomatic approach to probability, the equally likely condition
of (2-32) is an assumption used to establish the probabilities of an experimental model. In the classical approach, (2-32) is a logical conclusion and is
used, in fact, to define the probability of sf..
Let us look at several illustrations of this important special case.
Example 2.18
(a) We shall say that a coin is fair if its outcomes II and tare equally likely,
that is, if
P{h}
= P{t} = ~
(b) A coin tossed twice generates the space f:f
= {hh,
ht, th. tt}. If its four
outcomes are equally likely (assumption), then
P{hh}
= P{ht}
= P{th} = P{tt}
= 4I
(2-34)
In this experiment, the event ~. = {heads at the first toss} = {hh, ht} has
two outcomes, hence, P(~ 1 ) = 112.
(c) A coin tossed three times generates the space
~ = {hhh, hht, hth. htt. thh, tht. tth, ttt}
Assuming again that all outcomes are equally likely, we conclude that
P{hhh}
= · · · = P{ttt} = ~
(2-35)
The event ff 2 = {tails at the second toss} = {hth, htt, tth, ttt} has four
outcomes hence, P(fJ 2) = 1/2. The event sl = {two heads show} = {hhh,
hht, thh} has three outcomes, hence, P(sf.) = 3/8.
We show in Section 3-1 that the equally likely assumptions leading
to (2-34) and (2-35) are equivalent to the assumption that the coin is fair
and the tosses are independent. •
Example 2.19
(a) We shall say that a die is fair if its six outcomes/; are equally likely, that
is, if
= · · · = P{f"} =!6
The event {even}= {12 ,/..,Jf.} has three outcomes. hence, P{even} =
P{/1}
3/6.
(b) In the experiment with two dice. we have 36 outcomes/;./j. If they are
equally likely. {/;./j} = 1/36. The event {II} = {[.,/,. fJ.~} has two outcomes, hence. P{ II} '-= 2/36. The event {7} = Utft.. f,Jj .fds .fsfi. f,[.. •
.14/3 } has six outcomes, hence. P{7} "" 6/36. •
Example 2.20
(a) If a coin is not fair. then
P{h} = p
P{t} = q
p ... q = I
For a theoretical investigation, pis an arbitrary number. If
represents a real coin, pis determined empirically, as in (1-r
,..
38
CHAP.
2
FUNDAMENTAL CONCEPTS
(b) A coin is tossed twice. generating a space with four outcomes. We assign
to the elementary events the following probabilities:
P{hh} = p 2
P{ht} = pq
P{th} = qp
P{tt}
These probabilities are consistent with (2-30) because
p2 _. pq _ qp + p2 = (p _ q)2 = 1
= q2
C2-37)
The assumption ofC2-37) seems artificial. As we show in Section 3-1, it is
equivalent to (2-36) and the independence of the two tosses. In preparation, we note the following.
With }ftt 'll2 • 5 1 • 3"2-the events heads at the first, heads at the
second, tails at the first, tails at the second toss. respectively-we have
P<~ 1 ) = P{hh} + P{ht} = p2 ~ pq = p(p- q) = p
P('H2) = P{hh} - P{tlr} = p2 + qp = p(p + q) = p
P(3"1) = P{th} + P{tt} = qp T q2 = qCp + q) = q
P(3"2) ::.: P{ht} "t" P{tt} = pq + q2 = q(p T q) = q
The elementary event {hh} is the intersection of the events }t 1 and ~~;
hence, PC'Ilt n 'lf2) = P{hh}. Thus
PC~1 n 1tz) = P{hh} = p 2 = PC1t,)P(ltz)
P('lt 1 n 3 2 ) = P{ht} = pq = PC'It 1 )PC~~)
(2-38)
P(3" t n 'ltz) = P{th} = qp = PC~ 2JP(3" t>
PC~t n 3~):..: P{tt}
q~ = P<:ft>PC3"~)
=
Example 2.21
(a) In Example 2.12Ca), S = {r. d} and P{r}
•
= p, P{d} =
(b) In Example 2.12(b), the space is
~ = {rm, rd. dm,
q as in C2-36).
df}
To specify it, we assign probabilities to its elementary events:
P{rm} = P1
P{rf} = P2
P{dm} = PJ
P{df} = P•
where PI - P2 + Pl • P• = I. Thus Pt is the probability that a person polled
is Republican and male. lf9' is the model of an actual poll, then p 1 == n,,.ln
where n,,. is the number of male Republicans out of n persons polled.
In this experiment, the event ~ = {Republican} consists of two
outcomes. Applying (2-31). we obtain
PC~) = P{rm} + P{rf} "" P1 ... P2
Similarly,
P(.,if) = PI .,. P~
P(~) = p~ + P•
PC~)= P2 + P4
•
Equally Likely Events The assumption of equally likely outcomes on which
(2-32) is based can be extended to events. Suppose that :A 1 , • • • , .s-4.,. are m
events of a partition. We shall say that these events are equally likely if their
probabilities arc equal. Since the events are mutually exclusive and their
union equals'::/, we conclude as in (2-32) that
P(.s4 1)
= · · · = P(.s4,.) = mI
(2-39)
Problems involving equally likely outcomes and events are important
in a variety of applications, and their solution is often difficult. However, the
SEC.
2-2
PRORABJI.ITY SPACE
39
difficulties are mainly combinatorial (counting the number of outcomes in an
event). Since our primary objective is the clarification of the underlying
theory, we shall not dwell on such problems. We shall give only a few
illustrations.
Example 2.22
We deal a bridge hand from a well-shuffled deck of cards. Find the probabilities of
the following events:
.:4 - {hand contains 4 aces}
~ = {4 kings}
't = {4 aces and 4 kings}
'.i. = {4 aces or 4 kings}
The model of this experiment has ( ~~) outcomc'i. namely. the number of ways
that we can take 13 out of 52 objects [sec (2-9)1. In the context of the model, the
assumption that the deck is well shuffled means that all outcomes are equally likely.
In the event st. there are 4 aces. and the remaining 9 cards are taken from the 48
cards that arc not aces. Thus the number of outcomes in :.4 equals ( ~ ) . This is true
also for the event :1\. Hence.
Pl:ll)
= PWU
(~)
- - - = .0026
( 52)
13
The event '€ = .:S n til contains 4 aces and 4 kings. The remaining 5 cards are
taken from the remaining 44 cards that are not aces or kings. Hence.
(~)
PC'-f:) = - - = 1.7
(~i)
Finally~ =
,71 U
PC':;?;J
~1\:
X
10 "
hence [see (2-26)].
= P(.!4) + P(i/3)
1 2 (~) (~n
- P('O = - - - - - -
= .0052
•
Example 2.23
(a) A box contains 60 red and 40 black balls. A ball is selected at random. Find
the probability Pu that the ball is red.
The model of this experiment has 60 · 40 = 100 outcomes. In the
context of the model, the random selection is equivalent to the assumption
that all outcomes are equally likely. Since there arc 60 red balls. the c·,cnt
R = {the select ball is red} has 60 outcomes. Hence. Pu "" P(\/t) = 601100.
(b) We select 20 balls from the box. Find the probability p, that 15 of the
selected balls are red and 5 black.
In this case, an outcome is the selection of 20 balls. There are
1
( : ) ways of selecting 20 out of 100 objects; hence. the experiment has
40
CHAP.
2
FUNDAMENTAL CONCEPTS
(~~)equally likely outcomes. There are(~) ways of selecting 15 out of
the 60 red balls and ( ~) ways of selecting 5 out of 40 black balls. Hence,
there are ( ~) x ( ~) ways of selecting IS red and 5 black balls. Since all
outcomes are equally likely, we conclude that
(~)X(~)
Pb = - - - - = .065
( ~':)
(2-40)
This can be readily generalized. Suppose that a set contains L red objects
and M black objects (two kinds of elements) where L + M = N. We select
n s N of these objects at random. Find the probability p that I of these
objects are red and m black where I T m = n.
This experiment has ( ~) equally likely outcomes. There are ( ~)
ways of selecting I out of the L red objects and (:) ways of selecting m
out of the M black objects. Hence, as in (2-40),
L!
M!
I!(L- I)! m!(M- m)!
p=
N!
n!(N- n)!
Example 2.24
(2-41)
•
We deal a 5-card poker hand out of a 52-card well-shuffled deck. Find the probability
p that the hand contains 3 spades.
This is a special case of(2-41) with N =52, L = 13, and I= 3 if we identify the
13 spades with the red objects and the other 39 cards with the black objects. With
M = 39, n = 5, and m = 2, (2-41) yields
(1])(3!)
(Sf)
p = - - - = .082
•
In certain applications involving a large number of outcomes, it is not
possible to determine empirically the probabilities p; of the elementary
events. In such cases, a theoretical model is formed by using the equally
likely assumption leading to (2-33) as a working hypothesis. The hypothesis
is accepted if its consequences agree with experimental observations. The
following is an important applications from statistical mechanics.
Example 2.25
We place at random m particles in n ~ m boxes. Find the probability p that the
particles are located in m preselected boxes, one in each box.
SEC.
2-2
PROBABILITY SPACE
41
The solution to this problem depends on what we consider as outcomes. We
shall analyze the following celebrated models.
(a) Maxwell-Boltzmann. We assume that them panicles are distinct, and we
consider as outcomes all possible ways of placing them into the n boxes.
There are n choices for each panicle; hence. the number N of outcomes
equals n"'. The number N,4 of favorable outcomes equals the m! ways of
placing the m panicles into the m preselected boxes (permutations of m
objects). Thus N = n"'. N.,~ = m!. and (2-33) yields
m!
p = nnr
(b) Bose-Ein.stein. We now assume that them panicles are identical. In this
case, N equals the number of ways of placing m identical objects into
11 boxes. As we know from (2-12) and from Fig. 2.8. this number equals
n- m- I)
(
. There is, of course. only one way of placing the m par-
m
.
ticles in the m preselected boxes. Hence,
I
/) = ---'-----
("+:-')
m!(n- I)!
<n + m- I)!
(c) Fermi-Dirac. We assume again that the panicles are identical and that
we place onl.v one panicle in each box. In this case. the number of possibilities equals the number (. n ) of combinations of n objects taken m at a
m·
time. Of these, only one is favorable: hence.
I
p=-=
(n)
m!(n- ml!
n!
.m
One might argue. as indeed it was argued in the early years of statistical mechanics. that only the first of these solutions may be accepted.
However. in the absence of direct or indirect experimental evidence, no
single model is logically correct. The models are actually only hypothe.ses:
the physicist accepts the panicular model whose consequences agree with
observations. •
We now consider experiments consisting
of a noncountable number of outcomes. We assume, first, that ~consists of
all points on the real line
~ = {-x < t < x}
(2-42)
NONCOUNTABLE OUTCOMES
This case arises often: system failure, telephone calls, birth and death, arrival times, and many others. The events of this experiment are all intervals
{t 1 s 1 s 12} and their unions and intersections (see "Fundamental Note"
later in this discussion). The elementary events are of the form {I;} where I; is
any point on the t-axis, and their number is noncountable. Unlike the case of
countable elements, ~ is not specified in terms of the probabilities of the
elementary events. In fact, it is possible that P{t;} = 0 for every outcome I;
42
CHAP.
2 FUNDAMENTAL CONCEPTS
even though ~ is the union of all elementary events. This is not in conflict
with (2-20) because in the equation, the events .:A.; are countable.
We give next a set of events whose probabilities specify ~completely.
To facilitate the clarification of the underlying concept of density, we shall
use the mass interpretation of probability. If the experiment ~ has a countable number of outcomes ,;, the probabilities p, = P{,;} of the elementary
events {C;} can be viewed as point masses (Fig. 2.11 ). If~ is noncountable as
in (2-42) and P{C;} = 0 for every ,;;the probability masses are distributed
along the axis and can be specified in terms of the density function a(t)
defined as follows. The mass in an interval (1 1 , t 2) equals the probability of
the event {1 1 s 1 s 12}. Thus
(2-43)
The function a(t) can be interpreted as a limit involving probabilities. As we
see from (2-43), if lit is sufficiently small, P{t 1 s t s t 1 +lit}= a(t 1)/it, and
in the limit
( )
at,
.
I1m
= .11-G
P{t 1 s I s t 1
-
~I}
A
Ql
(2-44)
We maintain that the function a(t) specifies the experiment ~ completely. Indeed, any event sf of~ is a set~ of points that can be written as a
countable union of nonoverlapping (disjoint) intervals (see "Fundamental
Note"). From this and (2-20) it follows that P(s4) equals the area under the
curve a(t) in the region ~. As we see from (2-30). the area of a(t) in any
interval is positive; hence, a(t) ~ 0 for any t. Furthermore, its total area
equals P(~) = I. Hence, a(t) is such that
a(t)
~0
r.
a(t)dt
=I
(2-45)
Note The function a(t) is related to but conceptually different from the density
of a random variable, a concept to be developed in Chapter 4.
In (2-45) we assumed that ~ is the set of all points on the entire line. In
many cases, however, the given set~ is only a region~ of the axis. In such
cases, a(t) is specified only for 1 E ~ and its area in the region ~ equals I.
The following is an important special case.
Fipre 2.11
P;
2-2
Sf.C.
PROBABII.ITY SPACE
43
We shall say that Y. is a set of random points in the interval (tl, b) if it
consists of all points in this interval and et(t) is nmsttmt. In this case [see
(2-45)), et(t) = 1/(b - a) and
P{t, ~ I ~ 1~}
=
f
t:
,.
I• - /1
et(/) dt = - - - -
b -
(2-46)
ll
Fundtm!(•fltal Nott•. In the definition of a probability space. we have assumed
tacitly that all subsets of!f are events. We can do so if!J is countable. but if it is
noncountable. we cannot assign probabilities to ull its suhs\!ts consistent with
axiom Ill. For this reason. events of !f ure all subsets that can he expressed as
countable unions and intersections of intervals. That this does not include all
subsets of ~ is not easy to show. However. this is only of mathematical
interest. For most applications. sets that arc not countable unions or intersections of intervals are of no int\!rest.
Empirica/lmt•tprt'ttltimr oft~ltl. As we know. the probabilities of the events of
a model can he evaluated empirically as relative frequencies. If. therefore. a
model parameter can be expressed as the probability of un event. it can be so
evaluated. The function a(t) is not a probability: however. its integral is [sec
(2-43)1. This fact leads to the following method for evaluating a(t).
To find a(t) fort = t;. we form the event
.!4; = {t; :S t :S f; - .lt}
At a single trial. we observe an outcome t. If the observed t is in the interval ( t;.
t, - .11), the event .s4; occurs. Denoting by .l11, the number of such occurrences
at 11 trials. we conclude from <2-43) and 11-1) that
P(:.i;)
= !,t,1.·.11 a(t)dt
.ltl,
,. II
If .1t is sufficiently small. the integral equals a(t,).lt: hence •
.ill
a(t) .1t == - '
I
II
This expresses a(f;) in terms of the number
interval (I;. I; + .it).
Example 2.26
.i11;
(2-47)
(2-481
of observed outcomes in the
We denote by 1 the age of a person when he dies, ignoring the part of the population
that live more than 100 years. The outcomes of the resulting experiment arc all points
in the interval (0. 100). The experimental model is thus specified in terms of a
function a( I) defined for every t in this interval. This function can be determined from
(2-48) if sufficient data are available. We shall asumc that
a<r> = 3 x w· 9r~ooo - r> 2 o ~ r < 100
<2-49>
(see Fig. 2.12). The probability that a person will die between the ages of 60 and 70
equals
P{60::: t
::i
}
70 ""3
X
10
91
70,
1\U
,
t·(IOO- t)·dt- .154
The probability that a person is alive at 60 equals
P{t > 60}
= 3 x 10
9
J:oo t~CIOO- t)2 dt = .317
44
CHAP.
2
FUNDAMENTAL CONCEPTS
a{t)
100
0
Figure 2.12
Thus according to this model. 15.4% of the population dies between the ages of 60
and 70 and 31.7% is alive at age 60: •
Example 2.27
Consider a radioactive substance emitting particles at various times 1;. We observe
the emissions starting at t = 0. and we denote by t 1 the time of emission of the first
particle (Fig. 2.13). Find the probability p thatt 1 is less than a given time 10 •
In this experiment. ~ is the axis t > 0. We shall assume that
a{t) = >.e-At
{2-50)
1~0
From this and (2-43) it follows that
p = P{t 1 < t 0 } =A
Jo(to e·Aldt
=I-
e-Ato
•
Points on the Plane Experiments involving points on the plane or in space
can be treated similarly. Suppose, for example, that the experimental outcomes are pairs of numbers (x, y) on the entire plane
~ = {-x <X, y < x}
(2-51)
or in certain subset of the plane. Events of this experiment are all rectangles
and their countable unions and intersections. This includes all nonpathological plane regions ~. To complete the specification of~. we must assign
probabilities to these events. We can do as in (2-43): We select a positive
function a(x, y), and we assign to the event {(x. y) e a} the probability
P{(x, y) E
~} =
Jf
a(x, y)dtdy
(2-52)
'.1
The function a(x, y) can be interpreted as surface mass density.
Fipre 2.13
0
)(
)(
f;
t
SEC.
2-3
CONDITIONAL PROBABILITY AND INDEPENDENCE
45
.t
l'igure 2.14
Example 2.28
A point is selected at random from the rectangle~ of fig. 2.14. Find the probability p
that it is taken from the trapezoidal region 9:.
The model of this experiment consists of all points in 9t. The assumption that
the points are selected at random is equivalent to the model assumption that the
probability density a(x, y) is constant. The area of@t equals 24: hence, a(x, y) = 1124.
And since the area of~ equals 9, we conclude from (2-52) that
p
=..!..If
dxdy = 1.
24
24
•
<j
2-3
Conditional Probability and Independence
Given an event .M. such that P(.M.) =I= 0, we form the ratio P(lA n .M.)/P(.M.)
where lA is any event of~. This ratio is denoted by P(lA.I.M.) and is called the
"conditional probability of lA assuming .M.." Thus
PClAI.M.>
= P(lA n .M.)
(2-53)
P(.M.)
1
The significance of this important concept will be appreciated in the course
of our development.
Empirical llrtnprewtion. We repeat the experiment" times, and we denote by
n.., and n31 r_., the number of occurrences of the events .M. and s4 n .M., respectively. If n is large, then (see (I-I))
P(.M.) "'" n.«
n
P(s4 n .M.) ""
n:4n.fl
n
Hence.
P(s4
n
.M.)
P(.M.)
== nJAn.trln
n.ufn
= n.ar..tr
n.AI.
46
CHAP.
2
FUNDAMENTAL CONCEPTS
and (2-53) yields
P(sA. .M.) == n..,".tt
(2-54)
n.11
The event sA. n .M. occurs iff .M. and sA. occur; hence, n.-.n.tt is the number of
occurrences of sA. in the subsequence of trials in which .M. occurs. This leads to
the relative frequence interpretation of P<~I.M.): The conditional probability of
sA. assuming ..« is nearly equal to the relative frequency of the occurrence of sA.
in the subsequence of trials in which ..« occurs. This is true if not only n but
also n.11 is large.
Example 2.29
Given a fair die, we shall determine the probability of 2 assuming even. This is the
conditional probability of the elementary even sA. =- {.12} assuming ..« = {even}.
Clearly, sA. c ..«; hence, sA. n .M = sA.. Furthermore, P(sA.) = 116 and P(.M.) = 3/6;
hence,
P(sA. n .M.) 116 I
P(.f2/even) =
P(.M.)
= 112 = 3
Thus the relative frequency of the occurrence of 2 in the subsequence of trials in
which even shows equals 1/3. •
Example 2.30
In the mortality experiment (Example 2.26), we wish to find the probability that a
person will die between the ages of 60 and 70, assuming that the person is alive at 60.
Our problem is to find P(~l.«> where
sA. = {60 < t s 70}
..« = {t > 60}
As we have seen, P(sA.) = .154 and P(.M.) = .317. Since sA. n .M. =sA., we conclude that
P(sA.)
.154
P<sA.I.«> = PUf.) = .317 = ·486
Thus 15% of all people die between the ages of 60 and 70. However, 48.6% of
the people that are alive at 60 die between the ages of 60 and 70. •
Example 2.31
A box contains 3 white balls and 2 red balls. We select 2 balls in succession. Find the
probability p that the first ball is white and the second red.
The probability ofthe event 'W 1 ={white first} equals 315. After the removal of
the white ball there remain 2 white and 2 red balls. Hence, the conditional probability
of the event ~2 = {red second} assuming 'W 1 equals 2/4. And since 'lt' 1 n ~2 is the
event {white first, red second}, we conclude from (2-531 that
6
20
Next let us find a direct solution. The experiment has 20 outcomes, namely.
the P~ = 5 x 4 permutations
w1w2, WJM'J, w,r, • . . . , r2w2, r2w3. r2r1
of the 5 objects w1 , w2 • w3 , r 1 , r 2 taken 2 at a time. The elementary events are equally
likely, and their probability equals 1/20. The event {white first, red second} consists
of the 6 outcomes
p = P('W, n ffi2)
= P(~21'W,)P('W,)
-=
42 x 53 =
Wt Tt , W2r1, WJTt , Wt r2, W2r2, M'3T3
Hence, its probability equals 6120. •
SEC.
2-3
CONDITIONAL PROBABILITY AND INDEPENDENCE
47
In a number of applications. the available information (data). although
sufficient to specify the model. does not lead directly to the determination of
the probabilities of its events but is used to determine conditional probabilities. The next exa!'Jlple is an illustration.
Example 2.32
We arc given two boxes. Box I contains 2 red and 8 white cards: box 2 contains 9 red
and 6 white cards. We select ut rundom one of the boxes. and we pick at random one
of its cards. Find the probability p that the selected card is red.
The outcomes of this experiment are the 25 cards contained in both boxes. We
denote by :13 1 the event consisting of the 10 cards in box I and by fil 2 the event
consisting of the 15 cards in box 2 in Fig. 2.15. From the assumption that a box is
selected at rdndom we conclude that
P!?A1l
= P(;1\2)
=-
~
12-55)
The event Jl. = {red} consists of II outcomes. Our problem is to find its
probability. We cannot do so directly because the 25 outcomes of~ are not equally
likely. They are, however. conditionally equally likely. subject to the condition that a
box is selected. As a model concept this means that
io'
P(l1tltJII) =
9
P(./1 :1!2> "" i5
12-56)
Equations (2-55) and (2-56) are not derived. They are assumptions about the
model based on the hypothesis that the box and the card arc selected at random.
Using these assumptions. we shall derive PfdlJ deductively.
Since
P<~·tll)
I
= Pf:/1 n
~~~
/, '"'I·" 2) ·..
P(~l)
(,.,. .7)
PC11
n ~21
P!~2)
it follows from (2-55) and !2-56) that
Pfdl.
n
;'A )
I
2 X ! .., ..!_
= 10
2
10
.1>
n
P (m
._,
m 2)
9
= 15 X
I
l -'-
3
)0
The events 'til n :1\ 1 and Jl. n ?A 2 arc mutually exclu!>ivc (see Fig. 2.15). and their
union equals the event 't/1.. Hence.
PI~> = P!;ifl n :1} 1)
-1
PC11
n :1}:>
-=-
1~
Thus if we pick at random a card from a randomly selected box, in 40% of the
trials the selected card will be red. •
l'igure 2.15
.1!.
8w
48
CHAP.
2
FUNDAMENTAL CONCEPTS
From (2-53) it follows that
P(.stl n ~) = P(.slti~)P(3l)
(2-57)
Repeated application of this yields
P(.stl n ~ n <€) = P(.stliOO n <€) P (~ n <€) = P(.stli~ n '-€)P(~I~)P(~)
This is the chain rule for conditional probabilities and can be readily generalized (see Problem 2-14).
Fundamental Property We shall now examine the properties of the numbers P(.stli.M.) for a .fixed .M. as .stl ranges over all the events of~. We maintain
that these numbers are, indeed, probabilities; that is, they satisfy the axioms. To do so, we must prove the following:
I.
II.
III.
If .stl n 00
P(.stli.M.)
~
P<~I.M.>
=
0
I
(2-58)
(2-59)
= 0, then
P(.stl
u OOI.M.> = PC.stli.M.> + P<OOI.M.>
(2-60)
• Proof. Equation (2-58) follows readily from (2-17) because .stl n .M. is an
event. Equation (2-59) is a consequence of that fact that ~ n .M. = .M.. To
prove (2-60), we observe that if the sets .sit and 33 are disjoint, their subsets
.stl n .M. and ~ n .M. are also disjoint (Fig. 2.16). And since (.srt u ~) n .M. =
(.stl n .M.) u (~ n .M.), we conclude from (2-19) that
P(.slt u ~I.M.> = P[(.stl U ~) n .M.) = P(.stl n .M.) P(~ n .M.)
P(.M.)
P(..t{)
+ P(.M.)
and (2-60) results.
The foregoing shows that conditional probabilities can be used to create from a given experiment ~ a new experiment conditioned on an event .M.
of~. This experiment has the same outcomes C1 and events .stl; as the original
experiment~. but its probabilities equal P(.stl1I.M). These probabilities specify a new experiment because, as we have just shown, they satisfy the
axioms.
Figure 2.16
SEC. 2-3
CONDITIONAl. PROBABILITY AND INDEPENDENCE
49
Note, finally. that PC.51ti.M.> can be given the following mass interpretation. In the Venn diagram (fig. 2. 16), the event sl r. Ji. is the part of .5lt in .M.,
and P<.51ti.M.) is the mass in that region normalized by the factor 1/P(.M.).
Total Probability and Bayes' 11zeorem
In Example 2.32. we expressed the probability PC :fl.> of the event 9l in terms
of the known conditional probabilities PC9iltll,) and P(t.ltl~h). The following
is an important generalization.
Suppose that [.51t 1 , • • • , .s4m I is a partition of f:J consisting of m events
as shown in Fig. 2.17. We maintain that the probability P(tll) of an arbitrary
event ?A of f:J can be written as a sum:
P(?J3)
= P(~l.51t,)PC.sd,)
• Proof. Clearly [see (2-4)[.
~ = ze n f:l = ?A n (.51t 1 u ·
+ · · · + P<:A!:Jl,)PC.iim)
(2-61)
· · u ~,> = <1A n .~·i,) u · · · u (?13 n .s4m)
But the events ~ n sll; are mutually exclusive because the events sll; are
mutually exclusive. Hence [see (2-20)].
P(iA) = P(::A n :.:d,) - • • · - PC-A r :~,.,)
(2-62)
And since P(~ n sll;) = P(tilj:A;)P(:4;), (2-61) follows.
This is called the theorem of total probability. It is used to evaluate the
probability P(~) of an event :13 if its conditional probabilities P(OOI:4;) arc
known.
Example 2.33
In a political poll, the following results are recorded: Among all voters, 70% are male
and 30% are female. Among males, 4()9(. are Republican and 60"k Democratic.
Among females. 45% are Republican and 55~ are Democr.nic. Find the probability
that a voter selected at random is Republican.
In this experiment, f:f ::=. {rm, rf. dm, ~f}. We form the events.« ={male},~ :.::.
{female}. and~ = {Republican}. The results of the poll yield the following data
PUO ::. .70
P(::i) .... 30
PM!.«) = .40
P(1/. 5) = .45
Hence fsee (2-61)1,
P('til.)
•·igure 2.17
= P(f1t ..,tf.)P(..tt)
+
P<;11. 3-)P(.:fJ-= .415
•
50
CHAP.
2
FUNDAMENTAL CONCEPTS
• Bayes' Theorem. We show next that
(2-63)
This is an important result known as Bayes' theorem. It expresses the posterior probabilities P(.slf;IOO) of the events .slf; in terms of their prior probabilities
P(sA.;). Its significance will become evident later.
• Proof. From (2-53) it follows that
P(st·l~>
I
= P(.slf; n ~)
P(~)
P(~l.slf·)
= P(.slf; n ~)
I
P(.slf;)
Hence,
P<stl~>
I
=
P(~l.slf;)P(.slf;)
P(~)
(2-64)
Inserting (2-61) into (2-64), we obtain (2-63).
Example 2.34
We have two coins: coin A is fair with P{heads} = J/2, and coin B is loaded with
P{heads} = 213. We pick one of the coins at random, we toss it. and heads shows.
Find the probability that we picked the fair coin.
This experiment has four outcomes:
~ = {ah, at, bh, bt}
For example, ah is the outcome "the fair coin is tossed and heads shows." We form
the events
.s4 = {fair coin} = {ah. ht}
~ = {loaded coin} = {bh. bt}
'Jl = {heads shows} = {ah, bh}
The assumption that the coin is picked at random yields P(.s4) = P(~) = 112. The
probability of heads assuming that coin A is picked equals P('Jll.s4) = 1/2. Similarly,
PC'Jl!OO> = 2/3. This completes the specification of the model. Our problem is to find
the probability P(.s4I'Jl) that we picked the fair coin assuming that heads showed. To
do so, we use (2-63):
P('Jll.s4)P(.s4)
P<.s4I'H> = P<'HI.s4)P{.s4) + P<'HI~>P<98)
•
In the next example, pay particular attention to the distinction between
the determination of the model data based on the description of the experiment and the results that follow deductively from the axioms.
Example 2.35
We have four boxes. Box I contains 2,000 components, of which I ,900 are good and
100 are defective. Box 2 contains SOO components, of which 300 are good and 200
defective. Boxes 3 and 4 each contain I ,000 components, of which 900 are good and
SEC.
2-3
CONDITIONAL PROBABILITY AND lNDEPENDI~NCE
100 defective. We select at random one of the boxes and pick
component.
Cit
51
random a single
(a) find the probability Pu that the selected component is defective.
(b) The selected component is defective: find the probability p, that it came
from box 2.
Model Spedfication The space !I of this experiment has 4,000 good (g) clements
forming the event Cfi and 500 defective (d) clements forming the event a. The elements in each box form the events
~~ = {1.900g, IOOd}
!A~ = {300g. 200d}
~l = {900g. IOOd}
:11~ = {900g. IOOd}
The information that the boxes are selected at nmdom leads to the assumption that
the four events have the same probability. Hence lsee (2-3911.
PC~,) = PC?A2 I = PC~~)
= PC'til4 I . . :.
~
(2-65)
The rclndom selection of a component from the ith box leads to the assumption
that all elements in that box are conditionally equally likely. From this it follows that
the conditional probability P(a ~.) that a component taken from the ith box is
defective equals the proportion of defective components in that box. This is the
extension of (2-32) to conditional probabilities. and it yields
c; .-...
100 -· 05
P<~
w•d - 2.000 - ·
p, !Jl )
C:i: J.J~
P<~lflhl = ~~:::0 ==
P('.£
.I
:11~)
200 - 4
-
500 - ·
=
~~: = .I
(2-66)
Deduction
(a) from (2-62) and the foregoing it follows that the probability P(~) that the
selective component is defective equals
.
I
I
P('::t) = .05 X 4- .4 X 4 +.I
X
4I ·
.I
X
4I =
.1625
(b) The probability Pb that the defective component came from box 2 equals
P(@hl~>. Hence [see (2-64)],
P(~ ~) = P(~!002)P(002J = .4 x .25 = 615
21
P('it)
.1625
.
Thus the prior probability P(\~h) of selecting box ~ 2 equals .25; the
po.fterior probability. assuming that the component is defective, equals
.615. •
t:mpiricallmerprt'tcltion. We perform the experiment n times. In 25% of the
trials, box 2 is selected. If we consider only the n:t trials in which the selected
part is defective, then in 61.5% of such trials it came from box 2.
Example 2.36
In a large city, it is established that 0.5% of the population has contracted AIDS. The
available tests give the correct diagnosis for 80% of healthy persons and for 98% of
sick persons. A person is tested and found sick. Find the probability that the diagnosis is wrong, that is, that the person is actually healthy.
52
CHAP.
2
FUNDAMENTAL CONCEPTS
We introduce the events
sf.= healthy
~ = tested healthy
C€ =sick= 3i
~ = tested sick = 9i
The unknown probability is P(st'a). From the description of the problem it follows
that
P(sf.) = .995
P(<€) = .005
P(~lsf.) = .20
P(21~) = .98
Hence [see (2-61)],
P(9l) = P(2!sf.)P(sf.) + P(9l ~)P(~) = .2039
This is the probability that a person selected at random will test sick. Inserting into
(2-64), we conclude that
P(sf.l~) = P(2isf.)P(sf.) = .1990 = 976
P(9l)
.2039 .
•
Independent Events
Two events sf. and ~ are called statistically independent if the probability of
their intersection equals the product of their probabilities:
P(st n
~)
= P(sf)P(~)
(2-67)
1
The word statistical will usually be omitted. As we see from (2-53), if the
events~ and~ are independent, then
P(sti~>
= P(sf.)
P(~lst)
= P(~)
(2-68)
The concept of independence is fundamental in the theory of probability. However, this is not apparent from the definition. Why should a relationship of the form of (2-67) merit special consideration? The importance of the
concept will become apparent in the context of repeated trials and combined
experiments. Let us examine briefly independence in the context of relative
frequencies.
Empirical Interpretation. The probability P(sf.) of the event sf. equals the relative frequency ""'n of the occurrence of sf. in n trials. The conditional probability P(stla) of sf. assuming~ equals the relative frequence of the occurrence of
sf. in the subsequence of n~ trials in which~ occurs [see (2-54)].1fthe events sf.
and a are independent, then P(stla> = P(sf.); hence,
n..,
n
= nsAn!A
n~
(2-69)
Thus if the events sf. and a are independent, the relative frequency of the
occurrence of sf. in a sequence of n trials equals its relative frequency in the
subsequence of na trials in which a occurs. This agrees with our heuristic
understanding of independence.
Example 1.37
In this example we use the notion of independence to investigate the possible connection between smoking and lung cancer. We conduct a survey among the following
SEC.
2-3
CONDITIONAL PROBABILITY AND INDEPENDENCE.
53
four groups: cancer patients who are smokers (cs). cancer patients who are nonsmokers (c.f). healthy smokers (cs). healthy nonsmokers (cs). The results of the
survey show that
P(cs) = P1
P(c.'f) = P2
Pies) .:..: p,
P(C.f) = P4
We next form the events
<t = {cancer patients} -= {cs. d} 2 = {smokers} = {cs. cs}
~ n 2 = {cancer patients. smokers} = {cs}
Clearly.
P('~) = P1 - P2
P(':/.) - PI ~ P.'
P(t n r,t) = PI
If P('-f.. n ~) :/: P(<t)P(ra), that is. if P1 :/: lp 1 - p 2)(pl - pd. the events {cancer
patients} and {smokers} are statistically dependent.
:'~Jote that this reasoning does not lead to the conclusion that there is a causal
relationship between lung cancer and smoking. Both factors might result from a
common cause (work habits. for example) that has not been considered in the experimental model. •
Example 2.38
Two trains, X and Y. arrive at a station at ntndom between 0:00 and 0:20A.M. The
times of their arrival are independent. Train X stops for 5 minutes, and train Y stops
for 4 minutes.
(a) Find the probability p 1 that train
X arrives before train Y.
(b) Find the probability p 2 that the trains meet.
X arrived
before train Y.
Model Spedfication An outcome of this experiment is a pair of numbers (x, y) where
x and y are the arrival times of train X and train Y. respectively. The resulting space
!f is the set of points in the square of Fig. 2.18a. The event
:4 = {X arrives in the interval (1 1 • 12 )} =- {1 1 :s .\' :s 12}
is a venical strip as shown. The assumption that x is a random number in the interval
(0. 20) yields
(c) Assuming that the trains meet. find the probability p 1 that tntin
P(.s4)
= lz
- 11
20
The event
~ = {Y arrives in the interval (1 3 • 14 )} = {1 3 ~ y :s 14 }
is a horizontal strip, and its probability equals
P(~)
The event .s4
n
= l4-
l3
20
911 is a rectangle as shown, and its probability equals
P(.s4 n ~) = P(.s4)P{911)
= (I•
- 13 )(rz - 11 )
20 X 20
This is the model form of the assumed independence of the arrival times. Thus the
probability that (x, y) is in a rectangular set equals the area of the rectangle divided
by 400. And since any event ~ can be expressed as a countable union of disjoint
rectangles, we coqclude that
P(~) = area of 2
400
54
CHAP.
2
FUNDAMENTAL CONCEPTS
4
0
(a)
20
."C
(b)
Figure 2.18
This concludes the specification of if. We should stress again that the relationships are not derived; they are assumptions based on the description of the problem.
Deduction
(a) We wish to find the probability p 1 of the event~ = {X arrives before Y}.
This event cQnsists of all points in~ such that x s y. Thus <e is a triangle.
and its area equals 200. Hence,
200
PI = P(~) = 400 = .500
(b) The trains meet iff x s y + 4. because train Y stops for 4 minutes. andy s
x + 5, because train X stops for 5 minutes. Thus the trains meet iff the
event~= {-5 s x- y s 4} occurs. This event is the region of Fig. 2.18b
consisting of two trapezoids. and its area equals 159.5. Hence,
,.
159.5
399
P2 = P(':t) = 400 = ·
(c) The probability that X arrives before Y (i.e .. that event~ occurred) assuming that the trains met (i.e., that event~ occurred) equals P(<€1~). Clearly,
'€ n a is a trapezoid. and its area equals 72. Hence,
_
G
_
P3 - P<<€1:-t) -
P<~
n a> _ ..E._ _
- 1595 - .451
This example demonstrates the value of the precise model specification in
the solution of a probabilistic problem. •
Example 2.39
P(a)
Here we shall use the notion of independence to complete the specification of an
experiment formed by combining two unrelated experiments.
SEC.
2-3
CONDITIONAL PROBABILITY AND INDEPENDENCE
55
We are given two experiments. The first is a fair die specified by the model
:1,
=
Ui . ... . J~}
P{Ji} ..,
~
and the second is a fair coin specified by the model
:f~
= {h.
t}
P{lr}
= P{t}
I
,,. .;
We perform both experiments. and we wish to find the probability p that 5
shows on the die and head.f on the coin. If we make the reasonable assumption that
the two experiments are independent. it would appear from 12-67) that we should
conclude that
(2-70)
This conclusion is correct; it does not. however. follow from 12-67). In that equation,
as in our entire development. we dealt only with subsets of a sin1de space. To accept
(2-70) not merely as a heuristic statement but as a conclusion that follows logically
from the axioms in a single probabilistic model. we must construct a new experiment
ff in which 5 and heads are subsets. This is done as follows.
The space ff of the new experiment consists of 12 outcomes. namely. all pairs
of objects that we can form taking one from~. and one from :1~:
:t = {f,h,Jih •. .. ,f,t,Jf.t}
Thus ff = :/1 x f/ 2 is the Cartesian product of the sets :1 1 and :t~ [see 12-5)]. In this
experiment, 5 is an event sll consisting of the two outcomes j~lr and f,r; heads is an
event tA consisting of the six outcomes jjh • . . . . f6h. Thus
sll = {5} = {fch. fit}
~ = {heads} = {jjh, •.. .f,Jt}
To complete the specification of:t, we must assign probabilities to its subsets. Since
{5} and {heads} are events in the original experiments. we must set
P(sll) =
!6
P(dll =-
!2
In the experiment:/, the event {5 on the die and heads on the coin} is the intersection
sll n ~ {hh} of the events sll and ffi. To find the probability of this event, we use the
independence of sll and~. This yields P(al n :1\l = 1/6 x 112 = 1112. in agreement
with (2-70). •
We show next that if the events .'12 and ?A are independent. the events sl
and ?13 are also independent:
then
P(!A n :M = Pla>P<:A> <2-71)
If P<:A n ?A> = P<dl>P<11l>.
• Proof. As we know,
:11.
u S4 = ~
Hence, P(dl.) + P(.i) =
follows that
P(:A n :1:\) =
:1\ ""' {.lll n :1u u 1:4 n :-Jl)
I and P(OO) = P(sfJ. n :1\) + P(.~ n :1l). From this it
P(~) -
P<:A n
= (I - P(dl.))P(9A)
~)
= PH·M - P(:tl)P(:1.1)
= P(-;t)P(t/3)
and (2-71) results.
We can similarly show that the events :A and :13 arc also independent.
56
CHAP.
2 FUNDAMENTAL CONCEPTS
Figure 1.19
Generalization The events .54 1 , • • • , .54, are called mutually statistically
independent, or, simply, independent if the probability of the intersection of
any number of them in a group equals the product of probabilities of each
event in that group. For example. if the events .'ilf 1 • .542 , and .54 3 are independent, then
P(-54,
P(.s4,
n
.'ilf2)
n .'ilf3) = P(.s4,)P(.s43)
P(JA,
n .'ilf2 n af3)
= P(.s4, )P(.s4:!)
P(.'ilf2
n af3) = P(.'ilf2)P(.'ilf3)
(2-72)
(2-73)
Thus the events .s4,, .'ilf2, .'ilf3 are independent, if they are independent in pairs
and their intersection satisfies (2-73). As the next example shows, three
events might be independent in pairs but not mutually independent.
Example 2.40
= P(.'ilfJ)P(-542 )P(af3)
The events s4, :1, and '€ are such that (Fig. 2.19)
P(af)
= P(li) = P('€) = !5
P(af
P(af
n ~ n '-€) = ..!..
25
n ~) = P(sl n <€) = P(OO n '€) = 2~
This shows that they are independent in pairs. However, they are not mutually
independent because they do not satisfy (2-73). •
Problems
2-1
2-2
Show that: (a) if sf u 00 = sf n ~.then sf = ~;(b) sf u (~ n '€) c (s4 u 00) n
'€; (c) if af 1 c sf, 00 1 c 00, and sf n ~ = 0, then at, n :1 1 = 0.
Iff/={-~< I< oo},s4 = {4 SIS 8}, 00 = {7 sIS 10}, find sf U 00, sf n ~.i',
and st n 00.
PROBLEMS
1·3
2-4
2·5
2-7
2·9
2-10
2·11
2-12
2·13
2-14
2·15
57
The set 9' consists of the 10 integers I to 10. Find the number of its subsets that
contain the integers 1. 3, and 7.
(De Morgan's law) Using Venn diagrams, show that stl u 9l = ~ n j and
stl n li = :9i u i.
Express the following statements concerning the three events stl, li\, and~ in
terms of the occurrence of a single event: (a) none occurs; (b) only one
occurs; (c) at least one occurs; (d) at most two occur: (e) two and only two
occur: (I) stl and fi occur but <(S does not occur.
If P(stl) = .6. P(OO) = .3, and P(stl n 9l) = .2. find the probabilities of the events
:9i u !i. :9i u i, stl n i. and :si n i.
Show that: (a) PC::A U ~ U ~) = P(stl) - PC9l) - P('f.) - P(.ltt n 9l) - P(.ltt n
~) - P(!i n <€) + P(stl n 98 n <(S); (b) (Boote's inequality) P(stl 1 U • • • U
stl,) s P(stll) T • • .... P(stl, ).
Show that: (a) if P(stl) = P(~) = I, then P<stl n :1.1) = I: (b) P~(.<A n ?A) s
P(stl)P(\it) and P<::A n 91\) s [PC.<A) -r PC~J.Ul/2.
Find the probability that a bridge hand consists only of cards from 2 to 10.
In a raffle, 100 tickets numbered I to 100 are sold. Seven are picked at random,
and each wins a prize. (a) Find the probability that ticket number 27 wins a
prize. (b) Find the probability that each of the numbers 5. 15, and 40 wins a
prize.
A box contains 100 fuses, 10 of which are defective. We pick 20 at r.mdom and
test them. Find the probability that two will be defective.
If P(stl) = .6, P(stl U ~) = .8, and P(stli~J = .5. find PC~).
Show that P(stl) = PC::AI.«>P<.M.> + P<stliM>P<Ah
Show that PCstl n 00 n ~ n ~> = P(.<A:~ n <€ n ;:cJP<?AI'€ n ~>P<~I~>P<~).
Show that: (a) if ::4 n 00 = 0. then
P(stl)
,.,. ..~~
'.A)
P(:1i)
P( ON .:4 U ,,.J = P(.<;4) + P(?A)
and
P(li lstl) = I _ P(::A)
(b) P(Sf I.M.> = I - P(stli.M.).
2-16 We receive 100 bulbs of type A and 200 bulbs of type B. The probability that a
bulb will last more than three months is .6 if it is type A and .8 if it is type B.
The bulbs are mixed, and one is picked at random. Find the probability p 1 that
it will last more than three months: if it does. find the probability p 2 that it is
type A.
2·17 The duration t of a telephone call is an element of the space !f = {t 2!: O}
specified in terms of the function edt)
= !(' t'
'". c
=5
minutes, as in (2-43).
Find the probability of the events .<A = {O s t s 10} and ~ = {t 2!: 5}: find the
conditional probability P<stll~).
2·18 The events stl and 00 are independent, and ~ c <4. Find P(.<A).
2·19 Can two events be independent and mutually exclusive?
Two houses A and Bare far apart. The probability that in the next decade A
will bum down equals to- 3 and that B will bum down equals 2 x 10· 3 • Find the
probability p 1 that at least one and the probability p2 that both will bum down.
2-21 Show that if the events stl, !i, and~ are independent. then (a) the events Sf and
ii are independent; (b) the events, stl, li, and <i are independent.
58
CHAP.
2
FUNDAMENTAL CONCEPTS
2-ll Show that II equations are needed to establish the independence of four
events; generalize to n events.
2-23 Show that four events are independent iff they are independent in pairs and
each is independent of the intersection of any of the others.
1·24 A string of Christmas lights consists of 50 independent bulbs connected in
series; that is, the lights are on if all bulbs are good. The probability that a bulb
is defective equals .01. Find the probability p that the lights are on.
3 _ _ __
Repeated Trials
Repeated trials have two interpretations. The first is empirical: We equate
the probability P(s4) of an event .<4 defined on an experiment ~ to the ratio
, . ,/n where n. 4 is the number of successes of .rA in " repetitions of the
underlying physical experiment. The second is conceptual: We form a new
experiment f!, = ~~ x · · · x ~ the elements of which are sequences ~ 1 • • •
~,where~; is any one of the elements of ~f. In the first interpretation n is
large; in the second. " is arbitrary. In this chapter. we use the second
interpretation of repeated trials and determine the probabilities of various
events in the experiment ~f,.
3-1
Dual Meaning of Repeated Trials
In Chapter I. we used the notion of repeated trials to establish the relationship between a theoretical model and a real experiment. This was based on
the approximation P(sf) "" n!AIn relating the model parameter P(sf) to the
observed ratio n.11 fll. In this chapter. we give an entirely different interpretation to the notion of repeated trials. To be concrete. we start with the coin
experiment.
59
60
CHAP.
3
REPEATED TRIALS
REPEATED TOSSES OF A COIN The experiment of the single toss of a coin
~
{h. t} and the probabilities of its
is specified in terms of the space
=
elementary events
P{h} = p
(3-1)
P{t} = q
Suppose that we wish to determine the probabilities of various events
involving n tosses of the coin. For example. the probability that in 10 tosses.
7 heads will show. To do so. we must form a new model if, the outcomes of
which are sequences of the form
ft ... f; ... f,
(3-2)
where f; ish or t. The space~~~ so formed is written in the form
9', = ~ X • • • X ~
(3-3)
and it is called Cartesian product. This is a reminder of the fact that the
elements of~, are sequences as in (3-2) where f, is one of the elements of~.
Clearly, there are 2" such sequences: hence, ~f, has 2" elements.
The experiment if, cannot be specified in terms of 9' alone. Additional
information concerning the multiple tosses mu!\t be known. We shall presently show that if the tosses are independent, the model rJ, is completely
specified in terms of9'. Independence in the context of a real coin means that
the outcome of a particular toss is not affected by the outcomes of the
preceding tosses. This is in general a reasonable assumption. In the context
of a theoretical model, independence will be interpreted in the sense of
(2-67). As preparation, we discuss first the special case n = 3.
Example 3.1
A coin tossed n = 3 times generates the space 9') =- if x
8 outcomes
~~
x 9' consisting of the 2) =
hlrh. hht. hth. 1111. tlrh. till. tth. tit
We introduce the events
'It; "" {heads at the ith tosst
fj;
(3-4)
= {tails at the ith tos~}
and we assign to these events the probabilities
(3-5)
P('lt;) = p
P(~;) = q
This is consistent with (3-2). Using (3-5) and the independence of tosses. we shall
determine the probabilities of the elementary events of !I,,. Each of the events '#t; and
~;consists of four outcomes. For example.
'll 1 = {hhh.lrlrt. hth. htt}
:1 1 = {tlrlt. tlrt. ttlr. ttt}
The elementary event {hhlr} can be written as the inter~ection of the events 1t1 • Jf~.
'It 3 : hence,
P{hhh} = P{:Jt, n 'Jf2 n :~t) t
From the independence of the tosses and 12-73) it follows that
P<'ll, n 'lf2 n 1f)} = P<'lf, )P('Il~ )P( H:d = p·'
hence. the probability of the event {hhlr} equals p\ Proceeding similarly. we can
determine the probability of all elementary events of ~1.'. The result is
P{hhh} = p'
P{thh} = p 2q
P{hht} = p~q
P{tht} = pq2
P{hth} = p 2q
P{tth} = pq2
P{htt} = pq2
P{ttt} = p·'
SEC.
3-J
DUAL MEANING OF REPEATED TRIAI.S
61
Thus the probability of an elementary event equals p 4q 1 4 where k is the number of heads. This completes the specification of :1.,. To find the probability of any
event in :t). we add the probabilities of its elementary events as in (2-30). For
example. the event .'ll ={two heads in any order} consists of the three outcomes hlu.
hth. thh: hence.
P(.•/l) - P{hlu} + P{hth} + P{thh} - 3p~Cf
(3-6) •
A coin tossed 11 times generates the space !f,, consisting of2" outcomes.
Its elementary events are of the form ;~ ...,. {k head-. in a specific order}. We
shall show that
(3-7)
P{A heads in a specific order} = pAq" 4
We introduce the events /If; and :·l; as in 0-4) and assign to them the prnhahilities p and q as in (3-5). To prove (3-7). it suffices to express the elementary
event ;~as an intersection of the events ··1f; and /i;. Suppose. to be concrete.
that \'A :-:: {lrth . . . t}. In this case.
{hrlr . . . h} = Jt 1 n ;,-~ n "H', n · · · n ;'If,,
where the right side contains k events of the form J(.; and n - k events of the
form :•i;. from the independence of the tosses and (2-73 ). it follows that the
probability of the right side equals p4q'' 4 as in 0-7).
Example 3.2
Find the probability p, that the first 10 tosses of a fair coin will shuw heads and the
probability p 1, that the first 9 will be heads and the next one tails.
In the experiment ~J- 111 • the events
.•tl - { 10 heads in a mw}
:A .,. N heads thc:n tails}
arc elementary. With p =- q - 112 and n .... HI. 13-71 yields
P(.o,4) -
I
/'(:'1\1 :- -,w
-
Thus. contrary to a common impression. the events :1l and :11 arc equally rare. •
Note• Equation (3-7) seems heuristically obvious: We have k heads and 11 - k
tails: the probability for heads equals p and for tails q: the tosses are independent: hence. (3-7) must be true. However. although heuristics lead in this case
to a correct conclusion, it is essential that we phrase the problem in terms of a
single model and interpret independence in terms of events satisfying (2-67).
We shall now show that the probability that we get k heads (and n - k
tails) in any order equals
P{k heads in any order}
= ( Z) pAq"-4
(3-8)
• Proof. The event {k heads in any order} consists of all outcomes formed
with k heads and n - k tails. The number of such outcomes equals the
number of ways that we can place k heads and 11 - k tails on a line. As we
have shown in Section 2-1, this equals the combinations
CZ = ( ~)
of n
62
CHAP.
3
REPEATED TRIALS
objects taken kat a time [see (2-1011. Multiplying by the probability p£q" £of
each elementary event. we obtain (3-8).
For n = 3 and k = 2. (3-8) yields
1'{2 heads in any order}
= ( ~) p1q
::.: 3p1q
in agreement with (3-6).
Example 3.3
A fair coin is tossed 10 times. Find the probability that heads will show 5 times.
In this problem.
I
p=q=i
k -: 5
11 = 10
and (3-8) yields
I
I
252
.
P{5 head s m
any ord er} = ( 10)
5 x 2'" x ~ == 1.024
Example 3.4
•
We have two coins as in Example 2.34. Coin A is fair. and coin 8 is loaded with
P{h} = 2/3. We pick one of the coins at r.mdom. toss it 10 times. and observe that
heads shows 4 times. Find the probability that we picked the fair coin.
The space of this experiment is a Cartesian product 9'11 x ~ 111 where~~~= {c1, b}
is the selection of a coin and U' 10 is the toss of a coin 10 times. Thus this experiment
has 2 x 210 outcomes. We introduce the events
~ = {coin A tossed 10 times}
~ = {coin 8 tossed 10 times}
'!IJ = {4 heads in any order}
Our problem is to find the conditional probability P(:41~t).
From the randomness of the coin selection, it follows that P(~) = P(~) = 112.
If .<;4 is selected. the probability P(~l.llfl that 4 heads will show is given by (3-81 with
p = q = 1/2. Thus
"
I
I
P(~~~)
= ( 10)
4 X p X ~
.
P(':tj:~)
= ( 10
4 )
X
(3
2 )•
X
(3
I )"
Inserting into Bayes' theorem 12-63). we obtain
r. _
P(~~l.?liP(sf)
P(:A l~~ - P(!:tls41P!.9'l) + P(~ti:~)P(~)
Example 3.5
I
+ 2 ,.., 3 111
= .783
•
A coin with P{h} ::. p is tossed " times. Find the proballility Pu that at the first n - I
tosses. heads shows k - I times. and at the 11th toss. heads shows.
First Solution In n tosses. the probability of k heads equals pAq"-£. There are
{~ =:) ways of obtaining k -
I heads at the first 11 - I trials and heads at the 11th
trial: hence.
Pu
= ( nk -
I) pq"
I
£
(3-9)
SEC.
3-J
DUAL MEANING OF REPEATED TRIALS
Second Solution The probability uf k · I heads in
(z ~ :) p'
IC/'"
It I(
63
I trials equals
11 -
(J.J0)
II
The probability of heads at the 11th trial equab I'· Multirlying !3-IOJ hy p (independent tusses). we obtain (3-9).
Fur k = 1. equation 13-9) yields
fl.,- pet 1
(J-Ill
This is the probability thut heads will show ut the nth tm•., hut not before. •
.Vot(' Example 3.5 can he given u different interpretation: If we toss the coin
an infinite number of times. the probability that heads will show at the nth toss
but not before equals pc(' 1• In this interrrctatiun. the underlying experiment is
the space !f., of the infinitely many tusses.
Probability Tree In fig. ~.I we give a graphical representation of
repeated trials. The two horizontal segments under the letter !f 1 represent
the two clements lr and t of the experiment ~1' 1 - :1. The probabilities of {h}
and {t} arc shown at the end of each segment. The four horiLontal segments
under the letter !f~ represent the four clements of the space !f~ - ~f x ~f of
the toss of a coin twice. The probabilities of the corresponding elementary
events arc shown at the end of each segment. Proceeding similarly. we can
form a tree representing the experiments !/ 1• • • • • :1, fur any 11.
•·igure 3.1
" ,
hilt
lull
Ill
~I"' Vt.t:
= jill!. Ill, til. II:
~~ = ihllil. illlt. iltil. lut.
"" lI ,2
l pq
l
lilt
~~2
till!
till!. tilt. rtil. rrr:
til
1pq
I
tilt
I
If
tr/1
II
11[2
I
(((
64
CHAP.
3
REPEATED TRIALS
Dual Meaning of Repeated Trials The concept of repeated trials has two
fundamentally different interpretations. The first is empirical. the second
conceptual. We shall explain the difference in the context of the coin experiment.
In the first interpretation. the experimental model is the toss of a coin
once. The space ~consists of the two outcomes lr and t. A trial is a single
toss of the coin. The experiment is completely specified in terms of the
probabilities P{h} = p and P{t} = q of its elementary events {h} and {t}.
Repeated trials are thus used to determine p and q empirically: We toss the
real coin n times, and we set p = 111,/n where n,, is the observed number of
heads. This is the empirical version of repeated trials. The approximation
p == n1,1n is based on the assumption that n is !ilt/fidently large.
In the second interpretation. the experiment model is the toss of a coin
n times. where n is any number. The space f!, is now the Cartesian product
~n = ~ x · · · x ~consisting of2" outcomes of the form hth . .. h. A single
trial is the toss of the coin n times. This is the conceptual interpretation of
repeated trials. In this interpretation. all statements are e.mct. and they hold
for any n, large or small. If we wish to give a relative frequency interpretation to the probabilities in the space ~~,. we must repeat the experiment of
the n tosses of the coin a large number of times and apply ( 1-1 ).
3-2
Bernoulli Trials
Using the coin experiment as an illustration. we have shown that if ff
= {C 1 •
Cz} is an experiment consisting of the two outcomes C1 • C2 and it is repeated n
times, the probability that C1 will show k times in a specific order equals
p 4qn- 4• and the probability that it will show k times in any order equals
p,(k)
= ( ~) p 4q" .,
p
= P{Cd
The following is an important generalization.
Suppose that~ is an experiment consisting of the elements C;. Repeating ~ n times, we obtain a new experiment
~n
=~
X ~~ X • • • X ~
(Cartesian product). The outcomes of this experiment are sequences of the
form
(3-12)
where f; is any one of the elements C; of ~f.
Consider an event .54 of~~ with P(J4) = p. Clearly. the complement .54 is
also an event with P(~) = I - p = q. and [.54, :s.i I is a partition of~. The ith
element f; of the sequence (3-12) is an element of either .54 or ~. Each
sequence of the form (3-12) thus genercltes a sequence
~ = .54.54.54 ••• sA
(3-13)
SEC.
3-2
BERNOULLI TRIALS
65
where we place .stl at the ith position if .stl occurs at the ith trial. that is, if e; E
.~;otherwise. we place
:si. All sequences of the form of(3-13) are events of
the experiment ~fn.
Clearly. 00 is an event in the space !f,: we shall determine its probability under the assumption that the repeated trials of!/' are independent. From
this assumption it follows as in (3-7) that
P(,ql.~ ....~) = P<:A>PC.ii> · • · P<.>'ll = pq · • · p
(3-14)
If in the sequence ~ the event .stl appears k times, then the event Si appears
" - k times and the right side of (3-14) equals pkqn·A. Thus
(3-15)
P{.-.4 occurs k times in a specific order} == p'q"- 4
We shall next determine the probability of the event
'.:JJ = {.~ occurs k times in any order}
• Fundamental Theorem. In
p,(k)
11
independent trials. the probability
= P{~ occurs k times in any order}
that the event .~ will occur k times (and the event :!i "
equals
• Proof. There are
CZ
= (
~)
k times) in any order
events of the form {.<fl occurs k times in a
specific order}. namely. the ways in which we can place ,<4 k times and .Si" k times in a row (see (2-IO)j. Furthermore, all these events are mutually
exclusive, and their union equals the event {.~·1 occurs 1.: times in any order}.
Hence. (3-16) follows from 0-151.
Example 3.6
A fair die is rolled seven times. Find the probability tH21 that 4 will show twice.
The original experiment !/"is the single roll of the die and-'·~ """ {./4}. Thus
I
5
Pl.·-~1 -= (,
With"
= 7 and J..
Pl.'!ll -
6
= 2. 0-16) yields
7! I ~ 5 '
P7121 '· 2!;! ((;) ((,) - .234
Example 3.7
•
A pair of fair dice is rolled four times. Find the probability p4(0) that II will not show.
In this case. !f is the single roll of two dice and :.4 .., {J;.t~..f,h}. Thus
P(.'.~) -=-
.,
f{;
P(.•.4)
=-
'\4
Jfl
66
CHAP.
3
REPEATED TRIALS
With n = 4 and k = 0. {3-16) yields
34) = .796
= (36
4
p~(O)
•
We discuss next a number of problems that can be interpreted as repeated trials.
Example 3.8
Twenty persons arrive in a store between 9:00 and 10:00 A.M. The arrival times are
random and independent. Find the probability p, that four of these persons arrive
between 9:00 and 9: 10.
This can be phrased as a problem in repeated trials. The original experiment 9
is the r.mdom arrival of one person and the repeated trials are the arrivals of the 20
persons. Clearly,
sf= {a specific person arrives between 9:00 and 9: I0}
is an event in~. and its probability equals
PC.vl) = 10
60
[see (2-46)J because the arrival time is random. In the experiment ~t3, of the 20
arrivals, the event {four people arrive between 9:00 and 9: IO} is the same as the event
{sf occurs four times}. Hence.
4
Pu
Example 3.9
20! (')
= p:!C){4) = 4!16!
6 (5)'"
6
= .02
•
In a lottery. 2,000 persons take part. each selecting at random a number between I
and 1.000. The winning number is 253 (how this number is selected is immaterial). (a)
Find the probability p, that no one will win. (b) Find the probability Ph that two
persons will win.
(a) Interpreting this as a problem in repeated trial~o. we consider as the original
experiment ~ the r.mdom selection of a number N between I and 1.000.
The space ~ has 1,000 outcomes. and the event sf = {N = 253} is an
elementary event with probability
P(sf) = P{N = 253} = .001
The selection of 2,000 numbers is the repetition of~ 2.000 times. It
follows, therefore. from (3-16) with k = 0 and p = .001 that
Pu = (.999)2•000 = e- 2 = .135
(b) If two persons win. sf occurs twice. Hence.
Pb = (
2
·~) (.001)2(.999) 1•998 =
.27
•
SEC.
Example 3.10
3-2
BERNOULLI TRIALS
67
A box contains K white and N - K black cards.
(a) With replacements. We pick a card at random. examine it. and put it back.
We repeat this process n times. Find the probability p., that k of the picked
cards are white and n - k are black.
Since the picked card is put back. the conditions of the experiment
remain the same at each selection. We can therefore apply the results of
repeated trials. The original experiment is the selection of a single card.
The probability that the selected card is white equals KIN. With
P=
K
the probability that k out of the
p, = (
N- K
N
11
q
= -~v-
selections arc white equals
1Jn (N)K ' ( 1 -
K )"
N
'
(3-11)
(b) Without replllc:emems. We again pick a card from the box. but this time we
do not replace it. We repeat this process n times. Find the probability Ph
that k of the selected cards are white.
This time we cannot use (3-16) because the conditions of the experiment change after each selection. To solve the problem. we proceed directly. Since we are interested only in the total number of white cards. the
order in which they are selected is immaterial. We can consider. therefore.
the n selections as a single outcome of our experiment. In this
experiment. the possible outcomes are the number (:)of ways of selecting n out of N objects. Furthermore. there arc ( ~) ways of selecting k
out of K white cards. and
k ways of selecting n (Nn -·_ K)
k out of the
N - K black cards. Hence lsee also (2-41)). Ph is given by the hypC'rpeometric series
pN) (. N- pN)
( k
ll - k
Ph .:: - - - - - - · -
(3-18)
(7r)
Note that if k << K and n « N. then after each drawing, the
number of white and black cards in the box remains essentially constant;
hence. p., == p,. •
We determine next the probability P{k 1 s k s k~} that the number k of
successes of an event .s4 in n trials is between k. 1 and k~.
68
CHAP.
3
REPEATED TRIALS
• Theonm
(3-19)
where p
= P(dl.).
• Proof. Clearly, {k 1 s k s k2 } is the union of the events
~4 = {dl. occurs k times in any order}
These events are mutually exclusive, and P(OOd = p,.(/<) as in (3-16). Hence,
(3-19) follows from axiom Ill on page 33.
Example 3.11
An order of 1.000 parts is received. The probability that a part is defective equals .I.
Find the probability p, that the total number of defective parts does not exceed 110.
In the context of repeated trials. ff is the arrival of a single part. and .91 = {the
part is defective} is an event of~~ with P(.si) = .I. The arrival of 1.000 parts generates
the space ~,. of repeated trials. and p, is the probability that the number of times .91
occurs is between 0 and 110. With
p = .I
t1 =0
k2 = 110
n = 1.000
(3-191 yields
•
Example 3.12
We receive a lot of N mass-produced objects. Of these, K are defective. We select at
random n of these objects and inspect them. Based on the inspection results. we
decide whether to accept or reject the lot. We use the following acceptance test.
Simple Sanaplinc Suppose that among the n sampled objects, k are defective. We
choose a number leo, depending on the particular application, and we accept the lot if
k s /co. If k > k0 • the lot is rejected. Show that the probability that a lot so tested is
accepted equals
~
, .. 0
(PN)
(NpN)
k
n - k
K
N
(3-20)
p=-
(Z)
• Proof. The lot is accepted if k s k0 • It suffices therefore to find the probability p,
that among the n sampled components. k will be defective. As we have shown in
Example 3.10, Pt is given by the hypergeometric series (3-18). Summing fork from 0
to lco. we obtain (3-20). •
We shall now examine the behavior of the numbers
p,.(k)
= ( ~) ptq"-'
(kn) = k!(ll n!-
k)!
for fixed n, as k increased from 0 to n (see also Problem 3-8).
SEC.
3-2
BERNOULLI TRIALS
69
p .. •5
= :!0
II
0 :!
20
km
k
p,(kl
p = .3
= 20
II
~0
k
k
If p = 112. then p,(k.) is proportional to the binomial coefficients ( ~ ):
p,(k) =
n I
(k)
2"
In this case. p,(k) is symmetrical about the midpoint n/2 of the interval
(0. n). If n is even. it has a sinl(le maximum fork= k, = n/2; if n is odd. it has
two maxima
n-1
n t I
and
k.-= k· = - k = k., = -y-
-
2
If p :1= 112. then p,(k.) is not symmetrical: its maximum is reached for
k """ np. Precisely. if (11 + I )p is not an integer. then p,(k) has a single
maximum fork. = k, = Un + I )p). * If (n + I )pis an integer. then p,(k) has
two maxima:
k = k., = (n + J)p - I
and
k. ·~ k.~ - (lr + l)p
These results are illustrated in Fig. 3.2 for the following cases:
I.
2.
3.
4.
n = 20
n::... II
p= .5
p::. .5
= 2()
p= .3
p= .4
II
n=9
k., .:.. llp ..:.. 10
k., = (11 ·- l)p = 5
k~ ..:. (n + l)p
(II + l)p = 6.3
k, = 16.3) = 6
l)p.:.. 4
(II
k, ::. 3
k~.:.:::. 4
..
The significance of the curves shown is explained in Section 3-3.
• (x) means ··the largest" integer smaller than x. ··
=6
70
CHAP.
3
REPEATED TRIALS
3-3
Asymptotic Theorems
In repeated trials. we are faced with the problem of evaluating the probabilities
11(11 - 1) • • • (11 - k + 1)
p,(k) =
p4q" 4
(3-21)
1. 2 . . . k
and their sum (3-19). For large n. this is a complicated task. In the following.
we give simple approximations.
The Normal Curves
We introduce the function
1
,![(X) = - - t' ,. ·:~
(3-22)
\12;
and its integral
(3-23)
These functions are called (standard) twmwl or Gtm.uian curves. As we
show later. they are used extensively in the theory of probability and statistics.
Clearly.
1((-X)
= ,![(X)
Furthermore (see the appendix to this chapter),
• J. . .
--
yf2;
t' ''~dx = 1
(3-24)
From this and the evenness of R(XI it follows that
G(l)
=
1
G(O)
=~
G(-x)
= ~-
G(x)
(3-25)
In Fig. 3.3 we show the functions R(X) and G(x). and in Table* Ia we
tabulate G(x) for 0 s x s 3. For x > 3 we can use the approximation
(i(x)
== 1 -
!X .~[(X)
(3-26)
De Moivre-Lap/ace Theorem
It can be shown that for large n, Pn(k) can be approximated by the samples of
the normal curve R(X) properly scaled and shifted:
(3-27)
• All tables arc at the back of the bunk.
SEC.
3-3
ASYMPTOTIC THF.OREMS
7)
g(xl
Cilxl
0
•·igure 3.3
where
tr -= vlijhi
T/ -= np
(3-28)
In the next chapter. we give another interpretation to the function g(x)
(density) and to the constants 71 (mean) and tr (standard deviation).
The result in (3-271 is known as the l>t• Moivn·-I.ap/ace tlwort•m. We
shall not give its ntther difficult proof here: we shall only comment on the
range of values of 11 and k for which the approximation is satisfactory. The
standard normal curve takes significant values only fur lxl < 3; for lxl > 3. it
is negligible. The scaled and shifted version of the interval (- 3. 3) is the
interval f) ..: (T/ - 3tr. 71 + 3tr). The rule-of-thumb condition for the validity
of {3-27) is as follows: If the interval I> is in the interior of the interval (0. II),
13-271 is a satisfactory approximation for every k in the interval D. In other
words. (3-27) can he used if
() <
lip'- 3 Vllpq
<
k < ,, I 3
v;;pq <
II
(3-29)
Note. finally. that the approximation is best if p,l k) is nearly symmetrical. that is. if p is close to .5. and it deteriorates if p is close to 0 or to I.
We list the exact values of p,.(k) and its approximate values obtained
from (3-27) for n = 8 and for p = .5. As we see. even for such moderate
values of n, the approximation error is small.
72
CHAP.
3
REPEATED TRIALS
k
Pa(/c)
2
I
0
3
5
4
6
7
8
.004 .031 .109 .219 .273 .219 .109 .031 .004
approx. .005 .030 .104 .220 .282 .220 .104 .030 .005
Example 3.13
A fair coin is tossed 100 times. Find the probability that heads will show k = 50 and
/c = 45 times.
In this case.
11 = 100
p = q = .5
llP = 50
11pq = 25
Condition (3-29) yields 0 <50- 15 < k <50 + 15 < 100: hence. the approximation
Pn(k)
==
S~ ,.-!4 ~~~~~·
is satisfactory provided that /c is between 35 and 65. Thus
1
1
P{k = 50} == - = .08 P{k = 45} = 5-- ('
5 V21T
yl2;
•·~ = .048
•
As we have seen, the probability that the number k of successes of an
inn trials is between k 1 and k2 equals
event~
P{kl S k S k2} =
±(Z)
p 4qn-l.
4· 4,
Using the De Moivre-Laplace theorem (3-27). we shall give an approximate
expression for this sum in terms of the normal curve G(x) defined in (3-23).
• Theonm
(3-30)
provided that k1 or k2 satisfies (3-29) and
u
= v'/ijiq >>
(3-31)
I
• Proof. From (3-27) it follows that
~
~
z
( ) f"'q" -4 == _I_ ~
~
\I'21TU 4-4,
4-4,
The right side is the sum of the k2
/(x)
=
I
u
v'21T
-
I! 14-l)l!f2or!
k1 + I samples of the function
e-«.c-1)1'.,2tr•
= -I
(T
(x-71)
g -(T
SEC.
3-3
ASYMPTOTIC THEOREMS
I
k1- 0.5
(a)
k2
73
X
+ 0.5
(b)
l'igure 3.4
for x "" k 1 , k1 + I, . . . . k2 • Since u >> I by assumption. the functionf(x) is
nearly constant in the interval (k. k + I) of unit length (Fig. 3.4a); hence. its
area in that interval is nearly .f(k). Thus
~
LJ (" c,~, _1), .,
!_.,..
I
u \12., 4· 4,
With the transformation
I
--:=
\12.,
u
X-TJ
JJ(':
(3-32)
("-h'
4•
dx :..:. udy
v=(1'
•
the right side of 0-32) equals [see 0-23)1
-I- f.''~
\121T
1)1/tr
14 1 • "11/or
/o.~ ··Tl.) - (,, ( k1 ·- TJ)
e .,.: - dv = G (
1•
U
•
(1'
(3-33)
and (3-30) results.
The approximation (3-27) on which (3-30) is based holds only if k is in
the interval (TJ - 3u, Tl + 3cr). However, outside the interval, the numbers
p,(k) and the corresponding values of the exponential in (3-27) are negligible
compared to the terms inside the interval. Hence. (3-7) holds so long as the
sum in (3-27) contains terms in the interval (TJ -· 3cr. Tl + 3u).
Note in particular that if k 1 = 0. then
G ( ~) = G (- • ~) < G( 3) -.. .001
u
v npq·
=0
because (see (3-29)) np > 3 vlipq; hence,
i (tr) p'q"
4-11
p
4
== G ( k2 - 11p )
vnpq
(3-34)
provided that k2 is between TJ - 3u and TJ + 3u.
Example 3.14
The probability that a voter in an election is a Republican equal!! .4. Find the probability that among I ,000 voters. the number k of Republicans is between 370 and 430.
This is a problem in repeated trials where .!If = {r} is an event in the experiment
74
CHAP.
3
REPEATED TRIALS
9' and P(s/l) = p = .4. With
n = 1.000
k 1 = 370
(3-34) yields
P{370
Example 3.15
:5:
k
:5:
k2 = 430
430} == G
llp = 400
(~)-
G ( ·-
npq
= 240
~) = .951
•
We receive an order of 10,000 parts. The probability that a part is defective equals .I.
Find the probability that the number k of defective parts does not exceed. 1.100.
With
n = 10,000
k2 = 1.100
p =.I
np = 1.000
11pq = 900
(3-34) yields
P{k :5: 1.000}::: G (
~)
•
= .999
Correction In (3-32), we approximated the sum of the k2 - k 1 + I terms on
the left to the area of/(x) in k2 - k 1 intervals of length I. If k2 - k1 >> I. the
additional term on the left that is ignored is negligible, and the approximation
(3-30) is satisfactory. For moderate values of k2 - k 1 , however. a better
approximation results if the integr.ttion limits k 1 and k2 on the right of (3-32)
are replaced by k 1 - 112 and k2 + 112. This yields the improved approximation
~
(n) TJ'q"·""" G (k2 + 0.5- np) _ G (k'- 0.5- np)
(3_35 )
p
of (3-30), obtained by replacing the normal curve g(x) by the inscribed
staircase function of Fig. 3.4b.
4~4,
vnpq
vnpq
THE LAW OF LARGE NUMBERS According to the empirical interpretation
of the probability p of an event~. we should expect with near certainty that
the number k of successes of ~ in n trials be close to np provided that n is
large enough. Using (3-30), we shall give a probabilistic interpretation of this
expectation in the context of the model ~, of repeated trials.
As we have shown, the most likely value of k is, as expected, np.
However, not only is the probability tsee (3-27))
I
r(3-36)
vnpq
that k equals np not close to I. but it tends to zero as n- oc. What is almost
certain is not that k equals np but that the ratio kin is arbitrarily close top for
n large enough. This is the essence of the Jaw of large numbers. A precise
formulation follows.
P{k
= np} =
~
SEC.
3-3
ASYMPTOTIC THEOREMS
75
• Theorem. For any positive 1-:.
/.;
,
P{p - 1-: -:: ··
p • t:} > .997
<
(3-37)
provided that " > 9pqle~.
• Proof. With /.; 1 = (p - e)n.
P{(p
e)n
< k <
(p
"-~
+ t:)n}""'
"'- (p
(i
1
1-:)n. 0-30) yields
I (fJ. T ·~'"·- ''1!.1
v llf'Cf
li
I (p .~
e)ll -_np]
v'npq
Hence.
p { p- ,.; !5
nk
!5
(n
}
(
p + e = G ,.; Vpq)-
(i
(
-·;
iii
v,Cf)-
'!(j
(•: V/;t/)
tn
I (3-38)
If n > 9pql e~. then 1-: v'llfj}q > 3 and
2G(e
~;;)-I
>2G(3)- I , 2 x .998ft- I.,.. .997
and 0-37) results.
Note. finally. that
fii
Ui ( ,.;
'J PC/ )
I
,,....,
--+ 2G(-x)
I ' I
Since this is true for any e. we conclude with near certainty that kin tends to
11 -+ ~. This is, in a sense. the theoretical justification of the empirical
interpretation ( 1-1) of probability.
p as
Example 3.16
(a)
A fair coin is tossed 900 times. Find the probability that the ratio J.Jn is
bet ween .49 <tnd .51.
In this prohlcm. 11 - 9(Ml. r. - .01. 1: \:nip~/ - .fl. and IJ-~IU yields
P { .49 ··. :,.,
<..
.51 } ""' :!Cil.fll
I - .4515
(b) Find n such that the probability that kin is between
.49 and .51 is .95.
In this case. (3-38) yields
P 1.49 < ;, < 51}
~(i(.ll:!
\til
I ·- .95
Thus n is the solution of the e4uatiun (i(.(l~ \:111- .975. Frum Table
I we sec that
(;(1.95).,... .9744 < .975 < (i(~.IMII "' .97725
Using linear interpolatiun. we conclude that hec 14-~))J .02 v/i ~"" 1.96:
hence. 11 -· 9.6tMI. •
Generalized Bernoulli Trials
In Section 3-2 we determined the probability p,(/;.) that an event .tA of an
experiment ~f will succeed k times in 11 independent repetitions of rf. Clearly.
if .14 succeeds k times. its complement succeeds 11 - /.; times. The fundamental theorem (3-16) can thus be phrased as follows:
76
CHAP.
3
REPEATED TRIALS
/
f)~
.!i3
J
(b)
(a)
Cl
(c)
Figure 3.5
The events .<;4, = :A. :A 2 = .~form a partition of !I (Fig. 3.5a). and their
probabilities equal p, = p and p 2 = I - p. respectively. In the space ~f, of
repeated trials, the probability of the event {.<4 1 occurs k 1 = k times and &1
occurs k. 2 = " - k times} equals
p,(k)
= p,(kl. k1)
= /.
,I1;. I P1' p~:
(3-39)
"1·"1·
Our purpose now is to extend this result to arbitrary partitions.
We are given a partition A = 1.-:4, ••..•.<A,) of the space~~. consisting
of the revents dl.; (Fig. 3.5b). The probabilities of these events are r numbers
P; = P(dl.;) such that
P1 + · · · + p, = I
Repeating the experiment fJ " times. we obtain the experiment :-1, of repeated trials. A single outcome of ~1, is a sequence
~~~~
... ~,
where ~j is an element of ~f. Since A is a partition. the element ~j belongs to
one and only one of the events :A;. Thus at a particular trial. only one of the
events of A occurs. Denoting by k; the number of occurrences of .94; in n
trials. we conclude that
k 1 + · · · + k, = n
At the jth trial. the probability that the event :A; occurs equals p;.
Furthermore. the trials are independent by assumption. Introducing the
event
~ = {.<A; occurs k.; times in a specific order}
we obtain
P(~~) =
Pt' · · · p~·
(3-40)
This is the extension of (3-15) to arbitrary partitions.
We shall next determine the probability p,(/.: 1 • •
(.:Ji = {.<A; occurs li.; times in any order}
•
k,) of the event
P',·
(3-41)
• Theorem
p 11(k I •
• • • •
"!
.
kr ) -- kl! , . , /i.,! P 4I '
• • •
• Proof. For a specific set of numbers k 1 , • • • • k,. all events of !f, of the
form~ have the same probability. Furthermore. they are mutually exclu-
SEC.
3-4
RARE EVENTS AND POISSON POINTS
77
sive, and their union is the event ~1.. To find P(~'i.-) it suffices to find the total
number N of such events. Clearly. N equals the number C4, ..... 4, of combinations of" objects grouped in r classes, with k; objects in the ith class. As we
have shown in (2-14). the number of such combinations equals
I
C"4, .....4. -- li.,! ."·
. . /;.,!
Multiplying hy
P<?A>. we obtain (3-41 ).
• Corollary. We are given two mutually exclusive events .llt 1 and .lit~ (Fig.
3.5c) with p 1 = P(.<A- 1 ) and p~ = P(.<A~). We wish to find the probability p., that
in n trials, the event .s4 1 occurs k, times and the event :A~ occurs k~ times.
To solve this problem. we introduce the event
.lit_, = .~. u .~~
This event occurs k3 = n - (k, + k2 ) times and its probability equals p 3 =
1 - p 1 - p 2 • Furthermore, the events .~~t .. .llt2, .lit~ form a partition of ~.
Hence,
p,
Example 3.17
= p,( t\JI. • t\~' k'.\ ) = I.
I.
ll!
p4oI p4•
p4'
~
.\
"'·"~·"'·
Jl. Jl. I
(3-42)
A fair die is rolled 10 times. Find the probability p, that .fi shows 3 times and even
shows 5 times.
In this case •.rJl, = {jj}. !A 2 = {even}
3
II = 10
kt = 3
.li.2 = 5
PI = 6
P1 = 6
Clearly. the events .<A, and dl 2 are mutually exclusive. We can therefore apply (3-42)
with
This yields
I)·'( &.3)~( cJ2 - .4
10! (
P•n< 3• 5• 2) = 3!5!2! 6
1
•
3-4
Rare Events and Poisson Points
We shall now examine the asymptotic behavior of the probabilities
n(n - I) · · · (n - lo; -+ I )
p,(k) =
k!
.... _ p4q•,-4
under the assumption that the event .lit is rtm•, that is. that p << 1. If n is so
large that np >> I. we can use the De Moivre-Laplace approximation (3-27).
If, however, n >> I but np is of the order of 1-for example, if n = 1,000 and
p = .002-then the approximation no longer holds. In such cases, the following important result can be used.
78
CHAP.
3
REPEATED TRIALS
• Poisson Theorem. If p << I and 11 >> I. then. for II. of the order of 11p.
(
")
1\
4
P4q't
(npl
4 ::.- (' -•IP -
(3-43)
II.!
• Proof. The probabilities p,(/\) take significant values for II. near np. Since
p << I we can use the approximations np << n. II.<< 11:
ll(ll - I) · · · (n - II. + I) == n4
q = I - p ,.,. (' P
q" 4 == q'' == e-"P
This yields
n(n - I) • • • (II - k + I)
11 4
--=-----'---:--:-'------'- p'q" 4 "" _ p'(' ·ttp
(3-44)
II.!
k!
and (3-43) results.
We note that (3-43) is only an approximation. The formal theorem can
be phrdsed as a limit:
a'
(3-45)
( -") p'q'' '-+(' , _
k
k!
as n-+ x, p-+ 0, and np-+ a. The proof is based on a refinement of the
approximation (3-44).
Example 3.18
An order of 2.000 parts is received. The probability that a part is defective equals
to-~. Find the probability p, that no component is defective and the probability Ph
that there are at most 3 defective components.
With n = 2.000. p = 10 \ 11p = ~. (3-43) yields
p, "" P{k = 0} = q" = (I - 10 ·'1~·01111 - t' ~ = .406
Pb
= P{k s
3}
=
±(Z)
t•o
pkq"· 4
±
= •~o e-• (n~)A
= e- 2 {t
k.
+
~1 + 2.2:
2
+ :)
3.
= .851
•
Generalization Consider the r events~; of the partition A of Fig. 3.5. We
shall examine the asymptotic behavior of the probabilities p,(k 1 • • • • • k,)
in (3-41) under the assumption that the first r - I of these events are rare.
that is, that
PI << I. . . . • p, I << I
Proceeding as in (3-44). we obtain the approximation
p,(k, • . . . • k,) =- k
n!
I1
•
• •
•
k pt• · · · p~·
·,
11I e· "· a~·
,_'
'
Q<
t·
1
1
t\j'
(3-46)
where tit = np, • . . • • a,_, = nPr-1.
This approximation holds for II.; of the order np;, and it is an equality in
the limit as n -+ oc.
SF.C.
j---1.--J
3-4
RARE EVENTS AND POISSON POINTS
79
k--tb-+:
~~x~~~~x~~x~'----*x--~~~x~------*x---------------~x~~'
··T/~
11
12
1.1
T/~
14
Figure 3.6
Poi!l.wm Points
The Poisson approximation is of particular importcmce in problems involving
random points in time or space. This includes radioactive emissions. telephone calls. and trctffic accidents. In these and other fields. the points are
generated by a model ~f, = !/ x · · · x !f of repeated trials where ~~ is the
experiment of the rctndom selection of a single point. In the following, we
discuss the resulting model as n -+ x.. starting with the single-point experiment ~f.
We are given an interval ( ··· T/2. 712) (Fig. 3.6), and we place a point in
this interval at rctndom. The probability that the point is in the interval (1 1 , I~)
of length Ia = 1~ - 11 equals I,,IT [see (2-46)]. We thus have an experiment ~~
with outcomes all points in the interval (- T/2. 1'/2).
The outcomes of the experiment !I, of the repeated trials of ~~ are ,
points in the interval (- T/2. T/2). We shall find the probability p,(k) that k of
these points are in the interval (1 1 , 1~). Clearly. p,(k) is the probability of the
event {k points in 1,}. This event occurs if the event {1 1 ~ 1 ~ 12 } of the
experiment~~ occurs k times. Hence.
P{k points in 1,} - ( ~)
PV'-'
,_1'1.,
We shall now assume that, >> I and T >> t;,. In this case. {1 1 c;::
rctre event. and (3-44) yields
• • }
I ,. (111)1')4
P{k·pomtsml, ,...c."···
k!
(3-47)
1 ~ t~}
is a
(3-48)
This. we repeat, holds only if 1, << 1' and k << 11. Note that this probability
depends not on nand T sepamtely but only on their ratio
"1'
ll."--
(3-49)
This rcttio will be called the dc•nsily of the points. As we see from (3-49),
A equals the average number of points per unit of time.
Next we increase " and 1'. keeping the ratio niT constant. The limit of
~..
as n -+ x will be denoted by ~" and will he called the experiment of
Poisson poi111s with density 11.. Clearly. a single outcome of~~.. is a set of 11
points in the interval (- T/2. 1'/2); hence. a single outcome of :-f.. is a set of
infinitely many points on the entire 1 axis. In the experiment ~f". the proba-
80
CHAP.
3
REPEATED TRIALS
bility that k points will be in an interval (1 1, 11 ) of length 111 equals
.
.
}
., (.\1,)4
P{k Poants 1n 1 = e-ft"-"
k!
k
= 0,
I, . . .
(3-50)
This follows from (3-48) because the right side depends only on the ratio llf1'.
Nonoverlapping Intervals Consider the nonoverlapping intervals (1 1 • 11 )
and (1 3 , 14 ) of Fig. 3.6 with lengths 1, = 11 - t 1 and 1, = t4 - t~. In the
experiment 9' of placing a single point in the interval (- T/2, T/2). the events
.llt 1 ={the point is in t,}, s4.1 ={the point is in r,}, and .!A.t = {.i 1 n ~ 1 } ={the
point is outside the intervals r, and r,} form a partition. and their probabilities
equal
1,
lh
TT
.A
P(.v..,) -= I -
respectively. In the space ~1, of placing n points in the interval (- T/2. T/2),
the event {k, in 1,, k 11 in r,} occurs itT s4. 1 occurs k 1 -= k, times, :A1 occurs k~ =
k11 times, and s4. 1 occurs k 3 = n - k, - k, times. Hence [see (3-46)],
.
.
P{k, ant,, k, an
r,}
n!
= k• !k:!!k,!
(1")''
T ('h)'~
T (I
-
t, 1,)''
TT
(3-51)
From this and (3-46) it follows that if
n
n- x.
(3-52)
-=11.
T
then
•
k •
P{k, m 1,. , an
1,
}
(.\1,)4.• -AI (,\th)4•
= e -AI" k
e • -k
1
,.1
h•
(3-53)
Note from (3-51) that if Tis finite, the events {k, in 1,} and {kh in r,} are
not independent, however, forT- x., these evenb are independent because
then [see (3-53) and (3-50)]
(3-54)
P{k, in t,, k, in 111 } = P{k 11 in t,}P{k, in t11 }
Sumnuuy Starting from an experiment involving n points randomly
placed in a finite interval, we constructed a model fl consisting of infinitely
many points on the entire axis, with the following properties:
"&
1. The number of points in an interval of length 111 is an event {k, in t,,}
2.
the probability of which is given by (3-50).
If two intervals 10 and tb are nonoverlapping, the events {k 11 in 1,}
and {kb in lb} are independent.
These two properties and the parameter A specify completely the model
of Poisson points.
~"&
PROBLEMS
81
Appendix
Area under the Normal Curve
We shall show that
r.
I
c,- .. ,'cb:
~ .J~
a>O
(3A-I)
• Proof. We set
This yields
12
=I', I'.
c' .... ·.,·,clxdy
In the differential ring AR of Fig. J.7. the integrand equals e-,.,· hecause x! + y 2 = r 2• From this it follows that the intcgml in AR equals e · ,.,·
times the area 21rrdr of the ring. Integrating the resulting product for r from 0
to x and setting r! = z. we obtain
J!
and (3A- I) results.
= J.'
"
2trre_"'.dr
=
1r
J.'· e
"·c/:.
"
=~
a
Figure 3.7
dr
Problems
3-1
A fair coin is tossed 8 times. (a) Find the probability p, that heads shows once.
(b) Find the probability P! that heads shows at the sixth toss but not earlier.
82
CHAP.
3
REPEATED TRIALS
3-l
3-J
3-4
3-S
3-6
3-7
3-8
A fair coin is tossed 10 times. Find the probability p that at the eighth toss but
not earlier heads shows for the second time.
A fair coin is tossed four times. Find the number of outcomes of the space ~~ ..
and of the event .<;J ""' {heads shows three times in any order}: find Pt:A).
Two fair dice are rolled 10 times. Find the probubility p 1 that 7 will show twice:
find the probability p 1 that II will show once.
Two fair dice are rolled three times. Find the number of outcomes of the space
~~.I and of the events .9l = {7 shows at the second roll} • .1\ = {II does not show}:
find Pl!/11 and PUAI.
A string of Christmas lights consists of 20 bulbs connected in series. The
probability that a bulb is defective equals .01. (at Find the probability p 1 that
the string of lights works. (b) If it does not work. we replace one bulb at a time
with a good bulb until the string works: find the probability p 1 that the string
works when the fifth bulb is replaced. (d find the probability p1 that we need
at most five replacements until the string works.
(a) A shipment consists of 100 units. and the probability that a unit is defective
is .2. We select 12 units; find the probability p 1 that 3 of them are defective. (b)
A shipment consists of 80 good and 20 defective components. We select at
random 12 units: find the probability p 1 that 3 of them are defective.
Compute the probabilities p,(k I ::. {;) p'q" ' for: (a) " = 20. p = .3. k
I. . . .• 20: (b) n = 9. p = .4. k = 0. I. .
of two consecutive values of p,lkl equals
• 9.
rlkl "'p,(k- II=
3-10
3-11
3-ll
3-13
3-14
3-IS
3-16
Show that the ratio rCk)
kq
k + lip
and that rlkl increases ask increases from 0 to"· Using this. find the values of
k for which p,(k) is maximum.
Using the approximation (3-271. find p,(k) for" = 900. p = .4. and k = 340.
350. 360. and 370.
The probability that a salesman completes a sale is .2. In one day. he sees 100
customers. Find the probability p that he will complete 16 sales.
Compute the two sides of (3-18): (a)for n = 20, p = .4. and k from 5 to II; (b)
for n = 20. p :.... .2. and k from I to 7.
A fair coin is tossed n times. and heads shows k times. Find the smallest
number n such that P{.49 s kin s .51} > .95.
A fair die is rolled 720 times. and 5 shows k times. Find the probability that k is
a number in the interval 196. 144).
The probability that a passenger smokes is .3. A plane has 189 passengers. of
whom k are smokers. Find the probability p that 45 s k s 67: (a) using the
approximation (3-30); (b) using the approximation (3-35).
Of all highway accidents. 52% are minor. 30% are serious. and 18% are fatal.
In one day. 10 accidents arc reported. Using (3-4:!). find the probability p that
two of the reported accidents are serious and one is fatal.
The probability that a child is handicapped is to-·'. Find the probability p 1 that
in a school of 1,800 children. one is handicapped. and the probability p 2 that
more than two children are handicapped: (a) exactly; (b) using the approximation (3-43).
p,(k)
3-9
(~)
= 0.
(ll -
PROBI.F..MS
83
3-17 We receive a 12-digit number by teletype. The probability that a digit is printed
wrong equals 10 ·'. Find the probability p that one of the digits is wrong: (a)
exactly: (b) using the Poisson approximation.
3-18 The probability that a driver has an accident in a month is .01. Find the
probability p 1 that in one year he will hnvc one accident and the probability p~
that he will have at least one accident: (a) exactly: (b) using (3-43).
3-19 Particles emitted from a rudioactive substance form a Poisson set of points
with A -· I. 7 per second. Find the probability I' that in :! seconds. fewer than
five particles will be emitted.
4 _ _ __
The Random Variable
A random variable is a function x(() with domain the set':/ of experimental
outcomes ( and with range a set of numbers. Thus xW is a model concept,
and all its properties follow from the properties of the experiment ':/. A
function y = g(x) of the random variable x is a composite function y(() =
g(x(()) with domain the set ':/. Some authors define random variables as
functions with domain the real line. The resulting theory is consistent and
operationally equivalent to ours. We feel strongly. however, that it is conceptually preferable to interpret x as a function defined within an abstract
space ':/, even if':/ is not explicitly used. This approach avoids the use of
infinitely dimensional spaces, and it leads to a unified theory.
4-1
Introduction
We have dealt so far with experimental outcomes, events, and probabilities.
The outcomes are various objects that can be identified somehow, for example, "heads," "red," "the queen of spades." We have also considered
experiments the outcomes of which are numbers, for example, "time of a
call," "IQ of a child"; however, in the study of events and probabilities, the
84
SEC.
4-1
INTRODUCTION
85
numerical character of such outcomes is only a way of identifying them. In
this chapter we introduce a new concept. We assign to each outcome' of an
experiment f:f a number x(,). This number could be the gain or loss in a game
of chance, the size of a product. the voltage of a random source, or any other
quantity of interest. We thus establish a relationship between the elements';
of the set~ and various numbers x(,;). In other words, we form a function
with domain the set ff of abstract objects ';and with range a set of numbers.
Such a function will be called a random variable.
Example 4.1
The die experiment has six outcomes. To the outcome Ji we assign the number lOi.
We have thus formed a function x such that x(/;) = IOi. In the same experiment, we
form another function y such that y</;l = 0 if i is odd andy</;) = I if i is even. In the
following table. we list the functions so constructed.
f, Ji l ~
x(/;)
y(/;)
f4
h Ji.
10 20 30 40 50 60
0
I
0
I
0
I
The domain of both functions is the set :J. The range of" consists of the six
numbers 10, . . . • 60. The range of y consists of the two numbers 0 and I. •
To clarify the concept of a random variable, we review briefty the
notion of a function.
Meaning or a •·unction As we know, a function x = x(t) is a rule of correspondence between the values oft and x. The independent variable t takes
numerical values forming a set fl, on the t-axis called the domain of the
function. To every tin fl, we assign, according to some rule, a number x(t) to
the dependent variable x. The values of x form a set ff, on the x-axis called
the range of the function. Thus a function is a mapping of the set fl, on the
set fl.,. The rule of correspondence between t and x could be a table, a curve,
or a formula. for example, x(t) = t2•
The notation x(l) used to represent a function has two meanings. It
means the particular number x(t) corresponding to a specific 1; it also means
the function x(l), namely, the entire mapping of the set f:l, on the set fix. To
avoid this ambiguity, we shall denote the mapping by x, leaving its dependence on 1 understood.
Gentrali1.fltlon The definition of a function can be phrased as follows: We are given two sets of numbers f:l, and fl.,. To every 1 E f:!, we assign
a number x(t) belonging to the set f:/11 • This leads to the following genercllization.
86
CHAP.
4 THE RANDOM VARIABLE
We are given two sets of objects ~a and Y';J consisting of the arbitrary
elements a and {3, respectively:
a E Y'n
{3 E ff(J
We say that {3 is a function of a if to every element a of the set ~a we make
correspond one element {3 of the set ~fJ. The set :Ja is called the domain of
the function and the set f:ffJ its range. Suppose that ';/a is the set of all children
in a community and f://J the set of their mothers. The pairing of a child with its
mother is a function. We note that to a given a there corresponds a single {3.
However, more than one element of the set fJ a might be paired with the same
{3 (a child has only one mother, but a mother might have more than one
child). Thus the number Np of elements of the set f:/11 is equal to or smaller
than the number Na of the elements of the set ~a. If the correspondence is
one-to-one, then Na = N11.
The Random Variable
A random variable (Rv) represents a process of assigning to every outcome C
of an experiment f:f a number x(C). Thus an RV xis a function with domain the
set f:f of experimental outcomes and range a set of numbers (Fig. 4. I). All Rvs
will be written in boldface letters. The notation x(C) will indicate the number
assigned to the specific outcome and the notation x will indicate the entire
function, that is, the rule of correspondence between the elements Cof fJ and
the numbers x(C) assigned to these elements. In Example 4.1, x indicates the
table pairing the six faces of the die with the six numbers 10, . . . , 60. The
domain of this function is the set f:f = {Ji, . . . ,J,.}, and its range is the set
{10, . . . , 60}. The expression x(ji) is the number 20. In the same example,
y indicates the correspondence between the six faces/; and the two numbers
0 and I. The range ofy is therefore the set {0, 1}. The expression y(Ji) is the
number I (Fig. 4.2).
c.
Events Generated by avs In the study of avs, questions of the following
form arise: What is the probability that the RV x is less than a given number
Figure 4.1
.s4
st
= {x!Sx}
= {t1 !Sx!:x2f
SEC.
x:
0
20
30
40
I
I
4-1
50
• • • •
\
I
INTRODUCTION
87
60
t
•
X
~
•y
Figure 4.2
x? What is the probability that x is between the numbers x 1 and x 2 ? We
might, for example, wish to find the probability that the height x of a person
selected at random will not exceed certain bounds. As we know, probabilities arc assigned only to events; to answer such questions, we must therefore express the various conditions imposed on x as events.
We start with the determination of the probability that the RV x does
not exceed a specific number x. To do so. we introduce the notation
$l. = {x ::; .t}
This notation specifies an event .sll consisting of all outcomes ' such that x(')
s x. We emphasize that {x s x} is not a set of numbers; it is a set $l. of
experimental outcomes (Fig. 4.1). The probability P(S'l) of this set is the
probability that the RV x does not exceed the number x.
The notation
?A = {x 1 < X< X2}
specifies an event~ consisting of all outcomes ' such that the corresponding
values x(C) of the RV x arc between the numbers x 1 and x~.
Finally,
~ = {x = xo}
is an event consisting of all outcomes ' such that the value x(C) of x equals
the number x0 •
Example 4.2
We shall illustrate with the avs x and y of Example 4.1. The set {x s 35} consists of
the elementsjj ,/2 , and/3 because x(Ji) s 35 only if i = I. 2. or 3. The set of {x s 5} is
empty because there is no outcome such that x(Ji) s 5. The set {20 s x :s 35} consists
ofthe outcomesJi and/3 because 20 s x(Ji) s 35 only if i = 2 or 3. The set {x = 40}
consists of the element~ because x(Ji) = 40 only if i = 4. Finally, {x = 35} is the
empty set because there is no outcome such that x(Ji) = 35.
Similarly, {y < O} is the empty set because there is no outcome such that
y(Ji) < 0. The set {y < I} consists of the outcomesf1 .f,, and f. because y(Ji) < I for
i = I, 3, or 5. Finally, {y ;s I} is the cenain event because y(Ji) '$ I for every Ji. •
88
CHAP.
4
THE RANDOM VARIABLE
In the definition of an Rv, the numbers x(C) assigned to x can be finite or
infinite. We shall assume, however, that the set of outcomes C such that
x(C) = :::t:x: has zero probability:
P{x = :x:} = 0
P{x = -:x:} = 0
(4-1)
With this mild restriction, the definition of an Rv is complete.
4-2
The Distribution Function
It appears from the definition of an RV that to determine the probability that x
takes values in a set I of the x-axis, we must first determine the event {x E I}
consisting of all outcomes Csuch that x(C) is in the set/. To do so, we need to
know the underlying experiment~. However, as we show next, this is not
necessary. To find P{x E /}it suffices to know the distribution of the Rv x.
This is a function Fx(x) of x defined as follows.
Given a number x, we form the event .54.x = {x s x}. This event depends
on the number x; hence, its probability is a function of x. This function is
denoted by Fx(x) and is called the cumulative distribution/unction (c.d.f.) of
the RV x. For simplicity, we shall call it the distribution function or just the
distribution of x.
• Definition. The distribution of the RV x is the function
Fx(x)
= P{x s
x}
(4-2)
defined for every x from -:x: to :x:.
In the notation Fx(x), the subscript x identities the RV x and the independent variable x specifies the event {x s x}. The variable x could be replaced
by any other variable. Thus Fx(w) equals the probability of the event {x s w}.
The distributions of the RVS x, y, and z are denoted by Fx(x), F 1(y), and Fz(t.),
respectively. If, however, there is no fear of ambiguity, the subscripts will be
omitted and all distributions will be identified by their independent variables.
In this notation, the distributions of the avs x, y, and z will be F(x), F(y), and
F(z), respectively. Several illustrations follow.
EDJDple 4.3
(a) In the fair-die experiment, we define the
RV x such that x(Jj) = IOi as in
Example 4.1. We shall determine its distribution F,(x) for every x from -:x:
to ac, We start with specific values of x
F,(200) = P{x s 200} = PU:I) = I
F,(45) = P{x s 45} = P{Jj, fi, J3, 14} =
64
SEC.
4-2
THE DISTRIBUTION FUNCTION
Fx(X)
89
F>'(y)
I
I
2
I
6
0
y
X
Figure 4.3
= P{fi.f,Jj} = ~
F,(30)
= P{x s
30}
F,(29.99)
= P{x s
29.99}
= P{Jj. h} = ~
6I
f,(IO.l) = P{x s 10.1} = P{fi} =
FA(5) = P{x s 5} = P(0) = 0
Proceeding similarly for any .t. we obtain the staircase function
FA(x) of Fig. 4.3a.
(b) In the same experiment. the av y is such that
y(/;)
= {o,
i
i
= I. 3, 5
= 2. 4. 6
In this case,
F,(l5)
F,.(l)
F~.(O)
F_,.(- 20)
=
P{y s 15} = P(9') = I
I}""" P(Yl _,_ I
= P{y s
= P{y s
0}
3
= P{fj ./1. fi} = 6
= P{y s - 20} = P(0)
- 0
The function F 1 (y) is shown in Fig. 4.3b •
Example 4.4
In the coin experiment,
fl = {h, t}
P{h} = p
P{t} = q
We form the av x such that
x(h) = I
X(f) = 0
The distribution ofx is the staircase function F(x) of Fig. 4.4. Note, in particular, that
F(0.9) = P<x s 0.9} = P{t} = q
F(4) = P{x s 4} = P(f/) = I
F(-5) ..:.: P{x s -5} .:.:. P(0) :..:. 0
F(O) = P(x s 0} = P{t} -= q
•
90
CHAP.
4 THE RANDOM VARIABLE
F(x)
Pf
q.--......;......
0
X
Fipre 4.4
Example 4.5
Telephone calls occurring at random and uniformly in the interval (0, 1) specify an
experiment ~ the outcomes of which are all points in this interval. The probability
that tis in the interval (t 1 , t2 ) equals [see (2-46)]
12- It
P{1 1 s 1 s 12} = - T -
We introduce the av x such that
X(l) = I
0s I s T
In this example, the variable 1 has a double meaning. It is the outcome of the
experiment ~ and the corresponding value x(l) = 1 of the av x. We shall show that
the distribution oh is a ramp, as in Fig. 4.5. To do so, we must find the probability of
the event {ll s x} for every x.
Suppose, first, that x > T. In this case, X(l) s x for every I in the interval (0,
because ll(t) = 1; hence,
F(x) = P{ll(l) s x} = P{O s I s 7l = I
x> T
If 0 s x s T, then x(l) s x for every 1 in the interval (0, x); hence,
n
= TX 0 s x s T
Finally, if x < 0, then {x(l) s x} = 0 because x(l) = 1 ~ 0 for every 1 in ~; hence,
F(x) = P{x < x} = P(0) = 0
x<0
•
F(x)
Example 4.6
= P{x s
x}
= P{O s
1 s x}
The experiment ~ consists of all points in the interval (0, oo). The events of~ are all
intervals (11 , 12) and their unions. To specify~. it suffices, therefore, to know the
probability of the event {1 1 s t s 12 } for every 11 and 12 in~. This probability can be
Fipre 4.5
F(x)
T
X
4-2
SEC.
THF. DISTRIBUTION FUNCTION
91
specified in terms of a function a(t) as in (2-43):
P{t 1 s t s t 2 }
(•·
= ),:
a(t)dt
We shall assume that
a(t)
= 2e- 21
This yields
P { 0 :s t :s to 1l
= 2 Joftn e ·u' dt
'
= I - e · ·'"
(4-3)
(a) We form an RV x such that
X(t) = t
I 2: 0
Thus. as in Example 4.5, t is an outcome of the experiment ~ and the
corresponding value of the RV x. We shall show (fig. 4.6a) that
,.-u. x ~ 0
X< 0
0
0, then x(t) s x for every t in the interval (0, x): hence,
F<x> = { I -
If x
2:
F.(x)
= P{x s
If x < 0, then {x s x}
=0
x} = P{O s t s x} = I - e·U.
because llt(t) 2: 0 for every t in ~; hence,
F,(x)
= P{x <
(4-4)
x} = P(0)
=0
Note that whereas (4-3) has a meaning only for to 2: 0. (4-4) is defined
for all x.
(b) In the same experiment, we define the RV y such that
o ~ r s o.s
y<r> = { 1
t > 0.5
Thus y takes the values 0 and I and
P{y = 0} = P{O :s t :;:: 0.5} "' I
0
P{y
= I}~
P{t > 0.5}
('
t
= 2 Jn(' \ ('-~'dt = e·•
From this it follows (fig. 4.6b) that
~~ =
{!-
y ;:: I
,-o
0-5yS)
•
y<O
Figure 4.6
F.(x)
F,.(}')
I-
0.5
X
(a)
---..J
e-• ...
)'
0
(b)
92
CHAP.
4 THE RANDOM VARIABLE
It is clear from the foregoing examples that if the experiment ~ consists
of finitely many outcomes, F(x) is a staircase function. This is also true if~
consists of infinitely many outcomes but x takes finitely many values.
PROPERTIES OF DISTRIBUTIONS The following properties are simple
consequences of (4-1) and (4-2).
F( -~)
I.
2.
= P{x = -:x:} = 0
F(:x:)
= P{x s
:x:}
=I
The function F(x) is monotonically increasing; that is,
if
x 1 < x2
then
F(xd s F<x2)
(4-5)
(4-6)
• Proof. If x 1 < x2 and C is such that x(C) s x, , then x(C) s x2;
hence, the event {x < x.} is a subset of the event {x s x2 }. This
yields
F(xd = P{x < Xt} s P{x < x2} = F(x2>
From (4-5) and (4-7) it follows that
0 s F(x) s 1
Furthermore,
if F(x0 ) = 0 then F(x) = 0 for every x s xo
P{x > x}
3.
=I
(4-7)
(4-8)
(4-9)
(4-10)
- F(x)
c.
the events {x s x} and {x > x} are mutually exclusive, and their union equals fl. Hence,
P{x s x} + P{x > x} = P(~) = I
• Proof. For a specific
4.
(4-11)
• Proof. The events {x s x 1} and {x1 < x s x2} are mutually exclusive, and their union is the event {x s x2 }:
{x s x.} U {x, < x s x 2 } = {x s x2 }
This yields
P{x s xd + P{x, < x s x2}
and (4-11) results.
5.
= P{x s
x;!}
The function F(x) might be continuous or discontinuous. We shall
examine its behavior at or near a discontinuity point. Consider first
the die experiment of Example 4.3a. Clearly, F(x) is discontinuous
at x = 30, and
2
3
3
F(30) =F(30.01) = 6
F(29.99) = 6
6
Thus the value of F(x) for x = 30 equals its value for x near 30 to
the right but is different from its value for x near 30 to the left.
Furthermore, the discontinuity jump of F(x) from 2/6 to 3/6 equals
the probability P{x = 30} = 1/6. We maintain that this is true in
·
general.
SEC.
4-2 THE DISTRIBUTION FUNCTION
93
.--------
Po
t ______ _
0
Figure 4.7
Suppose that F(x) is discontinuous at the point x = x 0 • We
denote by F(x0 ) and F(x0 ) the limit of F<x> as x approaches x0 from
the right and left, respectively (fig. 4.7). The difference
Po = F(xo > - F(xo >
(4-12)
is the "discontinuity jump" of Ftc) at the point x0 • As we see from
the figure,
P{x < xo} = F(xii)
P{x s x0 } = F(xii)
(4-13)
This shows that
(4-14)
F(xo) = F<xo)
Note, finally, that the events {x < x0 } and {x = 0} are mutually
exclusive, and their union equals {x s x}. This yields
P{x < x0 } + P{x = xo} = P{x s xo}
From this and (4-13) it follows that
P{x = xo} = F(xo) - F(xii) =Po
P{x < xo} = F(xo) (4-15)
Continuous, Discrete, and Mixed Type avs We shall say that an RV xis of
continuous type if its distribution is continuous for every x (Fig. 4.8). In this
case, F(x") = F(x) = F(x- ); hence,
P{x = x} = 0
(4-16)
Figure 4.8
F(x)
Fix)
F(x)
"'V
Pt
..... F<xn
X
0
Continuous
0
Xi
Discrete
:c
X
0
~ixed
94
CHAP.
4
THE RANDOM VARIABLE
for every x. Thus if x is of continuous type, the probability that it equals a
specific number xis zero for every x. We note that in this case,
P{x s x}
= P{x < x} = F(x)
(4-17)
F(x,)
We shall say that an av x is of discrete type if its distribution is a
staircase function. Denoting by x; the discontinuity points of F(x) and by p;
the jumps at x;, we conclude as in (4-12) and (4-14) that
P{x = x;} = F(x;) - F(xi) = p;
(4-18)
Since F( -~> = 0 and F(~) = 1, it follows that if F(x) has N steps, then
Pt + ' ' · + PN = 1
(4-19)
Thus if x is of discrete type, it takes the values X; with probabilities p;. It
might take also other values; however, the set of the corresponding outcomes has zero probability.
We shall say that an av xis of mixed type if its distribution is discontinuous but not a staircase.
If an experiment ~ has finitely many outcomes, any av x defined on ~
is of discrete type. However, an av x might be of discrete type even if t;J has
infinitely many outcomes. The next example is an illustration.
P{x1 s x
Example 4.7
s x2 }
= P{x 1 < x s
x2 }
= F(x2 )
Suppose that sf is an event of an arbitrary experiment
~f.
-
We shall say that x.., is the
zero-one RV associated with the event sf if
CCsf
I
X.;~(C) = { 0
CE sf
Thus x.., takes the values 0 and I (fig. 4.9), and
P{x.., = I}
= P(sf) = p
P{x_. = 0} = P(~)
=I
- p
ne PercentUe Curve The distribution function equals the probability u
•
=
F(x) that the av x does not exceed a given number x. In many cases, we are
faced with the inverse problem. We are given u and wish to find the value xu
of x such that P{x s Xu} = u. Clearly, Xu is a number that depends on u, and it
Figure 4.9
FCx)
t
1-p·---"""""-t
0
X
SEC.
4-2
THE DISTRIBl:TION ..-uNCTION
95
X
Percentile
Distribution
Figure 4.10
is found by solving the equation
(4-20)
Thus x, is a function of u called the u-percentile Cor qua11tile or fractile) of
the RV x. Empirically. this means that IOOu%- of the observed values of x do
not exceed the number x,. The function x, is the inverse of the function u =
F(x). To find its graph. we interchange the axes of the F(x) curve (Fig. 4.10).
The domain of x, is the interval 0 ::s u ::s I, and its range is the x-axis
F(x,) = u
- :x: ::S X ::S :x:.
Note that if the function F(x) is tabulated. we use interpolation to find
the values of x, for specific values of u. Suppose that u is between the
tabulated numbers u, and uh:
F(x,) = Ua < 11 < uh = F(xh)
The corresponding x, is obtained by the straight line approximation
Xh- Xu
4 21
X "" X (U - It )
( • )
ll
U
lih- U,.
II
of F(x) in the interval (x,.. xh>· In Fig. 4.1 I. we demonstrate the determina-
Jo'igure 4.11
X
F(x)
1.60
.94520
1.65
.95053
1.95
.97441
200
.97725
230
.98928
235
.99061
u
x,.
.95
J.(j4
.975
1.96
.99
2.33
96
CHAP. 4
THE RANDOM VARIABLE
tion of Xu for u = .95, .975, and .99 where we use for F<x> the standard
normal curve G(x) [see (3-23)].
Median The .5-percentile x.s is of particular interest. It is denoted by
m and is called the median of x. Thus
F(m) =
.5
m = x.s
The Empirkal Distrihutimr. We shall now give the relative frequency interpretation of the function F(x). To do so, we perform the experiment n times and
denote by ~; the observed outcome at the ith trial. We thus obtain a sequence
t. ' . . . ' t;' . . . ' t,.
(4-22)
of n outcomes where t; is one of the elements Cof~. The av x provides a rule
for assigning to each element Cof~ a number x((). The sequence of outcomes
(4-22) therefore generates the sequence of numbers
(4-23)
Xt, ••• , X;, • • • , X 11
where x; = x(t;) is the value of the av x at the ith trial. We place the numbers x;
on the x-axis and form a staircase function F,.(x) consisting of n steps, as in Fig.
4.12. The steps are located at the points X; (identified by dots), and their height
equals 1/n. The first step is at the smallest value Xmin of X; and the last at the
largest value Xmu. Thus
F,.(x) = 0
for
X < Xmin
F,(x) = I
for
x ~ Xmu
The function F,.(x) so constructed is called the empirical distribution of the
RV X.
As n increases, the number of steps increases, and their height 1/n tends
to zero. We shall show that for sufficiently large n,
F,(x) = F(x)
(4-24)
in the sense of(l-1). We denote by n~ the numberoftrials such thatx; s x. Thus
n.• is the number of steps of F,(x) to the left of x; hence.
n
n
F,.(x) = .2.
(4-25)
Figure 4.12
X
SEC.
0
10
m =2
4-2
m =3
X
THE DISTRIBUTION FUNCTION
0
97
10
X
Figure 4.13
As we know, {x s x} is an event with probability F(x). This event occurs at the
ith trial iff x; s x. From this it follows that ntis the number of successes of the
event {x s x} inn trials. Applying (1-1) to this event. we obtain
F(x) -= P{x
s
x} ==
!!l
= Fn(.tl
n
Thus the empirical function F,,(x) can be used to estimate the conceptual
function F(x) (see also Section 9-4).
In the construction of f'n(x). we assumed that the numbers x; are all
different. This is most likely the case if F(x) is continuous. If, however. the RV
xis of discrete type taking the N values c 4 • then x; = c4 for some k. In this case,
the steps of F,,(x) are at the points c4 , and the height of each step equals min
where m is the multiplicity of the numbers x, that equal c4 (Fig. 4.13).
Example 4.8
We roll a fair die 10 times and observe the outcomes
fihfi.J,f,fdf.J,j;J,
The corresponding values of the RV x defined as in Example 4.1 are
10
50
60
40
50
20
60
30
50
30
In Fig. 4.13 we show the distribution F(x) and the empirical distribution Fn(x). •
Tire Empirical P(•rc·t•llfil(• rQuett•h•t Curv«'J. Using the 11 numbers X; in (4-23),
we form n segments of length X;. We place them in line parallel to they-axis in
order of increasing length, distance lin apart (Fig. 4.14). If. for example, xis
the length of pine needles. the segments are n needles selected at random. We
then form a polygon the corners of which are the endpoints of the segments.
For sufficiently large n. this polygon approaches the u-percentile curve Xu of
the RV X.
Tlte Density Function
We can use the distribution function to determine the probability P{x E R}
that the RV x takes values in an arbitrary region R of the real axis. To do so,
we express Rasa union of nonoverlapping intervals and apply (4-11). We
show next that the result can be expressed in terms of the derivative f(x) of
F(x). We shall assume, first, that F(x) is continuous and that its derivative
exists nearly everywhere.
98
CHAP.
4
THE RANDOM VARIABLE
u
0 1/n
F~pre4.14
• Dtjinition. The derivative
f(x)
= dF(x)
dx
(4-26)
of F(x) is called the probability density function (p.d.f.) or the frequency
function of the RV x. We discuss next various properties of f(x).
Since F(x) increases as x increases, we conclude that
(4-27)
f(x) ~ 0
Integrating (4-26) from x 1 to x2 , we obtain
F(x2> - F(x.>
With x2
= -co, this yields
Xl
= J:.J<f>df
F(x)
because F( -co)
1
= ., f(x)dx
(4-28)
(4-29)
= 0. Setting x = ~. we obtain
r.
f(x)dx
=1
(4-30)
Note, finally, that
(4-31)
SEC.
4-2
THE DISTRIBUTION FUNCTION
99
/Cx)
X
(a)
(b)
Figure 4.15
This follows from (4-28) and (4-17). Thus the area of f(x) in an interval (x 1 ,
x2) equals the probability that x is in this interval.
If x, = x, x2 = x +ax, and ax is sufficiently small. the integral in (4-31)
is approximately equal to f(x)ax; hence,·
P{x ~ X ~ X + ax} ::= f(x)ax
(4-32)
From this it follows that the density f(x) can be defined directly as a limit
involving probabilities:
f(x)
=
lim P{x ~
6x....O
With
Xu
X
~
X
+ ax}
(4-33)
ax
the u-percentile of x, (4-29) yields (Fig. 4. t5a)
u
= F(x,) =
f: f(x)dx
Note, finally, that if f(x) is an even function-that is, iff(- x) = f(x) (Fig.
4. 15b)-then
I - F(x) = F(-x)
Xt-u = -x,
(4-34)
From this and the table in Fig. 4. tt it follows that ifF (x) is a normal
distribution, then
Xot = -x99 = -2.3
X.os = - X.9S = -1.6
DISCRETE TYPE RVS Suppose now that F(x) is a staircase function with
discontinuities at the points x•. In this case, the RV x takes the values x. with
probability
(4-35)
P{x = x.} = P• = F(x,J - F(xi)
The numbers P• will be represented graphically either in terms of F(x)
or by vertical segments at the points x• with height equal to P• (Fig. 4.16).
Occasionally, we shall also use the notation
P• = f<x•>
to specify the probabilities P•. The function f(x) so defined will be called
point density. It should be understood that its valuesf(x•> are not the derivatives of F(x); they equal the discontinuity jumps of F(x).
100
CHAP.4
THE RANDOM VARIABLE
F(x)
1-
T
Pt
3
3
i
i
l
I
0
I
I
I
1
2
3
ii
i
2
0
X
3
X
Figure 4.16
Example 4.9
The experiment r:l is the toss of a fair coin three times. In this case, r:l has eight
outcomes, as in Example 3.1. We define the RV x such that its value at a specific
outcome equals the number of heads in that outcome. Thus x takes the values 0, 1, 2,
and 3, and
I
3
3
1
P{x = O} =P{x = I} =P{x = 2} =
8
8
8 P{x = 3} =-8
In Fig. 4.16, we show its distribution F(x) and the probabilities Pk· •
The Empiriclll Density (1/istol(rtlm}. We have performed an experiment n
times, and we obtained the n values xi of the RV x. In Fig. 4.12, we placed the
numbers xi on the x-axis and formed the empirical curve F,.(x). In many cases,
this is too detailed; what is needed is not the exact values of xi but their number
in various intervals of the x-axis. For example, if x represents yearly income,
we might wish to know only the number of persons in various income brackets.
To display such information graphically, we proceed as follows.
We divide the x-axis into intervals of length 4, and we denote by n~c the
number of points x; that are in the kth interval. We then form a staircase
functionf,(x), as in Fig. 4.17. The kth step is in the kth interval (c4 , c1c + 4), and
Figure 4.17
n~c
n~
0
X
2
3
6
11
15
9
7
3
2
SEC.
4-3
ILLUSTRATIONS
101
its height equals n41n.:1. Thus
nt
f,.,(x) = n.1
c4 s x s c 4 + .1
(4-36)
The function/,.(x) is called the histogram of the RV x. The histogram is used to
describe economically the data X;. We show next that if n is large and d small,
f,(x) approaches the density /(x):
j,.(x) == /(x)
(4-37)
Indeed, the event {c4 s x < c4 + .:1} occurs n4 times in" trials, and its probability equals/Ccdd [see (4-32)). Hence.
j'(c, ).;1 == P{c,
S
x
<
c4
+ .1} -::· '..!!
II
= (, (.r).:1
. ''
Probability Mass Density In Section 2-2 we interpreted the probability P(:A.) of an event sA. as mass associated with :A. We shall now give a
similar interpretation of the distribution and the density of an av x. The
function F(x) equals the probability of the event {x ~ x}; hence, F(x) can be
interpreted as mass along the x-axis from - x to x. The difference F(x2 ) F(x1) is the mass in the interval (x 1 , x2 ), and the difference F(x + 4x) F(x) === f(x)4x is the mass in the interval (x, x + 4x). From this it follows that
.f(x) can be interpreted as mass density. If x is of discrete type, taking the
values x4 with probability Pk, then the probabilities are point masses p4
located at xk. Finally, if xis of mixed type, it has distributed masses with
density f(x), where F'(x) exists, and point masses at the discontinuities of
F(:c).
4-3
Illustrations
Now we shall introduce various avs with specified distributions. It might
appear that to do so, we need to start with the specification of the underlying
experiment. We shall show, however, that this is not necessary. Given a
distribution <l>(x), we shall construct an experiment ~ and an RV x such that
its distribution equals <l>(x) •
•. ROM THF.
mSTRIBUTIO~
TO THE MODEL We are given a function <l>(x)
having all the properties of a distribution: It increases monotonically from 0
to I as x increases from -'JC to x, and it is continuous from the right. Using
this function, we construct an experimental model ~ as follows.
The outcomes of~ are all points on the t-axis. The events of~ are all
intervals and their unions and intersections. The probability of the event
{t 1 ~ t ~ t2 } equals
P{t1 s t s t 2} = <l>(t2) - 4>(t 1)
(4-38)
This completes the specification of ~.
We next form an av x with domain the space ~and distribution the
102
CHAP.
4
THE RANDOM VARIABLE
given function
~(x).
To do so. we set
(4-39)
x(t) = t
Thus t has a dual meaning: It is an element of~ identified by the letter t. and
it is the value of the RV x corresponding to this element. For a given x, the
event {x s x} consists of all elements t such that x(t) s x. Since x(t) = t, we
conclude that {x s x} = {t s x}; hence [see (4-38)],
F(x) = P{x s x} = P{t s x} = ~(x)
(4-40)
Note that ~ is the entire t-axis even if ~(x) is a staircase function. In
this case, however, all probability masses are at the distontinuity points of
~(x). All other points of the t-axis form a set with zero probability.
We have thus constructed an experiment specified in terms of an arbitrary function ~(x). This experiment is. of course, only a theoretical model.
Whether it can be used as the model of a real experiment is another matter.
In the following illustrations, we shall often identify also various physical
problems generating specific idealized distributions.
Fundame11tal Note. From the foregoing construction it follows that in the
study of a single RV x, we can avoid the notion of an abstract space. We can
assume in all cases that the underlying experiment ~ is the real line and its
outcomes are the value x of x. This approach is taken by many authors. We
believe, however, that it is preferable to differentiate between experimental
outcomes and values of the av x and to interpret all avs as functions with
domain an abstract set of objects, limiting the real line to special cases. One
reason we do so is to make clear the conceptual difference between outcomes
and avs. The other reason involves the study of several, possibly noncountably many avs (stochastic processes). If we use the real line approach, we
must consider spaces with many. possibly infinitely many, coordinates. It is
conceptually much simpler. it seems to us, to define all avs as functions with
domain an abstract set ~.
We shall use the following notational simplifications. First, we introduce the step function (Fig. 4.18):
U(x)
= {~
x~o
(4-41)
x<O
This function will be used to identify distributions that equal zero for x < 0.
For example, /(x) = 2e-2.rU(x) will mean that f(x) = 2e-2.r for x ~ 0 and
f(x) = 0 for x < 0.
Figure 4.18
U(x)
0
X
SEC.
4-3
II.LUSTRATIONS
103
The notation
f(x)- «f>(x)
(4-42)
will mean thatf(x) = -ycb(x) where 'Y is a factor that does not depend on x. If
j(x) is a density, then 'Y can be found from (4-30).
Normal We shall say that an RV xis standard normal or Gaussian if
its density is the function (Fig. 4.19)
I
.,,
g(x) = - - e· ·-·-
v'21T
introduced in (3-22). The corresponding distribution is the function
G(x)
= _I_ J'
v'21T
e -~='2 d~
1
from the evenness of g(x) it follows that
G(-x)
=I
- G(x)
Shifting and scaling g(x). we obtain the general normal curves
f(x) = -I- e·· 1• ·"r~,,.I
.a- = - K - (4-44)
u v'21T
(T
(T
(X-TJ)
F(x)
= -I- J·' e
u v'21T -...
2,
u-.,lrz,.-d~
= G (·x-TJ)
--
(4-45)
(T
We shall usc the notation N(TJ, u) to indicate that the RV xis normal, as
in (4-44). Thus N(O, I) indicates that x is standard normal.
u). then
From (4-45) it follows that if x is
N(TJ,
P{x,
~ x s x2 }
TJ) - G (XtTJ)
= F(x2 ) - F<x.> = G. (X2-u-u-
(4-46)
l'igure 4.19
.\'(0. I I
!\"(3. 2)
0.1
X
104
CHAP.
4 THE RANDOM VARIABLE
With x 1 = TJ - kcr, and x2
= TJ
+ kcr, (4-46) yields
P{TJ - kcr s x s TJ + kcr}
= G(k)
-
G( -k) =
2G(k) - 1
(4-47)
This is the area of the normal curve (4-44) in the interval (TJ- kcr, TJ + kcr).
The following special cases are of particular interest. As we see from
Table Ia,
G(l) =
.8413
G(2)
= .9772
G(3)
= .9987
Inserting into (4-47), we obtain
P{TJ - cr < x s TJ + cr} ""' .683
P{TJ - 2cr < x s TJ + 2cr} = .954
P{TJ - 3cr < x s TJ + 3cr}""' .997
(4-48)
We note further (see Fig. 4.11) that
P{TJ - 1.96cr < X s TJ + 1.96cr}
P{TJ - 2.58cr < x s TJ + 2.58cr}
P{TJ - 3.29cr < x s TJ + 3.29cr}
= .95
= .99
= .999
(4-49)
In Fig. 4.20, we show the areas under the N(TJ, cr) curve for the intervals in
(4-48) and (4-49).
The normal distribution is of central imponance in the theory and the
applications of probability. It is a reasonable approximation of empirical
distributions in many problems, and it is used even in cases involving Rvs
with domain a finite interval (a, b). In such cases, the approximation is
possible if the normal curve is suitably truncated and scaled, or, if its area is
negligible outside the interval (a, b).
Figure 4.l0
"
f!+O
X
1-+----------.999'----------....l
r
~t--------.99'-------+-1
....~------.95------.1
'I - 3.26a 'I - 2.S8a 'I - 1.96a
"
'I + 1.9611 'I+ 2.S8a 'I+ 3.2611
SEC.
Example 4.10
4-3
ILLUSTRATIONS
105
The diameter of cylinders coming out of a production line are the values of a normal
av with 11 = 10 em, u = 0.05 em.
(a) We set as tolerance limits the points 9.9, 10.1, and we reject all units
outside the interval
(9.9, 10.1) = (1J - 2u. 11 -r 2u)
Find the percentage of the rejected units.
As we see from (4-48), P{9.9 < x s 10.1} = .954; hence. 4.6% of the
units are rejected.
(b) We wish to find a tolerance interval (10- c. 10 - <')such that only 1% of
the units will be rejected.
From (4-49) it follows that P{IO- c < x ~ 10 - d = .99 for c =
2.58u = .129 em. Thus if we increase the size of the tolerance interval from
0.2 em to 0.258 em. we decrease the number of rejected components from
4.6% to 1~. •
Uniform We shall say that an RV xis uniform (or uniformly distributed) in the interval (a - c:/2, a -r c/2) if
c
c
I
-
j(X)
= { OC'
a-;;SXSCI·T-
2
elsewhere
The corresponding distribution is a ramp, as shown in Fig. 4.21.
Gamma The
RV x has a gamma distribution if
f(x) = yxh-le ... U(x)
b > 0
c > 0
(4-50)
The constant 'Y can be expressed in terms of the following integral:
r·
na) = Ju •\'a 1e \ d\·•
This integral converges for a
Clearly (see (3A-I)],
1'(1)=
r
(4-51)
> 0. and it is called the
Jo'"e·~'dy=
gamma function.
I
1
(!)
= Jor· -Vy
e-''d..,. = 2 r· e-~= dz =
2
·
Jo
(4-52)
y;
•·igure 4.21
F!x)
/(X)
I
Uniform
c
0
a - c·/2
a + c!2
x
0
a- ct2
a+ c/2 x
106
CHAP.
4 THE RANDOM VARIABLE
Replacing a by a + I in (4-51) and integrating by parts, we obtain
f(a +I)=
a Jo" yo-le-1'dy = af(a)
(4-53)
This shows that if we know f(a) for I < a < 2 (Fig. 4.22), we can determine
it recursively for any a > 0. Note, in particular. that if a = n is an integer,
r(n + 1) = nr(n) = n(n - 1) · · · f(l) = n!
For this reason, the gamma function is called generalized factorial.
Withy = ex, (4-51) yields
{" xb-le-•·x dx
'Y lo
= 2cb Jo{" yb-le-:v dy = 2cb r(b)
(4-54)
And since the area of /(x) equals I. we conclude that
cb
'Y
= f(b)
The gamma density has extensive applications. The following special
cases are of particular interest.
Chi-square
f(x) =
1
2"12f
xr"I2He-xi2U(x)
(i)
n: integer
Of central interest in statistics.
Er/ang
/(x)
= (n-c" I)! x" .. 1e-rxu(x)
n: integer
Used in queueing theory, traffic, radioactive emission.
Figure 4.22
1
Vi
2
0
SEC.
/(.t)
4-3
107
II.I.USTRATIONS
F(;o:)
Exponential
0
0
X
X
Jo"igure 4.2..1
f:xponential
(Fig. 4.23)
/(x)
= ce
,.,U(x)
F(x)
= (I
- e
•t
)Utr)
Important in the study of Poisson points.
Cauchy We shall introduce this density in terms of the following experiment. A particle leaves the origin in a free motion. Its path is a straight
line forming an angle 8 with the horizontal axis (Fig. 4.24). The angle 8 is
selected at random in the interval ( -.,/2, .,/2). This specifics an experiment
~ the outcomes of which are all points in that interval. The probability of the
event {8 1 ~ 8 ~ fh} equals
8~- lh
P{e, ~ 8 ~e.}=-=--_..:....:.
•
1r
as in (2-46). In this experiment, we define an RV x such that
x(8) = a tan fJ
Thus x(8) equals the ordinate of the point of intersection of the particle with
the vertical line of Fig. 4.24. Clearly, the event {x < x} consists of all outcomes 8 in the interval (-.,/2, cb) where x = a tan cb: hence.
1r
}
cb + Tr/2 I 1
X
F(x) = P{x ~ x} = P { - - ~ 8 ~ cb =
= - + - arctan 2
.,
2 1r
a
Differentiating, we obtain the Cau('hy density:
al1r
.
j (x) = , + '
x- a-
(4-55)
Figure 4.24
j'(;l:)
:o: =-a
tan~
Candt}
x(O)
o~a-
a
X
108
CHAP.
4
THE RANDOM VARIABLE
Binomial We shall say that an RV x has a binomial distribution of
order n if it takes the values 0, 1, . . . , n with probabilities
= k} = (Z) pkq"- 4
P{x
k
= 0,
1, . . . , n
(4-56)
Thus x is a discrete type RV, and its distribution F(x) is a staircase function
= L (Z) pkq"-k q = 1 - p
(4-57)
h;,
with discontinuities at the points x = k (Fig. 4.25). The density f(x) of x is
different from zero only for x = k, and
f(k) = ( Z) pkq" 4
k = 0, I, . . . , n
F(x)
The binomial distribution originates in the experiment ~, of repeated
trials if we define an Rv x equal to the number of successes of an event .s4 in n
trials. Suppose, first, that ~, is the experiment of the n tosses of a coin. An
outcome of this experiment is a sequence
t = t •... t,
where t; ish or t. We define x such that x(t) = k where k is the number of
heads in t. Thus {x = k} is the event {k heads}; hence, (4-56) follows from
(3-8). Suppose, next, that ~, is the space of the n repetitions of an arbitrary
experiment~ and that .s4 is an event of~ with P(si) = p. In this case. we set
x(t) =kif tis an element of the event {d occurs k times}. and (4-56) follows
from (3-16).
Largen As we have seen in Chapter 3 (De Moivre-Laplace theorem),
the binomial probabilities /(k) approach the samples of a normal curve with
TJ
= np
u
= vnpq
(4-58)
Thus [see (3-27)]
1
f(k) =
e-lk· ,p)!J1\!jjpq
V21rnpq
(4-59)
Figure 4.25
f(x)
F(x)
I
Binomial
n = 25
p = .2
0
s
X
X
SEC.
4-3
ILLUSTRATIONS
109
Note, however, the difference between a binomial and a normal av. A binomial av is of discrete type, and the function f(x) is defined at x = k only.
Furthermore, f(x) is not a density; its values f(k) are probabilities.
The approximation (4-59) is satisfactory even for moderate values of n.
In Fig. 4.25, we show the functions F(x) andf(x) and their normal approximations for n = 25 and p = .2. In this case, Tl = np = 5 and cr = v;;j)q = 2. In
the following table, we show the exact values of /(k) and the corresponding
values of the normal density N(S, 2).
k
0
f(k)
1
2
3
5
4
6
7
8
9
10
11
.004 .024 .071 .136 .187 .196 .163 .111 .062 .029 .012 .004
N(S, 2) .009 .027 .065 .121 .176 .199 .176 .121 .065 .027 .009 .002
Poisson An RV x has a Poisson distribution with parameter a if it
takes the values 0, I. 2, . . . with probability
P{x
= k} = e ·u a"
k!
k
= 0,
1. 2, .
(4-60)
The distribution of x is a staircase function
k
F(x)
= e-u L !!..
ksc
(4-61)
k!
The discontinuity jumps of F(x) form the sequence (Fig. 4.26)
/(k)
= e-u a"
k!
k
= 0,
I, 2, . . .
(4-62)
depending on the parameter a. We maintain that if a < I, then /(k) is
maximum fork= 0 and decreases monotonically ask~ :lC, If a> 1 and a is
l'igure 4.26
f(x)
F(x)
I
Poisson
a= 1.5
0
0
'.
110
CHAP.
4
THE RANDOM VARIABLE
not an integer, /(k) has a single maximum fork = [a). If a > I and a is an
integer,f(k) has two maxima: fork. = a - I and fork = a.
All this follows readily if we form the ratio of two consecutive terms in
(4-62):
f(k - I)
a4 1(k - I)! k
/(k) =
a*lk!
=
a
Large a If a >> I, the Poisson distribution approaches a normal distribution with 71 =a, cr = Va:
al.
I
e a - == - - e-lk-ari2a
a >> I
(4-63)
k!
v1mi
This is a consequence of the following: The binomial distribution (4-55)
approaches the Poisson distribution with a = np if n >> I and p << I [see
Poisson theorem (3-43) 1. If n >> I, p << I, and np >> I, both distributions
tend to a normal curve, as in (4-63).
Poisson Points In Section 3-4 we introduced the space
· points specified in terms of the following properties:
I.
of Poisson
The probability that there are k points in an interval (1 1 ,
length lu = /2 - 11 equals
·---rr-
e -At
2.
~,.
(Ata)k
k
= 0.1,...
12 )
of
4 64)
(-
where A is the "density" of the points.
If (1,, 12) and (13, l4) are two nonoverlapping intervals, the events
{ka points in (I,, t2)} and {kh points in (1 3 • /4)] are independent.
Given an interval (1 1 , t 2 ) as here, we define the RV x as follows: An
outcome Cof r;J,. is an infinite set of points on the real axis. If k of these points
are in the interval (1 1 , 12), then x(C) = k. From (4-64) it follows that this RV is
Poisson-distributed with parameter a = Ala where A is the density of the
points and Ia = l2 - r,.
In the next example we show the relationship between Poisson points
and exponential distributions.
Example 4.11
Given a set of Poisson points, identified by dots in Fig. 4.27. we select an arbitrary
point 0 and denote by w the distance from 0 to the first Poisson point to the right ofO.
We have thus created an RV w depending on the set of the Poisson points. We
Fapre4.1.7
e Poisson point
.. . . . ..
~
(
0
w
•• •
SEC.
4-3
Ill
ILLUSTRATIONS
....
maintain that the RV w has an exponential density:
J,.(w)
= "e-h"U(w)
F.,.(w)
= (J
- e-A~ )U(w)
.(4-65)
..
• Proof. It suffices to find the probability P{w s w} of the event {w s w} where·•v is a
specified positive number. Clearly. w s w iff there is at least one Poisson Point in the
interval (0, w). We denote by x the number of Poisson points in the interval (0, w). As
we know, the RV xis Poisson-distributed with parameter""'· Hence,
P{x = 0} = e A~
w > 0
And since {w s w} = {x ~ 1}, we conclude that
F,..(w) = P{w s w} = P{x ~ I} = I - P{x == O} = I - e-~ ..Differentiating, we obtain (4-65). •
Geometric An RV x has a geometric diJtrihution if it takes the values
I. 2, 3, . . . with probability
P{x = k} = pq'·
1
k
=
I, 2. 3....
where q = I - p. This is a geometric sequence. and
±
pqk-•
' .
.
(4-66)
= _P_
= I
•
I - q
The geometric distribution has its origin in the following application or"Ber-·
noulli trials (see Example 3.5): Consider an event at of an experimint ~with
P(~) = p. We repeat~ an infinite number of times. and we denote· by x the
number of trials until the event .sll. occurs for the first time. Clearly, x i~ an RV
defined in the space 9':r. of Bernoulli trials. and. as we have shown in (3-1 f), it
has a geometric distribution.
4-1
Hypergeometric The RV x has a lrypergeometric distribution if it takes
the values 0, I, . . . , n with probabilities
( K)(N
k
P{x
K.
,:=k)
= k} = - - - -
(~)
k
= 0.
I. . . . • n
(4-67)
where N. K, and n are given numbers such that
llSKsN
Example 4.12
Example 4.13
A set contains K red objects and N - K black objects. We select n s K of these
objects and denote by x the number of red objects among the n selections. As we see
from (2-41). the RV x so formed has a hypergeometric distribution. •
We receive a shipment of 1,000 units, 200 of which are defective. We select at
random from this shipment 25 units, test them, and accept the shipment if the
defective units are at most 4. Find the probability p that the shipment is accepted.
112
CHAP.
4
THE RANDOM VARIABLE
The number of defective components is a hypergeometric av x with
N = 1,000
K = 200
n =25
The shipment is accepted if x s 4; hence,
±(2~) (25~ k)
4
p =
L P{x = k} = •·o
•=o
{I~)
= .419
This result is used in quality control. •
4-4
Functions of One Random Variable
Recall from calculus that a composite function y(t) = g(x(t)) is a function
g(x) of another function x(t). The domain of y(t) is the t-axis. A function of
an
RV
is an extension of this concept to functions with domain a probability
space~.
Given a function g(x) of the real variable x and an av x with domain the
space ~, we form the composite function
y(C)
= g(x(C))
(4-68)
This function defines an RV y with domain the set ~. For a specific C; E ~, the
value y(C;) of the RV y so formed is given by y; = g(x;), where x; = x(C;) is the
corresponding value of the RV x (Fig. 4.28). We have thus constructed a
function y = g(x) of the RV x.
Distribution of g(x)
We shall express the distribution F,(y) of the RV y so formed in terms of the
distribution Fx(x) of the RV x and the function g(x). We start with an
example.
Faaure 4.28
y= g(x)
X
SEC.
4-4
FUNCTIONS OF ONE RANDOM VARIABLE
((3
We shall find the distribution of the Rv
y = x2
starting with the determination of F,.(4). Clearly. y s 4 iff- 2 s x s 2; hence,
the events {y s 4} and {-2 s x s ·2} are equal. This yields
Fr(4) = P{y s 4} = P{-2 s x :52}=
f~,(2)- f~,(-2)
We shall next find Fv(-3). The event {y s -3} consists of all outcomes such
that y<C> s -3. This event has no elements because y(C) = x2(C) ;:::- 0 for
every C; hence.
Fv(-3) = P{y s -3} = P(0) = 0
Suppose, finally. that y ;:::- 0 but is otherwise arbitrary. The event {y s
:d consists of all outcomes Csuch that the values y( C> of the RV yare on the
portion of the parabola g(x)· = :r 2 below the horizontal line L, of Fig. 4.29.
This event consists of all outcomes Csuch that {x~<C> s y} where, we repeat,
y is a specific positive number. Hence, if y ;:::- 0. then
f~,(y)
= P{y s
y}
= P{-yY s
xs
yY} = F,(Vy)-
F,{-yY)
(4-69)
If y < 0. then {y s y} is the impossible event, and
F,.(y) = P{y s y} = 1'<0> = 0
(4-70)
With F,.(y) so determined. the density f,.(y) of y is obtained by differentiating F,.(y). Since
.
dF,,(yY) =_I_ (,(Vv)
d,...
2 .
vv· · ·
t1gure 4.29
f~(X)
I
0
-yy
0
..fi
~-v;~xsv;--j
.\'
X
)'
114
CHAP.
4
THE RANDOM VARIABLE
(4-69) and (4-70) yield
f...< .")
Example 4.14
_I_J.<VV> +_I_ r<-
=
{
~
vY ' .
2
y>O
vY J'
(4-71)
y<O
Suppose that
}; (x)
A
= _ I _ t'-.~:2a~
v'21T
and
y=
x2
(T
Sincef.(-x) = /.(x). (4-71) yields
f..(y)
.
= _1_/.(vY)U(y) =
Vy
(T
V'I21iY t'-~·J2a!U(y)
•
We proceed similarly for an arbitrary g(x). To find F,.(y) for a specific
y, we find the set Ir of points on the x-axis such that g(x) y. If x e Ir, then
g(x) is below the horizontal line Ly (heavy in Fig. 4.30) andy s y. Hence,
F,(y) = P{y s y} = P{x E I,}
(4-72)
Thus to find F,.(y) for a specific y, it suffices to find the set I,. such that g(x) s
y and the probability that x is in this set. For y as in Fig. 4.30, Ir is the union
of the half line x s x 1 and the interval x2 s x s x 3 ; hence, for that value of y,
s
F1(y)
= P{x E I,} = P{x s
= FA(x,) +
x1} + P{x2 s x s x3 }
Fx(XJ) - FA(x2)
The function g(x) of Fig. 4.30 is between the horizontal lines y = Ya and
y
= y,:
Yb < g(x) < Ya
for every x
(4-73)
Since y(C) is a point on that curve, we conclude that if y > Ya, then y(C) s y
for every C e ~and if y < y,, then there is no C such that y(C) s y. Hence,
tlpre4.30
X
-----------Yb --------------
SEC.
4-4
115
FUNCTIONS OF ONF. RANDOM VARIABLE
.
{P(~) = I
f.,.(y) = P(0) = 0
.V > .Vu
Y < Yh
(4-74)
This holds for any g(x) satisfying (4-73).
Let us look at several examples. In all cases, the RV x is of continuous
type. The examples are introduced not only because they illustrate the determination of the distribution Fv(Y) of the RV y = g(x) but also because the
underlying reasoning contributes to the understanding of the meaning of an
RV. In the determination of Fv(y), it is essential to differentiate between the
RV y and the number y.
Illustrations
I.
Linear transformation. (See Fig. 4.31).
(a)
g(x) =
X
2+ 3
Clearly, y s y itT x s Xa = 2y - 6 (Fig. 4.31 a); hence,
F1 (y) = P{x s x,} = F,(2y - 6)
= --X2 . . 8
In this case, y s y itT x ~ xh = 16 - 2y (Fig. 4.31b); hence,
F1 (y) = P{x ~ Xb} = I - f~(l6 - 2y)
(b)
2.
g(x)
Limiter. (See Fig. 4.32.)
g(x) =
RV
{
a
x >a
x
-as x sa
-a
x <-a
If y >a, then {y s y} = ~:hence, F_v(Y) = I.
If -as y sa, then {y s y} = {x s .v}: hence, F1 (y) = f~(y).
If y < -a, then {y s y} = 0; hence, F,.(y) = 0.
In this example, F_v(Y) is discontinuous at y = :ta; hence, the
y is of the mixed type, and
P{y = a}= P{x ~a} = 1 - f~,(a)
P{y = -a}= P{x s -a}= Fx(-a)
•·igure 4.31
(a)
(b)
116
CHAP.
4
THE RANDOM VARIABLE
g(x)
X
-a
0
x
a
-a
0
a
y
Fagure 4.31
3.
Dead zone. (See Fig. 4.33.)
x>c
-c: s x s c
{
X+ C
X< -c:
lfy ~ 0, then {y s y} = {x s y + c}; hence, Fv(Y) = Fx(Y +c).
lfy < 0, then {y s y} = {x s y- c}; hence, F1(y) = Fx(Y- c).
X-C:
g(x) =
4.
0
Again, F_,(y) is discontinuous at y = 0; hence, the
mixed type, and
P{y = 0} = P{-c <X< c} = Fx(c) - Fx(-c)
Discontinuous transformation. (See Fig. 4.34.)
RV
y is of
+C
X ~ 0
x-c:
x<O
lfy ~ c, then {y s y} = {x s y- c}; hence, F 1 (y) = Fx(Y- c).
If -c < y < c, then {y s y} = {x s 0}; hence, F 1 (y) = Fx(O).
If y < -c, then {y s y} = {x s y + c}; hence, F_,.(y) =
Fx(Y +c).
In this illustration, we select for g(x} the distribution f"_.(x) of the RV
x (Fig. 4.35). We shall show that the resulting Rv
y = F...(x)
is uniform in the interval (0, I) for any Ft(x).
The function g(x) = f".r(x) is between the lines y = 0 andy = I.
(X) = {X
g
5.
Figure 4.33
g(x)
F(x)
-c 0 c
X
0
y
SEC.
4-4
FUNCTIONS OF ONE RANDOM VARIABLE
-c.- 0
0
J J7
c
Figure 4.34
Hence [see (4-74)),
= {~
y>l
(4-75)
y<O
If 0 s; y s; I, then {y s; y} = {x s; x1 } where x1 is such that
f~,(x1 ) = y. Thus x1 is they-percentile of x, and
F_v(Y) = P{x s Xy} = y
0 s y s; I
(4-76)
Illustration 5 is used to construct an RV y with uniform distribution
starting from an RV x with an arbitrary distribution. Using a related approach, we can construct an RV y = g(x) with a specified distribution starting
with an RV x with an arbitrary distribution (see Section 8-3).
F_,.(y)
Density of g(x)
If we know F,.(y ), we can find fv<Y) by differentiation. In many cases, however, it is simpler to express the density f,(y) of the RV y - g(x) directly in
terms the density f,(x) of x and the function g(x). To do so, we form the
equation
(4-77)
g(x) = Y
where y is a specific number, and we solve for x. The solutions of this
equation are the abscissas x; of the intersection points of the horizontal line
L, (Fig. 4.36) with the curve g(x).
l'igurc 4.35
g(x)
= l·"x(xl
Fx(X)
F<xrl
y
I
118
CHAP.
4
THE RANDOM VARIABLE
dy
-L, + dy
.,_YJI+------++---~or-Ly
Figure 4.36
• Fundamtntlll Thtonm. For a specific y, the density /y(y) is given by
I' ( )
Jy y
=
fx(x,)
lg'(x,)l
+ ... - fx(x;) + ...
lg'(x;)l
=~
fx(x;)
~ jg'(x;)l
where x; are the roots of (4-77):
y = g(x1) = · · · = g(x;) = · · ·
and g'(x;) are the derivatives of g(x) at x = x;.
(4-78)
(4-79)
• Proof. To avoid generalities, we assume that the equation y = g(x) has
three roots, as in Fig. 4.36. As we know [see (4-32)],
/y(y) dy = P{y < y < y + dy}
(4-80)
Clearly, y is between y and y + dy iff x is in any one of the intervals
(Xt, Xt
+
dxt)
(X2 - ldx2l• X2)
(X3, X3
+ dx3)
where dx, > 0, dx2 < 0, dx3 > 0. Hence,
P{y < y < y + dy} = P{x, < x < Xt + dx,}
+ P{x2 - ldx2! < x < x2} + P{x3 < x < X3 + dx3} <+81)
This is the probability that y is in the three segments of the curve g(x)
between the lines L, and L1 + dy. The terms on the right of (4-81) equal
P{x1 < x < Xt + dx 1}
= fx(x,)dx,
P{x2- ldx2l < x < x2}
P{x3 < x < X3 + dx3}
= fx(x2>ldx2!
= fx(x3) dx3
I
dx,
= ~(
)dy
g Xt
dx2
= ~(
g X2 )dy
dx3
1
= ~(
g X3 )dy
1
Inserting into (4-81), we obtain
I' ( ) d = /x(x,) d + ./x(x2) d + fx(x3) dy
JY y y
g'(x,) y
lg'(x2>l y
g'(x3)
and (4-78) results.
SEC.
4-4
FUNCTIONS OF ONE RANDOM VARIABJ.E
119
Thus to findj;.(y) for a specific y. we find the roots x; of the equation
y
= g(x) and insert into (4-78). The numbers x; depend. of course, on y.
We assumed that the equation y = g(x) has at least one solution. If for
some values of)' the line L,. docs not intersect g(x). thcnfvCY) = 0 for these
values of y.
Note, finally. that if gCx) = y 0 for every x in an interval (a, b) as in Fig.
4.33, then F_,.(y) is discontinuous at y = Yo·
Illustrations
I.
gCx) = ax + h
The equation J' = ax+ b has a single solution x 1 = (y - b)/a
for every)'. Furthermore. g'Cx) = a: hence.
I
I
/v(}') = lctJr(XJ) = 'lli fr - t l (4-82)
(y-b)
ThusJ;.(y) is obtained by shifting and scaling the density Jx<x> of x.
I
g (x) =-
2.
X
The equation y = 1/x has a single solution x 1 = Jly for every y.
Furthermore,
-I
g '(x)
= -,
r
Hence,
(4-83)
3.
= ax2 a > 0
If y > 0, the equation y = ax 2 has two solutions: x 1 = yYiQ
g(x)
and x2 =
-ffa (fig. 4.29). Furthermore,
= 2ax g'(x1) = 2 vUY g'Cx2) =
g'(x)
-2
vQY
Hence,
.t;<y>
I
=
IY·
vayf. ( "Va)
2
+
'Y
I
vayfr (- ~a)
2
<4-84)
If y < 0, the equation y = ax 2 has no real roots; hence,J;.(y) = 0
[see also (4-71)].
Example 4.15
xis uniform in the interval (5, 10). as in Fig. 4.37. andy = 4x2• In this case.
J.C- vY74) = 0 for every .v > 0 and
The
RV
f. (
~)- L2
Hence.
J;.(y) =
r
';:'
fV
s < -v 4 <
10
elsewhere
100 < y < 400
elsewhere
•
120
CHAP.
4 THE RANDOM VARIABLE
f,.(y)
.OS
.oos
s
0
0
400
100
y
Figure 4.37
4.
g(x) = sin x
I, the equation y = sin x has no real solutions: hence.
/,(y) = 0. If IYI < I, it has infinitely many solutions X;
y = sin X;
x; =arcsin y
Furthermore,
If
IYI >
g'(x;) = cos
x; = VI - sin2 :c; =
v"i""="?
and (4-78) yields
f,(y)
I
~
= v'J=Y2
~ /.(x;)
(4-85)
where fx(x;) are the values of the density /.(x) of x (Fig. 4.38).
Suppose now that the RV x is uniform in the interval ( -7r. 1r ).
In this case,
-1T <X;< 1T
Fipre4.38
g(x)
1
X
fr
-1
0
X1
X
0
y
....
X
SEC.
4-5
MEAN AND VARIANCE
121
g(x)
0
-s
X
-s
o2
6
x
0
4
7 9
y
l'igurc 4.39
for any y; hence, the sum in (4-85) equals 212Tr. This yields
lhr
J;<y> = {
5.
:1 -
y2
·yl <
Yl
I
(4-86)
>I
g(x) = fAx>
as in Fig. 4.35. Clearly, g(x) is between the lines y = 0 andy = I;
hence,.fv(y) = 0 for y < 0 andy> I. For 0 s y s I. the equation
y = Fx(x) has a single solution x 1 = X~- where x~. is they-percentile of
the RV x; that is. y = F(x,.). Furthermore. if'(x) = F.~(x) = fx(x);
hence,
Osysl
Thus the RV y is uniform in the interval (0, I) for any f~,(x) [see also
(4-76)].
Point Masses Suppose that x is a discrete type Rv taking the values x" with
probability p 4 • In this case, the RV y = g(x) is also of discrete type. taking the
values y 4 = g(x4). If y 4 = g(x4) only for one x 4 • then
P{y = y,} = P{x = xd = P4 =- frC:cd
If, however, y = Y• for x = Xa and x = xh-that is. if y, = g(xu) = g(xh)-then
the event {y = Y•} is the union of the events {x = xu} and {x = xh}; hence,
P{y = yk} = P{x = Xu} + P{x = Xh} = p" + Ph
Note, finally, that if x is of continuous type but g(x) is a staircase
function with discontinuities at the points x 4 , then y is of discrete type,
taking the values y 4 = g(xd. In Fig. 4.39. for example,
P{y = 7} = P{2 < x s 6} = f,(6) - F,(2) = .3
This is the discontinuity jump of F~.(y) at y = 7.
4-5
Mean and Variance
The properties of an RV x are completely specified in terms of its distribution.
In many cases, however, it is sufficient to specify x only partially in terms of
122
CHAP.
4
THE RANDOM VARIABLE
certain parameters. For example. knowledge of the percentiles xu of x, given
only in increments of .I, is often adequate. The most important numerical
parameters of x are its mean and its variance.
Mean
The mean or expected value or statistical average of an RV x is, by definition, the center of gravity of the probability masses of x (Fig. 4.40). This
number will be denoted by E{x} or Tlx or.,.
If x is of continuous type with density f(x) (distributed masses), then
E{x} =
r.
xf(x) dx
(4-87)
I
If x is of discrete type taking the values x 4 with probabilities p~;. = f(x4 ) (point
masses), then
(4-88)
A constant c can be interpreted as an RV x taking the value c for every
(. Applying (4-88), we conclude that
E{c} = cP{x = c} = c
Empirical Interpretation. We repeat the experiment n times and denote by .t1
the resulting values of the RV x [see also (4-23)]. We next form the arithmetic
average
XJ + ... +X11
~
I
(4-89)
X=
= ~ Xj-
n
1
n
x
of these numbers. We maintain that tends to the statistical average E{x} ofx
as n increases:
(4-90)
E{x}
n large
x ""'
FIIUJe 4.40
fl
=f
xf(x) dx
X
SEC.
4-5
MEAN AND VARIANCE
123
We divide the x-axis into intervals of length .1 as in Fig. 4.17 and denote by n~
the number of X; between £'4 and c:4 + a. If .1 is small. then X; == c 4 for ck s X; s
c4 - .1. Introducing this approximation into (4-M9). we obtain
- = '\....
~
x
4
,, =- ,... c·,j.,kd
.
~
C't -
II
<4-91)
j.
4
where/,(cd = n41n.1 is the histogr.tm ofx (sec <4-36)1. As we have shown in
(4-37),/,(xl = .f(x) for small a; hence. the last sum in (4-91) tends to the integral
in (4-871 as n -+ :x: this yields (4-901.
We note that x;ln is the area under the empirical percentile curve in an
interval of length 1/n (fig. 4.41): hence. the sum in (4-891 equals the total area
of the empirical percentile.
From the foregoing it follows that the mean E{x} of our RV x equals the
algebraic area under its percentile curve x,. namely,
I.... xf<x>dx = Jo(' x,du
Ftr,)
=u
Interchanging the x and F(x) axes, we conclude that £{x} equals the difference of the areas of the regions ACD and OAB of Fig. 4.41. This yields
£{x}
=
I: R(x) dx - J:,,
F(.r) dx
(4-92)
where R(x) = I - F(x).
Equation (4-92) can be established directly from (4-87) if we use integration by parts.
Symmetrical Densities If f<x> is an even function, that is, if!<- x) = f(x),
then E{x} = 0. More generally, if f<x) is symmetrical about the number a,
that is, if
f<a + x) = f<a - x)
then
E{x} = a
This follows readily from (4-87) or, directly, from the mass interpretation of
the mean.
A density that has no center of symmetry is called skewed.
Figure 4.41
X;
B
0
X
E(x}= ACD- DAB
124
CHAP.
4 THE RANDOM VARIABLE
MEAN OF g(x)
Given an RV x and a function g(x), we form the RV
y = g(x)
As we see from (4-87) and (4-88), the mean of y equals
E{y} =
f.
L y,./y(y•>
or
yJ;.(y)dy
(4-93)
l
for the continuous and the discrete case, respectively. To determine E{y}
from (4-93), we need to find/y(y). In the following, we show that E{y} can be
expressed directly in terms of fx(x) and g(x).
• Fundamental Theonm
E{g(x)}
=
r.
(4-94)
or
g(x)fx(x)dx
• Proof. The discrete case follows readily: If y 4 -= g(x,.) for only a single .t~;,
then
L YkP{y = Yk} = L g(xk)P{x = x,.}
If Yk = g(x•> for several x., we add the corresponding terms on the right.
To prove the continuous case, we assume that g(x) is the curve of Fig.
4.36. Clearly,
{y < y < y + dy}
= {XJ < X < XJ + dx} U {X2 < X < X2 + dx2 } U {XJ < X < X3 + dx3}
where, in contrast to (4-81), all differentials are positive. Multiplying by y =
g(x 1) = g(x2 ) = g(x3 ) the probabilities of both sides and using (4-81), we
obtain
y/y(y) dy = g(x,)fx(x,) dx, + g(x2>fx<x2> dt2 + g(xJ)fx(xJ) dx3
Thus to each differential of the integral in (4-93) corresponds one or more
nonoverlapping differentials of the integral in (4-94). As dy covers they-axis,
each corresponding dx covers the x-axis; hence, the two integrals are equal.
We shall verify this theorem with an example.
Example 4.16
The RV xis uniform in the interval (I. 3) (Fig. 4.42), andy= 2x + I. As we see from
(4-82),
.t;.(y)
=~f. (Y;
I)
From this it follows that y is uniform in the interval (3, 7) and
E{y} =
r.
yf,.(yJ dy
= .25 ~ y dy = 5
This agrees with (4-94) because g(x) = 2K + I and
f" g(x)/.(x) dx = .5 J Ch +
3
-x
I
I) dx =
5
•
SI::C.
fx(X)
4-5
MEAN AND VARIANCE
125
fv(Y)
.5
y = 2x ... I
,...
.~5
I
0
I
'lx
3
0
X
3
y
7
'ly
l'igure 4.42
Unearity
From (4-94) it follows that
£{ax}
=a
£{g,(x) - · · · + g,.(x)}
r•
xJ.~(X) dx = a£{x}
(4-95)
f.
= [g,(x) + · · · - g,.(x)jft(x) dx
= £{.1(J(X)} - • · · .,. £{.1( (X)}
(4-96)
11
Thus the mean of the sum of n functions g,{x) of the RV x equals the sum of
their means.
Note, in particular, that E{h} = band
(4-97)
E{ax + b} = aE{x} + b
In Example 4.16, Tl.t = 2 'and Tl:v = 5 = 2.,.. + I, in agreement with (4-97).
Variance
The variance or dispersion of an RV x is by definition the central moment of
inertia of its probability masses. This number is denoted by u 2• Thus for
distributed masses,
'\
(4-98)
and for point masses,
(4-99)
From (4-98) and (4-94) it follows that
u 2 = £{(x - "1)2} = £{x2 }
-
2.,E{x}
T
.,z
126
CHAP.
4
THE RANDOM VARIABLE
where 7J
= E{x}. This yields
= cr, + rr,
,_.{ x-'}
c.
(4-100)
This has a familiar mass interpretation: Clearly, E{x2 } is the moment of
inertia of the probability masses with respect to the origin and 71 2 is the
moment of inertia of a unit point mass located at x = TJ· Thus (4-100) states
that the moment of inertia with respect to the origin equals the sum of the
central moment of inertia, plus the moment with respect to the origin, of a
unit mass at x = 7J. In probability theory. E{x} is called the first moment,
E{x2 } the second moment, and u 2 the second central moment (see also
Section 5-2). The square root u of u 2 is called the standard deviation of the
RV X.
We show next that the variance u~ of the
u~
We know that TJ,
u~
= E{(y -
RV y
= ax + b equals
= a u~
2
= a7Jx + b; hence,
.,,)2 } = E{[(ax + b) -
(4-101)
(a1Jx
+ b)F} = E{a2(x
- 1Jx)2 }
and (4-101) follows from (4-95).
Note, in particular, that the variance of x + b equals the variance x.
This shows that a shift of the origin has no effect on the variance.
As we show in (4-113), the variance is a measure of the concentration
of the probability masses near their mean. In fact, if u =0, all masses are at a
single point because thenf(x) = 0 for x =I= 11 [see (4-98)]. Thus
(4-102)
if
u =0
then
x = 11 = constant
in the sense that the probability that x =I= 11 is zero.
Note, in particular, that
(4-103)
if
E{x2 } = 0
then
x= 0
Illustrations
We determine next the mean and the variance of various distributions. Note
that it is often simpler to determine u 2 from (4-100).
Normal The density
of a standard normal RV z is even; hence, 11: = 0. We shall show that the
variance of z equals I:
Iu...2 = V"21T
J"_,. z e-:ndz = I
2
1
To do so, we differentiate the identity [see (3A-U]
r.
e-a:!dz
= y:;;:(a-lt2)
(4-104)
SEC.
4-5
MEAN AND VARIANCE
127
with respect to a. This yields
f7 (-
z2
)<'-a~: dz
= ..
~
v1T(a ·ll2)
Setting a= 112. we obtain (4-104).
We next form the RV x = az + b. As we sec from (4-82). if a> 0, then
fx(x)
(x -- b) ~_I_
e
a v'2:;T
::.! h
a
h
h¥t2u'
tl
Thus xis N(b. a). and [see (4-IOO)J
TJ~ = b
tTl = ll
(4-105)
This justifies the use of the letters ., and u in the definition (4-44) of a general
normal density.
Uniform Suppose that x is uniform in the interval (tl - d2, a -t c/2),
as in Fig. 4.21. In this case. tl is the center of symmetry of/(:c). and since the
location of the origin is irrelevant in the determination of u~. we conclude
that
,
I J·, ~ ,
c~
(4-106)
u· = -c ..: x- dx = -12
.,=a
Exponential If
/(:c)
then
E{x} = c
.
I. xe '·'dx
(I
= n .. -••U(.r)
I
c
=-
Hence,
I
r:{ x-'} - .,-,
u-, = c:.
YJ=c
We continue with discrete type avs.
Zero-One The
P{x
In this case.
E{x} = 0
Hence.
X
RV
I
= -.
c·
(4-107)
x takes the values I and 0 with
= p
P{x = 0} . ,. . q =- I - p
= I}
q - I X p = p
(4-108)
'YJ=p
Geometric The
x takes the values I. 2. : . . with
P{x = k} = ptf · 1
k = I. 2•.
RV
We shall show that
I
YJ=p
(4-109)
128
CHAP.
4
THE RANDOM VARIABLE
• Proof. Differentiating the geometric series
±tt=-qI - q
k~l
twice, we obtain
"
~ kqk-1
±
I
k(k - l)qk-2
=(I - q)2
=
k=l
2
(I - q)3
From this it follows that
E{x}
E{x2}
=p
f
k=l
= P L"
•·1
kq"- 1 =
k2q4-l
P
(I - q)2
=
I+
2
o -pqq)3 + o -P q)·~ = ---:/1
r
and (4-109) results [see (4-100)].
From (4-109) it follows that the expected number of tosses until heads
shows for the first time (see Example 3.5) equals lip.
Poisson The
RV
x takes the values 0, I, . . . with
ale
k
P{x = k} = e-(J k!
= 0,
I, .
We shall show that
(4-110)
T]=a
• Proof. Differentiating the identity
" a•
etJ=Lk!
l=-0
twice, we obtain
e"
"
=L
l=O
ak-2
k(k - I) -
k.1
Hence,
and (4-110) results.
From (4-110) and (4-64) it follows that the expected number of Poisson
points in an interval of length t(J equals AI(J. This shows that A. equals the
expected number of points in a unit interval.
APPROXIMATE EVALUATION OF E{g(x)} We wish to determine the mean
of the RV y = g(x). To do so, we must know the density ofx. We shall show
SEC.
4-5
MEAN AND VARIANCE
129
•·igure 4.43
that if g{x) is sufficiently smooth in the region (l}.. - c. Tl.r + c) where,h{x)
takes significant values (Fig. 4.43), we can express E{g(x)} in terms of the
of x.
mean Tl.r and the variance
Suppose, first, that g(x) = ax + b. In this case,
E{g(x)} = Dl}.r + b = g(l}.,)
This suggests that if g(x) is approximated by its tangent
g(x) = g(l}.r) + g'(l}.r)(x- l}.,)
in the intervall}.r :t c·, then E{g(x)} = g(l}.. ). Indeed, since E{x- Tl.rl = 0, the
estimate
(4-111)
results. This estimate can be improved if we approximate g(x) by a parabola
u;
= g(l}.,) + g'(l}.. )(x- l}.,) ...
Tl.r )2 } = ui. we conclude that
g(x)
Since £{(x -
g
i·' (
"(
)
x- 1}.,)2
(4-112)
(4-113)
This is an approximation based on the truncation (4-112) of the Taylor
series expansion of g(x) about the point x = Tic. The approximation can be
further improved if we include more terms in (4-112). The result, however,
involves knowledge of higher-order moments of x.
Example 4.17
g(x)
= X-I
In this case,
,
2
g (X)= - xl
I
N("'.t) = "'•
,
2
NI7J.,)"'- \
71.\
130
CHAP.
4 THE RANDOM VARIABLE
and (4-113) yields
•
TCHEBYCHEFFS INEQUALITY The mean 71 of an RV x is the center of
gravity of its masses. The variance cr2 is their moment of inertia with respect
to.,. We shall show that a mcijor proportion of the masses is concentrated in
an interval of the order of cr centered at .,.
Consider the points a
kcr, b
+ kcr of Fig. 4.44 where k is an
arbitrary constant. If we remove all masses between a and b (Fig. 4.44b) and
replace the masses p, = P{x ~ ., - kcr} to the left of a by a point mass p 1 at a
and the masses P2 = P{x ~ 71 + kcr} to the right of b by a point mass p 2 at b
(Fig. 4.44c). the resulting moment of inertia with respect to ., will be smaller
than cr 2 :
=., -
=.,
Hence,
I
Pt + P2 ~ kl
From this it follows that
I
P{lx- Til~ kcr} ~ k 2
(4-114)
or, equivalently,
I
P{71 - kcr < x < 71 + kcr} ~ I - k 2
(4-115)
where equality holds iff the original masses consist of two equal point
masses.
We have thus shown that the probability masses p 1 + p 2 outside the
interval (71 - kcr, 71 + kcr) are smaller than llk2 for any k and for any F(x).
Note, in particular, that if llk2 = .05-that is, if k = 4.47-then at most 5%
of all masses are outside the interval 71 ± 4.47cr regardless of the form of
Figure 4.44
b =f1 + ka
a =11- ka
a
.,
(a)
X
~
a
.,
{b)
1'--- .
b
X
··1 .,
I
a
{c)
1P2
b
..
.t
PROBLEMS
f(y)
c
13 1
=Aq
p
c
X
c
0
c
0
X
(b)
(a)
(c)
Figure 4.45
F(x). If F(x) is known, tighter bounds can be obtained; if, for example, xis
normal, then Lsee (4-50)) 5% of all masses are outside the interval., ::t 1.96u.
Markoff's Inequality Suppose now that x =:::: 0. In this case, F(x) = 0 for x <
0; hence, all masses are on the positive axis. We shall show that a major
proportion of the masses is concentrclted in an interval (0, c) of the order
of.,.
Consider the point c: = ATJ of Fig. 4.45 where 11. is an arbitrary constant.
If we remove all masses to the left of c (Fig. 4.45b) and replace the mass,es
p = P{x =:::: .\71} to the right of c by a point mass pat c, the moment with
respect to the origin will decrease from ., to pc:; hence, pATJ < .,. This yields
P{x =:::: ATJ}
I
<A
(4-116)
for any A and for any F(x).
Note that (4-114) is a special case of (4-116). Indeed, the RV (x - 71) 2 is
positive and its mean equals cr 2• Applying (4-116) to the RV (x - 71) 2, we
conclude with A = k2 that
P{lx - 7112 =:::: k2u2} s
~2
and (4-114) results because the left side equals P{lx - TJI
=::::
kcr}.
Problems
4-1
The roll of two fair dice specifies an experiment with 36 outcomes.fiJi.. The RVS
x and y are such that
ifi+k=1
x<Ji.li.) = i ... k
y(JiJi.) = {
ifH·k=11
and y(JiJi.) = 0 otherwise. (a) Find the probabilities of the events {6.5 < x <
10}, {x > 2}, {5 < y < 25}, {y < 4}. (b) Sketch the distributions Fx(x), Fv(y) and
the point densitiesfx~),/y(y).
·
The RV xis normal with., = 10, cr = 2. Find the probabilities of the following
events: {x > 14}, {8 < x < 14}, {lx - 101 < 2}. {x - 5 < 7 < x - 1}.
2~
4-l
132
CHAP.
4
THE RANDOM VARIABLE
4-3
4-4
4-5
4-6
4-7
-17
-9
-7
-6
-2
4-8
4-9
4-10
4-11
4-1.2
4-13
4-14
The SAT scores of students in a school district are modeled by a normal av
with TJ = 510 and cr = SO. Find the percentage of students scoring between 500
and 600.
The distribution ofx equals F(x) = (I - e- 21 )U(x). Find P{x > .5}, P{.3 < x <
.7}. Find the u-percentiles ofx for u = .I, .2, . . . , .9.
Given F(S) = .940 and F(6) = .956, find the x0..,, percentile of x using linear
interpolation.
If F(x) = x/5 for 0 s x s 5, find j(x) for all x. Find the corresponding upercentiles x,. for u = .I, .2, . . . , .9.
The following is a list of the observed values xi of an RV x:
3 6 II 14 18 23 26 31 34 35 37 39 42 44 48 51 53 61
Sketch the empirical distribution F,.(x) and the percentile curve x,. of x. Sketch
the histogramf,(x) for A = Sand A = 10.
The life length of a pump is modeled by an av x with distribution F(x) = (I e-•·• )U(x). The median of x equals x0.5 = 50 months. (a) Under a warranty, a
pump is replaced if it fails within five months. Find the percentage of units that
are replaced under the warranty. (b) How long should the warranty last if
under the warranty only 3% of the units are replaced?
(a) Show that if/(x) = c:lxe-r"U(x), then F(x) =(I - e-rx- cxe-.-.)U(x). (b)
Find the distribution of the Erlang density (page 106).
The maximum electric power needed in a community is modeled by an av x
with density c:lxe-r•U(x) where c = 5 x 10· 6 per kilowatt. (a) The power
available is 106 kw; find the probability for a blackout. (b) Find the power
needed so that the probability for a blackout is less than .005.
We wish to control the quality of resistors with nominal resistance R =
I ,ooon. To do so, we measure all resistors and accept only the units with R
between 960 and I ,040. Find the percentage of the units that are rejected. (a) If
R is modeled by an N(IOOO, 20) av. (b) If R is modeled by an av uniformly
distributed in the interval (950, I ,050).
The probability that a manufactured product is defective equals p. We test
each unit and denote by x the number of units tested until a defective one is
found. (a) Find the distribution of x. (b) Find P{x > 30} if p = .OS.
We receive a shipment of200 units, of which 18 are defective. We test 10 units
at random, and we model the number k of the units that test defective by an RV
x. (a) Find the distribution of x. (b) Find P{x = 2}.
(Pascal distribution) The av x equals the number of tosses of a coin until heads
shows k times. Show that
P{x
= k} = (~ =: :) p"q"··•
4-15 The number of daily accidents in a region is a Poisson RV x with parameter a =
3. (a) Find the probability that there are no accidents in a day. (b) Find the
probability that there are more than four accidents. (c) Sketch F(x).
4-16 Given an N(l, 2) av x and the functions (Fig. P4.16)
g,(x) = {xo
lxl s 2
lxl > 2
gz(x) = {
~(s2
;
2
-2 - x < -2
gl(x) = { -II
x<O
x>O
PROBLEMS
12<x>
It (X)
133
IJ(X)
2
-2
0
/
2
0
X
X
-2
l'igure P4.16
4-17
4-18
4-19
4-10
4-11
4-ll
we form the avs y = g 1(x), z = g 2(x). w = g 3(x). (a) Find their distributions. (b)
Find the probabilities of the events {y = 0}, {z = 0}. and {w = 0}.
The RV xis uniform in the interval (0, 6). Find and sketch the density of the RV
y = -2x + 3.
The input to a system is the sum x = I0 - 11 where 11 is an N<O, 2) RV, and the
output is y = x2• Findf.ly) and F_,.(y).
Express the density /y(y) ofthe RV y = g(x) in terms of the density f,(x) ofx for
the following cases: (a) g(x) = x 3; (b) g(x) = .~; (c) g(x) = 1xl (full-wave
rectifier); (d) g(x) = xU(x) (half-wave rectifier).
The base of a right triangle equals 5, and the adjacent angle is an RV 8 uniformly distributed in the interval (0, ?TI4). Find the distribution of the length
b = 5 tan8 of the opposite side.
The RV x is uniform in the interval (0. 1). Show that the density of the RV y =
- lnx equals e·-'U(y).
Lognormal distribution. The RV xis N('l'/. u) andy = e'-. Show that
.t .. )
f,•v
= uy v'21T
I
exp {- 2u·
__!_, (lm·. - .,.,>}
.,
This density is called lognormal.
4-13 Given an Rv x with distribution F,(x), we form the RV y
4-14
4-lS
4-16
4-17
4-18
4-19
Uh·)
.
= 2F,(x) + 3. Findf,.(y)
·
and F).(y).
Find the constants u and b such that if y ....: ax - band '17.• = 5. u, = 2, then.,.,,. =
0. (7'). = I.
.
A fair die is rolled once. and x equals the number of faces up. Find .,.,_, and u,.
We roll two fair dice and denote by x the number of rolls until 7 shows. Find
E{x}.
A game uses two fair dice. To participate. you pay $20 per roll. You win SIO if
even shows, $42 if 7 shows, and $102 if II shows. The game is fair if your
expected gain is $20. b the game fair?
Show that for any c. E{(x- d} = <'17. - d +IT;.
The resistance of a resistor is an RV R with mean t.ooon and standard deviation 200. It is connected to a voltage source V = II OV. Find approximately
the mean of the power P = V 21R dissipated in the resistor, using (4-113).
134
CHAP.
4
THE RANDOM VARIABLE
4-30 The RV x is uniform in the interval (9, 11) and y = x 3• (a) Find /y(_v); (b) find TJ..
and crx: (c) find the mean ofy using three methods: (i) directly from (4-93); (ii)
indirectly from (4-94); (iii) approximat~ly from (4-113).
· / 4-31 The av x is N(TJ, cr), and the function g(x) is nearly linear in the interval TJ 3cr < x < y + 3cr with derivative g'(x). Show that the RV y = g(x) is nearly
normal with mean TJy = g(TJ) and standard deviation cr.,. = lg'(TJ)Icr.
. · 4-32 Show that if/x(x) = e""U(x) andy= V2i then/,(y) = ye·).z'2U(y).
s _ _ __
Two Random Variables
Extending the concepts developed in Chapter 4 to two RVS, we associate a
distribution, a density, an expected value, and a moment-generating function
to the pair (x, y) where x andy are two RVS. In addition, we introduce the
notion of independence of two RVs in terms of the independence of two
events as defined in Chapter 2. Finally, we form the composite functions z =
g(x, y), w = h(x, y) and express their joint distribution in terms of the joint
distribution of the RVS x and y.
5-1
The Joint Distribution Function
We arc given two RVS x andy defined on an experiment~. The properties of
each of these Rvs are completely specified in terms of their respective distributions f~(x) and Fv(y). However, their joint properties-that is, the probability P{(x, y} E D} that the point (x, y) is in an arbitrary region D of the
plane-cannot generally be determined if we know only Fx(x) and F,(y). To
find this probability, we introduce the following concept.
135
136
CHAP.
5
TWO RANDOM VARIABLES
• Definition. The joint mmulative di.ftrihutimzjimctioll or. simply. thejoilll
distributio11 Ft,.(x. y) of the RVs x andy is the probability of the event
= {x ~ x} n
{x s x. y s .v}
{y s y}
consisting of all outcomes C such that x(C) s x and y(C) s y. Thus
Fn(X, y)
= P{x <
x. y s y}
(5-1)
The function F.t.,.(x, y) is defined for every x andy: its subscripts will often be
omitted.
The probability P{<x, y) ED} will be interpreted as mass in the region
D. Clearly, F(x, y) is the probability that the point <x. y) is in the quadrant 0 11
of Fig. 5.1; hence, it equals the mass in IJ0 • Guided by this. we conclude that
the masses in the regions 0 1 • D2. and D3 are given by
P{x s x. Y1 < y s J2}
P{x, < x s x2. y s y}
P{x,
<
x S X2, Y1
<yS
Y2}
= F(x2, Y2)
= F<x. Y2>
= F(x2 • y)
- F(x, Yd
-· F(x 1 • y)
F(x,, Y2) - F(x2 • .Vd
-
(5-2)
(5-3)
+ F(x1 , yd (5-4)
respectively.
JOINT DENSITY If the RVs x and y are of continuous type, the probability
masses in the xy plane can be described in terms of the function f(x, y)
specified by
P{x < x s x + dx, y < y s y + dy} = f(x, y)dxdy
(5-5)
This is the probability that the Rvs x and y are in a differential rectangle of
area dxdy. The functionf(x, y) will be called the joint density function of the
RVS ll andy.
From (5-5) it follows that the probability that (x, y) is a point in an
arbitrary region D of the plane equals
P{(x, y) CD}=
JJ j(x. y)dxd.Y
(5-6)
/)
Applying (5-6) to the region 0 11 of Fig. 5.1. we obtain
F(x, y) =
f f. /(a.~)
&
da d/:l
(5-71
This yields
f
Note, finally, that
f(x, y)
~
0
(x
t.
)
'Y
= iJ2F(x, y)
(5-8)
ax iJy
t.J<x. y)dxdy
This follows from (5-5) and the fact that {x s
:x;,
= F(:x:, :x:) = I
(5-9)
y s oo} is the certain event.
SEC.
5-1
THE JOINT DISTRIBUTION FUNCTION
137
Yt
,
Figure S.l
MAR<IINAL DISTRIBUTIONS In the study of the joint properties of several
RVs, the distributions of each are called marginal. Thus Fx(x) is the marginal
distribution and h(x) the marginal density of the RV x. As we show next,
these functions can be expressed in terms of the joint distribution ofx andy.
We maintain that
~-~.(X)= Fx.v<x. :x:)
r.
= .f~·<x. y)dy
The first equation follows from the fact that {x s x} = {x s x, y s
h(X)
(5-10)
:x:} is the
event consisting of all outcomes such that the point (x, y) is on the left of the
line L .• of Fig. 5.2a. To prove the second equation, we observe that.fx(x)dx =
P{x s x s x + dx} is the probability mass in the shaded region llD of Fig.
5.2b. This yields
h(x)dx
=
fJ /xy(x, y)dxdy = d:c J:.J. . (x. y)dy
JJ)
We can show similarly that
F,(y)
= f"xy(:x:, y)
/,(y)
= f.J:y(X.
(5-11)
y) dx
Figure 5.2
l
liD
xSx
-
.....-dx
ySy
(a)
(b)
138
CHAP.
5
TWO RANDOM VARIABLES
POINT AND LINE MASSES Suppose that the avs x and y are of discrete
type taking the values X; and Yk, respectively, with joint probabilities
P{x = x,, y = y,} = f(x;, y,t) =Pile
(5-12)
In this case, the probability masses are at the points (x1, yd andf(x1 Yt) is a
point density.
The marginal probabilities
P{x = x;} = fx(x;) = p;
P{y = yk} = /y(yk) = P1r.
(5-13)
can be expressed in terms of the joint probabilities Ptlr.. Clearly, P{x = X;}
equals the sum of the masses of all points on the vertical line x = x;, and
P{y = YA:} equals the sum of the masses of all points on the horizontal line y =
Ylr.· Hence,
(5-14)
p;=LPilr.
•
Note, finally, that
L Pi
i
Example 5.1
= LIt
qlr.
= Li,lt
Pik
=I
(5-15)
A fair die rolled twice defines an experiment with 36 outcomes/Jj.
(a) The avs x and y equal the number of faces up at the first and second roll,
respectively:
x(Ji/j) = i
y(Ji,/j) =j
i.j ==I, . . . , 6
(5-16)
This yields the 36 points of Fig. .5.3a; the mass of each point equals 1136.
(b) We now define x andy such that
x(Ji/j) = li- il
y(Jijj) == i + i
Thus x takes the six values 0, I, . . . , .5 and y takes the 11 values 2,
3, . . . , 12. The corresponding marginal probabilities equal
6 10 8 6 4 2
P; == 36 36 36 36 36 36
1 2 3 4 5 6 5 4 3 2 1
qj = 3636363636363636363636
F'ipre 5.3
y
••••••
••••••
••••••
••••••
••••••
s ••••••
s
0
(a)
X
s
0
(b)
X
SEC.
5-1
THJ:: JOINT I>ISTRIBUTION I'UNCTION
139
In this case. we have 21 points on the plane (Fig. 5.3bJ. There are
six points on the line x....: 0 with masses of 1/36: for example, the mass of
the point (0, 4) equals 1/36 because ix = 0, y - 4} = t/2 ,/z }. The masses of
all other points equal2/36: for example. the mass oft he point (3, 5) equals
2136 because {x:..:. 3. y .. 5} = U.J4./~ti}. •
In addition to continuous and discrete type Rvs. joint distributions can
be of mixed type involving distributed masses. point masses, and line masses
of various kinds. We comment next on two cases involving line masses only.
Suppose. first, that x is of discrete type, taking the values x;. and y is of
continuous type. In this case, all probability masses arc on the vertical lines
x = x; (Fig. 5.4a). The mass between the points y, andy~ of the line x = X;
equals the probability of the event {x = x;, y 1 < y < y~}. Suppose. finally. that
x is of continuous type and y = g(x). If D is a region of the plane not
containing any part of the curve y =!((X), the probability that (x, y) is in this
region equals zero. From this it follows that all masses are on the curve y =
!((X). In this case, the joint distribution F(x, y) can be expressed in terms of
the marginal distribution F,(x) and the function g(:c). for example. with x and
y as in Fig. 5.4b, F(x, y) equals the masses on the curve y = g(x) inside the
shaded area. This includes the masses on the left of the point A and between
the points B and C. Hence,
F(x, y) = F,(x 1 ) + F,(x) - F,(x~)
Independent
RVs
As we recall[see (2-67)], two events s4. and ~A arc independent if
P(s4. n ~) = P<:A)P(~)
The notion of independence of two RVs is based on this.
• Definition. We shall say that the Rvs x andy arc statistically indep(•ndent
if the events {x !5 x} and {y !5 y} arc independent, that is. if
P{x !5 x, y !5 y} = P{x !5 x}P{y !5 y}
(5-17)
Figure 5.4
Line masse)
J
gl.\')
YJ--
X
(a)
thl
140
CHAP.
5
TWO RANDOM VARIABLES
for any x andy. This yields
f~.v(X, y)
= Fx(x)Fy(y)
(5-18)
Differentiating, we obtain
/r;(x, y)
= fx(xlf;.(y)
(5-19)
Thus two Rvs are independent if they satisfy (5-17) or (5-18) or (5-19).
Otherwise, they are "statistically dependent."
From the definition it follows that the events {x1 s x < x2 } and {>· 1 s y <
y2 } are independent; hence, the masses in the rectangle D3 of Fig. 5.1 equals
the product of the masses in the vertical strip (x 1 s x < x2 ) times the masses
in the horizontal strip(>·, s y < y 2 ). More generally. if A and Bare two point
sets on the x-axis and they-axis, respectively, the events {x E A} and {y E B}
are independent.
Applying this reasoning to the events {x = x;} and {y = Yk}, we conclude
that ifx andy are two discrete type RVs as in (5-12) and independent, then
(5-20)
Note, finally, that if two RVs x andy are "functionally dependent." that
is, if y = g(x), they cannot be (statistically) independent.
Independent Experiments Independent RVs are generated primarily by
combined experiments. Suppose that fJ, is· an experiment with outcomes ''
and ~2 another experiment with outcomes C2. Proceeding as in Section 3-1,
we form the combined experiment (product space)~ = f:/ 1 x ~2 • The outcomes of this experiment are ~~~2 where~~ is any one of the elements C1 of~ 1
and ~2 is any one of the elements C1 of ~f::. In the experiment ~~. we define
the Rvs x andy such that x depends only on the outcomes of ~f 1 andy depends only on the outcomes of~2 • In other words. x(~ 1 ~::) depends only on
~~ and y(~ 1 ~2 ) depends only on ~2 • We can show that if the experiments
~1 1 and ~/2 are independent. the RVs x andy so formed are independent as
well.
Consider, for example, the RVS x andy in (5-16). In this case,~. is the
first roll of the die and ~2 the second. Furthermore, P{x = i} = l/6, P{y =
k} = 1/6. This yields
P
= {x = i, y = k} = 3~ = P{x = i}P{y = k}
Hence, in the space
are independent.
~
of the two independent rolls of a die, the RVs x and y
SEC.
5-J
141
THE JOINT DISTRIBUTION FUNCTION
ILJ.t.;STRATIONS The Rvs x and y are independent. x is uniform in the
interval (0, a), andy is uniform in the interval (0, b).
!
fx(x)
= { ao
{i
0 s x sa
otherwise
=
0 s; y s; b
·
0
otherwise
Thus the probability that xis in the interval (x, x + .1-t) equals tufa, and the
probability that y is in the interval (y, y + Ay) equals Aylb. From this and the
independence of x and y it follows that the probability that the point (x, y) is
in a rectangle with sides tu and Ay included in the region R = {0 s x sa, 0 s
y s b} equals tuAylab. This leads to the conclusion that
h-(y)
...!..b
={ a
f(x, y)
0 s x s a. 0 s y s b
(5-21)
0
otherwise
The probability that (x, y) is in a region D included in R equals the area of D
divided by ab.
Example 5.2
A fine needle of length cis dropped "at random" on a board covered with parallel
lines distanced apart (fig. 5.5a). We wish to find the probability p that the needle
intersects one of the lines. This experiment is called Buffon's needle.
We shall first explain the meaning of randomness: We denote by x the angle
between the needle and the parallel lines and by y the distance from the center of the
needle to the nearest line. We assume that the Rvs ll and y are independent and
uniform in the intervals (0, Trl2) and (0, d/2), respectively. Clearly. the needle intersects the lines itT
Figure S.S
x.....,•./
J'
I
cJ
:!
I
!1-
d
----.,
""2 stn.
.·
\'
12
a<c
I
~I
y
I
I
L.-
1r
I
(a)
I ('
2
(b)
X
142
CHAP.
5
TWO RANDOM VARIABLES
Hence. to find p, it suffices to find the area of the shaded region of Fig . .S.Sb and use
(5-21) with a = 1r12 and b = d/2. Thus
4
c
2c
p = 1rd Jo 2 cos x dx = 1rd
r··1
De Monte Carlo Metbod from the empirical interpretation of probability it follows
that if the needle is dropped n times. the number n; of times it intersects one of the
lines equals n; = np - 2nd1rd. This relationship is used to determine 1r empirically: If
n is sufficiently large, 1r = 2ncldn;. Thus we can evaluate approximately the deterministic number 1r in terms of averages of numbers obtained from a random experiment. This technique. known as the Monte Carlo method. is used in performing
empirically various mathematical operations, especially estimating complicated integrals in several variables (see Section 8-31. •
The RVS x and y are independent and normal. with densities
f,(x)
·~ .
= ,.• I'2- e····,.:rr
I
/.(v) = - -
" v ~11
In this case, (5-19) yields
·
f(x, V)
'
Example 5.3
.,, .
(!.,.. _,.
u'\12;
= -•-, e·«x!'·'.!"~·,.:
(5-22)
21TCT"
With f{x, y) as in (5-22), find the probability p that the point (x, y) is in the circle
~sa.
As we see from (5-22) and (5-6),
P{~ <a}= 2~u~
Jf
e·•·•!•Y:~rdxdy
'.r·+,,••<u
I f." 21TTt'-r'.;tr
. . .dr = I
= --,
21TO'"
0
.., .
- e·u··-•r
•
Circular Symmetry We shall say that a density functioh/(x, y) has circular
symmetry if it depends only on the distance of the point (x, y) from the
origin, that is, if
f(x, y) = «//(r)
r = v'x 2 + y 2
(5-23)
As we see from (5-22), if the RVs x and y are normal independent with zero
mean and equal variance, their joint density has circular symmetry. We
show next that the converse is also true.
• Theorem. If the Rvs x and y are independent and their joint density is
circularly symmetrical as in (5-23), they are normal with zero mean and
equal variance.
5-J
SEC.
143
THE JOINT I>ISTRIBUTION FUNCTION
This remarkable theorem is another example of the importance of the
normal distribution. It shows that normality is a consequence of independence and circular symmetry. conditions that are met in many applications.
• Proof. From (5-23) and (5-19) it follows that
r·.,·· eM·.,
(5-24)
cfJ(r) = j~(.r)/v(v)
r "' V x· + yWe shall show that this identity leads to the conclusion that the functions
j,(x) and J;-lv) are normal. Difl"erentiating both sides with respect to x and
using the identity
d "'-( _ dfb(r) or _ X "'-'( )
"~~ r) - - - - - "~~ r
ax
dr ax
r
-
we obtain
(5-25)
As we see from (5-24), x<b(r)
= xfx(x)J;.(y). From this and (5-25) it follows that
I f.~(x)
I cf>'(r)
r
cf>(r)
=
x
f.(x)
~-~
The left side is a function only of r = V x~ + y 2 , and the right side is
independent of y. Setting x = 0, we conclude that both sides are constant.
Thus
d
-I [;(x)
- - = a = constant dx
In /.(:c) = ax
x ft(x)
In j.(x) =
ax2
T
+ C
j.(:c)
= c en.•l'2
Since fx(x) is a density, a < 0; hence. x is normal. Reasoning similarly, we
conclude that y is also normal with the same a.
t"undions of Independent R\"S Suppose now that z is a function of the
and that w is a function of the RV y:
z
= g(x)
w = h~)
RV
x
(5-26)
We maintain that if the avs x andy are statistically independent, the avs z
and w are also statistically independent.
• Proof. We denote by A: the set of values of x such that g(x) s z and by B..,
the set of values y such that h(y) s w. From this it follows that
{z s z} = {x EAt}
{w s w} = {y E B.,..}
for any z and w. Since x andy are independent, the events {x E A4 } and {y E
B,..} are independent; hence,
P{z s z. w s w} = P{z s z}P{w s w}
(5-27)
Note, for example, that the avs x2 and y3 are independent.
144
CHAP.
5
TWO RANDOM VARIABLES
5-2
Mean, Correlation, Moments
The properties of two avs x and y are completely specified in terms of their
joint distribution. In this section, we give a partial specification involving
only a small number of parameters. As a preparation, we define the function
g(x, y) of the avs x and y and determine its mean.
Given a function g(x, y), we form the composite function
z = g(x, y)
This function is an RV as in Section 4-1, with domain the set ~ of experimental outcomes. For a specific outcome'; E ~.the value z(,;) ofz equals g(x;,
y;) where x; andY; are the corresponding values ofx andy. We shall express
the mean of z in terms of the joint density /(x, y) of the avs x and y.
As we have shown in (4-94), the mean of the RV g(x) is given by an
integral involving the density fx(x). Expressingfx(x) in terms of f(x, y) [see
(5-10)], we obtain
E{g(x)} =
J: g(x)fx(x)dx J: J: g(x)f(x, y)dxdy
=
(5-28)
We shall obtain a similar expression for E{g(x, y)}.
The mean of the RV z = g(x, y) equals
Tit
=
r.
z.fr.(z) dz
(5-29)
To find.,,, we must find first the density of z. We show next that, as in (5-28),
this can be avoided.
• Theorem
E{g(x, y)}
=
r. r.
g(x, y)f(x, y)dxdy
(5-30)
and if the avs x andy are of discrete type as in 15-12).
E{g(x, y)}
= L g(x;. ydf(x;. yd
(5-31)
i.4
• Proof. The theorem is an extension of(4-94) to two variables. To prove it.
we denote by 11Dl the region of the xy plane such that z < g(x. y) < z + dz.
To each differential dz in (5-29) there corresponds a region 110: in the xy
plane where g(x. y) == z and
P{z s z s z + dz} = P{(x, y) E 11D:}
5-2
SEC.
145
MEAN, CORRELATION, MOMENTS
As dz covers the z-axis, the corresponding regions W: are nonoverlapping
and they cover the entire X)' plane. Hence, the integrals in (5-29) and (5-30)
are equal.
It follows from (5-30) and (5-31) that
£{g 1(x, y) + · · · + g,(x, y)} = £{g 1(x, y)} + · · · + £{g,(x, y)} (5-32)
as in (4-96) (linearity of expected values).
Note [see (5-28)] that
Tl.t
=
r. f..
xf(x, y)dxdy
cr;
= f~
ry
(x -
71x>2f<x. y)dxdy
Thus (7Jx, TJy) is the center of gravity of the probability masses on the plane.
The variances cri and cr~ are measures of the concentration of these masses
near the lines X = TJ., andy = 71,-. respectively. We introduce next a fifth
parameter that gives a measure of the linear dependence between x and y.
Covariance and Correlation
The covariance JJ..•.v of two RVs x and y is by definition the "mixed central
moment'':
Cov(x, y)
= p..,.
= E{(x - "h )(y - 71.•·)}
(5-33)
Since
= £{xy} -
7J.,£{y} - TJvE{x} - TJ.,TJy
= £{xy} -
E{x}J::{y}
E{(x - TJ.. )(y - 71~·)}
(5-33) yields
JJ.x.1·
(5-34)
The ratio
(5-35)
is called the corre/atimr coefficient of the Rvs x andy.
Note that r is the covariance of the centered and normalized RVS
x-""._,.•
Xo = __
Yn = y- Tl•·
u,
CTx
Indeed,
l:.'{Xo} = E{yo} = 0
crxo2 = £{w~}
cr~·n2 = £{y02 } = I
-u = I
X - Tit Y - Tlv }
JJ.n
{
} = E --· ~ = ~ = r
EXoyo
{
CTx
u~.
CT,cr_,.
146
CHAP.
5
TWO RANDOM VARIABLES
Uncorrelatedness We shall say that the Rvs x andy are uncorrelated if r = 0
or, equivalently, if
E{xy}
= E{x}E{y}
(5-36)
1
We shall next express the mean and the variance of the sum of two Rvs
in terms of the five parameters Tl.tt .,,. , CTx, CT,, ILx•. Suppose that
z=ax+by
In this case. Tl: = tiTJ, + h.,, and
E{(z - TJ:I~} = E{ltl(x - TJ.,) + /J(y - .,, IF}
Expanding the square and using the linearity of expected values. we obtain
u~ ::. a~u~ + b~u~ + 2ah,.,....
(5-37)
Note. in particular. that if,.,.,, .:. : 0. then
' = u;' + u~.'
(5-38)
CT\•.•
Thus if two RVs arc uncorrclatcd. the variance of their sum equals the sum of
their variances.
Orthogonality We shall say that two RVS x and y are ortltol(onal if
E{xy} = 0
(5-39)
In this case. E{(x + y)~} = E{x2 } + E(y~}.
Orthogonality is closely related to uncorrelatedness: If the RVs x and y
are uncorrelated. the celllert•d RVs x - TJ., and y -- .,,. are orthogonal.
Independence Recall that two RVs x and y are independent iff
(5-40)
.f(x. _v) = .f,(x)J..(J')
• Theonm. If two RVS are independent, they are uncorrelated.
• Proof. If (5-40) is true,
E{xy} =
f,.,
=f.
r.
xyf(x, y)dxdy
xJ,.(x)dx
f.
=f. f .
yJ;.(y)dy
xyJ,.(x)/y(y)dxdy
= E{x}E{y}
The converse is true for normal RVs (see page 163) but generally it is
not true: If two Rvs are uncorrelated. they are not necessarily independent.
Independence is a much stronger property than uncorrelatedness. The first
is a point property; the second involves only averages. The following is an
illustration of the difference.
Given two independent RVS x andy, we form the Rvs g(x) and h(y). As
we know from (5-27), these Rvs are also independent. From this and the
theorem just given it follows that
E{g(x)h(y)} = E{g(x)}E{h(y)}
(5-41)
SEC. 5-2
for any
g(x)
MEAN, CORRELATION, MOMENTS
J47
and g(y). This is not necessarily true if the RVs x and y are
uncorrelated; functions of uncorrelated Rvs arc generally not uncorrelated.
SCHWARZ'S INEQt;ALITY We show next that the covariance of two Rvs x
andy cannot exceed the product CTxCTy. This is based on the following result,
known as Schwarz's inequality:
(5-42)
I
Equality holds iff
(5-43)
Y =CoX
• Proof. With c an arbitrary constant. we form the mean of the RV (y - cx) 2 :
/(~·) = E{(y-
cx)2 }
=
t. J",, (y-
cx)2j(x, y)dxdy
(5-44)
Expanding the square and using the linearity of expected values. we obtain
/(d = E{y:!} - 2cf..'{xy} + c!E{x~}
(5-45)
Thus /(c) is a parabola (Fig. 5.6). and /(d > 0 for every c; hence, the
parabola cannot intersect the x-axis. Its discriminant /) must therefore be
negative:
and (5-42) results.
To prove (5-43). we observe that if J) = 0. then /(d has a real root c11 ; that
is.
/(co) = E{(y - CoX)2} = 0
This is possible only if the RV y - c·uX equals zero Jscc (4-103)]. as in (5-43).
• Corollary. Applying (5-42) to the Rvs x -· 71, andy ·- 71··. we conclude that
E2{<x - 71.• )(y - 71,.)}
~
f..'{(x .. 71, )~ }t.'{(y - 71, >2 }
Figure 5.6
l(cl
D<O
v~
0
c
148
CHAP.
5
TWO RANDOM VARIABLES
Hence.
,
lrl s
#Lil s u,,u,.
Furthermore [see (5-43)].
lrl = I itT
y - .,,. = co( X
-
I
Tl.•)
(5-46)
To find c,,. we multiply both sides of (5-46) by (x - .,, ) and take
expected values. This yields
E{(x - .,, )(y - Tl.•· H= <·uE{(x - Tl.• )~}
Hence.
u,.
if r = I
cu
ru., u ,.
= -,-· =
u:,
.. { - u,.
if r =-I
u .•
We have thus reached the conclll<iion that if lrl = I. then y is a linear
function of x and all probability masses are on the line
u,
- (x - ...
r=±I
.v - ...., ,. = ± u,
.,., )
(T~
As lrl decreases from I to 0. the masses move away from this line
(Fig. 5.7).
Empirical Interpretation. We repeat the experiment n times and denote by X;
and Y; the observed values of the avs x and y at the ith trial. At each of the n
points (x~o y 1), we place a point mass equal to 1/n. The resulting pattern is the
empirical mass distribution ofx andy. The joint empirical distribution F,.(x, y)
of these avs is the sum of all masses in the quadrant Do of Fig. 5.1. The joint
empirical density (two-dimensional histogram) is a function f,(x, y) such that
the product 4 14 2/,(x, y) equals the number of masses in a rectangle with sides
4 1 and 4 2 containing the point (x, y). For large n,
F,.(x, y) =- F(x, y)
f,(x, y) = j(x, y)
as in (4-24) and (4-37).
Figure 5.7
y
r=O
0
flx
X
X
5-2
SEC.
MF.AN, CORRELATION, MOMENTS
149
We next form the arithmetic averages .f and .Vas in (4-89) and the empirical estimates
-2
cr
,
'\.""
= -nI ~
;
(X· - X-)2
;;..,.. = !n L; (X; -
I ~
n ;
-'
-)2
CT~ -'- - ~ ()' -
~
'
.V
'
.t)(>•; - y)
.
u;.
u~. and Jl..,y. The point (i, f) is the center of gravity of
the empirical masses. 0:~ is the second moment with respect to the line x =
and
is the second moment with respect to the line 'Y = j. The ratio ;;. ...rii;ii,.
is a measure of the concentrcttion of the points near the line
cr,.
-
of the parameters
x.
o-;
\'-v=
•
•
±~(x-.l")
(T_,
This line approaches the line (5-46) as n--+ x. In Fig. 5.7 we show the empirical
masses for various values of r.
LINEAR REGRESSION Regression is the determination of a function y =
<f>(x) "fitting" the probability masses in the xy plane. This can be phrased as
a problem in estimation: Find a function <f>(x) such that, if the RV y is estimated by <f>(x), the estimation error y - <f>(x) is minimum in some sense. The
general estimation problem is developed in Section 6-3 for two Rvs and in
Section 11-4 for an arbitrcuy number of RVs. In this section, we discuss the
linear case; that is, we assume that <f>(x) = a + bx.
We are given two RVs x andy, and we wish to find two constants a and
b such that the line y = a + bx is the best fit of "yon x" in the sense of
minimizing the mean-square value
e = E{[yof the deviation v
(a
+ bx)2 } =
f. f.
[y - (a
+ bx)Jlf(x, y)dxdy (5-48)
= y- (a+ bx) ofy from the straight line a+ bx (Fig. 5.8).
Figure 5.8
y
r =I
..-..-·· ....- x on y
X
150
CHAP.
5
TWO RANDOM VARIABLES
Clearly, e is a function of a and b, and it is minimum if
ile = · ·:!
iiCI
J• , j·, ' I,..
= -2f;{y - (Cl
:~
=
-2
r, r.
= - 2E{(y
-
- a+ bx)lf'(x.
\')dxd\'
.
.
•
+ bx)}
= 0
(5-49)
ly(tl -
(tl
+ bx))xf(x. y)dxdy
bx)]x}
=0
The first equation yields
From this it follows that
y- (a+ bx) = (y- 1J,-) - b(x- 1Jx)
E{(y- 1J,-)- b(x- 1Jx>l = 0
Inserting into the second equation in (5-49), we obtain
E{[(y- 711 ) - b(x - 1Jx)J(x- 1Jx>l = ru_.u>. - bui = 0
Hence, b = ruvlu_.. We thus conclude that the linear LMS (least mean
square) fit of y on X is the line
y - 1Jy
(T
=r ~
(x
u_.
- 1Jx)
(5-50)
known as regression line (Fig. 5.8). This line passes through the point
(1Jx, 711 ), and its slope equals ru1 /u_.. The LMS fit of "x on y" is the line
X - 1Jx = r
uUx (y -
1Jy)
)'
passing through the point (1J.., 711 ) with slope u 1 1ru_.. If
regression lines coincide with the line (5-46).
lrl
= 1, the two
APPROXIMATE EVALUA110N OF E{g(s, y)} For the determination of the
mean of g(x, y), knowledge of the joint density f(x, y) is required. We shall
show that if g(x, y) is sufficiently smooth in the region D wheref(x, y) takes
significant values, E{g(x, y)} can be expressed in terms of the parameters 1Jx,
1Jy, U;tt Uy, and #Lxy.
Suppose, first, that we approximate g(x, y) by a plane:
g(x, y)
==
g(1J;tt 1Jy)
ag
ag
+ (x- 1Jx) ax+ (y- 1Jy) ay
(all derivatives are evaluated at x = 1Jx andy= 711 ). Since E{x- 1Jx} = 0 and
E{y - 717 } = 0, it follows that
(5-51)
E{g(x, y)} = g(7J.., 711 )
This estimate can be improved if we approximate g(x, y) by a quadratic
surface:
ag
ag
g(x, y) ""'g(7Jz, 1Jy) + (x- 1Jx) ax+ (y- 1Jy) ay
1
azg
azg
1
azg
+ 2 (x- 1Jx)2 axz + (x- 1J_.)(y- 71>) axay + 2 (y- 1Jy)2 ayz
SEC.
5-2
MF.AN, CORREI.AriON, MOMENTS
J5J
This yields
Example 5.4
Clearly.
if2g
- 2 = 2\'
•
ilx
Inserting into (5-52), we obtain
E{x2y} == .,;.,, - .,,u~ - 2.,,ru,u,
•
Momellls and Momelll Functions
The first moment 71 and the variance u~ give a partial characterization of the
distribution F(x) of an RV x. We shall introduce higher-order moments and
usc them to improve the specification of F(x).
•
D~jinitions.
The mean
m11 = E{x"} =
of X11 is the moment of order n of the
ILn
= E{(x-
71)11 }
f, x"f(x) dx
RV
(5-53)
x. The mean
= f~ (x-
71)"/(:c)dx
(5-54)
is the central moment of order n.
Clearly,
2
mo = ILo = I
m1 = 71
/LI = 0
1L2 = u2
m2 = 1L2 + 71
Similarly, we define the absolute moments £{1xl"} and the moments
E{(x - a)11 } with respect to an arbitrary point a.
Note that if f(x) is even, then 71 = O. mn = ILn, and P.2n-1 = 0; if f(x) is
symmetrical about the point a, then 71 = a and p.211 • 1 = 0.
Example 5.5
We shall show that the central moments of a normal RV equal
n even
_ { I x 3 · · · (n - I )u"
IJ.tt - 0
II odd
(5-55)
• Proof. Sincej(x) is symmetrical about its mean. it suffices to find the moments of
the centered density
I
., .
/(x)
= - - e·•· _.,.
uv'21T
152
CHAP.
5
TWO RANDOM VARIABLES
Differentiating the normal integral (3-55) k times with respect to a we obtain
f" z21te
.z
-x
<ll
dz =
1X3···(2k-ll,/1r
V a2k·J
24
With a = l/2u 2 and n = 2k, this equation yields
1
- z"e·::12fr dz = l x 3 · · · (n - l)u"
u\12;
J"
-x
And since p., = 0 for n odd, (5-55) results.
•
MOMENT-GENERATING FUNCTIONS The moment-generating function,
or, simply, the moment function, of an RV x is by definition the mean of the
function en. This function is denoted by cl>(s), and it is defined for every s for
which E{eSX} exists. Thus
(5-56)
and for discrete type avs
cl>(s) = ~ eu'f(x~c)
From the definition it follows that
E{e<n .. bls} = eh•E{easa}
(5-57)
= eb)cJ>t(as)
Hence, the moment function cl>1(s) of the RV y = ax + b equals
«<>,(s) = E{eslaa·hl} = eb•cJ>.,(as)
Example 5.6
We shall show that the moment function of aN(.,, u)
<l»(s)
RV
(5-58)
equals
= e"'etM:12
(5-59)
• Proof. We shall find first the moment function <1»0fs) of the
Clearly, Xo is N(O, I); hence,
RV Xo =
(x - 71)/u.
Inserting the identity
x2
sx - -
2
J
= - -2 (x -
sZ
s)2 ... -
2
into the integral, we obtain
= esZ12 J"-x\12;
_1_ e·lx-sJlf2dt = ,slf2
last integral equals 1. And since x = ., + Q'Xo,
<l»o(.f)
because the
(5-58). •
Example 5.7
The
RV
x is Poisson-distributed with parameter a
P{x
= k} =
a•
e·a k!
k = 0, J, . . .
(5-59) follows from
5-2
SEC.
MEAN, CORRELATION, MOMENTS
153
lnsening into (5-57), we obtain
Cl>(s) = e·a
2"
4D0
4
a1 e••
k,
(5-60) •
== e·ue"''
We shall now relate the moments m, of the RV x to the derivatives at
the origin of its moment function. Other applications of the moment function
include the evaluation of the density of certain functions of x and the determination of the density of the sum of independent avs.
• Moment Theonm. We maintain that
E{x"}
= m, = «t»lnl(O)
(5-61)
• Proof. Differentiating (5-56) and (5-57) n times with respect to s, we obtain
and
for continuous and discrete type avs, respectively. With s
from (5-53).
= 0, (5-61) follows
• Corollary. With n = 0, I, and 2, the theorem yields «1»(0) = I,
«1»'(0) = m, = TJ
«1»"(0) = m 2 = TJ 2 + u2
(5-62)
We shall use (5-62) to determine the mean and the variance of the
gamma and the binomial distribution.
Example S.H
The moment-generating function of the gamma distribution
f(x) = 'Yxb-le-•·•u(x)
equals
Cl>(s)
= 'Y
i"xbo
1 e-~<"··su
dx
= (c'Yf{b)
- s)b
c.h
= --.,.
(c -
s)b
(5-63)
(see (4-54)]. Thus
«~»'"'(s)
= .:. .;b(:..:.b_-___.:_1:.. .)·.,....-·_• ..!.;<b:..,.--....:.n.:....._-....:l~)c=-h
(c - s)b'"
Hence,
«~»'"'(O) = E{x"}
With n = I and n
- I) · · · (b +· n - I)
c"
(5-64)
= 2, this yields
E{x} = ~
E{x2 } = b(b ; I)
c
Example 5.9
= b(b
c
The av x takes the values 0, I. . . . , n with
P{x
= k} = ( Z) p•q"-k
u-' = b
c2
(5-65) •
154
CHAP.
5
TWO RANDOM VARIABLES
In this case,
(5-66)
Clearly,
«<>'(s)
= n(pr + q)"
cl>"(s) = n(n - l)(pr
4>'(0)
= np
4>"(0)
1pt>'
+
q)"- 2p 2 ~
= n2p 2 -
+ n(pt>' + q)"- 1pr
np2 + np
Hence,
(5-67) •
The integral in (5-56) is also called the Laplace transform of the functionf(x). For an arbitrary f(x), this integral exists in a vertical strip R of the
complex s plane. If f(x) is a density, R contains the jt.rraxis; hence, the
function
ci»UOJ)
=
f.. ei*'Xf(x) dx
(5-68)
exists for every real OJ. In probability theory, ci»UOJ) is called the characteristic function of the av x. In general, it is called the Fourier transform of the
function f(x).
It can be shown that
f(x) =21T
I J'_,.
" e-1-cJ»UOJ)dOJ
(5-69)
This important result, known as the inuersion formula, shows that f<x> is
uniquely determined in terms of its moment function.
Note, finally, that f(x) can be determined in terms of its moments.
Indeed, the moments m,. equal the derivatives ofcl>(s) at the origin, and cl>(s)
is determined in terms of these derivatives if its Taylor expansion converges
for every s. Inserting the resulting cJ»UOJ) into (5-69), we obtainf(x) .
.JOINT MOMENTS
Proceeding as in (5-53) and (5-54), we introduce the joint
moments
m.,
= E{x•y} =
r. t,.
x•yt(x. y)dxdy
(5-70)
and the joint central moments
P.kr
= E{(x- Tl~)'(y- Tl.v)'} = fx f,. (X- Tl.,)"(y- 11v>1<x. y)dxdy
Clearly,
m1o
= ,.,..
#J.IO = 0
mot
= ,.,,.
m20
,
,
= 71; + u:;
mo2
,
,
(5-71)
.,
= 71; + u;,
/Lot = 0
#J.l I = IJ..<.Y
#J.20 = u;
IJ.oz = u;
The joint moment func:tioh cl>(s 1, s2) of the avs x and y is by definition
cl>(s 1 , s 2 )
= E{E"'•••s~} = fx
r.
ets,x·.•~''f(x, y)dxdy
(5-72)
SEC.
5-3
FUNCTIONS OF TWO RANDOM VARIABLES
155
Repeated differentiation yields
(5-73)
This is the two-dimensional form of the moment theorem (5-61 ).
Denoting by <1>,{.\") and <1>,.(.\") the moment functions ofx andy. we obtain
the following relationship bet.wccn marginal and joint moment functions:
<1>,.(.\"~) ,...
<1>,(.\"1) = E{e···} = ct>{.\"1. ())
J::(t-•·Y}
<1>(0. ·'"~)
(5-74)
Note, finally. that if the avs x and y arc incll'pl'ndl'nt. the avs ('''• and
('"'''are independent lsee (5-41)). From this it follows that
cl>(.\"1. s~) = J::{e••• }E{t•':> } ...:. <l>,<sj)ct>,.<s!)
!5-75)
Example 5.10
From (5-75) and (5-59) it follows that if the Rv x and y are normal and independent,
with zero mean, then
<J)(·'"I •.\·!I ,.
,, ·• tr;.\·~
''}
exp { iI (tT\Si
1
(5-76) •
5-3
Functions of Two Random Variables
Given two Rvs x andy and two functions g(x, y) and h(x. y), we form the
functions
w = h(x, y)
z = g(x, y)
These functions are composite with domain the set f/; that is, they are RVs.
We shall express their joint distribution in terms of the joint distribution of
the RVS X and Y•
We start with the determination of the marginal distribution Fl(z) of the
RV z. As we know. Fl(z) is the probability of the event {z s z}. To find Ft(Z), it
suffices, therefore, to express the event {z s z} in terms of the RVs x andy.
To do so, we introduce the region Dt of the xy plane such that g(x, y) s z
(Fig. 5.9). Clearly, z s z itT g(x, y) s z: hence.
P{z ~ z} = P{g(x, y) s z} :.:. P{<x, y) E D:}
The region JJ, is the projection on the xy plane of the part of the g(x. y)
surface below the plane z = constant: the function F:IZ) equals the probability masses in that region.
Example 5.11
The
RVS
x and y are normal and independent with joint density
r ( )- I
. lxl· rl)l2cr
Jxy x, Y - 21Tw e
.
We shall determine the distribution of the
z
=+
Vx2
RV
t y2
156 CHAP. 5 TWO RANDOM VARIABLES
y
y
X
X
{a)
I b)
Flpre 5.9
If z ~ 0, then Dz is the circle g(x, y) = ~ s z shown in Fig. 5.9. Hence
(see Example 5.2),
Fz(Z)
= 2~(T2 JJ e-lx~•r,Jnal dxdy =
I -
e-z2f2al
D,
If z < 0, then {z s z} = 0; hence, F:<z> = 0.
Rayleigh Density. Differentiating F:(z), we obtain the function
f:<z> = ~ e-rf2alu<z>
(5-77)
known as the Rayleigh density. •
Example 5.12
(a)
z = max (x, y)
The region Dz of the xy plane such that max (x, y) s :is the set of points such that
x s z and y s z (shaded in Fig. 5.10). The probability masses in that region equal
F xy(Z, z). Hence,
Fz(Z) = FAy(Z, Z)
(5-78)
(b)
z = min (x. y)
The region Dz ofthexy plane such that min (x, y) s z is the set of points such thatx s
z or y s z (shaded in Fig. 5.10). The probability masses in that region equal the
masses Fx<z> to the left of the vertical line x = z plus the masses F,(z) below the
horizontal line y = z. minus the masses FA1(z. z) in the quadrant <x s z. y s z).
Hence,
(5-79) •
Joint Density of g(x, y) and h(x, y)
We shall express the joint density fzw(z, w) of the RVS z = g(x, y) and w = h(x,
y) in terms of the joint density /xy(x, y) of the RVS x andy. The following
theorem is an extension of (4·78) to two RVs, and it involves the Jacobian
J(x, y) of the transformation
1. = g(x, y)
w = h(x, y)
(S-80)
SEC.
5-3
FUNCTIONS OF TWO RANDOM VARIABLES
157
)'
z
L
min (.Y, y) S z
max (x, y) S z
0
·I
.\"
Figure S.to
The function J(x.
y)
z
~
X
is by definition the determinant
J(.t. y) ::.
iiJ:(:C. y)
iiX
~'.(X •.\:)
ilx
il\'
iiJt( X. y)
(5-tH)
-ily
and it is used in the determination of two-dimensional integrals involving
change of variables.
• Fundamental Theorem. To find./; .. <:. u·). we solve the systems (5-80) for x
and y. If this system has no real solutions in some region of the zw plane.
,(.,(z. w) = 0 for every(<.. w) in that region. Suppose. then. that (5-80) has one
or more solutions <x1• )';).that is.
(5-82)
/t(x;. )';) = w
In this case.
r (
)
.f~,(x,. Ytl
J; - w =
.
+ ... + J~,.( :C;. _\';) + ...
(5-83)
:.. ....
IJ<x,.
Yt>l
IJ<x;. y;>l
where (:c;. y;) are all pairs (X;. y;) that satisfy (.5-821. The number of such pairs
and their values depend. of course. on the particular values of z and w.
• Proof. The system (5-82) transforms the differential rectangle A of Fig.
5.11a into one or more differential pardllelogmms 8 1 of Fig. 5.11b. As we
know from calculus. the area I8 11 or the ith parallelogram equals the area
lA I = dzdw of the rectangle, divided by J(x;, y,). Thus
= h_,.(z. w)dzdw
= fx.,(x~o y;)IJ- 1(x;. Y;)ldzdw
P{z s z s z + dz. w s w s w + dw}
P{(x, y) E 8,}
(5-84)
And since the event {z s z s z + dz, w s w s w + dw} is the union of the
disjoint events {(x;, y1) E 8 1}, (5-83) follows from (5-84).
With h_,.(z, w) so determined, the marginal densities ft(z) and j,.(z) can
be obtained as in (5-10).
158
CHAP.
5
TWO RANDOM VARIABLES
y
y
M
z
0
=z
Y;
--~~
w
g(x, y)
0
X
X
(a)
(b)
Figure S.U
Auxiliary Variable We can use the preceding theorem to find the density
/,.(z) of one function z = g(x, y) of two avs x andy. To do so, we introduce an
auxiliary av w = h(x, y), we determine the joint density f: ...(z, w) from (5-83)
and the density f:(z) from (5-10):
/,.(z) =
r. J.....
(5-85)
(z, w)dw
The variable w is selected at our convenience. For example, we can set w
x or y. We continue with several illustrations.
=
Linear Transformadons Suppose first that
z=ax+b
w=cy+d
This system has a single solution
z-b
w-d
Xt
Yl = - a
c
for any z and w, and its Jacobian equals ac. Inserting into (5-83), we obtain
1
w(5-86)
f:...,.(z. w) = laclfn -a-· -t-.-
=--
(z- b
d)
Suppose, next, that
In this case, the Jacobian equals
J(x, y)
= ,a,
a2
If D = 0, then w = a2zl b, ; hence, the joint distribution ofz ahd w consists of
line masses and can be expressed in terms of Fi:.). It suffices, therefore, to
assume that D =I= 0. With this assumption, we have a single solution
Xt
and (5-83) yields
~
Jtw
(
z,
=
W
bzz - b,w
D
Yt
=
-azz + a,w
D
) = ~ (bzz - b,w -azz + a1w) _1_
Jxy
D
'
D
IDI
(5_88)
SF.C.
5-3
FUNCTIONS OF TWO RANDOM VARIABLES
159
mSTRIBUTION o•· x + y An important special case is the determination of
the distribution of the sum z = x + y. Introducing the auxiliary variable w =
y. we obtain from (5-88) with a 1 = b 1 = I. a~ = 0. h~ = I,
f,..(z. w) = J.).(z - w, w)
(5-89)
Hence.
/:(z)
=
f. J.,.(z -
w, w)dw
(5-90)
This result has the following mass interpretation. Clearly. z s z s z + dz iff z
s; x ..... y s z + dz, that is, iff the point (x. y) is in the shaded region llD;; of
Fig. 5.12. Hence,
fz{z)dz = P{z s x + y s z + dz} = P{(x, y) E flD:}
(5-91)
Thus to find f(z), it suffices to find the masses in the strip llD:.
For discrete type RVs, we proceed similarly. Suppose that x and y take
the values x; and y4 • respectively. In this case. their sum z = x + y takes the
values z = z, = x; + y 4 , and P{z = z,} equals the sum of all point masses on
the line x + .v = z,.
Example S. 13
The avs x and y are independent. taking the values I. 2. . . .• 6 with probability 1/6
(two fair dice rolled once). In this case. there are 36 point masses on the xy plane, as
in Fig. 5.13. and each mass equals 1136. The sum z = x + y takes the values 2. 3,
...• 12. On the line x- y = 5, there are four point masses: hence, P{z"" 5}"'" 4/36.
On the line x + y = 12. there is a single point: hence. P{z = 12} = 1136. •
The determination of the density of the sum of two independent RVS is
often simplified if we use moment functions. As we know. if the Rvs x and y
are independent. the Rvs e"' and esy are also independent; hence Isee (5-75)],
E{e''"'Y'} = £{e·'11}E{e·'Y}
Figure 5.12
y
•·igure 5.13
160
CHAP.
5
TWO RANDOM VARIABLES
From this it follows that
<l>.,.~(s)
= cl>~(s)<l>.v(s)
(5-92)
Thus the moment function of the sum of two independent RVs equals the
product of their moment functions.
Example 5.14
If the avs x andy are normal. then [see (5-59))
Cl»x(s) = exp
{.,~s + ~ u~s2 }
""
...1(s)
-
-
exp {.,).s +
!2 u)s
22}
Inserting into (5-92). we obtain
""
- exp {(71,
..,~·~(s)-
+ .,,)s +
! (u,2+ uy)s
2 2}
(5-93)
2
Clearly. (5-93) is the moment function of a normal av; hence. the sum of two normal
independent avs is normal with Til = .,. + Tlr• u: = u~ + u~. •
Example 5.15
If the Rvs x andy are Poisson. then [see (5-60))
Cl»,(s) = exp {u,(e' - I)}
Cl»,.(.vl = exp
{Cl,(t'' -
II)
Inserting into (5-92). we obtain
4»., _,.(.f 1 = exp {tl, + e~,.)(t' • ·- II}
Thus the sum of two independent Poisson Rvs with parameters
tively, is a Poisson RV with pantmeter a, -+ a,.. •
Example 5.16
(5-94)
tl.•
and
a,.
respec-
If the avs x and y have a binomial distribution with the same p. then [see (5-66)]
e~».(s) = (pe• + q)"'
4»7(s) = (pe• + q)"
Hence,
cJ»•• ,<s) = (pe• + q)"'•"
(5-95)
Thus the sum of two independent binomial avs of order m and n. respectively. and
with the same p is a binomial RV of order m • n. •
Independence and Convolution We shall determine the density of the sum
z = x + y of two independent continuous type avs. In this case.
ft,<x • .v> = ft(x)[..(y)
Inserting into (5-90). we obtain
fr.(z)
=
r.
fx<z-
w)J,(w)dw
(5-96)
This integral is called the conuolution of the functions .fx(x) and J,(y). We
have thus reached the important conclusion that the density of the sum of
two independent avs equals the convolution of their respective densities.
SEC.
5-3
FUNCTIONS OF TWO RANDOM VARIABLES
161
Combining (5-96) and (5-92), we obtain the following mathematical
result, known as the convolution theorem: The moment function <l»~(s) of the
convolution.fx(z) of two densitiesft(x) and.f..(y) equals the product oftheir
moment functions cl»,r(s) and <l»,.(s).
·
Clearly, if the range of x is the interval (a, b) and the range of y the
interval (c. d), the range of their sum z = x + y is the interval (a + c, b + d).
From this it follows that if the functions.fx(x) andfv(Y) equal zero outside
the intervals (a, b) and (c, d), respectively. their convolution .fx(z) equals
zero outside the interval (a + c:, b + d). Note in particular that if
.fx(x)=O
forx<O
and
f,.(...,.)=O
fory<O
then
/J<) = {~/.(<- w)f,(w)dw
~>0
(5-97)
z<O
because .fx(z - w)
t:.xample 5.17
= 0 for w > z.
From (5-97) it follows that if the RVs x and y arc independent with densities
J.(x) =- cu- 0 -"U(x)
j;.(y) = oe "-'U(y)
then the density of their sum equals
.f:(z:) = o 2U(z)
J: e-nl:
"'1e ...... dw =- o 2ze·o.:u(z)
•
In the following example, we use the mass interpretation of .fx(z) (see
Fig. 5.12) to facilitate the evaluation of the convolution integral.
Example 5.18
The avs x and y are independent, and each is uniformly distributed in the interval (0,
c). We shall show that the density of their sum is a triangle as in Fig. 5.14.
Clearly,f(x, y) = llc2 for every (x, y) in the squareS (shaded area) and /(x,
y) = 0 elsewhere. From this it follows that_h(z)dz equals iAD:;Ic 2 where IAD:I is the
area of the region z s :c + y s z + dz inside the squareS [see (5-91)). As we see from
Fig. 5.14.
laD I= {zdz
:
Hence,
(2c- zldz
r
cl
!:<z> = 2c-
Oszsc
l
---;r
JOI~TI.Y
oszsc
c<z<2c
c<z<2c
•
NORMAl. DISTRIBUTIONS We define joint normality in terms of
the normality of a single RV.
162
CHAP.
5
TWO RANDOM VARIABLES
y
fx(w) = fy(w)
1
c
(2c- z) dz
0
Z
t----,
c
0
X
FJgUre5.14
• Dtjinltion. Two RVs x and y are jointly normal itT the sum
z=ax+by
is normal for every a and b.
We show next that this definition specifies completely the moment
generating function
cl>.ry(s,, s2) = E{es,a·.slf}
of x and y. We assume for simplicity that E{x} = E{y} = 0.
• Theonm. Two RVS x and y with zero mean are jointly normal itT
where u 1 and u 2 are the standard deviations of x and y, respectively, and r is
their correlation coefficient.
• Proof. The moment function of the RV z
«<>z(S)
= E{e.sa} =
exp
Hu~s 2}
where
= ax + by equals [see (5-59)]
u~ = a2ui + 2abru1u 2 + b2u~
Furthermore.
«~>:<s)
= E{e'"• • ,,,}
cl>;(l)
, , + 2abra,u2 + b-ui)
, ., }
= exp { 2I u:"} = exp { 2I (a·uj
«1>:(1)
= E{e"• • ">} = «<>.u-(a. b)
Setting a = ·'"• and b = s2 we obtain (5-98).
Conversely. if «~>x.v< s,, S:!) is given by (5-98) and z
2
«<>:(s)
= «~>x.v(as, bs) = exp {~
(a2ui
=
ax + by. then
+ 2abra10':! + b2u~>}
SEC.
5-3
FUNCTIONS OF TWO RANDOM VARIABLES
163
This shows that z is a normal RV, hence (definition) the RVS x andy are jointly
normal.
Joint Density
fn.(.t, y)
.
From (5-98) and (5-72) it follows that
I
= 21TUtU2 VI
{
-
(x 2
1
xy
y2)}
- 2r-- + 2
r 2 exp - 2(/ - r-') 2
U t
UtU:!
U2
(5-99)
The proof is involved.
Joint normality can be defined directly: Two Rvs x and y are jointly
normal if their joint density equals e·Qh. >'1 where Q(x, y) is a positive
quadrdtic function of x andy. that is
Q(x. y) = c 1 x~ + c 2xy + <"l.\' 2 + c4x I <'~.\· + c,::: 0
Expressing the five pardmcters 11•· 11~· <r 1• u~. and r in terms of c; and using
the fact that the integral of .f'<x. y) equals I. we obtain
j~,.(.'(. y) =
2 ~0 exp {- 2;r l<d<x -
111 )2
-
2nr,<r~(x
· 111 )(y -- 112> +
uitv · 112>2!}
(5-100)
where D = u 1u 2 ~.Note that (5-100) reduces to (5-99) if 11• = 112 = 0.
Uncon-elatedness and Independence
It follows from (5-100) that if
r = 0 then
J.~,(x •
This shows that if two
independent.
RVS
.v>
= .f~(x)f,.(y)
arc jointly normal and uncorrclated. they are
Mtuginal Normality It follows from (5-98) or directly from the definition with a = I and b = 0, that, if two RVS are jointly normal, they are
marginally normal.
Linear Transformations If
z=ax+by
then lsee (5-88)]
w=c:x+dy
Q 1(z w)
= Q (dz - bw -cz + aw)
ad - be: ' ad - be
If Q(x, y) is a quadratic function of x andy, Q 1(z, w) is a quadratic function
of z and w. This leads to the following conclusion: If two RVs z and w are
linear functions of two jointly normal RVs x andy. they are jointly normal.
To find their joint density, it suffices therefore to know the five parameters
11t• 11 ... , Ut, u.,., and rtw·
'
164
CHAP.
5
TWO RANDOM VARIABLES
Example 5.19
The avs
ll
and y are normal with
.,, = 0
.,.. = 2
u.. = 1
Find the density of the av z = 2x + 3y.
As we know, z is N(1J,, u,) with
.,, = 21J.r + 3"1, = 4 u:
= 4u~
+
u, = 2
3
r=-
9cr: +
12ru..u,- = 49
8
•
GENERAL TRANSFORMATIONS We conclude with an illustration of the
fundamental theorem (S-83). We shall determine the distribution of the av
X
z=-
using as auxiliary variable the
RV
y
w = y. The system
z=-Xy
has a single solution x 1
w=y
= zw, y 1 = w, and its Jacobian equals
I
X
-
y
- y2
J(x, y) -
0
1
=Y
Inserting into (S-83), we obtain
hw(Z, w)
= lwlh1 (zw, w)
and (S-85) yields
h(z) =
Example 5.20
J: lwlf..,(zw,w)dw
(S-101)
The avs ll and y are jointly normal with zero mean as in (5-99). We shall show that
their ratio z = lily bas a Cauchy density centered at z = ru 11u2 :
/,.(z) =
u tCT2 v'f"="'r2
.,.
• Proof. Since.h,(-x, -y)
/,(z) =
'
(S-102)
=/.ry(x, y), it follows from (5-101) that
J... w exp {-
2
I
u~(z - ru./u2)2 + uf(l - r2)
2Tru1u 2VI - r2 o
(z2
z
1 )}
~ - 2r - - - ~ dw
u1u2 u2
w2
2(1 - r 2) u,
With
the integral equals
and (5-102) follows.
Integrating (5-102), we obtain the distribution
l
F,(z)
1
UzZ -
TUJ
= J-• /,.(a)da = 2 + UJ ~It -
r2
(5-103) •
PROBLEMS
165
Problems
5-1
5-l
S-3
S-4
5-S
5·6
5-7
If f(x, y) = ye ·l.r:- 8.<', find: (a) the marginal densitiesf.(x) and,{y(y); (b) the
constant y; (e) the probability p = P{x s .5, y s .5}.
(a) Express Fx,.(x, y) in terms of F,(x) if: y = 2x: y = -2x; y = x2• (b) Express
the probability P{x s x, y > y} in terms of f~,,(x. y).
The RVS x andy are of discrete type and independent. taking the values x = n,
11 = 0 •...• 3 andy '"" m. m .:. 0, . . . • 5 with P{x = n} = 1/4, P{y = m} =
116. Find and sketch the point density of their sum z = x ... y.
Show that
•) < F,,(x) - F,.(y)
F( X,) 2
(a) Show that if the avs x andy are independent and F,(w) = F 1 (w) for every
w, then P{x s y} = P{x ~ y}. (b) Show that if the avs x and z = x/y are
independent, then E{xlfy3} = E{x3 }/E{yl}.
Show that if.,.,.. = 'l1yo cr., = cr:., and r.. = I, then x = y in probability.
Show that if m, = E{x"} and p., = E{(x - '11)"} then
Show that if the RV xis N(O, cr), then
I x 3 · · · (n - l)crlk
E{lxl"}
n = 2k
= { 2•k !u2lt. t ~
n = 2k .,. I
Find the mean and the variance of the RV y = x2•
S-9 Give the empirical interpretation: (a) of the identity E{x.,. y} = E{x} + E{y};
(b) of the fact that, in general, E{xy} :1- E{x}E{y}.
5-10 Show that ifz =ax+ b. w = cy + d, and ac :1- 0, then r~.. = r~y·
5·11 Show that if the Rvs x andy are jointly normal with zero mean and equal
variance the avs z = x + y and w = x - y are independent.
5-12 Show that if the RVs x 1 and x2 are N(.,, cr) and independent, the RVS
-
Xt
+
X2
x=-2-
y=
(Xt - i)2 - (X2 -
2
i) 2
are independent.
5-13 We denote by a,x + b 1 the LMS fit ofy on x and by a 2y + b2 the LMS fit ofx on
y. Show that a 1a2 = r~••
5-14 The av xis N(O, 2) andy = x3• Find the LMS fit a - bx ofy on x.
5-15 Using the Taylor series approximation
ag
g(x, y) = g(.,•• '17y) + (x - .,.,.) ilx
ag
+ ()' - .,.,,.) ay
show that if z = g(x, y), then
2
CTz
= cr,2 (ag)
iiX
2
.._
2(ag}
CTy
ily
2
ag ag
+ 2rcr,cry iiX ily
166
CHAP.
5
TWO RANDOM VARIABLES
5·16 The voltage v and the current i are two independent avs with
TJ~ = IIOV
cr,. = 2V
TJ; = 2A
cr; = 0.1 A
Using Problem 5-15 and (5-52) find approximately the mean TJ,. and the standard deviation cr... of the power w = vi.
5·17 The RV xis uniform in the interval (0, c). (a) Find its moment function cl>(s ).
(b) Using the moment theorem (5-62). find Tl• and cr,.
5·18 We say that an av x has a Laplace distribution if j(x) = ~ e ···•. Find the
corresponding moment function cl>( s) and determine Tl• and cr, using (5-62).
5·19 Show that if two avs x andy are jointly normal with zero mean, then
E{x2y2} = E{x2}E{y2} "'" 2E:{xy}
5-20 The RVs x and y are jointly normal with zero mean. (a) Express the moment
function «~>:<s) of the av z =ax- by in terms of the three parameters cr_,, cr,.
and'·~· (b) Show that if two avs x andy are such that the sum z =ax+ by is. a
normal av for every a and b, then x and y are jointly normal.
5·11 The logarithm 'i'(s) =In t/l(s) of the moment-generating function cl>(s) is called
the second moment generating function or the cumulant generating function.
The derivatives /c,. = '1'1"1(0) of 'i'(s) are called the cumulants of x. (a) Show
that
leo = 0
lc, = TJ
/c2 = cr 2
/c3 = m3
/c4 = m4 - 3cr 4
where m,. = E{x"}. (b) Find 'i'(s) ifx is a Poisson RV with parameter a. Find
the mean and the variance of x using (a).
5·ll The avs x andy are uniform in the interval (0, I) and independent. Find the
density of the avs z = x T y. w = x - y, s = lx - y;.
5-23 The resista'lces of two resistors are two independent avs, and each is uniform
in the interval (9000, 1,1000). Connected in series. they form a resistor with
resistance R = R1 + R2 • (a) Find the density of R. (b) Find the probability p
that R is between l,900U and 2,1000.
5-24 The times of arrival of two trains are two independent avs x and y, and each is
uniform in the interval (0, 20). (a) Find the density of the RV z = x- y. (b) Find
the probability p 2 in Example 2-38 using avs.
S·lS Show that if z = xy. then
/:(r.)
•
= J"-x -1Xi. f..· (x. X!)dx
1
5-16 The avs x andy are independent. Express the joint density fz,..(z.. w) of the Rvs
z = x2 • w = y3 in terms of f.< x) and J.( y ). Use the result to show that the avs z
·
and w are independent.
5-17 The coordinates of a point (x, y) on the plane are two avs x and y with joint
density f.,.(x, y). The corresponding polar coordinates are
r
= Vx2 "'"
y2
• = arctan l
X
-7r
< t/1 <
1r
Show that their joint density equals
f,•(r, t/1) = rf.,(r cos t/J, r sin t/J)
Special case: Show that if the avs x and y are N(O, cr) and independent, the
avs rand f/1 are independent, r has a Rayleigh distribution, and f/1 is uniform in
the interval ( -7r, 1r).
PROBLEMS
167
5·28 The RVS x and yare independent and z = x + y. (a) Find the density of y if
/,(x) = ce· "Uix)
/:(z) = c~ze •:U(z)
(b) Show that if y is uniform in the interval (0, I l. then
J;<:.l =F..(:.)- F,(:.- IJ
5-19 Show that if the RVs x and y are independent with exponential densities
ce-.-.U(x) and ce-•)·U(y). respectively. their difierence z = x - y has the
Laplace density
('
2 e-•:.
5·30 Given two normal RVS x and y with joint density as in (5-99), show that the
probability masses m., m~. m 3• m4 in the four quadrants of the xy plane equal
I
a
I
a
m, = ml = 4 - 21T
m2 = m. = 4 - 21T
where a ==arcsin r .
.l'
I
Or
4
i1r
I
4+
Or
27f
r
0
I
+
4
0'
2ir
.t
I
4
0'
2,
Figure PS.30
5-31 The RVS x andy are N(O, u) and independent. Show that the Rvs z = x/y and
w = Vx2 + y2 are independent. z has a Cauchy density. and w has a Rayleigh
density:
6 _ _ __
Conditional Distributions,
Regression, Reliability
In this chapter, we use the concept of the conditional probability of events to
define conditional distributions, densities, and expected values. In the first
section, we develop various extensions of Bayes· formula, and we introduce
the notion of Bayesian estimation in the context of an unknown probability.
Later, we treat the nonlinear prediction problem and its relationship to the
regression line defined as conditional mean. In the last section, we present a
number of basic concepts related to reliability and system failure.
6-1
Conditional Distributions
Recall that the conditional probability of an event .54 assuming .M. is the ratio
P<-54I.M.>
= P(-54 n .M.)
(6-1)
P(.M.)
defined for every .M. such that P(.M.) > 0. In the following, we express one or
both events .54 and .M. in terms of various avs. Unless otherwise stated, it will
be assumed that all avs are of continuous type. The discrete case leads to
similar results.
168
SEC.
6-1
169
CONDITIONAL DISTRIBUTIONS
We start with the definition of the conditional distribution F(xi.M.> ofthe
x assuming .M.. This is a function defined as in (4-2) where all probabilities
are replaced by conditional probabilities. Thus
RV
F<xi.M.>
= P{x s
xi.M.}
= P{x ~.~) ..t-t}
(6-2)
Here, {x s x, .M.} is the intersection of the events {x s x} and .M.; that is,
it is an event consisting of all outcomes ' that are in .At and such that xW s x.
The derivative
f(xi.M.>
= dF(xi.M.>
(6-3)
dx
I
of F(xi.M.> is the conditional density of x assuming .«.
It follows from the fundamental note on page 48 that conditional distributions have all the properties of unconditional distributions. For example,
F(x2I.M.> -
F<xd.M.>
= P{x,
<xs
x~'··1t}
P{x <
X S X
(6-4)
the area of f(x!.M.> equals I. and [sec (4-32)]
f<xi.M.> dx
Example 6.1
= P{x < x < x + dx'.M.}
=
+ 6.x, .M.)
P<.t.f.)
(6-5)
In the fair-die experiment, the RV xis such that x(/;) = IOi and its distribution is the
staircase function of Fig. 6.1. We shall determine the conditional distribution F(x!.M.)
where .M. = {f2,J4,Jf,}.
If x 2! 60, then {x s x} is the cenain event and {x s x, .«} = .M.; hence.
.
F(xi.M.)
P(.M.)
= PU() = I
Jo'igure 6.1
F<x>
0
F(xlcven)
10
X
0
20
X
170
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
If 40 s x < 60, then {x s x, .M.} = {h,J4}; hence,
F( !..«)
= P{Ji,J4} = 2/6 = ~
P(.M.)
3/6
If 20 s x < 40, then {x s x, .M.} = {Ji }; hence.
X.
3
= P(.M.)
P{Ji} = 1/6 = !
3/6 3
x • .M.} = 0; hence, F(xiJO = 0.
F( '.M.)
XI
Finally, if x < 20, then {x s
•
TOTAL PROBABILITY AND BAYES' FORMliLA We have shown [see
(2-24)] that if [.s4 1 , • • • , .Sit,] is a partition of f:l, then (total probability
theorem)
P(~) = P<~l.stldP<.stld
for any ll.ll. With
~
F(x)
+· · ·+
P<~l.!lf,)P(.s-4,)
= {x s x}. this yields
= F<xl.stl.>P<.9fd + · · · + F<xi.Slf,)P(.~,)
(6-6)
(6-7)
and by differentiation
/(x)
Example 6.2
= f<xl.SitdP(s4d + · · · + /lx .!lf,)P(.~,)
(6-8)
Two machines M 1 and M2 produce cylinders at the rate of 3 units per second and 7
units per second, respectively. The diameters of these cylinders are two RVs with
densities N(TJ 1 , u) and N(TJ2, u). The daily outputs are combined into a single lot, and
the RV x equals the diameters of the cylinders. We shall find the density of x.
In this experiment,
.Ill, = {C came. from MJ}
.!ll2 = {C came from M2}
are two events consisting of 30% and 70% of the units, respectively. Thus f<xl.!lt 1 ) is
the conditional density of x assuming that the unit came from machine M 1 ; hence,
/(xl.!ll 1 ) is N(fl" u). Similarly,f(xl.!ll2) is N(712, u). And since the events .!ll 1 and .!ll2 .
form a partition, we conclude from (6-8) with P(.!ll 1) = .3 and P(.!ll2) = .7 that
/(x) =
_.3_ e-l•-,1)12a2 + _.7_ e·~-,z,ZJ2a2
0'
as in Fig. 6.2 •
Flpre 6.2
v'21r
0'
v'21r
SEC.
6-J
CONDITIONAl. DISTRIBUTIONS
171
We shall now extend Bayes' formula
/'(!1\j.'.'i)
/'(:;ij.?\) '::. """'P{:1\'i'""" 1'1::11
to RVs. With :1', -= {x s
P{:'llx s
Similarly lsce (6-411.
.
P{.~4lx,
x~. (6-9)
x~
(6-9)
yields
P{x !:= xl:;l~
Ftrl::i}
"- - · - - Pl::i) -- · -.- P(.·.·l)
P{x ::: xt
J· lx I
< x :<X!~=-
Hx1l::l)
F<xd::i)
-F(.r.) _ H:~
/'{::1)
(6- J())
(6-11)
Using (6-11 ). we shall define the conditional prutmbility
P{:illxl -- P{::1ix -= x}
uf an event :,4 assuming x '"'x. If P{x = x} > 0. we can usc (6-9). We cannot
do so. however. if x is of continuous type because then P{x =- x~ = 0. In this
case. we define P{.'lll.rl as a limit:
P{.'lllx - x} - lim P{:;llx
l.•-11
With x 1 = x. x1 -= x +
~x.
<;
x ': x
f
~.r}
it follows from this and (6-11 1 that
-= .(<xl::l)/'(::i)
/ '(::'I··)
• ..•
./hi
(6-12)
We next multiply both sides by ./hi and integrate. Since the area uff(xj:AI
equals 1. we obtain
(6-13)
This is another form uf the total probability theorem. The corresponding
version of Bayes' formula follows frum (6-121:
Jhkill
= ···, Pt:.tilxlf<~..
J. P<.~·llx!f<xldx
(6-14)
Bayesian Estimation
We are given a coin and we wish to determine the probability p that heads
will show. To do so. we toss it n times and observe that heads shows k times.
What conclusion can be drawn from this observation about the unknown p?
This problem can be given two interpretations. In the first interpretation, pis viewed as an unknown pantmeter. In the second, pis the value of
an RV p. These interpretations are examined in detail in Part Two. Here we
introduce the second interpretation (Bayesian) in the context of Bayes' formula (6-14).
We assume that the probability of heads is an RV p defined in an
experiment f:/,.. This experiment could be the random selection of a coin
from a large supply of coins. The RV p takes values only in the interval (0, I);
hence, its density f(p) vanishes outside this interval. We toss the selected
172
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
coin once, and we wish to find the probability PC~) that heads will show, that
is, that the event '3t = {heads} will occur.
• Theorem. In the single toss of a randomly selected coin, the probability
that heads will show equals the mean of p:
P('R)
= J: pf(p)dp
(6-15)
• Proof. The experiment of the single toss of a randomly selected coin is a
Cartesian product ~c x ~ 1 where :/1 = {h, 1}. In this experiment, the event
'#! = {heads} consists of all pairs ,,.h where ,,. is any element of the space :/,..
The probability that a particular coin with p(t.) = p will show equals p. This
is the conditional probability of the event :N, assuming p =- p. Thus
P{H'jp}.-:.p
(6-16)
Inserting into (6-13). we obtain (6-15).
Suppose that the coin is tossed and heads shows. We then have an
updated version of the density ofp, given byf(pi1n. This density is obtained
from (6-141 and (6-16):
f(pl"/() =
1
f.
I
pf(p)
(6-17)
P.f(p)dp
Rt:r•:ATEU TRIALS We now consider the toss of a randomly selected coin
n times. The space of this experiment is a Cartesian product !f,. x Y, where
!f, consists of all sequences of length n formed with h and 1. In the space
[f, X
if,.
dl
= {k heads in a specific order}
is an event consisting of all outcomes of the form '· hi · · · h where " is an
element of !f, and ht · · ·his an element of !f,. As we know from (3-15).
/'(.':o'ljp)
= p"q"- 4
q
=I
- p
This is the probability that a particular coin with p(?:,.l
will show k heads. From this and (6-13) it follows that
P(~)
=
J: p"q"-kj'(p)dp
(6-18)
= p. tossed" times.
(6-19)
Equation (6-19) is an extension of (6-15) to repeated trials. The corresponding extension of (6-17) is the conditional density
f(pj.slt) =
1
fo
p"q" -kj'(p)
p"q"-"f(p)dp
(6-20)
of the RV p assuming that in n trials k heads show. This density is used in the
following problem.
6-1
SEC.
CONDITIONAL DISTRIBUTIONS
173
We have selected a coin, tossed it n times. and observed k heads. Find
the probability that at the next toss heads will show. This is equivalent to the
problem of determining the probability of heads of a coin with prior density
f(ploSif). We can therefore use theorem (6-15). The unknown probability
equals
(6-21)
Example 6.3
Suppose that p is uniform in the interval (0. 1). In this case • .f(p) = I; hence, the
probability of heads (see (6-15)] equals
fl pdp = !
Jo
2
We toss the coin n times; if heads shows k times. the updated density of p is
obtained from (6-20). Using the identity
fl
•o _
Jo P
P
)'r-kd
P
= k!(n
(n
- k)!
+
I)!
(6-22)
we obtain
(n ... I)!
4 -4
ji(p .s4) -- k!(n
(6-23)
- k)! P q"
This function is known as beta density. lnsening into (6-21). we conclude that the
probability of heads at the next toss equals
(1
_ (n + I)! (1 H
_
_ k -.- 1
(6-24)
Jo pf(pl.si)dp- k!(n - k)! Jo p q" 4 dp - n • 2
Thus after the observation of k heads, the density of the coin is updated from uniform
to beta and the probability of heads from 1/2 to <k _._ 1)/(n + 2). This result is known
as the law of succession. •
The assumption that the density of the die, prior to any observation, is
constant is justified by the subjectivists as a consequence of the principle of
insufficient reason (page 17). In the context of our discussion, however, it is
only an assumption (and not a very good one, because most dice are fair).
Returning to the general case, we shall call the densitiesf(p) andf(ploSif)
in (6-20) prior and posterior, respectively. The posterior density f(ploSif) is an
updated version of the prior, and its form depends on the observed number
of heads. The factor
/(p) = pk(l - p)"··k
in (6-21) is called the likelihood function (see Section 9-5). This function is
maximum for p =kin.
In Fig. 6.3 we show the functionsf(p), /(p), and
f(p 1st) - l(p lf(p)
For moderate values of n, the factor /(p) is smooth, and the product l(plf(p)
exhibits two maxima: one near the maximum kl n of l(p) and one near the
174
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION. RELIABILITY
1: /tp)
2: /(p)
3:
f<p~:Jll
p
II
II
ta)
(bl
Figure 6.3
maximum of f(p). As n increases. the sharpness of /(p) prevails, and the
functionf(pl.llf) approaches /(p) regardless of the form of the prior f(p). Thus
as n- ~. f(plS~t> approaches a line at p = kin and its mean. that is. the
probability of heads at the next toss, tends to kin. This is, in a sense, the
model justification of the empirical interpretation kl n of p.
Bayes' Formulas
The conditional distribution F<xl.M.> of an RV x assuming .M. involves the
event {x s x, .M.}. For the determination of F<xl.M.>. knowledge of the underlying experiment is therefore required. However, if .M. is an event that can be
expressed in terms of the RV x, then F<xl.M.> can be expressed in terms of the
unconditional distribution F(x) ofx. Let us look at several illustrations (we
assume that all RVs are of continuous type).
Suppose. first. that
.M. :. {x -5 11}
In this case. F<.ri.M.) is the conditional distribution ofx assuming that x-:; a.
Thus
,..( I < ) = P{xP{~ x. x s} a}
(6-25)
r.t'X-a
X~ II
where tl is a fixed number and x is a variable ranging from -x to x. The
event {x :s: x. x -5 a} consists of all outcomes such that x s x and x :s: a; its
probability depends on x.
If x ~ a. then {x c:; x. x s tl} = {x s a}: hence.
•
P{x sa}
f<xlx;s:tl)=P.{
}=I
· X S ll
Sl'.C. 6-2
175
BAYES' FORMULAS
y
~a)
F(xlx
a
0
If x <a. then {x -;: x. x
~
a} - {x
<
x}: hence.
.
P{x s x}
Fix)
f<xlx :sa):.:: - ,{ .---~- - .( l
1· a
1 x <.:; ar
Thus F( x jx :s c1) is proportional to F< x) for x :s a. <md fur x > a it equals I
<Fig. 6.4). The conditional density f<xlx < a) is uhtained by differentiation.
Suppose. next. that
..tf = {a
~
x
<
h}
We shall determine f<x;.M) directly using (6-5):
.( xasxs
I .
b rt d x=-P{ x < x :::; x ········-·....,+ dx. a_<_x_<_-_h...!.}
./
P{a
~
x ·=: h}
To do so. we must find the probability of the event
{ t .,. X < t I c/.t}
{x < x < x + dx. a < x :::: h} = { · '
·(·} ·
(6-26)
a<x<b
otherwise
Since P{x < x:::: x + dx} = fl.tldx. we conclude from 16-5) that
.f<x;a
fix)
~ "<; b) .:... Feb.) - /:(a·)
fur a< x < b
(6-27)
and zero otherwise.
Example 6.4
x is NITJ. o-1: we shall find its conditional density fix AO assuming.« =
:s "< ., + o-}.
As we know (see Fig. 4.201, the probability that x is in the interval (TJ - u,
., + o-) equals .683. Setting F(b) - F(a) = .683 in (6-271. we obtain the truncated
normal density
The
RV
{., -
(T
/(xllx - .,1 < o-) =
I
·- .
e-.. -.,..,•u·
.683o- Vi,;
for ., - u < x < ., - u and zero otherwise (Fig. 6.5).
•
Hmpirkal llltl'rprt•ltltion. We wish to determine empirically the conditional
distribution F<xla :s x :s b). To do so, we repeat the experiment n times, and
we reject all outcomes ' such that x(') is outside the interval (a, b). In the
subsequenc~ s 1 of the remaining n 1 trials, the function F(x a :s x s b) has the
176
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
y
/(XI
lx - 111 <a)
Fagure6.S
empirical interpretation of an unconditional distribution: For a specific x, it
equals the ratio n;/n 1 where n. . is the number of trials of the subsequence s 1
such that x(C) s x.
Suppose that our experiment is the manufacture of cylinders and that
X;= x(C;) is the diameter of the ith unit. To control the quality of the output, we
specify a tolerance interval (a, b) and reject all units that fall outside this
interval. The function F<xla < x s b) is the distribution of the accepted units.
Note, finally, that if y = g(x) is a function of the RV x, its conditional
density /,(yi..«.) is obtained from (4-78), where all densities are replaced by
conditional densities. If, in particular, .M is an event that can be expressed in
terms ofx, then/,(yi.M> can be determined in terms offx(x) and the function
g(x).
Example 6.5
If .M. = {x ;;:: O} and
then [see (4-84)J
_
/,(ylx ;;:: 0) -
I
Wy
/x(Vy)
•
1 - F... (O) U(y)
Joint Distributions
We shall investigate the properties of the conditional probability P(sfl~>
when both events sf and :i are specified in terms of the RVs x and y.
We start with the determination of the function F,.(ylx 1 s x s x 2 ). With
sf = {y s y} and 00 = {x 1 s x s x 2}. it follows from (6-1) that
.... (V IX1
r,.
· ·
S X S X2
) = P{x1 s
x s x 2 • y s y}
P{x1 s x s x2}
F(x2• y) - F<x1. y)
= -'-:::-=-:-.::....;..-....,..;.__:...;.,.::..;.
F.,(x2) - F.,(xl)
(6- S)
2
We shall use (6-28) to determine the conditional distribution F,.(yix> of
y assuming x = x. This function cannot be determined from (6-1) because the
event {x = x} has zero probability. It will be defined as a limit:
F,.(yix>
•
= .1.<..:.0
lim F,.(yix
s
.
x s x + .:1x)
SEC.
6-2
BAYES' FORMULAS
177
Setting x1 = x and x~ = x + dy in (6-28) and dividing numerator and denominator of the right side hy .1x. we conclude with ~x- 0 that
F,.(\·lx) = -.-1- iiF(.x. Yl
·
f,<x l fiX
(6-29)
The conditional density .f;<ylxl of y. assuming x ,. x. is the derivative of
F,<ylx> with respect toy. Thus
.
ill-',1\-l.rl
j.(\' \') = ____.___:_ ___
\
iJ,\'
' I·
I
ii~F(x. \')
-= -·- -- -·-
.t: (.\')
tl.\: ily
The functionj~Lrlyl is defined simil<1rly. Omitting suhscripts. we conclude
from the foregoing and from (5-8) that
.
I (.\'. .\')
.(\'X)
I -= f<x.
,.l
--: --·
.I .
.I< x l
..:.
.
\')
·rtx.
.....
If the Rvs x andy are independent.
f<x. yl = ./lrl.f<yl
.f(_\':xl == f<yl
(6-30)
:...._
.I(_\')
/( X j .\' ) -'
.f< X )
For a specific x. the functiun f< x. y I is the intersection ( prc~lilt'l of the
surface z = j'(x. yl hy the plane x =-constant. The conditional density.f<ylx).
considered as a function ofy. is the profile of.fLr. yl normalized hy the factor
I /.f~( X).
From (6-30) and the relationship
f(y)
=
f ..J<x.
)')dx
between marginal and joint densities. it follows that
f(y)
=
J:.
(6-31)
f(y;x)f(x)dx
This is another form of the total probability theorem. The corresponding
form of Bayes' formula is
f(x•y)
·
Example 6.6
= f<yix)f(x)
f(y)
=
f<ylx>f<x>
(6-
32
)
f..f<y:x)f(x)dx
The RVs x andy are normal with zero mean and density (sec (5-99)1
I • (X~
f<x. \')- exp { - , - 2r-·--.:...,
•
2(1 - r·) Uj
u u
Ui
X\'
\'2)}
(6-33)
1 2
We shall show that
(6-34)
178
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
• Proof. As we know,
f(x) =
I
,
.
~ e-x·t2aj
(6-35)
u 1V21T
Dividing (6-31) by (6-33), we obtain an exponential with exponent
I
- 2(1 -
2
r 2)
(x
xy
v2 )
x2
u 2 - 2r u 1u, + ~; + ~
I
-
-
•
= 2u~ll
•
I
-
(
r2) Y -
ru2
"'0) x
)
2
and (6-34) results. •
Conditional Expected Values
The mean E{x} of a continuous type RV x equals the integral in (4-87). The
conditional mean E{xi.M.} is given by the same integral wheref(x) is replaced
by f<xi.M.). Thus
E{xi.M.}
=
f,. xf<xi.M.>dx
(6-36)
The empirical interpretation of E{xi.M.} is the arithmetic mean of the
samples x1 of x in the subsequence of trials in which the event .M. occurs.
Example 6.7
Two light bulbs are bought from the same lot. The first is turned on on May I and the
second on June I. If the first is still good on June 1, can it, on the averctge, last longer
than the second?
Suppose that the density of the time to failure is the functionj(x) of Fig. 6.6. In
this case, E{x} = 3.15 month. The conditional density f<xlx 2: I) is a triangle, and
E{xlx 2: I} = 5 months. Thus the average time to failure after June I of the old bulb is
5 - I = 4 months and of the new bulb 3.15 months. 1bus the old bulb is better than
the new bulb!
This phenomenon is observed in statistics of populations with high infant mortality: The mean life expectancy a year after birth is larger than at birth. •
If .M. is an event that can be expressed in terms of the RV x, the conditional mean E{yi.M.} is specified if f(x, y) is known. Of particular interest is
Fipre 6.6
)'
.7
-/(X)
/(XIX;::: I)
f
.I
0
'llx
s
13
X
SI::C.
6-2
BAYES' FORMULAS
179
Figure 6.7
the conditional mean E{ylx} ofy assuming x = x. As we shall see in Section
6-3, this concept is important in mean-square estimation.
Setting .M. = {x s x s x + ~x} in (6-36), we conclude with ~x- 0 that
=
E{ylx}
r.
yf(y!x)dy
(6-37)
Similarly,
=f. g(y)J(yix)dy
E{g(y)lx}
(6-38)
where f(ylx) is given by (6-30). For a given x, the integral in (6-37) is the
center of gravity of all masses in the strip (x, x + dx) of the xy plane (Fig.
6. 7). The locus of these points is the function
t/J(x)
=
f, yf(y!x)dy
(6-39)
known as the regression curve of y on x.
Example 6.8
lfthe avs x andy are jointly normal with zero mean, then Isee (6-34)]
(y - rcr2xlcrt)2 }
/(ylx) - exp { 2u~(l - r2)
For a fixed x, this is a normal density with mean ru2xlu 1: hence.
E{y.x}
= cb(x)
u2 x
. = r u,
(6-40)
If the avs x andy are normal with mean Tlx and Tl~·· respectively, thenf(ylx) is
given by (6-34) where y and x are replaced by y - Tly and x - Tlx. respectively. In this
case,J<ylx) is a normal density in y with mean
rcr2
(6-41)
1/>(x) = TJy + (x - TJ,)
u,
Thus the regression curve of normal avs is a straight line with slope rcr2/cr 1
passing through the point (TJx• .,,). •
180
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
The mean E{y} of an av y is a deterministic number. The conditional
mean E{ylx} is a function t/>(x) of the real variable x. Using this function, we
form the composite function t/>(x) as in Section 4-4. This function is an av
with domain the space fl. Thus starting from the deterministic function
E{ylx}, we have formed the av E{ylx} = t/>(x).
• Theorem. The mean of the av E{ylx} equals the mean of y
E{E{yjx}}
• Proof. As we know from (4-94).
E{ylx} = E{fb(x)} =
= E{y}
t.
(642)
fb(x)j(x)dx
Inserting (6-39) into this and using the fact that f(yjx)/(x)
obtain
E{t/>(x)}
=f.. (t.
yf(yjx)dy) f(x)dx
=f. t .
= f(x,
y), we
yf(x, y)dxdy
This yields (642) because the last integral equals E{y} [see (S-28)].
In certain problems, it is easy to evaluate the function E{ylx }. In such
cases, (642) is used to find E{y}. Let us now look at an illustration involving
discrete type avs.
Example 6.9
The number of accidents in a day is a Poisson av x with parameter a. The accidents
are independent, and the probability that an accident is fatal equals p. Show that the
number of fatal accidents in a day is a Poisson av y with parameter ap.
To solve this problem, it suffices to show that the moment function of y [see
(5-60)) equals
E{e"} = e"'1"'-ll
(6-43)
This involves only expected values; we can therefore apply (6-42).
• Proof. The
RV
xis Poisson by assumption; hence,
P{x
= n} = e·o a"
n!
n
= 0,
I, . . .
Ifx = n, we haven independent accide.nts during that day, and the probability of
each equals p. From this it follows that the conditional distribution of the number y of
fatal accidents assuming x = n is a binomial distribution:
P{y = klx = n}
= (~) p"q"-•
k
= 0. I, . . . , n
(6-44)
and its moment-function [see (5-66)] equals
E{e"lx = n} = (pe' + q)"
(6-45)
The right side is the value of the, RV (pe' + q)• for x = n. Since s has a Poisson
distribution, it follows from (4-94) that the expected value of the RV (pe' + q)S equals
SEC.
6-3
x
2
.. ~o
NONLINEAR REGRESSION AND PREDICTION
:x
(pes + q)"P{x
= n} = 2
.. -o
181
II
(pes + q)" e u a' = e·u eulpr•·ql
n.
Hence [see (6-42)1
£{elY} = E{E{esYixH = E{(pes + q)"}
and (6-43) results.
= e"P•'·ull
qt
•
We conclude with the specification of the conditional mean E{g(x, y)!x}
of the RV g(x, y) assuming x = x. Clearly, E{g(x, y)lx} is a function of x that
can be determined as the limit of the conditional mean E{g(x, y)lx !5 x !5 .\· +
dx}. We shall, however, specify it using the interpretation of the mean as an
empirical average.
The function E{g(x, y)lx} is the average of the samples g(x;, y;) in the
subsequence of trials in which x; = x. It therefore equals the average of the
samples g(x, y;) of the RV g(x, y). This leads to the conclusion that
E{g(x, y)lx} = E{g(x, y)jx}
(6-46)
~ote the difference between the RVs g(x, y) and g(x, y). The first is a
function of the RVs x andy; the second is a function of the RV y depending on
the parameter x. However, as (6-46) shows. both avs have the same mean,
assuming x = x.
Since g(x, y) is a function ofy (depending also on the parameter x). its
conditional mean is given [see (6-38)] by
E{g(x. y)lx}
=
r.
g(x, y)f(y 1x)dy
(6-47)
This integral is a function 6(x) of x; it therefore defines the Rv 6(x)
E{g(x, y)lx}. The mean of 9(x) equals
r.
O<x>.f<x>dx =
=
r. f .
f. r.
.lf(X.
y>f<ylx>f<x)c/xdy
g(x. ylf<x. y)dxdy
But the last integml is the mean of lf(X. yl: hence.
E{E{g(x. yllx}} = E{lf(X, yl}
Note the following special cases of (6-46) and (6-48):
E{g,(x)g~<y>lx}
f;{lft(X)g~(y)}
=
= lft(X )f;{lf~(y>lx}
= E{g,(x)E{.!!~(yllx}}
(6-48)
(6-49)
6-3
Nonlinear Regression and Prediction
The RV y models the values of a physical quantity in a real experiment, and
its distribution is a function F(y) determined from past observations. We
wish to predict the value y(() = y of this RV at the next trial (Fig. 6.8a). The
outcome ( of the trial is an unknown element of the space ff; hence, y could
182
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
-------
y
r.-.__
.__
........
y!t;> = y
c
= £{y ~
,. -~-
'!
(a)
(b)
Figure 6.8
be any number in the range ofy. We therefore cannot predict y; we can only
estimate it. Suppose that we estimate y by a constant c. The estimation error
y - c is the value of the difference y - c, and our goal is to choose c so as to
minimize in some sense this error. We shall use as our criterion for selecting
c the minimization of the mean-square (MS) error
e
= E{(y- c)2} = f,, (y
- c:)':j(x)dx
(6-50)
This criterion is reasonable; however, it is used primarily because it leads to
a simple solution. At the end of the section, we comment briefly on other
criteria.
To find c, we shall use the identity (see Problem 4-28)
E{(y - d~) = (TJ,. - £')~ + CT~
(6-51)
Since 71.•· and
u~.
are given constants. (6-51) is minimum if
c
= .., . = J~
"1,\
(6-52)
,.r.(\•)d\'
J\ .
•
.... -
Thus the least mean square (LMS) estimate of an
mean.
RV
by a constant is its
REGR•:SSION Suppose now that at the next trial we observe the value
x(') = x of another RV x. On the basis of this information. we might improve
the estimate of y if we use as its predicted value not a constant but a function
c/J(x) of the observed x.
It might be argued that if we know the number x = x(,), we know also'
and, hence, the value y = y(') ofy. This is the case, however, only ify is a
function of x. In general, x <C•> = x for every C. in the set .sl.r = {x = x }, but
the corresponding values y((.) of y might be different (Fig. 6.8b). Thus the
observed value x of x does not determine the unknown value y = y( (.) of y. It
reduces our uncertainty about y, however, because it tells us that C. is not an
arbitrary element of Y but an element of its subset .sl.r.
SEC.
6-3
NONI.INEAR REGRESSION AND PRF.DICTION
]83
For example. suppose that the RV y represents the height of all boys in
a community and the RV x their weight. We wish to estimate the height y of
Jim. The best estimate of y by a number is the mean Tl.•· of y. This is the
average of the heights of all boys. Suppose. however. that we weigh Jim and
his weight is x. As we shall show. the best estimate of y is now the average
f..'{ylx} of all children that have Jim's weight.
Again using the LMS criterion. we shall determine the function f/>(x) so
as to minimize the mean value
e
= E{ly-
f/>(x)j2}
= f~ f~ [y-
cb(xll:f<x. y)dxdy
(6-53)
of the square of the estimation error y - cb(x).
• Theorem. The LMS estimate of the RV y in terms of the observed value x
of the RV x is the conditional mean
f/>(x)
= E{ylx} =
• Proof. Inserting the identity f<x.
(' =
=
J'. J'.
y)
r.
J:f'Cylx>dy
= j(y x)f(x) into (6-53),
(6-54)
we obtain
[y- f/>(x}ff'lyl.,·)flx)dxdy
J'Jhl J'.
[y
d>(x)j~f'<yl.ndydx
All integrands are positive: hence, e is minimum if the inner integral on the
right is minimum. This integral is of the form (6-50) if cis changed to cb(x)
andj(y) is changed to f<ylx>. Therefore. the integral is minimum if f/>(x) is
given by (6-52), mutatis mutandis. Changing the function J..(y) in (6-52) to
·
f<yix), we obtain (6-54).
We have thus concluded that the LMS estimate of y in terms of xis the
ordinate f/>(x) of the regression curve (6-39).
Note that if the RVs x andy arc normal as in (6-31 ). the regression curve
is a straight line [see (6-40)]. Thus, for normal Rvs. linear and nonlinear
predictors are identical. Here is another example of Rvs with this property.
Example 6.10
Suppose that x and y are two Rvs with joint density equal to I in the parallelogram of
Fig. 6-9 and zero elsewhere. In this case, f<YIX) is constant in the segment AB of the
line Lx of the figure. Since the center of that segment is at y ..:.. .t/2, we conclude that
E{ylx} = x/2. •
Galton's Law The term regression has its origin in the following observation by the geneticist and biostatistician Sir Francis Galton (1822-1911):
"Population extremes regress toward their mean ... In terms of average
184
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
)I
f(x) =!
2
Figure 6.9
heights of parents and their adult children, this can be phrased as follows:
Children of tall (short) parents are on the average shorter (taller) than their
parents. This observation is based on the fact that the height y of children
depends not only on the height x of their parents but also on other genetic
factors. As a result, the conditional mean of children born of tall (short)
parents, although larger (smaller) than the population mean, is smaller
(larger) than the height of their parents. This process continues until after
several generations, the mean height of descendants of tall (short) parents
approaches the population mean. This empirical result is called Galton's
law.
The statistical interpretation of the law can be expressed in terms of the
properties of the regression line cp( x) = E {yl x}. We observe first. that
if x >.,,then cp(x) < x; if x <.,,then cp(x) > x
(6-55)
because cp(x) is the mean height of all children whose father's height equals
x. This shows that cp(x) is below the line y = x if x > ., and above the line y =
x if x < Tl· The evolution of this process to several generations leads to the
following property of cp(x ). We start with a group of parents with height x 0 >
71 as in Fig. 6.1 0, and we find the average Yo = cp( x0 ) < x0 of the height of their
children. We next form a group of parents with height x, = y 0 and find the
average y 1 = cp(x1) < y0 of the height of their children. Continuing this
process, we obtain two sequences x,. and y,. such that
x,. > cp(x,.) = y,. = x,.. 1 > cp(x,..,) = y,. .. , = x,..2
- 71
n-x
Starting with short parents, we obtain similarly two sequences x~ and y~
such that
x~
< cp(x~) = y~ = x~ .... <
cp(x~.,)
= y~., = x~.2
-
71
n-
x
This completes the properties of a regression curve obeying Galton's law.
Today, the term regression curve is used to characterize not only a function
obeying Galton's law but any conditional mean E{ylx}.
SEC.
6-3
NONLINEAR REGRESSION AND PREDICTION
185
y
xi>
xj
xi
I
XI
I
Xo
X
1.75 meters
Figure 6.10
THE ORTIIOGO~i\1.1'1'\' PRI~CIPU: We have shown in (5-49) that if tl +
hx is the best MS fit of y on "· then
E{[y - (a t bx))x} = 0
(6-56)
This result can be phrased as follows: If tl + hx is the linear predictor ofy in
terms of x. then the prediction error y - (a t hx) is onhogonal to x. We show
next that if «f,(x) is the nonlinear predictor of y. then the error y - (/,(x) is
onhogonal not only to x but to any function q(x) of "·
• Theorem. If «f,(x) .:..: E{ylx}. then
E{ly - (/,(x})q(x)} -= 0
(6-57)
• Proof. From the linearity of expected values. it follows that
E{y ·· «f,(x}lx} = l:.'{ylx} - (/,(x} "" 0
Hence tsee (6-49)),
E{ly - «f,(x)lq(x)} = f.:{q(x)E{y - «f,(x)lx}} = 0
and (6-57) results.
The Rao-Bia~kwell Theorem The following corollary of (6-57) is used in
parameter estimation (see page 313). We have shown in (6-32) that the mean
.,. of «f,(x) equals the mean.,, of y. We show next that the variance of «f,(x)
186
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABII.ITY
does not exceed the variance of y:
.,.. = "'~·
..... <,.2
,.2
(6-58)
")
= y - 4>(x) + 4>(x) - .,,. it follows that
(y - .,,)2 = (y - 4>(x)J2 + [4>(x) - TJc~~1 2 + 2[y - 4>(x)][ 4>(x) - "~• 1
• Proof. From the identity y - .,,
We next take expected values of both sides. With q(x) = 4>(x) - .,,. it
follows from (6-57) that the expected value of the last term equals zero.
Hence,
and (6-58) results.
Risk and Loss Returning to the problem of estimating an RV y by a constant
c, we note that the choice of cis not unique. For each c, we commit an error
y - c, and our objective is to reduce some function L(y - c) of this error.
The selection of the form of L(y - c) depends on the applications and is
based on factors unrelated to the statistical model. The curve L(y - c) is
called the loss function. Its mean value
R = E{L(y- c)}=
r.
L(y- c)f(y)dy
is the average risk. If L(y - c) = (y - c)2, then R is the MS error e in (6-50)
and c = .,,.
Another choice of interest is the loss function L(y - c) = ly - ci. In
this case, our problem is to find c so as to minimize the average risk R =
E{ly- cl}. Clearly,
R =
f . IY- clf<y)dy = t. (c- y)f(y)dy +I: (y- c)f(y)dy
Differentiating with respect to c, we obtain
- ~~ = J:,,.f(y)dy- J.." f(y)dy = F(c)-
= 2F(c)- 1
= 1/2, that is, if c
[I - F(c)]
Hence, the average risk E{ly - cl} is minimum if F(c)
equals the median of y.
6-4
System Reliability
A system is an object made to perform a function. The system is good if it
performs that function, defective if it does not. In reliability theory, the state
of a system is often interpreted statistically. This interpretation has two
related forms. The first is time-dependent: The interval of time from the
moment the system is put into operation until it fails is a random variable.
For example, the life length of a light buib is the value of a random variable.
The second is time-independent: The system is either good with probability
SEC.
6-4
SYSTF.M RF.I.IABII.ITY
187
p or defective with probability I - p. In this interpretation. the state of the
system is specified in terms of the number p: time is not a factor. For
example, for all practical purposes, a bullet is either good or defective. In
this section, we deal primarily with time-dependent systems. We introduce
the notion of time to failure and explain the meaning of conditional failure
rate in the context of conditional probabilities. At the end of the section, we
consider the properties of systems formed by the interconnection of components.
• Definition. The tinw to failun• or /{/(• length of u system is the time intcrvul from the moment the system is put into operation until it fails. This
interval is an RV x ::: 0 with distribution F(t) = P{x s t}. The difference
R(t) = I
F(l) =-=
P{x > t}
(6-59)
is the sy.'ltt•m rt'liahility. Thus F(l) is the probability that the system fails
prior to time 1. and R(l) is the probability that the system functions at time 1.
The mean of x is called mt•tm timt• to fai/m·(•. As we sec from (4-92).
f:{x} =
J.•u xf(x)dx
= J.'
R(l)dt
•
u
(6-6())
because F(x) = 0 for x < 0.
The conditional distribution
L'(
r
I
XX
>
) __
I -
P{x s x. x? t}
" -· ,,
P{ >
(6-61)
l
is the probability that a system functioning at time 1 will fail prior to timex.
Clearly. F(xlx >I) = 0 if x < 1 and
_ F(x) - F(l)
(6- )
62
F< X IX > I ) I _ F(l)
x > I
Differentiating with respect to x, we obtain the conditional density
>
_
f(x)
f(x IX - I) - I _ F(l)
x
>
(6-63)
I
The product f<xlx::: t)dx is the probability that the system will fail in the
time interval (x, x + dx), assuming that it functions at time 1.
Example 6.11
Suppose that x has an exponential distribution
F(x) =(I - e·o•)U(x)
/(x) =
ae "'U(x)
In this case (Fig. 6.11),
f<xlx :::
I)
ae·o•
= ---=or
= ae··ol.c -II
e
x>
1
(6-64)
Thus, in this case, the probability that a system functioning at time t will fail in
the interval (x, x + dx) depends only on the difference x- 1. As we shall see, this is
true only if/(x) is an exponential. •
188
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
y
X
Figure 6.11
Conditional Failure Rate
The conditional density f<xlx ~ t) is a function of x and t. Its value at x = tis
a function of t
fl(t)
=/Ctlx ~ t)
(6-65)
I
known as the conditional rate offailure or hazard rate. The product {j(t)dt
is the probability that a system functioning at time t will fail in the interval
(t, t + dt). Sincef(x) = F'(x) = -R'(x), (6-63) yields
fl(t)
Example 6.12
F'(t)
F(t)
=I-
=-
R'(t)
R(t)
(6-66)
The time to failure is an av x uniformly distributed in the interval (0, T) (fig. 6.12). In
this case, F(x) = x/T, R(t) = I - x!T, and (6-65) yields
Q( )
liT
I
,.. 1 = I - x/T = T - x
0s
I
<T
•
Using (6-66), we shall express F(x) in terms of {j(t).
• Theorrm
1 - F(x) = R(x) = exp {-
Ftg~~re
J.: fJ(t)dt}
6.12
/(x)
{J(t)
F(x)
T·~----~
0
(6-67)
T
1
T
X
X
0
T
SEC.
6-4
SYSTEM RELIABILITY
189
• Proof. Integrating (6-66) from 0 to x and using the fact that F(O) = 1 R(O)
= 0, we obtain
- J: {3(1)d1 =In R(x)
and (6-67) follows.
To expressf(x) in terms of {3(1), we differentiate (6-67). This yields
f(x)
= {3(x) exp {- J: {3(1) d1}
Note that the function {3(1) equals the value ofthe density f<xlx
~ 1) for
x = 1; however, {3(1) is not a density. A conditional density f<xi.M.> has all the
properties of a density only if the event .M. is fixed (it does not depend on x ).
Thus f<xlx :::: 1) considered as a function of xis a density; however, the
function f3(t) = f<tlx ~ t) does not have the properties of densities. In fact,
its area is infinite for any F(:c). This follows from (6-67) because F(x) = I.
EXPU~n:n
•·AIUJRE RAn: The probability that a given unit functioning
at time 1 will fail in the interval (1. 1 1 ~)equals P{l < x s 1 + ~lx > 1}.
Hence. for small li.
{J(I)~ ""P{l < x <::: I ~ lilx > 1}
(6-69)
This has the following ('mpiric'al inlnpl'('llllion: Suppose. to bt: concrete.
that x is the time to failure of a light bulb. We turn on 11 bulbs at 1 = 0 and
denote by ,, the number of bulbs that are still good at time 1 and by tin, the
number of bulbs that fail in the interval (I. 1 + Iii. As we know Isee (2-54))
the conditional probability {J(I)li equals the relative frt:quency of failures in
the interval (1. 1 + 8) in the subsequence of trial!. involving only bulbs that
are still good at time 1. Hence
{J(I)~
-111,
,,
·::: · -
(6-70)
Equation (6-691 is a probabilistic statement involving a single component: (6-70) is an empirical statement involving a large number of components. In the following. we give a probabilistic interpretation of (6-70) in
terms of N components. where N is any number large or small.
Suppose thut a system consists of N components und that the ith
component is modeled by an RV X; with R;(t) =: P{x; > 1}. The number of
units that are still good at time 1 is an RV n(l) depending on 1. We maintain
that its expected value equals
17(1) = E{n(t)} = R,(tl + · · · + R_,(l)
(6-71)
• Proof. We denote by y; the zero-one RV associated with the event {x; > 1}.
Thus y = I ifx1 > 1 and 0 ifx,·.c: t: hence. n(t) :..: y 1 ·I • • • + y.v. This yields
190
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABII.ITY
....
E{n(t)}
=L
....
= L P{x; > t}
E{y;}
i~l
i-1
and (6-71) results.
Suppose now that all components are equally reliable. In this case.
R,(t)
=· · ·-
Rs(t)
= R(t)
17111
= NRitl
(6-72)
and (6-71) yields lsee (6-6611
{3(1)
=-
R'(t)
R(t)
=-
(6-73)
lJ'(t)
171t)
Thus we have a new interpretation of {:J(t): The product {3(t)dt equals
the ratio of the expected number lJ(I) - l)(t + dt) of failures in the interval
(t, t + dt) divided by the expected numberl)(t) of good components at timet.
This interpretationjustifies the term expected failure rate used to characterize the conditional failure rate {3(t). Note. finally, that (6-73) is the probabilistic interpretation of (6-70).
WeibuU Distribution
ies is the function
A special case of particular interest in reliability stud{:J(t) = ct"
(6-741
I
This is a satisfactory approximation of a failure rate for most applications. at
least for values of t near t = 0. The density of the corresponding time to
failure is given by [see (6-68)]
/(x) = ex" •e·•A•'"V(x)
(6-75)
This function is called the Weibu/1 density. It depends on the two parameters
c and b, and its first two moments equal
E{x}
= (~r''" r(b;
I)
E{x2}
= (~f2'" r(b;
2)
where f(x) is the gamma function. This follows from (4-54) with x" = y. In
Fig. 6.13, we show f(x) and {3(t) forb= I. 2, and 3. If b = I, then{3(t) = c =
constant and f(x) = c:e··uu(x). This case has the following interesting
property.
Figure 6.13
f(x)
Weibull c = t
0
X
0
SEC.
6-4
SYSTEM RHIABil.ITY
191
Memory/ess Systems We shall say that a system is memoryless if the
probability that it fails in an interval (t, x) assuming that it functions at timet,
depends only on the length x - t of this interval. This is equivalent to the
assumption that
f<xlx ~ I) = f<x - t)
for every x :::: t
(6-76)
With x = t, it follows from (6-76) and (6-66) that {3(1) = f(tlx ~ t) =
/(0)
= constant. Thus a system is memoryless iff fC x) is an exponential
density or. equivalently, iff its conditional failure rate is constant.
Interconnection of Systems
We determine next the reliability of a system S consisting of two or more
components.
Parallel Connection We shall say that two components s, and s~ are
connected in parallel forming a system S if S functions when at least one of
the systems 5 1 and s~ functions (fig. 6.14a). Denoting by x,. x~. and z the
times to failure of the systems 5 1• s~. and S. respectively. we conclude that
z ""'- t if the larger of the numbers x 1 and x~ equals t: hence.
z =max (x 1 • x~)
<6-77)
The distribution F.-<:.> of z can be expressed in terms uf the joint distribution
ofx 1 and x~ as in <5-7H). If the systems S 1 and s~ arc independent. that is. if
the RVS x1 and x~ are independent. then
/-'_.(1) -= F 1 (t}F~(t)
<6-78)
This follows from (5-7H); however. we shall establish it directly: The event
{z < t} occurs if S fails prior to timet. This is the case if both systems fail
prior to timet. thut is. if both events {x 1 < t} and {x~ < t} occur. Thus {z < t}
is the intersection of the events {x 1 < t} and {x~ < t}. And since these events
are independent. (6-7tH follows.
We shall say that n systems S; are connected in parallel forming a
system S if S functions when at least one of the systems S; functions. Reasoning as in (6-77) and (6-78), we conclude that if the systems S; are indepen-
F"IJUre 6.14
z =max (x 1, x 2)
(b)
'(a)
192
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
dent, then
z = max (x,, ... , Xn)
(6-79)
Series Connection Two systems are connected in series forming a
system S if S fails when at least one of the systems 5 1 and s~ fails (fig.
6.14b). Denoting by w the time to failure of the systemS. we conclude that
w = 1 if the smaller of the numbers x 1 and x~ equals.t: hence.
w = min (x,. x~)
(6-80)
The reliability R,..(t) = P{w > t} of Scan be determined from (5-79). If
the systems s, and 5 2 are independent. then
R,.(t) = R,(I)R2(1)
(6-811
Indeed. the event {z > t} occurs if S fails after timet. This is the case if both
systems fail after t, that is. if both events {x 1 > t} and {x~ > t} occur. Thus
{z > t} = {x 1 > t} n {x2 > t}; and since the two events on the right are
independent. (6-81) follows.
Generalizing. we note that if n independent systems are connected in
series forming a system with time to failure w. then
w = min (x 1• • • • • x,)
R,..(l) = R 1(t) · · · R,(t)
(6-82)
St11nd-by Connection We put system S 1 into operation, keeping system S2 in reserve. When S, fails, we put S2 into operation. When S2 fails, the
system S so formed fails (fig. 6.14c). Thus if t 1 and t 2 are the times of
operation of s, and S2, then t 1 + t 2 is the time of operation of S. Denoting by
5 its time to failure, we conclude that
5 = X1 + X2
(6-83)
The density fs(s) of 5 is obtained from (5-90). If the systems S 1 and S 2
are independent, then [see (5-97)) fs(t) equals the convolution of the densities jj(t) and Ji(t)
/s(t)
= J:jj(z)Ji(t- z)dz ajj(t) •Ji(t)
(6-84)
Note, finally, that if Sis formed by the stand-by connection of nindependent systems sit then
5 = X1 + ' • ' + X11
/,(I) = Ji(t) * • ' • * fn(t)
(6-85)
Example 6.13
We connect n identical, independent systems in series. Find the mean time to failure
of the system so formed if their conditional failure rate is constant.
In this problem,
~(I) = c
R;(l) = e-rt
R(t) = R1(1) • · • R 11(1) = e-nrt
[see (6-82)]. Hence,
E{w} =
1o"
e-nrt dt
= -I
nc
•
SEC.
6-4
SYSTEM RELIABILITY
193
TIME-INDEPE~DE~T SYSTE!\1S
A time-independent system is either good
or defective at all times. The probability p = I - q that the system is good is
called system reliability. Problems involving interconnections of time-independent or time-dependent systems arc equivalent if time is only a parameter. This is the case for series-parallel but not for stand-by connections. Thus
to find the time-independent form of (6-78) and (6-81), we set p = R(t) and
q = F(t).
System interconnections are represented by linear graphs involving
links and nodes. A link represents a component and is closed if the component is good, open if it is defective. A system has an input node and an
output node. It is good if it contains one or more connected paths linking the
input to the output. To find the reliability p of a system, we must trace all
paths from the input to the output and the probability that at least one is
connected. In general. this is not a simple task. The problem is simplified if
we consider only series-parallel connections. Here are the reliabilities p =
I - q of the four systems of Fig. 6-15:
Qa
= q,q2
Pb
= PtP2
q,.
= (1
- PtP2)q~
Pd
= (I
- q,q2)P3
Structure •·unction The state of the system S is specified in terms of an RV y
taking the values I and 0 with probabilities
P{y
=
I}
=p
P{y = 0} = q
where p = I - q is the reliability of the system. This RV will be called the
state variable. Thus a state variable is the zero-one RV associated with the
event {good}. Suppose that S consists of n components S; with state vari-
Figure 6.15
q~
=n -
PrPz)QJ
~----OQJo-----~
---o Pr o--o
b
P& = PtP2
P2
o--
b
194
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
ables y;. Clearly. y is a function
y = 1/J(y, •...• y,)
of the variables y; called the structure function of S. Here are its values for
parallel and series connections:
Pal'tllle/ Connection
1/J(y., •.• , y,.)
From (6-79) it follows thai
(1 - y 1) •
= max y; = I -
•
(1 -
y,.)
(6-86)
Series Connection From (6-82) it follows that
(6-87)
I/J(y 1• • • • • y,) = min y; = y 1 y~ · · · y,
To determine the structure function of a system. we identify all paths
from the input to the output and use (6-86) and (6-87). We give next an
illustration.
Example 6.14
The structure on the left of Fig. 6.16 is called a bridge. It consists of four paths
forming four subsystems as shown. From (6-87) it follows that the structure functions
of these subsystems equal
Y1Y2
YlY4
Y•Y4Ys
Y~~Ys
respectively. The bridge is good if at least one of the substructures is good. Hence
[see (6-86)),
1/J(y., Y2• Yl• Y4) = max (YIY2• YlY4• Y•Y4Ys• Y2Y4.Js)
= I - (I - Y1Y2KI - Y3Y4)(1 - Y2Y3Ys)
The determination of the reliability of this bridge is discussed in Problem 6-17.
Flpre 6.16
Y1~Y2
---<,,~,.>-
•
PROBI.F.MS
195
Problems
6-1
6-l
6-3
6-4
6-S
We are given 10 coins; one has two heads. and nine are fair. We pick one at
nmdom and toss it three times. (a) Find the probability p 1 that at the first toss.
heads will show. (b) We observe that the first three tosses showed heads; find
the probability p 2 that at the next toss. heads will show.
Suppose that the age (in years) of the males in a community is an RV x with
den!lity .03,.-·03'. (a) Find the percentqe of males between 20 and SO. (b) Find
the average age of males between 20 and 50.
Given two independent N(O, 2) RVS x andy. we form the Rvs z = 2x + y,
w = x - y. Find the conditional density f(z'w) and the conditional mean
E{zlw = 5}.
If F(x) = (I - e-·2•)U(x). find the conditional probabilities
PI = P{x < 513 < X < 6}
P2 = P{x > 513 < X < 6}
The avs x andy have a uniform joint density in the shaded region of Fig. P6.5
between the parabola y = x(2 - x) and the x-axis. Find the regression line
f/>(x) = E{ylx}.
Figure P6.S
xl2- x)
X
6-6
6-7
6-8
6-9
Suppose that x andy are two normal avs such that 11.• = 1Jy = 0, "·• = 2, u 1 = 4,
rx1 = .5. (a) Find the regression line E{ylx} = t/J{x). (b) Show that the Rvs x
and y - tb(x) are independent.
Using the regression line tb(x) = E{y 1x}. we form the RV z = tb(x). Show that if
ax is the homogeneous linear MS estimate of the av y in terms of x and Ax is
the corresponding estimate of z in terms of x. then A = a.
The length of rods coming out of a production line is modeled by an av e
uniform in the interval (10, 12). We measure each rod and obtain the av x =
e + 11where 11is the error. which we assume uniform in the interval ( -0.2, 0.2)
and independent of e. (a) Find the conditional density £<xlc), the joint density
£~<x. c), and the marginal density £(x). (b) Find the LMS estimate~= E{elx}
of the length c of the received rods in terms of its measured value x.
We toss a coin 18 times, and heads shows II times. The probability p of heads
196
CHAP.
6
CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY
is an RV p with density /(p). Show that the LMS estimate p of p equals
p = 'Y J~ p'2(• - P>'f<p>dp
Find 'Y if /(p)
= 1.
6-10 The time to failure of electric bulbs is an RV :11: with density ce-""U(x). A box
contains 200 bulbs; of these, 50 are of type A with c = 4 per year and ISO are of
type 8 with c = 6 per year. A bulb is selected at random. (a) Using (6-7), find
the probability that it will function at time t. (b) The selected bulb has lasted
6-11
6-12
6-13
6-14
three months; find the probability that it is of type A.
We are given three systems with reliabilities Pt = I - q" P2 = I - q2, Pl = I q3 , and we combine them in various ways forming a new system S with
reliability p = 1 - q. Find the form of S such that: (a) p = PtP2Pl; (b) q =
q,q~l; (c) p = p,(l - q~3); (d) q = q,(l - P2Pl).
The hazard rate of a system is ~(t) = ct/(1 + ct); find its reliability R(t).
Find and sketch the reliability R(t) of a system if its hazard rate equals ~(t) =
6U(t) + 2U(t - t0 ). Find the mean time to failure of the system.
(a) Show that if w = min (x, y), then
P{
I < }Fy(l)- F!l}(t, I)
X <?: I W - I - Fx(t) + F1 (1) - Fxy(l, I)
(b) Two independent systems Sx and S1 are connected in series. The resulting
6-15
6-16
6-17
6-18
systemS.., failed prior tot. Find the probability p that S" is working at timet.
SystemS., fails at timet. Find the probability p 1 that S" is still working.
We connect n independent systems with hazard rates ~1 (1) in series forming a
system S with hazard rate ~(t). Show that ~,(t) = ~,(1) + · · · + ~,(t).
Given four independent systems with the same reliability R(t), form a new
system with reliability I - [1 - R2(t)J2,
Find the reliability of the bridge of Fig. 6-16.
(a) Find the structure function of the systems of Fig. 6-14. (b) Given four
components with state variables Y1t form a system with structure function
_{I
lfi(y,, 12 ' 13 ' Y•>- 0
ify, + Y:- Yl + Y• e: 2
otherwise
7_ _ _ _
Sequences of Random
Variables
We extend all concepts introduced in the earlier chapters to an arbitrary
number of RVs and develop several applications including sample spaces.
measurement errors. and order statistics. In Section 7-3 we introduce the
notion of a random sequence, the various interpretations of convergence,
and the central limit theorem. In the last section. we develop the chi-square.
Student t, and Snedecor distributions. In the appendix. we establish the
relationship between the chi-square distribution and various quadratic forms
involving normal RVs, and we discuss the noncentral characterofthe results.
7-1
General Concepts
All concepts developed earlier in the context of one and two Rvs can be
readily extended to an arbitrary number of avs. We give here a brief summary. Unless otherwise stated. we assume that all avs are of continuous
type.
Consider n RVs x1• • • •• x, defined as in Section 4-1. The joint distribution of these Rvs is the function
(7-1)
F(x 1•• • • , x,,) = P{x 1 s: x 1•• • • • x, <; x,}
197
198
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
specified for every X; from - x to x. This function is increasing as X; increases. and F(x, . . . • x) = I. Its derivative
) _ ii"F(x 1• • • • • :c,)
/( x, ....• x, (7-2)
-"· . . . <1X
!I
u.t 1
11
is the joint density of the avs X;.
All probabilistic statements involving the RVs x; can be expressed in
terms of their joint distribution. Thus the probability that the point (x 1•
. . , x,.) is in a region D of the n-dimensional space equals
P{(x" ...• x,.) ED}
= L, · · · Jf(x 1, • • • • x,)dx 1 •
• •
dx,
(7-3)
If we substitute certain variables in F<x 1, • • • • x,.) by x. we obtain the
joint distribution of the remaining variables. If we integrate f(x 1• • • • , x,)
with respect to certain variables. we obtain the joint density of the remaining
variables. For example.
F(x 1• :c3) = F(x" x, x 3• x)
(7-4)
/(XJ, XJ)
= f.f.J(x,, X2, .l), X4)dx2dx4
The composite functions
y, = g,(x" . . . , x,.) . . . y,. = g,.(x,, . . . , x,.)
specify the" RVs y, •... , y,. To determine the joint density of these RVs,
we proceed as in Section 5-3. We solve the system
g,(x., . . . , x,.)
= Yt . . . g,(x., . . . ,
x,.)
= Yn
(7-5)
for X; in terms of)';. If this system has no solution for certain values of y,.
then f,.(y 1• • • • , y,.) = 0 for these values. If it has a single solution (x"
. , x,.), then
.• y,.>=IJ< x, ..
. 'x,l
(7-6)
where
J(x" . . . , Xn)
=
(7-7)
is the Jacobian of the transformation (7-5). If it has several solutions. we add
the corresponding terms as in (4-78).
We can use the preceding result to find the joint density of r < 11
functions y 1• • • • , y, of the n RVs X;. To do so. we introduce n - r auxiliary
variables y,. 1 = x,. 1, • • • • y, = x,. find the joint density of the n RVs y, .
. . . • y,. using (7-6), and find the marginal density ofy, •... , y, by integrcltion as in (7-4).
The RVs x; are called (mutually) independent if the events {x s x;} are
independent for every x;. From this it follows that
SEC.
F(xl, •••• Xn)
ftxlt ... , Xn)
7-1
= FCxd ·
GF.NERAI. CONCEPTS
· · F<x,)
199
(7-8)
= .ftxd · · · .f<xn>
Any subset of a set of independent avs is itself a set of independent
avs. Suppose that the avs x,. x2 • and x~ are independent. In this case,
f(xJ, Xz, X3) = /(xJ)/(x2)/(xJ)
Integrating with respect to x3 , we obtainj(xJ, x2 ) = j(x1)/(x2). This shows
that the RVS x 1and x 2 are independent. Note that if the RVs x; are independent
in pairs, they are not necessarily independent.
Suppose that the av y1 is a function g1(x1) depending on the av x1 only.
Reasoning as in (5-26), we can show that if the avs x1 are independent, the
avs y1 = g 1(x1) are also independent.
Note. finally, that the mean of the RV z = .1f(X 1, . . .. X11 ) equals
E{g(xlt . . . 'Xn)}
=
r.... r,
g(XJ, • ••• Xn)f(xl •• •• 'Xn)dxl . •• dxn
as in (5-30). From this it follows that
E{~akgk(XI, . . . , Xn)} = Ia.E{g,(XJ, . . . , Xn)}
(7-9)
The covariance IJ.ii of the avs x1 and xi equals [sec (5-34)]
IJ.ii = E{(x;- Tl;)(xi- TJi)} = £{xixi} - Tli''li
where 11; = E{x1}. The n-by-n matrix
#J.IIIJ.I2
(7-10)
#J.21#J.22
(
#J.ni#Ln2
is called the covariance matrix of the n avs x;.
We shall say that the avs x; are uncorrelated if IJ.v = 0 for every i =I= j. In
this case. if
z = a1x1 -'- · · · + Cl 11X11
then
CT~ = a1CTT - • · • •a~~
(7-11)
where CT~ = IJ.;; is the variance of X;.
Note, finally, that if the avs x1 are independent, then
r..... r.
=
r.
gl(Xl) . . . Cn(Xn)f(xl) • .• f(xn)dxl •
CJ(XJ)/(XJ)dxl . . .
r.
. dxn
Cn(Xn>f<xn)dxn
Hence.
E{gJ(XJ) · · · Kn(Xn)} = £{gJ(XJ)} · · · E{g,(x,))
(7-12)
As in (5-56). the expression
<l>(sJ •... , s,.) = E{exp(SJXJ - · · · + S11X11 )}
(7-13)
will be called the joint moment function of the avs x1 • From (7 -12) it follows
that if the avs x; are independent, then
<l>(slt ...• Sn) = E{es•••} · · · E{e'•"•}
where <I>( s 1) is the moment function of x1 •
= <l>(sJ) ·
· · <l>(s11 )
(7-14)
200
CHAP.
Normal
7
SEQUENCES OF RANDOM VARIABLES
RVs
We shall say that then RVs x; are jointly normal if the RV
z
= a1X + ·
· ·+
llnXn
is normal for any a;. Introducing a shift if necessary. we shall assume that
E{x;} = 0. Reasoning as in (5-98) and (5-99). we can show that· the joint
density and the joint moment function of the Rvs X; are exponentials the
exponents of which are quadratics in x; and s;. respectively. Specifically.
<l>(s., . . . , s,.)
= exp {-21 i,
f.LuS;si}
(7-15)
1)•1
" 'YuX;X; }
(7-16)
exp { - -2I L
ij-1
where f.Lii are the elements of the covariance matrix C of X;. 'Yii are the
elements the inverse c-• of C. and 4 is the determinant of C. We shall verify
(7-16) for n = 2. The proof of the general case will not be given.
If n = 2, then f.Lu = ui. 1L12 = ru1u2. 1L22 = u~.
J<x1. • • •• x,.)
I
= v'(217')
11
4
c- ( 0'12
f0'10'2
c-l = ! (0'~
-rO'IO'~) = ('YII
4 -ro-10'2
In this case, the sum in (7-16) equals
2
'Y11X1
+
2-y12X1X2
+
2
'Y22X2
o-1
22
= ~1 (0'2X1
-
'Y~I
'YI2)
'Yn
2r0'10'2X1X2
,,
+ O'jXi)
in agreement with (5-99).
If the matrix Cis diagonal, that is, if f.Lii = 0 fori =I= j, then 'Yii = 0 fori =I=
j and 'Yu = 1/u~; hence,
1
1
f<x .. ••• , x,.) =
V(21T )" exp {- -2 ( 0'1 + · · · +
(7-17)
0'1 • • • u,.
Thus if the RVs x; are normal and uncorrelated, they are independent.
Suppose, finally, that the Rvs y; are linearly dependent on the normal
Rvs x;. Using (7-6), we conclude that the joint density of the RVS y; is an
exponential with a quadratic in the exponent as in (7-16); hence, the RVS y;
are jointly normal.
xt
§)}
0',.
Conditional Distributions
In Section 6-2 we showed that the conditional density of y assuming x
defined as a limit, equals
f<
Y
I ) = /(X. }')
x
f<x>
= x,
SEC.
7-1
GENERAL CONCEPTS
201
Proceeding similarly. we can show that the conditional density of the
RVs x, •... , "'''assuming x, = x 4 , • • • , x 1 = x 1 equals
f (X,, • •••
X4•1
IXA,
• •• ,
_ f(x,. . . • x,)
Xt ) - /:;.,.--(....;..._----''Xt, . . • xd
(7-18)
For example,
/(Xt, X2, X~)
j( Xt, X2 )
Repeated application of (7-18) leads to the chain rule:
f(x,, . .. , x,.) = /(x,.lx.. -t •... , Xt) · · · f<x2lx,)j(Xt)
(7-19)
We give next a rule for removing variables on the left or on the right of
the conditional line. Consider the identity
I
j(x3, x2lx,) = j(x,)f(x,, x2. x~)
) _
1
f( X3,X2,
Xt -
Integrating with respect to x 2 , we obtain
I I~
_,.f<x3, x2lx,)dx2 = j(x,) . ,.f(x., x~. x1>dx2
I"
f(x,, X3)
= f(x,) = j(xllx,)
Thus to remove one or more variables on the left of the conditional line, we
integrate with respect to these variables. This is the relationship between
marginal and joint densities [see (7-4)] extended to conditional densities.
We shall next remove the right variables x2 and x3 from the conditional
density f(x.lx3 , x2 • x,). Clearly,
/(.t.lxh x2. x,)j(x~. x2lx,) = f(x., x~. x2 lx,)
We integrate both sides with respect to x 2 and x~. The integration of the right
side removes the left variables x2 and xh leavingf(x.lx,). Hence,
r.
r~f<x.lx.,, x~. x,)j(x~. x~lx.>dx~dx, ""f<x.lxt>
Thus to remove one or more variables on the right of the conditional line, we
multiply by the conditional density of these variables, assuming the remaining variables, and integrate. The following case. known as the ChClpmtlnKo/mogoro.D'equt~tion. is of special interest:
r.
f(x,lx~. x.>f<x21x,)dx2 = f<x~ix,)
The conditional mean of the RV y assuming x; .::.: x; equals
E{ylx,. •... , x,} =
J: .vf<ylx,, •. ..• x,)dy
<7-20)
This integral is a function c/J(x 1• • • • • x,) of x;. Using this function, we form
202
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
the RV
</>(x, •...• x,)
The mean of this RV equals
= E{ylx, •...• x,}
f"" · · ·J"_,. <b<x.. ...• x,)j(x, • ...• x,) dx
1 • • •
dx,
Inserting (7-20) into this equation and using the identity
f(x, •. ..• x,. y) = f<.vlx, •. ..• x,)j(x, •...• x,)
we conclude that
E{E{ylx,., ... , x,}} =
r. ···r.
yf(x., . . . , x,., y)dx, · · · dx,.dy
= E{y}
(7-21)
The function <f>(x1• • • • • x,) is the gener.llization of the regression
curve <f>(x) introduced in (6-39). and it is the nonlinear LMS predictor of y in
terms of the RVs x1 (see Section 11-4).
Sampling
We are given an RV x with distribution F(x) and density /(x), defined on an
Using this av, we shall form n independent and identically
distributed (i.i.d.) RVS
experiment~.
(7-22)
X1t • • • , X;, • • • , X 11
with distribution equal to the distribution F(x) of the RV x:
F 1(x)
= · · · = F,.(x) = F(x)
(7-23)
These avs are defined not on the original experiment ~ but on the product
space
~11 =~X • • ·X fJ
consisting of n independent repetitions of~. As we explained in Section 3-1,
the outcomes of this experiment are sequences of the form
~
= ~ •... ~~ ... ~..
(7-24)
where ~~ is any one of the elements of ~. The ith RV Xt in (7-22) is so
constructed that its value Xt(~) equals the value x(~;) of the given av x:
Xt<f1 ' • ' ~~ ' ' • ~11 ) = X(~;)
Thus the values of x1 depend only on the outcome
distribution equals
F 1(x)
= P{Xt s
x}
= P{x s
x}
~~
(7-25)
of the ith trial, and its
= F(x)
(7-26)
This yields
E{Xt}
= E{x} = 71
(7-27)
Note further that the events {x1 s x1} are independent because~.. is a
product space of n independent trials; hence, the n avs so constructed are
i.i.d. with joint distribution
F(x 1 ,
••• ,
x,.)
= F(x1)
• • •
F(x,.)
(7-28)
SEC.
7-1
GENERAl. CONCEPTS
203
Thus. starting from an RV x defined on an experiment :1. we formed the
product space ~f, and the 11 RVs x;. We shall call the set :1, .wmrp/C' .'ipac·t• and
the 11 RVS X; a random samp/C' of size 11. This construction is called a .mmplilllf
of a population. In the context of the theory ... population .. is a model
concept. This concept might be an abstraction of an existing population or of
the repetition of a real experiment.
As we show in Part Two. the concept of sampling is fundamental in
statistics. We give next an illustration in the context of a simple problem.
Sample Mean
The arithmetic mean
x, + ... + x,
i=-'----n
<7-29)
of the samples x; is called their samp/C' mc•cm. The RVs x; are independent
with the same mean and variance: hence. they are uncorrelated and (see
(7-II)J
E{i} =
.,
cr~ + • · · .,. IT!
a .., = - - ~- -· -
TJ+···iTJ
= 71
ll
IT~
ll
(7-30)
Thus i has the same mean as the original RV x. hut its variance is cr~/11. We
shall use this observation to estimate Tl·
The RV xis defined on the experiment ~f. To estimate TJ. we perform the
experiment once and observe the number x = x(~). What can we say about 71
from this observation'! As we know from Tchebycheft-s inequality (4-115),
P{TJ - IOcr s x-;:: 71 + IOcr}
~
.99
(7-31)
This shows that the event {TJ- IOcr s x < 71 IOcr} will almost certainly occur
at a single trial, and it leads to the conclusion that the observed value x of the
RV xis between 71 - lOu and 71 + lOu or. equivalently. that
x - I Ocr < 71 < x
-t
(7-32)
IOcr
This conclusion is useful only if IOcr << Tl· If this is not the case. a single
observation of x cannot give us an adequate estimate of Tl· To improve the
estimation. we repeat the experiment " times and form the arithmetic average x of the samples x; ofx. In the context of a model ..r is the observed value
of the sample mean i of x obtained by a single performance of the experiment ~f,. As we know, the mean of i is TJ. and its variance equals cr~ltr;
hence.
IOcr
p {TJ - Vn
_
S X <;
IOcr}
TJ + Vn
2!
.99
Replacing x by i and u by u/Vn in (7-3 I), we conclude with probability .99
that the observed sample mean xis between 71 · IOcr/Vn and 71 + IOcr/Vn
or, equivalently. the unknown 71 is between .f- IOcr/Vn and :i + IOu/Vn.
Therefore, if n is sufficiently large, we can claim with near certainty that 71 ===
x. This topic is developed in detail in Chapter 9.
204
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
7-2
Applications
We start with a problem taken from the theory c~( meclS/tremellts. The distance c between two points is measured at various times with instruments of
different accurclcies. and the results of the measurements are n numbers :r1•
What is our best estimate of c'!
One might argue that we should accept the reading obtained with the
most accurclte instrument. ignoring all other readings. Another choice might
be the weighted average of all readings. with the more accurclte instruments
assigned larger weights. The choice depends on a variety of factors involving
the nature of the errors and the optimality criteria. We shall solve this
problem according to the following model.
The ith measurement is a sum
X; :.:: C + II;
(7-33)
where 111 is the measurement error. The Rvs 111 are independent with mean
zero and variance a}. The assumption that £{11;} = 0 indicates that the
instruments do not introduce systematic errors. We thus haven RVs x; with
mean c and variance crl. and our problem is to find the best estimate of c in
terms of the " numbers u;. which we assume known. and the n observed
values x 1 of the RVs x;. If the instruments have the same accurclcy. that is. if
CT; = u = constant, then. following the reasoning of Section 7-1. we use as
the estimate of c the arithmetic mean of x1 • This. however, is not best if the
accuracies differ. Guided by (7-30), we shall use as our estimate the value of
an RV t with mean the unknown c and variance as small as possible. In the
terminololgy of Chapter 9. this RV will be called the unhiused minimum
variance e.ftimator of c. To simplify the problem. we shall make the additional assumption that c is the weighted average
c = 'YJllJ + ... + 'Y..X..
(7-34)
of the n measurements x1• Thus our problem is to find the n constants 'YI such
that E{c} = c and Yare is minimum. Since E{x1} =c. the first condition yields
'YI + · • · + 'Yn = I
(7-35)
From (7-11) and the independence of the avs x1 it follows that the
variance of c equals
v = 'Yiui + · · · + 'Y;u;
<7-36)
Hence, our objective is to minimize the sum in (7-36) subject to the constraint of (7-35). From those two equations it follows that
V = 'YTUT + · · · + 'Yrur + · · · + (I - 'Y1 - • • • - 'Y,._,)u~
This is minimum if
av
iJ'Y;
= 2'Y;Ui~
- 2(1 - 'YI - • • • - 'Y,.-J}u;;~
=o
. = I , ... , n -
I
And since the expression in parentheses equals 'Y11 , the equation yields
'Y;uT
= 'Y,.u~
I
SEC. 7-2
APPLICATIONS
205
Combining with (7-36). we obtain
v
')'; = ~
v = 1/o-j + . .
+ 1/o-;,
(7-37)
We thus reach the reasonable conclusion that the best estimate of cis
the weighted aver.lge
• x,lo-i + ·
c= 1/o-j + .
· + x,lrT;,
. + 1/o-;,
(7-38)
where the weights')'; are inversely proportional to the variances of the instrument errors.
Example 7.1
The length c· of an object is measured with three instruments. The resulting measurements are
·'"• = 84 X2 = 85 .l',l = K7
and the standard deviations of the errors equal I. 1.2. and 1.5. respectively. Inserting
into (7-381. we conclude that the best estimate of c equals
~ = X1 + X2fl,44 + X,l/2.25 = UA ~
•
c
I ... 111.44 + 112.25
...... 9.
We assumed that the constants o-; (instrument errors) are known. However, as we see from (7-38). what we need is only their ratios. As the next
example suggests. this case is not uncommon.
Example 7.2
A pendulum is set into motion at time 1 = 0 starting from a vertical position as in Fig.
7.1. Its angular motion is a periodic function 8ttl with periud T = 2c. We wish to
measure c.
To do so. we measure the first 10 zero crossings of 0(11 using a measuring
instrument with variance o- 2• The ith measurement is the sum
t; = ic + 8;
£{8;} = 0
t:{8i} = cr 2
Figure 7.1
206 CHAP. 7 SEQUENCES OF RANDOM VARIABLES
where li; is the measurement error. Thus t; is an RV with mean k and variance an
unknown constant u~. The results of the measurement are as follows:
t; = 10.5
20.1 29.6 39.8 50.2 61 69.5 79.1 89.5 99.8
To reduce this problem to the measurement problem considered earlier. we
introduce the Rvs
6.
t;
"; = i
x;=7=c+";
Clearly.£{,;}= 0. £{,f} = u~li~: hence.
'
£{x;} = c
o:.~,
=
u;~
Finally.
t
X;=~=
I
10.5
10.05 9.87 9.95
10.04 10.17 9.93 9.89 9.94 9.98
Inserting into (7-38) and canceling u~. we obtain
•
Xt + 4x~ + · · · + IOOx, 0
c=
I + 4 + ... + 100 :: 9.91
Thus the estimate c~ of c does not depend on u. •
Random Sums
Given an RV n of discrete type taking the values I. 2. .
RVs x 1• x2 • • • • , we form the sum
. and a sequence of
•
5
= }: x4
<7-39>
4-1
c.
This sum is an RV defined as follows: For a specific
the RV n takes the
value n = n(C). and the corresponding values == 5(C) of s is the sum of the
numbers x4 (C) from I to"· Thus the outcomes Cof the underlying experiment
determine not only the values of x4 but also the number of terms of the sum.
We maintain that if the Rvs x4 have the same mean E{x4 } = Tlx and they
are independent of n, then
(7-40)
E{s} = TJxE{n}
To prove this, we shall use the identity [see (7-21)]
E{s} = E{E{5In}}
If n = n, then 5 is a sum of n RVs, and (7-9) yields
E{5ln}
=E
{± x.ln} = ±E{x.ln} = ±E{x.}
·-· ·-·
·-·
The last equality followed from the assumption that the Rvs x4 are independent of n. Thus
E{5ln} = TJJin
E{E{5In}} = E{.,Jin} = TJJCE{n}
We show next that if the RVS x4 are uncorrelatedwith variance u~. then
E{s2}
Clearly,
5
2
= .,~E{n2} + u~E{n}
• •
=~ ~
k•l m=l
II
Jlkllm
E{s2ln}
=~
(7-41)
II
~
k•l m=l
E{Xkllm}
SEC.
7-2
APPI.ICATIONS
207
The last summation contains 11 terms with/.. ...,. m and ,~ 11 terms with
/.:. :/: m. And since
k - Ill
.
{E{xz} ..:. cr~ + 11~
f..{xAx,} = /•'{
lf.'{ 1 = 11•~
/.:.·I m
.xu.x.,r
we conclude that
E{s2in} · · hri + 71~)n + 71.~(n~ - n)
71~n~ .,. uin
Taking expected values of both sides, we obtain (7-41).
Note, finally. that under the stated assumptions, the variance of s
equals
1.'{ s-'} - c:.1.''{s} = ..,-u·
' n, ... cr·..,
'
cr··'' = c:.
(7-42)
•f.\
.\ •1n
Example 7.3
The thermal energy w =·
parameters
my~/2
of 11 particle is an
3
I' - :;
KV
having a gamma c"·nsity with
I
(' . k .,
where Tis the absolute temperature of the p11rticle and k t ..n x 10 ~'joule degrees
is the Holtzmunn constant. 1-'rum this and 15-(>51 it fulluws that
.
h
/~{w} - ;: -
3H
T
,
h
?tk~1:
cr~ - ;.~ - -
...
2
The number n, of particles emitted from a rudio11ctivc substance in t seconds is 11
Poisson RV with parameter a - Jl.t. Find the me11n and the v11riance of the emitted
energy
' As we know,
E{n,} = tl = 11.1
Inserting into (7-40) and (7-42). we obtain
- 3k nt
E{l}
.
2
.
cr·• -
CT~
•••
= Cl
= 11.1
(3/i.T)~
3k 2 T~ . . 15k~T2JI. T
-2
11. T - - - 11.1 =
2
4
•
Order Stati.\'tic
We are given a sample of 11 Rvs X; defined on the sample space rt, of repeated
trials as in (7-25). For a specific~ E y·,. the Rvs x; take the values X; = x;(e>.
Arnmging then numbers X; in increasing order. we obtain the sequence
x,, -s x,~ s · · · s x,.
(7-43)
We next form then RVs y; such that their values Y; == y;(~) equal the
ordered numbers x,,
y, = x,, !"= Y:! = x,: s · · · c;; y, -=- x,_
<7-44)
Note that for a specific i, the values x;(f) of the ith RV X; may occupy
different positions in the ordering as eranges over the space ~l,. For example, the minimum y1<e> might equal x~(~) for some~ hut xK(~) for some other
~. The Rvs y; so constructed are called the order sltlti.Hics of the sample X;.
208
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
As we show in Section 9-4, they are used in nonparametric estimation of
percentiles.
The /cth RV y" in (7-44) is called the kth order statistic. We shall determine its density fk(y) using the identity
/,(y)dy = P{y < y,. s y + dy}
(7-45)
The event sf = {y < Y• s y + dy} occurs if k- 1 of the RVS X; are less
than y, n - k are larger than y + dy, and, consequently, one is in the internal
(y, y + dy). The Rvs X; are the samples of an RV x defined on an experiment
~.The sets (Fig. 7.2)
~2 = {y < X S y + dy}
~3 = {y + dy < x}
~. = {x s y}
are events in ~, and their probabilities equal
P(~,) = Pt = Fx(Y)
P(~2) = P2 = ,h(y)dy
P(t.e3) = P1 = I - Fx(Y + 4y)
where Fx(x) is the distribution and ,h( x) is the density of x. The events ~. ,
~ 2 • ~ 3 form a partition of~; therefore, if~ is repeated n times, the probability that the events, ~;will occur k; times equals [see (3-41)]
n!
k 'k 'k 1 pt•p~zp~'
k, + k2 + k3
I • 2• 3•
=n
(7-46)
Clearly, sf is an event in the sample space fin, and it occurs ifliJ, occurs k1 times, ~ 2 occurs once, and \1 3 occurs n - k times. With
k1 = k - I
k2 = I
k3 = n - k
(7-46) yields
P{y < y" s y + dy} =
(k _
l)!~!!(n
_ k)!
J1- 1(y)/x(Y)dy [I
- Fx(Y + dy)Jn-lt.
Comparing with (7-45), we conclude that
fk(y)
= (k-
1)7(!n - k)!
p~-'(y)[l
- F.(y)]n ·• ,h(y)
(7-47)
The joint distributions of the order statistics can be determined similarly. We shall carry out the analysis for the maximum Yn = Xmax and the
minimum Yt = Xmin of X;.
Fipre 7.2
z
-----!1,----~.
1-+------983·------
~2
SEC.
7-2
APPI.ICATIONS
209
EXTREME ORDER STATISTICS We shall determine the joint density of the
RVS
= y,
z
using the identity
[.,..(z, w)dzdw
The event
~ =
{z
= P{z <
<zs z
-t-
dz,
= Y1
w
z s z + dz.
l1'
<
1\'
<
w s w + dw}
z> w
w s w ... dw}
(7-48)
(7-49)
occurs iff the smallest of the avs X; is in the interval ( lt'. l1' .,.. dw ), the largest
is in the interval (z, z + dz). and, consequently. all others are between w +
dw and z. To find PC'-€), we introduce the sets
t£ 1 = {x s w}
C'i~ = {w < x s w + dw}
rh = {w + dw < x s z}
~4 = {z < x s z ~ dz}
ri ~ = {x > ~ + dz}
These sets form a partition of the space ~~. and their probabilities equal
p 1 = F,(w)
p 2 = j.(w)dw
p~ = F,(z) - F,(w + dw)
P4 = f,(z) ch
p~ = I - 1-',(z -t· dz)
(7-SO)
Clearly. the set~ in (7-49) is an event in the sample spaceY,, and it occurs
iff!":t 1 does not occur at all, 0:~ occurs once, ~f., occurs n - 2 times, Qt 4 occurs
does not occur at all. With
once, and
as
k, = 0
k2 = I
k~ = II
it follows from (3-41) and (7-50) that
2
-
P{z < z s .. + d.. w < w s w ~
= n(n
-
dl1'}
~·
l).{.(w)dw[F,(z)-
/.;4 :.:
=
"~
I
11
(ll -
!
F,<w - dwW
=0
p.p'! 2p
2)! - ·' 4
~f,(z)dz
z > wand zero otherwise. Comparing with (7-48). we conclude that
for
I" ("'
·)
... lt
J~ ..-
= { 11(n -
0
1)/,(z)J.(w)[F,Cz)- f,(wl)"· ~
z
>w
z<
w
(7-51 )
Integrating with respect to wand z respectively. we obtain the marginal
densities f..(z) and f,..(w). These densities can be found also from (7-47):
Setting k = n and k = I in (7-47). we obtain
[.(z)
f,..(w)
= nf,(z)F~
= n.f,(w)ll
1
(z)
- F,(wl]"' 1
(7-52)
Range and Midrange Using (7-51). we shall determine the densities of the
RVS
z+w
2
The first is the range of the sample, the second the midpoint between the
maximum and the minimum. The Jacobian of this tnmsformation equals I.
Solving for z and w, we obtain ~ = ,., + r/2. w ·"" s - r/2. and (5-88) yields
r-=z-w
J,,,(r. s l
= J..... (.'i
s=--
+
,. ..\· - ~,.)
2
(7-53)
210
CHAP.
7
Example 7.4
SEQUENCES OF RANDOM VARIABLES
Suppose that xis uniform in the interval 10. c·). In this case. F,l:c I= xlc forO< x < ":
hence.
F <lzl - F <I w I = c·~ - .!!:
c
I
.
1 (W)-=-
f <lzl = !c·
I
('
in the shaded area 0 s w s .: :s c of Fig. 7.3 and zero elsewhere. Inserting into 17-51 ).
we obtain
w)
.(,".(<..
.
I)(;::- ••·)"
= n(n--,---·
cc·
~
() <
w s;:: < ('
17-54)
and 7.ero elsewhere. This yield!;
i:
,
n(n - I)
11
. = ---·
~
" (;:- w)" ·-dw =--~ z"
[...ht•) = n(n - _.!1 J.' Cz w)"-~ dz .:... !!.. Cc'
w
/:(<.)
(~
(~
1
OS<:SC'
- w)"
1
0 s w s ('
For n = 3 the curves of Fig. 7.3 result. Note that
E{z}
= _n_
n + I
c·
E{w}
('
=-" + I
From (7-53) and (7-54) it follows that
•.f)
J,,.(r
.
=
nln - I) , ,
r (~
I
c
0
c
c
X
fz(Z)
f ... <w>
3
c
3
c
z c
X
/,(r)
3
2c·
0
c
z
0
c
"'
0
c
r
SEC.
for 0 s
.f -
7-2
APPLICATIONS
211
r/2 s: s + r/2 s c and zero elsewhere. Hence.
f,(r) = n(n- I)
,
c"
/.(s) =
J,.,,2 ,, . 2d.f
= n(ll- I) r"
f2• r" 2 dr =!!.. (2s)"
Jo
c"
{
2,.. ·2.•
1o
c"
rl2
,
r"-- dr
n
= -c" (2c
2(C-
r)
1
- 2.f )"
1
Note that
E{r}
. I
= II·-,.
E{s}.::..
,."
+ I
IT'
2
2(11 - lk~
I I~( II I :!J
•
tr; ,
•
(II I
.
= . . . ..... ...
(
2(11
+
1)(11
+
(7-55) •
2)'
Sums of Independent Random Variables
We showed in Section 5-3 that the density of the sum z .,.. " + y of two
independent RVs " and y is the convolution
nz)
=
r..~~(Z
-
(7-56)
y)j;.(y)dy
of their respective densities [see (5-96)]. Convolution is a binary operation
between two functions. and it is written symbolically in the form
i<z> = fx<z> * J;.(z)
From the definition (7-56) it follows that the operation of convolution is
commutative and associative. This can also be deduced from the identity
[see (5-92)]
(7-57)
fl>:(s) = 4>.,(s)4>.• (s)
relating the corresponding moment functions.
Repeated application of (7-57) leads to the conclusion that the density
of the sum
z = x, + ... + x,
(7-58)
of n independent avs X; equals the convolution
i<z> = Ji<z> * · · · * /,(;.)
(7-59)
of their respective densities Ji<x ). From the independence of the avs
follows as in (7-12) that
E{en} = E{es<a,-···-a..l} = E{e•••} · · · E{e,.•}
Hence,
fl>:(s) = 4> 1(s) • • • tl>,(s)
where tl>;(s) is the moment function ofx;.
en; it
(7-60)
212
CHAP.
7
Example 7.5
SEQUENCES OF RANDOM VARIABLES
Using (7-60), we shall show that the convolution of n exponential densities
Jj(x) = · · · = f,(x) = ce·"U(x)
equals the Erlang density
zn·l
f..(;.) = en (n _ )! e ·•:U(z)
(7-61)
1
Indeed. the moment function of an exponential equals
cl»·(s)
'
= <'
i
. e ''e"dx = -("-
II
(' -
S
Hence,
«<»:(s) = cl»i(s)
= (<'
('n
(7-62)
_ s)n
and (7-61) follows from (5-63) with h = n.
The function.f:(z) in (7-56) involves actually infinitely many integnlls.
one for each z. and if the integnmd consists of several analytic pieces. the
computations might be involved. In such cases. a grctphical interpretation of
the integral is a useful technique for determining the integration limits: To
find,h(z) for a specific z. we form the function/,(- y) and shift it z units to the
right; this yields j,(z - y). We then form the product/,(<: - y)j..(y) and
integrctte. The following example illustrcttes this.
Example 7.6
The RVS x1 are uniformly distributed in the interval (0, 1). We shall evaluate the
density of the sum z = x1 + x2 +X). As we showed in Example 5.17. the density of
the sum y = x1 + x2 is a triangle as in Fig. 7.4. Since z = y + '"· the density of z is the
convolution of the trianglef,.(y) with the pulse/3(x). Clearly,f:(l) = 0 outside the
interval (0, 3) because.f.(y) = 0 outside the interval (0. 2) and/3(.t) ""' 0 outside the
interval (0. 1). To find f..(z) for 0 < z < 3. we must consider three cases:
If 0 s z s I. then
f:<zl
If I s
r·
z2
= J~ .v dy = 2
z s 2. then
f:<ll =
Finally, if 2 s
J:_, ydy + J;' (2- y)d)• =
zs
f:<z>
-: 2 + 3z ...
~
3. then
f
2
= ;
1
{2 -
.v) dy =
z2
2-
3; +
9
2
In all cases. f:(d equals the area of the shaded region shown in Fig. 7.4. Thus f:{z)
consists of three parabolic pieces. •
Binomial Distribution Revisited We shall reestablish the binomial distribution in terms of the sum of n samples Xi of an av x. Consider an event s4 with
probability P(s4) = p. defined in an experiment~. The zero-one av associ-
SEC.
7-2
APPLICATIONS
213
fi(x)
3
4
0
X
h<z- J')
z
2
)'
z- I
0
2
z
)'
Fipre 7.4
ated with this event is by definition an
RV
x such that
Thus x takes the values 0 and I with probabilities p and q. respectively.
Repeating the experiment ~ , times. we form the sample space ~f, and the
samples x;. We maintain that the sum
z
=
x, + ... + x,
has a binomial distribution. Indeed. the moment function of the
<I>;(S) =
RVS
x; equals
<l>,(s) = E{t''"} = P('' t q
Hence.
<l>:(s)
= <1> 1(s)
• • • <l>,(s) = (pe' t q)"
From this and (5-66) it follows that z has a binomial distribution. We note
= pq; hence. E{z} = np. u~ = npq.
again that E{x;} = p.
ur
214
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
7-3
Central Limit Theorem
Given " independent RVs x,. we form the sum
z = "' + ... + x,
(7-63)
The central limit theorem (CI.T> states that under certain general conditions.
the distribution of z approaches a normal distribution:
(7-64)
as" increases. Furthermore. if the RVs x; arc of continuous type. the density
of z approaches a normal density:
j:(~) ==
I
t' •:-., ,:.~.,:
(7-65)
.
u .. V21T
Under appropriate normalization. the theorem can be stated as a limit:
If Zo = (z - T'J:)Icr... then
F:..(z)
.
-
l:..(z)
{;(z)
I
.. ,
~ e : '-
-
(7-66)
21T
for the general and for the continuous case. re!.pectively.
No general statement can be made about the required size of 11 for a
satisfactory approximation of /·:~(z) by a normal di!.tribution. For a specific"·
the nature of the approximation depends on the form of the densities/;(x). If
the RVs x; are i.i.d .. the value " = 30 is adequate for most applications. In
fact, if/;(x) is sufficiently smooth, values of n as low as 5 can be used. The
following example illustrates this.
II ••
Example 7.7
The
Rvs X;
II ' '
\.
are uniformly distributed in the interval (0. I). and
~
E{x;} =
u7 =
1
12
We shall compare the density of z with the normal approximation {7-65) for
= 2 and " = 3.
n=l
n
ll
....
.,. =-=I
2
rr~ = ~ = ~
f.(;,)
==
~! r•
11:-u:
n=3
n
3
71:=2=2
,
11
I
rr:=,2=4
As we showed in Example 7.6. f.(z)
/2 ,, .~. :
f:(;;)-=v;('-: 1 1
is a triangle for n = 2 and consists of three
SEC.
CENTRAl. LIMIT THEOREM
215
z
z
2
0
7-3
Figure 7.5
parctbolic pieces for 11 = 3. In Fig. 7.5. we compare these densities with the corresponding normal approximations. Even for such small values of 11. the approximation
error is small. •
The central limit theorem (7-66) can be expressed as a property of
convolutions: The convolution of a large number of positive functions is
approximately a normal curve lsee (7-55)).
The central limit theorem is not always true. We cite next sufficient
conditions for its validity:
If the Rvs x; are i.i.d. and E{xl} is finite. the distribution F 11(z) of their
normalized sum Zo tends to N(O. I).
If the Rvs x; are such that lx;l < A < x and u; > c1 > 0 for every i. then
Fo(z), tends to N(O, I).
If E{xl} < B < x for every i and
u? = ui
+· · ·+
u;. -+ x
as"-+ x
(7-67)
then F0(z) tends to NCO. I).
The proofs will not be given.
Example 7.8
If
jj(.t) = e· •U(.t I
then
t:{x7} ,.,. )·., x"t• • d.t "" 11!
II
Thus E{x:} = 6 < ~: hence. the centred limit theorem applies. In this case, 11: =
nE{llj} = n, a~ = nu1 = n. And since the convolution of n exponential densities
equals the F.rlang density (7-61). we conclude from (7-65) that
.n-l
1
4
(n -
I)!
1,-: "'::': _ _ ,,-c:-n.:'2"
V21Tii
for large n. In Fig. 7.6. we show both sides for n
= 5 and n . .,. 10. •
216
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
-F.rlang
----~onnal
s
0
Fipre 7.6
11ae De Moivre-Lapla~e Theorem The (7-64) form of the central limit theorem holds for continuous and discrete type Rvs. If the RVS x; are of discrete
type. their sum z is also of discrete type. taking the values''. and its density
consists of points h(z,) = P{z = z.}. In this case. (7-65) no longer holds
because the normal density is a continuous curve. If. however. the numbers
Zt. form an arithmetic progression. the numbersf(zd are nearly equal to the
values of the normal curve at these points. As we show next. the De
Moivre-Laplace theorem (3-27) is a special case.
• Definition. We say that an RV xis of lattice type if it takes the values xk =
ka where k is an integer. If the RVS x1 are of lattice type with the same a. their
sum
Z
= Xt
+ • • • +X"
is also of lattice type. We shall examine the asymptotic form of h(Z) for the
following special case.
Suppose that the RVS x1 are i.i.d .• taking the values I and 0 as in (7-60).
In this case, their sum z has a binomial distribution:
P{z = k} =
(~) pkq"-•
k
= 0, I. ...• n
and
E{z} = np
u: = npq -
!lC
as n -
!lC
Hence, we can use the approximation (7-64). This yields
Fl(z) =
~ (~) pkq"-k ~ G (~)
t.sl
vnpq
(7-68)
From this and (4-48) it follows that
3vnpq :s z :s np + 3vnpq} =
.997
(7-69)
In Fig. 7. 7a, we show the function F"(z) and its normal approximation.
Clearly, Fl(z) is a staircase function with discontinuities at the points z = k.
If n is large, then in the interval (k - I, k) between discontinuity points, the
normal density is nearly constant. And since the jump of Fl(z) at z = k equals
P{np -
SEC.
7-3
nNTRAL l.IMIT THEOREM
217
k
z
z
np
(b)
Figure 7.7
P{z
= k},
we conclude from (7-68) (Fig. 7.7b) that
( n)
e-lk· np)Z:~npq
1
pkqn-4 ::::
k
(7-70)
V21rnpq
as in (3-27).
Multinomial Distribution Suppose that I.~if 1• • • • , .54,] is a partition of ~
with p 1 = P(.s41). We have shown in (3-41) that if the experiment ~ is performed n times and the events .llf1 occur k1 times, then
P{.s41 occurs k1 times}
= k I•1 • ~!.
k , pt•
r•
· · · P~'
For large n, this can be approximated by
I [<k 1
exp { - 2
-
np 1)
np 1
2
2
+ · · · +
..:....<k..:....,_-_np~,..:....) l}
np,
(7-71)
= 2 (7-71) reduces to (7-70). Indeed, in this case,
PI+ P2 = I
We maintain that for n
Hence,
2
2
+ - I ) -_ ..:...<k~~--_n....:P....:I'-) (7-72)
np1
np2
np1
trP2
np1P2
And since k1 = k. P1 = p, P2 = I - p 1 = q, (7-70) results.
(k1 - np1)
+ -..:...<k..:..~_-_n...:.P...:.l:-)2
_ ('·
-
1\.1 -
) ( I
npl 2 -
Sequenc£'.'i and Limits
An infinite sequence
XI, • • • 'Xn, • • •
ofavs is called a random process. For a specific'· the values Xn(') ofxn form
a sequence of numbers. If this sequence has a limit, that limit depends in
general on The limit of Xn(C) might exist for some outcomes but not for
c.
218
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
others. If the set of outcomes for which it does not exist has zero probability.
we say that the sequence x, converges almost everywhere. The limit of x,
might be an RV x or a constant number. We shall consider next sequences
converging to a constant c.
Con'tl~rg~nc~ In th~ MS s~ns~ The mean E{(x,. - c)2} of (x,. - c) 2 is a
sequence of numbers. If this sequence tends to zero
E{(x,. - c)2} ,_,.
0
(7-73)
we say that the random sequence x,. tends to c in the mean-square sense.
This can be expressed as a limit involving the mean .,,. and the variance u~
ofx,..
•
Th~onm.
The sequence x, tends to c in the MS sense itT
.,,.u,.,_,. 0
,_ c
..
(7-74)
• Proof. This follows readily from the identity
E{(x,. - c)2} = (TJ,. - c)2 + u~
Con'tl~rg~nc~ In Probability Consider the events {lx,. - cl > 8} where s
is an arbitrary positive number. The probabilities of these events form a
sequence of numbers. If this sequence tends to zero
P{lx,.- cl > 8}(7-75)
,_,. 0
for every 6 > 0, we say that the random sequence x,. tends to c in probability.
•
Th~onm.
If x,. converges to c in the MS sense, it converges to c in proba-
bility.
• Proof. Applying (4-116) to the positive RV lx,.- cl 2, we conclude, replacing ATJ by 8 2, that
2
P{lx,. - cl2 ~ 8 2} = P{lx,. - cl ~ 8} s E{lx,. ~ cl }
(7-76)
6
Hence, if E{lx,. - cl2}- 0, then P{lx,. - cl > 8} - 0.
Con'tl~rg~nce In Distribution This form of convergence involves the
limit of the distributions F,.(x) = P{x,. s x} of the avs x,.. We shall say that
the sequence x,. converges in distribution if
F,.(x) ,_,.
- F(x)
(7-77)
for every point of continuity of x. In this case, x,.(C) need not converge for
any If the limit F(x) is the distribution of an RV x. we say that x,. tends to x
in distribution. From (7-77) it follows that ifx,. tends toxin distribution, the
probability P{x,. ED} that x,. is in a region D tends to P{x e D}. The Rvs x,.
and x need not be related in any other way.
The central limit theorem is an example of convergence in distribution:
The sum z,. = x1 + · · · + x,. in. (7-63) is a sequence of Rvs converging in
·
distribution to a normal RV.
c.
SEC.
7-4
SPECIAL DISTRIBUTIONS OF STATISTICS
2J9
l.aw of LafJCe Numbers We showed in Section 3-3 that if an event .54 occurs
k times in n trials and P(s4) = p. the probability that the ratio kin is in the
interval (p - e, p + e) tends to I as n- 2; for any s. We shall reestablish this
result as a limit in probability of a sequence of Rvs.
We form the zero-one RVs X; associated with the event s4 as in (7-61),
and their sample mean
i,
Since £{x;} = p and
X1 +' ''+X,
= --'---------"-
u? = pq, it follows"that
£{i,}
=p
u! = np!! = pq -
....
" ,, ., 0
n...
Thus the sequence i,. satisfies (7-73); hence, it converges in the MS sense
and, therefore, in probability to its mean p. Thus
P{li,. - PI > e} -
n-"
0
And since i,. equals the ratio kin, we conclude that
P{l~- pi
>
e} ~x 0
for every e > 0. This is the law of large numbers (3-37) expressed as a limit.
7-4
Special Distributions of Statistics
We discuss next three distributions of particular importance in mathematical
statistics.
Chi-Square Distribution
The chi-square density
(7-78)
is a special case ofthe gamma density (4-50). The term f(m/2) is the gamma
function given by [see (4-51)]
m =even= 2k
m =odd= 2k + 1
(7-79)
We shall use the notation x2( m) to indicate that an av x has a chisquare distribution.* The constant m is called the number of degrees of
• For brevity we shall also say: The
RV
xis
x2!m).
220
CHAP. 7
SEQUENCES OF RANDOM VARIABLES
o.s
s
0
Fipre 7.8
freedom. The notation x2(m, x) will mean the functionf(x) in (7-78). In Fig.
7.8, we plot this function form = I, 2, 3, and 4.
With b = m/2 and c = 1/2, it follows from (5-64) that
E{xn} = m(m + 2) · · · (m + 2n - 2)
Hence,
E{x} = m
E{x2} = m(m + 2)
u~ =2m
Note [see (4-54) and (5-63)] that
E{!}x ='Y Jo{"' x"''2-3e-x'2dx =_I_
m - 2
cl>(s)
The
= V(l
m>2
1
(7-80)
(7-81)
(7-82)
(7-83)
- 2s)"'
x2 density with m = 2 degrees of freedom is an exponential
I
I
f(x) = 2 e·xl2 U(x)
cl>(s) = 1 - 2s
(7-84)
and for m = 1 it equals
/(x)
because r(l/2)
Example 7.9
1
=~
e-"12 U(x)
V21TX
1
cl>(s)
= VI -
2s
(7-85)
= v'1T.
We shall show that the square of an N(O, I)
freedom. Indeed, if
RV
has a
and
x2 density with one degree of
y = x2
then [see (4-84))
r.(v)
J\ •
=.v.vJ'
~ r(Yr)U(\')
- ·
=
~
,.· r'2U(v)
V21Ty
·
(7-86) •
SEC.
7-4
SPECIAL DISTRIBUTIONS OF STATISTICS
221
The following property of x2 densities will be used extensively. If the
RVs x andy are x2<m) and x2( n ). respectively. and are independent. their sum
z "" x + y
is x 2(m + ll)
(7-87)
• Proof. As we see from (7-tm.
I
<l>x(s)
= YO -
I
cl>y(y) =
2s)m
V(l -
2s)n
Hence lsee (7-14))
I
<l>:(s) = <l>.,(s)cl>,.(.f) :... ~is)"''"
From this it follows that the convolution of two
density:
x2 densities
is a
x2
(7-88)
l'undamen1al Property The sum of the squares of n independent N(O, I)
RVs x; has a chi-square distribution with n degrees of freedom:
"
lfQ = ~
i
Xf
(7-89)
then
I
• Proof. As we have shown in (7-86). the RVs xi are x2(1). Furthermore.
they arc independent because the Rvs x; arc independent. Repeated application of (7-88) therefore yields (7-89).
The Rvs X; arc N(O, I); hence [see (5-55)),
E{xl} = I
E{x1} = 3
Var xi = 2
From this it follows that
E{Q} = n
(7-90)
cr~ = 2n
in agreement with (7-81).
As an application, note that if the RVs w; are N(71, cr) and i.i.d., the RVs
X; = (w; - 71)/u are N(O, I) and i.i.d., as in (7-89); hence, the sum
Q =
i
(W;-
71 )
2
is
}( 2(n)
(7-91)
CT
;-1
QUADRATIC FORMS The sum
n
Q
=L
Qijll; Xj
(7-92)
i.j•l
is called a quadnttic form of order n in the variables x;. We shall consider
quadratic forms like these where the Rvs x; are N(O. I) and independent, as
in (7-89). An important problem in statistics is the determination of the
conditions that the coefficients a;; must satisfy such that the sum in (7-92)
have a x2 distribution. The following theorem facilitates in some cases the
determination of the x2 chantcter of Q.
222
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
• Theortm 1. Given three quadratic forms Q. Q 1, and
= Q, + o~
it can be shown that if
o
Q~
such that
<7-93)
(a) the RV Q has a x~<n) distribution
(b) the RV Q 1 has a x2(r) distribution
(c) Q~ :::: 0
then the RVs Q 1 and Q 2 are independent and Q 2 has a x~ distribution with
n - r degrees of freedom.
The proof of this difficult theorem follows from (7 A-6): the details.
however, will not be given. We discuss next an important application.
Sample Mean and Sample Variance Suppose that
arbitrary population. The Rvs
i =!
i
n ;. 1
i
s2 =_I_
X;
tr - I ;- 1
X;
is a sample from an
(7-94)
(X;- i):!
are the sample mean and the :rumple variance, respectively. of the sample
As we have shown.
,
X;.
E{i}
ui =
= Tl
u-
-;
where., is the mean and u 2 is the variance of x;. We shall show that
E{s2} = u 2
(7-95)
• Proof. Clearly,
(x; - TJ)2
= (X;-
i
+i -
TJ)2
= (X;
- i)~
+ 2(x, - i)(i - TJ)
+ (i
- TJ)~
n
Summing from I ton and using the identity L
(X; -
i)
= 0.
we conclude that
i-1
L"
(x; - TJf
j-J
= L"
(x; - i)2
+ n(i - TJ)2
(7-96)
j-J
Since the mean ofi is., and its variance u 2/n, we conclude, taking expected
values of both sides, that
nu~
= E{(X; -
i)2 }
+ a~
and (7-95) results.
We now introduce the additional assumption that the samples x; are
normal.
• Theortm 2. The sample mean i and the sample variance s2 of a normal
sample are two independent Rvs. Furthermore, the Rv
is
(7-97)
SEC.
7-4
SPECIAL DISTRIBUTIONS OF STATISTICS
223
• Proof. Dividing both sides of (7-96) by u2, we obtain
±
(X;- .,r = (i- .,r + ±(X;- ir
;~
(7-98)
u
u/Vn
;-t
u
The RV i is normal with mean 71 and variance u~ln. Hence, the first RV in
parentheses on the right side of (7-98) is N(O, I), and its square is x2(1).
Furthermore, the RV (x1 - 71)/u is N(O, I); hence, the left side of (7-98) is
x2(n) [see (7-89)). Theorem 2 follows, therefore. from theorem I.
Note, finally, that
,
2u"
(7-99)
E{s2} = u 2
u··=-.•. n- I
1
This follows from (7-97) and (7-81) because
u2
2
E{s2} = n _ E{Qz}
u .~
1
Student
I
2
=
(T
(
n-
.,
I) u 2a~
-
Distribution
We shall say that an RV x has a Student l(m) distribution with m degrees of
freedom if its density equals
!< ·~)
= 'Y
This density will be denoted by
is based on the following.
• Theorem. If the
r((m + 1)/21
'Y = ·-,.--- .
V7Tm I (m/2)
I
~x 2 /m)"'' 1
RVs z
l(m. x).
(7-100)
The importance of the 1 distaibution
and ware independent. z is Nm. I). and w is x2<m ),
the ratio
z
X=
vwr;;;
is
I( m)
(7-101)
• Proof. From the independence of z and w it follows that
f..,(z. w) - e .. :z'~(w"'' 2 · •e·"·' 2 )U(w)
To find the density of x, we introduce the auxiliary variable y = w and form
the system
x = zVmiW y = w
This yields z =
w = y; and since the Jacobian of the transformation
equals YmiW, we conclude from (S-83) that
xv"YJ'm,
/xy(X, y)
=
~hw(X~, y)- vye· .• ~~''"'(ymf2 .. le·y12)
for y > 0 and 0 for y < 0. Integrating with respect to y, we obtain
fx(x) -
With
Jn(..
ylm-llt2
{ 2\' (I + x2)}
m dy
exp -
(7-102)
224
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
(7-102) yields
f,(x)-
f•
I
V(l + x 1/m)"'' 1 lo
,
q«"''"'- e-'~dq ·-
I
V(l + x 21m)"'' 1
and (7-101) results.
Note that the mean of x is zero and its variance equals
E{xF = m£{z2}E
{.!}
= m__!!!__
w
2
(7-103)
This follows from (7-82) and the independence uf z and w.
• Corollary. If i and s2 are the sample mean and the sample variance,
respectively, of an N(TJ, a) sample x; [see (7-94)), the ratio
t - i - 71
is t(n - I)
-s!Vn
(7-104)
• Proof. We have shown that the avs
z
i-11
= a!Vn
s2
w = (n - I) a2
are independent, the first is N(O, 1), and the second is
and (7-101) it follows that the ratio
x2(n
- 1).
From this
~-....,..,.....,...
i - 71 /
a/Vn
(n - l)s2
a 2(n - I)
i - 71
=~
is t(n - 1).
Snedecor F Distribution
An av x has a Snedecor distribution if its density equals
x•12-J
/(x) = 'Y
V(l + kxlm)k-m
U(x)
(7-105)
This is a two-parameter family of functions denoted by F(k, m). Its importance is based on the following.
• Theorem. If the RVs z and ware independent. z is x1(k). and w is x2(r). the
ratio
zlk
x =is
F(k, r)
(7-106)
wlr
• Proof. From the independence of z and w it follows that
J; ...(z.
w)- (z41He-:·2)(w'12 •e-""'2} U(z)U(w)
We introduce the auxiliary variable y
rz/kw. y = w. The solution of this system is z
equals rlkw. Hence [see (5-83)).
= w and form the system x =
= kxy/r, w = y. and its Jacobian
SEC.
7-4
SPECIAl. DISTRIBUTIONS Of-" STATISTICS
225
for x > 0. y > 0. and 0 otherwise. Integrating with respect toy. we obtain
I ,. .
!.. )
(,(x) _ x4;2 1
,.c4•11·2 1 cxp l-:.... ( 1 ... - x d-~·
'
II '
2
/'
•
With
i
,
"
q-:,...\' (1+-x
,. )
\' .:. .
2q
·
I ' kxl r
we obtain
f
(X) -
• <
rA·~· 1
(.'
·--·· • -· · · · ·-.,
CJ14 ' ' 1 ~ t' ' 1 c/q
(I + kx/r)' 4 ··~-. 11
and {7-106) results.
Note that the mean of the
E{x} =
RV
x in (7- 106) equals Isec <7-81) and (7-82>1
!:. f.'{z}f..' {.!.}
k
=
w
1
!:. _k_
(7-107)
-= - - ·
2
kr
2
r
Furthermore. if
x is F(/.;.. rl
Percentiles The 11-percentile of an
that
11 =
The 11-pcrccntiles of the
noted by
I . ,.
then
RV
- IS
X
(7-108)
x is by definition a number x, such
P{x s x,} = P{x > x 1
x~(m). t(m).
I
'(I'. 1\)
,1
(7-109)
and f'(k. rl distributions will be de-
t,(m)
F,(k. rl
respectively.
We maintain that
F,(k.
r) =
h,-1(1.
r) = t~(rl
F,(k., r) :::,.
• Proof. If the
RV
11
(7-110)
F 1 u : r. k)
(7-111)
'
kI x~<k
>
(7-112)
x is F(k, r). then
= P{x s
F,(k, r)}
= P {~
;=: F,(!.
r)}
and (7-110) follows from (7-108) and (7-109).
Ifx is NCO, 1). then x2 is x 2(1 ). From this and <7-10 I) it follows that if y is
t(r), then y2 is F(l. r): hence.
2u - I
= P{ly!
s t,(r)}
= P{y2 s
t~(r)}
= P{y2 s
F 2,. 1( I, r)}
and (7-111) results.
Note, finally, that if w is
x2(r). then Isec (7-81))
E{!}
Var!,. = !,.
r = I
Hence, wlr- I as r- :lO, This leads to the conclusion that the RV x in (7-106)
tends to zlk as r- x, and (7-112) follows because z is x2(k).
226
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
Appendix: Chi-Square Quadratic Forms
Consider the sum
II
Q
=~
(7A-I)
aijXiXj
iJ•I
where x; are n independent N(O, I) random variables and aii are n2 numbers
such that au = ai;. We shall examine the conditions that the numbers au must
satisfy so that the sum Q has a x2 distribution. For notational simplicity, we
shall express our results in terms of the matrices
A
=
a,,.]
[·a·"·
a,.,
.. .
a,.,.
X=
["': ]
X*
= [x.,
. . . , x,.]
x,.
Thus A is a symmetrical matrix, X is a column vector, and X* is its transpose. With this notation,
Q =X* AX
We define the expected value of a matrix with elements wu as the
matrix with elements E{wii}. This yields
xx• =
x1x1
· · ·
[ x,.x,
£~*} = [-
>
(7A-2)
where 1,. is the identity matrix. The last equation follows from the assumption that E{x;xi} = I if i = j and 0 if i =I= j.
Diagonalization It is known from the properties of matrices that if A is a
symmetric matrix with eigenvalues A.;, we can find a matrix T such that the
product TAT* equals a diagonal matrix D:
TAT*
=D
D = [.:.
:
(7A-3)
From this it follows that the matrix Tis unitary; that is, T* equals its inverse
r-•, and hence its determinant equals 1. Using this matrix, we form the
vector
z-.n- []
APPENDIX: CHI-SQUARF. QUADRATIC FORMS
227
We maintain that the components z; ofZ are normal, independent, with
zero mean, and variance I.
• Proof. The avs z; are linear functions of the normal avs X;; hence, they are
normal. Furthermore, E{z;} = 0 because £{x;} = 0 by assumption. It suffices,
therefore, to show that £{z;zi} = I fori = .i and 0 otherwise. Since
Z* = X*T*
ZZ* = 7XX*T*
T* = T I
we conclude that
(7 A-4)
£{ZZ*} = T £{XX*}T* .:.:. 7T I '- I,,
Hence. the avs z; are i.i.d. and N(O, I).
The following identity is an important consequence of the diagonalization of the matrix A. The quadratic form Q in (7 A-1) can be written as a sum
of squares of normal independent avs. Indeed. since X = T* Z and X* =
Z*T. (7A-3) yields
"
Q = Z* TA T*Z = Z* IJZ = L A;ZJ
(7 A-5)
,
I
• f'undamental Theorem. A quadratic form generated by a matrix A has a
x2 distribution with r degrees of freedom iff r uf the eigenvalues A; of A equal
I and the other n - r equal zero. Rearranging the order of all A; as necessary.
we can state the theorem as follows:
Q is
x2(r)
iff
Ai ;. . :. {
~~
r<i<11
(7A-6)
• Proof. From (7 A-5) it follows that if eigenvalues A, scttisfy (7 A-6). then
,
,
. '
Q='-z;
i
(7A-7)
I
Hence rsee (7-87)]. Q is x~(r). Conversely. if A;:/- 0 or I for some i. then the
corresponding term A;ZT is not x 2• and the sum Q in 17A-5) is not x~.
The following consequence of (7 A-7) justifies the term dt•gret•s offret'dom used to characterize a x2 distribution. Suppose that Q =- ~ wr where w;
are n avs linearly dependent on x;. If Q is x~(r). then ctt most r of these avs
are linearly independent.
Noncnllral Distrihlllion.\·
Consider the sum
,.
" '
Qu = Qu(y;) = ' - y;
i
(7A-8)
I
where y; are n independent N< 71;. I) avs. If 71; :I= 0 for some i. then the RV Oo
docs not have a x2 distribution. Its distribution will he denoted by
n
where e
=~
,,..
71~
= Qo(TI;)
(7A-9)
and will be called noncentrcll x2 with tr degrees of freedom and t•ccellfridty t'.
We shall determine its moment function <l>(s).
228
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
• Theorem
<l>(.f)
I
= t:{t'•O.·} = ~
exp { -se- }
(1 - 2.f)"
I - 2.f
• Proof. The moment function of YJ equals
£{e'Y;}. =
I J• • ('"·e
.
vl1T
1\ -
(7A-10)
.
q t·i2 dy;
The exponent of the integrand equals
,
S\'~
-
1);)~
( l'; •
• I
2
(
..:..:
I)(
2 .l'·I
.'1 -
-
1),
)~
- --
I - 2.\'
.f1)J
+I -2-f
Hence !see (3A-I)J.
cl>;(.\')
.
'}
= /:.{('''·
I
= VI -
}
2./"P { I .f1)l
- 2s
Since the RVS y; arc independent. we conclude from the convolution theorem
(7-14) that <l>(.f) is the product of the moment functions <l>;(s); this yields
(7 A-10). From (7 A-10) it follows that the x2(n. d distribution depends not
on 1); separately but only on the sum ('of their squares.
• Corollary
E{Qo}
= ll + e
(7A-11)
• Proof. This follows from the moment thet>rcm (5-62). Differentiating
(7A-10) with respect to sand settings = 0, we obtain <1>'(0) = n + e. and
(7 A-ll) results.
Centering In (7 A-6). we have established the x2 charclcter of certain quadratic forms. In the following. we obtain a similar result for noncentrcll x2
distributions.
Using the avs y; in (7A-tH. we form the sum
II
Q,(y;)
=
L
aiJYiYi
iJ""I
where au are the elements of the quadrcltic form Q in (7A-l). We maintain
that if Q is x2(r), then
II
where e
= Q,(1)1) =
L
aii11i11J
(7A-12)
iJ-1
• Proof. The avs X;= y;- 1); are i.i.d. as in (7A-1 ), and Q 1(y;) = Q(x; + 1);).
The avs z; in (7A-6) are linear functions of x;. A change, therefore, from X; to
X;+ 1); results in a change from Zt to Zt + 81 where 8; = £{zt}. From this and
(7 A-7) it follows that
,
Q,(y;)
= Q(X; + 1)1) = L
(Z;
+
8t)2
i•l
This is a quadratic form as in (7A-8); hence, its distribution is
(7A-12).
x2(r, e) as in
PROBLEMS
229
Application We have shown in (7-97> thHt if i is the sample mean of
the
RVS x;.
the sum
II
~( X;
Q = L..
-··IS X-(11
,
X)-
-
I)
, I
Withy;= x;
Q1 =
i
71;. y = i + Tj. J::{y}..::: Tj. it follows from (7A-12) that the sum
:i
;a. J
(y;-
y)2 is x2<n- I. e)
Note. finally. (sec C7A-11)1 that
E{Q.} :....
withe=}: (71;- Tj)2 (7A-13)
,.;J
(ll -
I) •
e
(7A-14)
Noncentral t and f' ()istributions Noncentral distributions are used in hypothesis testing to determine the operating characteristic function of various
tests. In these applications. the following forms of noncentrality are relevant.
A noncentralt distribution with " degrees of freedom and eccentrity e
is the distribution of the ratio
z
VwTin
where z and ware two independent RVs, z is Nk. 1). and w is x~(,). This is
an extension of<7-101).
A noncentrcll F(k. r. (')distribution is the distribution of the ratio
xlk
wlr
where z and w arc two independent avs. z is noncentral
This is an extension of <7-106).
x~<k.
e>. and w is
x~< r).
Problems
7-1
A matrix C with elements P.ii is called nonnegative definite if
L C;CjiJ.iJ ~ 0 for
iJ
7-2
7·3
every c; and c:;. Show that if C is the covariance matrix of n Rvs x;, it is
nonnegative definite.
Show that if the RVS x, y. and z are such that r,,,. = '•: _, I, then rx: = I.
The RVS x1 , x2 , x3 are independent with densities f,lx,), Ji<x2). Jl(xJ). Show
that the joint density of the RVs
y, = x,
7-4
Y2
= X2
- x,
Yl -
X] -
X2
equals the product jj(y, >fi<Y2 - Y• )Jl(.Vl - Y2>·
(a) Using (7-13), show that
a4<11(.flt S2 , SJ, S4) = r.L'{x1x2X3X4 }
as,as2aslas.
(b) Show that if the RVs x; are normal with zero mean and E{x;x1} = P.ii. then
E{x,x2XJX4} = P.I2P.l4 + P.uP.2• ~ P.1•P.21
230
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
7-5
7-6
Show that if the avs x1 are i.i.d. with sample mean i, then E{(x1 - i)2} =
(n - l)u 21n.
The length c of an object is measured three times. The first two measurements,
made with instrument A. are 6.19 em and 6.25 em. The third measurement,
made with instrument B. is 6.21 em. The standard derivations of the instru·
ments are 0.5 em and 0.2 em, respectively. Find the unbiased linear minimum
variance estimate of c.
Consider the sum [see also (7-39))
c
7-7
where
P{n
= k} = P4
k
= 1, 2, . . .
and the avs x1 are i.i.d. with density £(x). Using moment functions, show that
if n is independent ofx1, then
f:<z> = k-1
L" P•fx•'<;.)
7-8
where}141(x) is the convolution of.f..(x) with itself k times.
A plumber services N customers. The probability that in a given month a
particular customer needs service equals p. The duration of each service is an
av x1 with density ce-~~u(x). The total service time during the month is thus
the random sum z = x1 + · · • + x, where n is an av with binomial distribution. (a) Show that
c"
L (N)
p"qN-n
z"-le-~:
n
(n - 1)!
/,(z) = N
,.~ 0
'
(b) The plumber charges $40 per hour. Find his mean monthly income if c
112, N = 1,000 and p = .1.
=
The avs X; are i.i.d., and each is uniform in the interval (9, 11). Find and sketch
the density oftheir sample mean i for n = 2 and n = 3.
7·10 The avs x1 are i.i.d., and N(.,, u). Show that
7-9
I ~"
(a) if y = - \}2
lx; - Til
L
t•l
n
.
(b) ifz =
1
2
E{y} = u
then
"
(n _ 1) ~ (x;- x 1• 1)2
then
0'2
'
=
(
Tr
- 1) -u2
2
n
E{z} = u 2
7·11 Then avs x1 are i.i.d. with distribution F(x) = I - R(x) and z = max x1, w
min X;.
R,.(w) = R"(w)
(a) Show that
=
(b) Show that if
f(x)
= e-IA-IIU(x -
8)
then
F,.lw)
= e-III•·-IIU(w -
8)
(c) Find F,(r.) and F,..(w) if /(x) = ce-~~u(x). (d) Find F,(z) and F,..(w) ifx has
a geometric distribution as in (4-66).
7·11 The n avs x1 are i.i.d. and uniformly distributed in the interval c - 0.5 < x <
c + 0.5. Show that ifz = max x~o w = min x1, then (a) P{w < c < z} = 1 - 1/2";
_ {n(n ...: 1)(z - w)"- 2
c - 0.5 < w < z < c + 0.5
(b)
fz,.(z. w) - o
otherwise
PROBLEMS
231
7-13 Show that if Y• is the kth-order statistic of a sample of size n of x and .t.~ is the
median of x. then
7-14 The height of children in a certain grade is an RV x uniformly distributed
between 55 and 65 inches. We pick at random five students and line them up
according to height. Find the distribution of the height t of the child in the
center.
7-15 A well pipe consists of four sections. Each section is an RV X; uniformly
distributed between 9.9 and 10.1 feet. Using (7-64). find the probability that the
length x = x1 + x2 ... x3 + x.. of the pipe is between 39.8 and 40.2 feet.
7-16 The Rvs X; are N(O, u) and independent. Using the CLT. show that if y = xi +
· + x!. then for large n,
/y(y)
,J
= u· 41Tn exp {- _I_
I\'
4nu 4 •
- na-2)2}
7-17 (Random walk) We toss a fair coin and take a step of size c to the right if heads
shows, to the left if tails shows. At the nth toss. our position is an RV x,. taking
the values me where m = n, n- 2, . . . , -n.
(a) Show that
P{x,. = me}
n· I
_m+ll
= ( k) 2"
k-
2
(b) Show that for large"·
P{x,.
I
= me} ,.., Vn1ii2 t•
.,
""'·"
(c) Find the probability {x50 > 6c} that at the 50th step, x,. will exceed 6c.
7-18 <Proof of the CLT) Given n i.i.d. RVS x; with zero mean and variance a- 2, we
form their moment functions cJ>(s) = E{e,.·}. 'i'(s) =In Cl>(s). Show that (Taylor
series)
,
'i'(s) =
~- s 2
Ifz = (1/'Vn) (x 1 + · · ·
-t
higher power of s
then «<>:(s)
+ x,.),
a-2
'i'.(s) = - s2
·
2
-
= «<>"(s!Vn)
I
powers of-.,..
Vn
Using the foregoing, show that f:(r.) tends to an N(O, u) density as n- or:.
7-19 The n avs x; are i.i.d. with mean 71 and variance a- 2. Show that if i is their
sample mean and g(x) is a differentiable function, then for large n, the RV y =
g(i) is nearly normal with Tly = g'(Tj) and rry = lg'(Tj)lutVii.
7-lO The RVs x; are positive, independent, andy equals their product:
Y = Xt' ' 'X"
Show that for large n, the density of y is approximately lognormal (see Problem 4-22)
/y(y) ""
cyVi;
exp { -
!. (lny - b)} U(y)
2
2
where b i~ the mean and c2 the variance of the RV
z
= In y = L In x;
232
CHAP.
7
SEQUENCES OF RANDOM VARIABLES
This result leads to the following form of the central limit theorem: The distribution of the product y of n independent positive avs X; tends to a lognormal
distribution as n- x.
7·21 Show that if the RV x has a x2(7) distribution, then E{ 1/x} = 1/5.
7-22 (Chi distribution) Show that ifx has a x2(m) distribution andy = Vi, then
/y(y) = 'YY •e·y!i2U(y)
2"'2r~n/2)
'Y =
7-23 Show that if the RVs xi are independent N(TJ. u) and
(xi - .i)
1 "
s2 = n _ 1 ~ - u -
2
then
E{s} =
7·24 The n RVs Xi i.i.d. with density /(x) =
sample mean i, and show that
E
{J}x = n__!!£__
- I
E
'JIn
ce·····~·(x).
{I}
Xi
2
)'(n/2)u
_ 1 r((n _ 0121
Find the density of their
n2c2
= (n - I)(n - 2)
7-25 The avs xi are N(O, u) and independent. Find the density of the sum z =xi ...
X~+ xi.
7-26 The components v.~, v,., and v~ of the velocity v = vv: + v~ - v: of a particle
are three independent avs with zero mean and variance u 2 ;. kTim. Show that
v has a Maxwell distribution
I~-1r v2e·rrl2a·u(v)
''
r(v) = _
Ju
0'3
7-27 Show that if the avs x andy are independent, xis N(O, l/2) andy is }( 2(5), then
the RV z = 4x2 + y is x2(6).
7-28 The RVs z and ware independent, z is N(O, u), and w is x2(9). Show that ifx =
3z/Vw, then,h(x) - 1/(90'2 + x2)'.
7·29 The avs z and w are independent, z is x2(4), and w is x2(6). Show that if x =
z/2w, then,h(x)- [x/(1 + 2x)')U(x).
7-30 The avs x andy are independent, xis N(TJ.~, ul, andy is N(TJ., u). We form
their sample means i, y and variances s~ and ~ using samples ·of length n and
m, respectively. (a) Show that if ci2 = as~ + ~ is the LMS unbiased estimator
of u 2, then
~
v1
2u 4
2u 4
a = --b = --where v 1 = - V• = - Vt + v2
v, + v2
n - I
·
m - I
are the variances of
and
respectively. Hint: Use (7-38) and (7-97).
(b) Show that ifw = i - y,.,. = Tlx- .,,., and
s:
S:.
cY., = (as~ + b~) (~
-
!)
then the av(w - Tlw)lcts has a t distribution with n + m - 2 degrees of
freedom.
7-31 Show that the quadratic form Q in (7A-1) has a ]( 2 distribution if the matrix A is
idempotent, that is, if A 2 = A.
7-32 The RVs x and y are independent, and their distributions are noncentral
x2(m, e 1) and xlCn, e2), respectively. Show that their sum z = x + y has a noncentral x2(m + n, e 1 + e2) distribution.
PART TWO
STATISTICS
8 _ _ __
The Meaning of Statistics
Statistics is part of the theory of probability relating theoretical concepts to
reality. As in all probabilistic predictions, statistical statements concerning
the real world are only inferences. However, statistical inferences can be
accepted as near certainties because they are based on probabilities that are
close to 1. This involves a change from statements dealing with an arbitrary
space ~ to statements involving the space ~/, of repeated trials. In this
chapter, we present the underlying reasoning. In Section 8-2, we introduce
the basic areas of statistics, including estimation. hypothesis testing, Bayesian statistics, and entropy. In the final section, 8-3, we comment on the
interaction of computers and statistics. We explain the meaning of random
numbers and their computer genenttion and conclude with a brief comment
on the significance of simulation in Monte Carlo methods.
8-1
Introduction
Probability is a mathematical discipline developed on the basis of an abstract
model. and its conclusions are deductions based on the axioms. Statistics
deals with the applications of the theory to real problems, and its conclu-
235
236
CHAP.
8
THE MEANING OF STATISTICS
sions are inferences based on observations. Statistics consists of two parts:
analysis and design. Analysis, or mathematical statistics, is part of the theory of probability involving RVS generated mainly by repeated trials. A major
task of analysis is the construction of events the probability of which is close
to 0 or to 1. As we shall see, this leads to inferences that can be accepted as
near certainties. Design, or applied statistics, deals with the selection of
analytical methods that are best suited to particular problems and with the
construction of experiments that can be adequately described by theoretical
models. This book covers only mathematical statistics.
The connection between probabilistic concepts and reality is based on
the empirical formula
n~ == np
(8-1)
relating the probability p =P(.!4) of an event .!4 to the number n81 of successes
of .!4 inn trials of the underlying physical experiment~. This formula can be
used to estimate the model parameter pin terms of the observed number n~.
If p is known, it can be used to predict the number n~ of successes of .!4 in n
future trials. Thus (8-1) can be viewed as a rudimentary form of statistical
analysis: The ratio
•
n~
p=-
n
(8-2)
is the point estimate of the parameter p.
Suppose that f:l is the polling experiment and .!4 is the event {Republican}. We question 620 voters and find that 279 voted Republican. We then
conclude thatp = 279/620 = .45. Using this estimate, we predict that 45% of
all voters are Republican.
Formula (8-1) is only an approximation. A major objective of statistics
is to replace it by an exact statement about the value n81 - np of the approximation error. Since pis a model parameter, to find such a statement we must
interpret also the numbers n and n~ as model parameters. For this purpose,
we form the product space
~~~=~X • • • X f:/
consisting of then repetitions of the experiment~ (see Section 3-1), and we
denote by n~ the number of successes of .!4. We next form the set
~
= {np -
3Viij)q < na~ < np + 3Viij)q}
(8-3)
This set is an event in the space f:/,, and its probability equals [see (7-69)]
P{np - 3vnpq < na~ < np + 3Viij)q} = .997
(8-4)
We can thus claim with probability .997 that n41 will be in the interval np :t
3 Viij)q. This is an interval estimate of n41 •
We shall use (8-4) to obtain an interval estimate of p in terms of its
point estimate p = na~ln. From (8-4) it follows with simple algebra that
9
P{(p - p)2 <- p(l - p)} = .997
n
SEC.
8-J
INTRODUCTION
237
Denoting by p 1 and P2 the roots of the quadratic
(p -
p)~ = ~ p(l
n
- p)
(8-5)
we conclude from (8-4) that
(8-6)
P{p, < p < P2} = .997
We can thus claim with probability .997 that the unknown parameter pis in
the interval (p 1, p~). Thus using statistical analysis, we replaced the empirical point estimate (8-2) of p by the precise interval estimate (8-6).
In the polling experiment, n = 620. p = .45. and (8-5) yields
9
(p - .45)-' =
p( I - p)
Pt = .39
P2 = .5 I
620
Sot(' In news broadcasts. this result is phrased as follows: "A poll showed
that forty-five percent of all voters are Republican; the margin of error is
:=6%." The number :6 is the difference from 45 to the endpoints 39 and 51 of
the interval (39. 51). The fact that the result is correct with probability .997 is
not mentioned.
The change from the empirical formula (8-1) to the precise formula
(8-4) did not solve the problem of relating p to real quantities. Since P(~) is
also a model concept, its relationship to the real world must again be based
on (8-1). This relationship now takes the following form. If we repeat the
experiment 9-n a large number of times, in 99.7% of these cases, the number
n.71 of success of .'iii will be in the interval np
3 vnpq. There is, however,
a basic difference between this and (8-1). Unlike P(.s4.), which could be any
number between 0 and I, the probability of the event ~ is almost I. We can
therefore expect with near certainty that the event \1l will occur in a single
performance of the experiment 9'n. Thus the change from the event .s4. with
an arbitrary probability to an event~ with P(li) == I leads to a conclusion
that can be accepted as near certainty.
The foregoing observations are relevant, although not explicitly stated,
in most applications of statistics. We rephrase them next in the context of
statistical inference.
=
Statistics and Induction
Suppose that we know from past observations the probability P(.M.) of an
event .M.. What conclusion can we draw about the occurrence of this event in
a single performance of the underlying experiment? We shall answer this
question in two ways, depending on the size of P(.M.). We shall give one
answer if P(.M.) is a number distinctly different from 0 or 1-for example,
.6-and a different answer if P(.M.) is close to 0 or 1-for example, .997.
Although the boundary between the two probabilities is not sharply defined
(.9 or .9999?), the answers are fundamentally different.
238
CHAP.
8
THE MEANING OF STATISTICS
Case 1 We assume, first, that P(.M.) = .6. In this case. the number .6
gives us only some degree of confidence that the event .M. will occur. Thus
the known probability is used merely as a measure of our state of knowledge
about the possible occurrence of .M.. This interpretation of P(.t.f.) is subjective
and cannot be verified experimentally. At the next trial, the event vtt will
either occur or not occur. If it does not, we will not question the validity of
the assumption that P(.M.) = .6.
Case 2 Suppose, next, that P(.M.) = .997. Since .997 is close to 1. we
expect with near certainty that the event .M. will occur. If it does not occur,
we shall question the assumption that P(.M.) = .997.
Mathematical statistics can be used to change case I to case 2. Suppose
that s4 is an event with P(s4) = .45. As we have noted, no reliable prediction
about s4 in a single performance of~ is possible (case 1). In the space~" of
620 repetitions of~. the set~= {242 < n.x < 316} is an event with P(~) =
.997 == I. Hence (case 2), we can predict with near certainty that if we
perform Y, once, that is, if Y is repeated 620 times, the number na~ of
successes of s4 will be between 242 and 316. We have thus changed subjective knowledge about the occurrence of s4 based on the given information
that P(s4) = .45 to an objective prediction about a6 based on the derived
probability that P(~) == .997. Note that both conclusions are inductive inferences. The difference between the two, although significant, is only quantitative. As in case I, the conclusion that ~ will occur at a single performance of
the experiment Y, is not a logical certainty but only an inference. In the last
analysis, no prediction about the future can be accepted as logical necessity.
8-2
The Major Areas of Statistics
We introduce next the major areas of statistics stressing basic concepts. We
shall make frequent use of the following:
Percentiles The u-percentile of an
RV
x is a number x,. such that u
=
F(x,.). The percentiles of the normal, the x2, the t. and the F distributions are
listed in the tables in the appendix. The u-percentile of the NCO, I) distribution is denoted by z,.. If xis N(71, u), then
x,.
With 'Y
= 71
+ z,.u
because F(x)
= G (x ~ 71 )
= I - a a given constant, it follows that (see Fig. 8.1)
(8-7)
Sample Mean Consider an RV x with density /(x), defined on an
experiment ~. A samplt> of x of length n is a sequence of n independent and
SEC.
8-2
THE MAJOR AREAS OF STATISTICS
z
239
X
F"JgUI'e 8.1
identically distributed (i.i.d.) avs x 1• • • • , x,. with density f(x) defined on
the sample space ~f, = g. x · · · x 9- as in Section 7-1. The arithmetic mean
i of x; is called the sample mean of x. Thus [see (7-30>1
I ~
x=-Lx;
n ;.
t
71.\ = .,,
'
cr-:! =CT~
-
'
(8-8)
"
If we perform the corresponding physical experiment 11 times, we obtain the
• • • • x,.. These numbers arc the values x; = x;(C) of the
samples X;. They will be called ob.'iervaticms; their arithmetic mean = i(C)
is the observed sample• mean.
From the CLT theorem (7-64) it follows that for large "· the av i is
approximately normal. In fact, if/(x) is smooth. this is true for n as small as
10 or even less. Applying (8-7) to the RV i. we obtain the basic formula
n numbers x 1•
x
(8-9)
Estimation
Estimation is the most important topic in statistics. In fact, the underlying
ideas form the basis of most statistical investigations. We introduce this
central topic by means of specific illustrations.
We wish to measure the diameter 8 of a rod. The results of the measurements arc the values of the RV x = 8 + " where " is the measurement
error. We know from past observations that "is a normal RV with 0 mean
and known variance. Thus xis an N(8, a) RV with known a, and our task is
to find 8. This is a problem in parameter estimation: The distribution of the
RV xis a function F(x. 8) of known form depending on an unknown parameter 8. The problem is to estimate 8 in terms of one or more observations
(measurements) of the RV x.
We buy an electric motor. We are told that its life length is a normal RV
x with known 71 and a. On the basis of this information, we wish to estimate
the life length of our motor. This is a problem in predictio11. The distribution
f'(x) of the RV xis completely known. and our problem is to predict its value
x = x(C) at a single performance of the underlying experiment (the life length
of a specific motor).
240
CHAP.
8
THE MEANING OF STATISTICS
F(x,8)
F(:<)
8 unknown
known
Estimate 8
(a)
Predict
X
X
(b)
Figure 8.2
In both examples, we deal with a classical esimation problem. In the
first example, the unknown 8 is a model parameter, and the data are observations of real experiments. We thus proceed from the observations to the
model (Fig. 8.2a). In the second example, the model is completely known,
and the number to be estimated is the value x of a real quantity. In this case,
we proceed from the model to the observation (Fig. 8.2b). We continue with
the analysis.
We are given an av x with known density /(x), and we wish
to predict its value x = x<C> at the next trial. Clearly, x can be any number in
the range of x, hence, it cannot be predicted; it can only be estimated. Thus
our problem is to find a constant c such as to minimze in a sense some
function L(x - c) of the estimation error x - c. This problem was considered
in Section 6-3. We have shown that if our criterion is the minimization of the
MS error E{(x- c)2}, then c equals the mean Tlx ofx; if it is the minimization
of E{lx- cj}, then c equals the medianx.s ofx. The constant c so obtained is a
point estimate of x. A point estimate is used if we wish to find a number c
that is close to x on the average. In certain applications, we are interested
only in a single value of the av x. Our problem then is to find two tolerance
limits c 1 and c2 for the unknown x. For example, we buy a door from a
factory and wish to know with reasonable certainty that its length x is between c 1 and c2 • For our purpose, the minimization of the average prediction
error is not a relevant objective. Our objective is to find not a point estimate
but an interval estimate of x.
To solve this problem, we select a number y close to I and determine
the constants c 1and c2 such as to minimize the length c2 - c 1 of the interval
(c 1 , c2 ) subject to the condition that
P{c, < x < c2} = y
(8-10)
PREDICTION
If this condition is satisfied and we predict that x will be in the interval (c 1,
c2), then our prediction is correct in l()()y% of the cases. We shall find c 1 and
c2 under the assumption that the density /(x) is unimodal (has a single
maximum). Suppose, first, thatf(x) is also symmetrical about its mode Xmak
as in Fig. 8.3. In this case, xmak = ., and c2 - c1 is minimum if c 1 = 71 -a, c2 =
71 + a where a is a constant that can be determined from (8-10). As we see
from Fig. 8.3a,
(8-11)
6=1-y
a = 7l - Xa12 = Xt-812 - 7l
SEC.
0
11
8-2
THF. MAJOR AREAS OF STATISTICS
.'(
24)
X
(a)
(b)
(c)
Figure 8.3
For an arbitrary unimodal density. c2 - c 1 is minimum if/(c 1) = j(c2 ) as
in Fig. 8.3b. (Problem 9-9) This condition, combined with (8-10), leads by
trial and error to a unique determination of the constants c 1 and c 2 • For
computational simplicity we shall usc the constants (Fig. 8.3c)
Ct = X312
C:! = Xt AI:!
(8-12)
For asymmetrical densities, the length of the resulting interval (c 1, c 2 ) is no
longer minimum. The constant 'Y = I - 3 is called the confidence coefficient
and the tolerance interval (c,. c2) the 'Y confidence interval of the prediction.
The value of 'Y is dictated by two conflicting requirements: If 'Y is close
to I, the estimate is reliable but the size c:2 - c 1 of the confidence interval is
large; if 'Y is reduced, c· 2 - c, is reduced but the estimate is less reliable. The
commonly used values of 'Yare .9, .95, .99. and .999.
Example 8.1
The life length of tires of a certain type is a normal RV with ., = 25,000 miles and u =
3 ;000 miles. We buy a set of such tires and wish to find the .95-confidence interval of
their life.
From the normality assumption it follows that
P{TJ - a < x < TJ + a} = 2G
(~)
- I = .95
242
CHAP.
8
THE MEANING OF STATISTICS
This yields G(alu)
= .915, alu = Z.97~· a =
z. 97~u
== 2u; hence.
P{l9,000 <X< 31,000} = .95
Since .95 is close to I, we expect with reasonable confidence that in a single performance of the experiment, the event {.,., - a < x < .,., + a} will occur: that is. the life
length of our tires will be between 19,000 and 31,000 miles. •
PARAMETER ES11MA110N In the foregoing discussion, we used statistics
to change empirical statements to exact formulas involving events with probabilities close to 1. We show next that the underlying ideas can be applied to
parameter estimation. To be concrete, we shall consider the problem of
estimating the mean., of an RV with known variance u 2• An example is the
measurement problem.
The empirical estimate of the mean ., = E{x} of an RV x is its observed
sample mean x [see (4-90)]:
1
., ==X=LX;
n i-1
II
(8-13)
This is an approximate formula relating the model concept ., to the real
observations x;. We shall replace it by a precise probabilistic statement
involving the approximation error x - .,. For this purpose, we form the
sample mean i of x and the event ?A = {., - a < i < ., + a} where a is a
constant that can be expressed in terms of the probability of ~. Assuming
that i is normal, we conclude as in (8-9) that if P(~) = 'Y = I - a. then a =
z 1-&12u/Vn. This yields
6'2u} =-y
---:v;-
Z1 -6J2U <- <
+ Z1P { .,-~
X
71
(8-14)
or, equivalently,
(8-15)
We have thus replaced the empirical statement (8-13) by the exact
probabilistic equation (8-15). This equation leads to the following interval
estimate of the parameter 71: The probability that ., is in the interval
a
u=J--2
equals 'Y = I - a. As in the prediction problem, 'Y is the confidence coefficient of the estimate.
Note that equations (8-14) and (8-15) are equivalent; however, their
statistical interpretations are different. The first is a prediction: It states that
if we predict that i will be in the .fixed interval., ~ z,p/Vn, our prediction
will be correct in 100-y% of the cases.. The second is an estimation: It states
that if we estimate that the unknown number ., is in the random interval i ~
Sf.C.
8-2
THE MAJOR ARF.AS OF STATISTICS
243
zp/Vn, our estimation will be correct in l()()y% of the cases. We thus
conclude that if y is close to I, we can claim with near certainty that ., is in
the interval x ~ z,u/Vn. This claim involves the average x of the n observations x; in a single performance of the experiment ~n· Using statistics, we
have as in (8-4) reached a conclusion that can be accepted as near certainty.
Example 8.2
We wish to estimate the diameter 1J of a rod using a measuring instrument with zero
mean error and standard deviation u = I mm. We measure the rod 64 times and find
that the average of the measurements equals 40.3 mm. Find the .95-confidence
interval of 11·
In this problem
y = .95
I;
3
= .975
Z.91~ ""
2
n = 64
Inserting into (8-15), we obtain the confidence interval 40.3
~
u=l
0.25 mm.
•
Hypothesis Testing
Hypothesis testing is an important area of decision theory based on statistical considerations. Docs smoking decrease life expectancy? Do IQ scores
depend on parental education? Is drug A more effective than drug 8? The
investigation starts with an assumption about the values of one or more
parameters of a probabilistic model ~0 • Various factors of the underlying
physical experiment are modified, and the problem is to decide whether
these modifications cause changes in the model par.lmeters, thereby generating a new model f:f 1• Suppose that the mean blood pressure of patients
treated with drug A is Tlo· We change from drug A to drug B and wish to
decide whether this results in a decrease of the mean blood pressure.
We shall introduce the underlying ideas in the context of the following
problem. The mean cholesterol count of patients in a certain group is 240 and
the standard deviation equals 32. A manufacturer introduces a new drug
with the claim that it decreases the mean count from 240 to 228. To test the
claim, we treat 64 patients with the new drug and observe that the resulting
average count is reduced to 230. Should we accept the claim?
In terms of RVS, this problem can be phrased as follows: We assume
that the distribution of an RV xis either a function Fo(x) with Tlo = 240 and
u 0 = u = 32 or a function F 1(x) with .,, = 228 and u, = u = 32. The first
assumption is denoted by Ho and is called the null hypothesis; the second is
denoted by H 1 and is called the alternative hypothesis. In the drug problem,
H 0 is the hypothesis that the treatment has no effect on the count. Our task is
to establish whether the evidence supports the alternative hypothesis H 1 • As
we have noted, the sample mean i ofx is a normal RV with variance u/Vn =
4. Its mean equals Tlo = 240 if H 0 is true and.,, = 228 if H 1 is true. In Fig. 8.4,
we show the density of i for each case. The values of i are concentrated
near its mean. It is reasonable, therefore, to reject H 0 iff xis to the left of
some constant c. This leads to the following test: Reject the null hypothesis
244
CHAP.
8
THE MEANING OF STATISTICS
Fipre 8.4
iff i < c. The region x < c of rejection of H0 is denoted by Rc and is called the
critical region of the test. To complete the test, we must specify the constant
c. To do so, we shall examine the nature of the resulting errors.
Suppose, first, that H0 is true. If i is in the critical region, that is, if i <
c, we reject H 0 even though it is true. Our decision is thus wrong. We then
say that we committed a Type I error. The probability for such an error is
denoted by a and is called Type I error probability or significance level of the
test. Thus
a
= P{i < c!Ho} =
r. fx~x.
Tlo) dx
(8-16)t
Suppose next that H 1 is true. If i > c, we do not reject H 0 even though
H1 is true. Our decision is wrong. In this case, we say that we committed a
Type II error. The probability for such an error is denoted by f3 and is called
Type II error probability. The difference P = 1 - f3 is called the power of the
test. Thus
f3
= P{i > ciH1} = L"' fx~x. ,.,, ) dx
(8-17)
For a satisfactory test, it is desirable to keep both errors small. This is
not, however, possible because if c moves to the left, a decreases but f3
increases; if c moves to the right, f3 decreases but a increases. Of the two
errors, the control of a is more important.
Note In general, the purpose of a test is to examine whether the available
evidence supports the rejection of the null hypothesis; it is not to establish
whether Ho is true. If i is in the critic::al region R~, we reject H0 • However, if i'
is not in R,., we do not conclude that Ho is true. We conclude merely that the
evidence does not justify rejection of H0 • Let us clarify with a simple example.
We wish to examine whether a coin is loaded. To do so, we toss it 100 times
and observe that heads shows k times. If k = 15, we reject the null hypothesis
because 15 is much smaller than 50. If k = 49, we conclude that the evidence
does not support the rejection of the fair-coin hypothesis. The evidence, howt The expression P{i < ciH0} is not a conditional probability. It is the probability that i < c
under the assumption that H0 is true.
SEC.
8-2
THE MAJOR AREAS OF STATISTICS
245
ever, does not lead to the conclusion that the coin is fair. We could have as
well concluded that p = .49.
To carry out a test, we select a and determine c from (8-16). This yields
a
. Tlo) = G (cu!Vn
- Tlo)
= f;(c,
(8-18)
The resulting f3 is obtained from (8-17):
/3
Example 8.3
=
c - "'')
I - F.t<c. Til)= 1 - G ( --;;J\1;
(8-19)
In the drug problem,
.,, = 228
CT = 32
n = 64
i = 230
1)o = 240
We wish to test the hypothesis H0 that the new drug is not effective, with significance
level .OS. In this case,
a = .OS
z,.. = -1.64S
c = 233.4
/3 = I - G(I.3S) = .089
Since .i = 230 < c, we reject the null hypothesis; the new drug is recommended. The
power of the test is P = I - /3 == .911 •
Ftmdamellllll Note There is a basic conceptual difference between parameter
estimation and hypothesis testing, although the underlying analysis is the
same. In parameter estimation, we have a single model and use the observations to estimate its parameters. Our estimate involves only precise statistical
considerations. In hypothesis testing, we have two models: A model ~0 representing the null hypothesis and a model~. representing the alternative hypothesis (Fig. 8.S). We start with the assumption that :.1 0 is the correct model and
use the observations to decide whether this assumption must be rejected. Our
decision is not based on statistical considerations alone. Mathematical statistics leads only to the following statements:
If ~0 is the true model, then P{i > d = a
(8-20)
If~. is the true model, then P{i < c} = /3
These statements do not-indeed, cannot-lead to a decision. A decision
involves the selection of the critical region and is based on other considerations, often subjective, that are outside the scope of mathematical statistics.
Figure 8.5
Hypothesis testing
246
CHAP.
8
THE MEANING OF STATISTICS
Bayesian Statistics
In the classical approach to the estimation problem, the parameter 8 of a
distribution F(x, 8) is viewed as an unknown constant. In this approach, the
estimate of 8 is based solely on the observed values x; of the RV x. In certain
applications, 8 is not totally unknown. If, for example, 9 is the probability of
heads in the coin experiment. we expect that its possible values are close to
.5 because most coins are fair. In Bayesian statistics, the available prior
information about 8 is used in the estimation process. In this approach, the
unknown parameter 9 is viewed as the value of a random variable 8, and the
distribution of x is interpreted as the conditional distribution F. . <xl9> of x
assuming 8 = 8. The prior information is used to assign somehow a density
/,(8) to the RV 9, and the problem is to estimate the value 8 of 8 in terms of
the observed value x of x and the density /,(8) of 8. The problem of estimating the unknown parameter 8 is thus changed to the problem of estimating
the value 8 of an RV 9. In other words, in Bayesian statistics, estimation is
changed to prediction.
We shall illustrate with the measurement problem. We measure a rod
of diameter 8; the results are the values x; = 8 + 11; of the sum 9 + " where "
is the measurement error. We wish to estimate 9. If we interpret 8 as an
unknown number, we have a classical estimation problem. Suppose, however, that the rod is picked from a production line. In this case, its diameter 9
can be interpreted as the value of an RV 8 modeling the diameters of all rods.
This is now a problem in Bayesian estimation involving the RVS 8 and x =
8 + "· As we have seen in Section 6-3, the LMS estimate iJ of 8 is the
regression line
iJ
=
£{6jx}
=f. 8f,(8jx)d8
(R-21)
Here iJ is the point estimate of our rod and xis its measured value. To find iJ,
it suffices to find the conditional density f,(8lx> of 8 assuming x = x. As we
know [see (6-32)],
Ji< 8l ) = .fx<xl8> /,(9)
•
...'·
x
f....(x)
(8-22)
The function /,(8) is the unconditional density of 8, which we assume
known. This function is called prior (before the measurement). The conditional density / 8(8jx) is called posterior (after the measurement). The conditional density .fx<xl9) is assumed known. In the measurement problem, it can
be expressed in terms of the density / 11(11) of the error. Indeed, if 8 = 8, then
x = 9 + v; hence,
.fx<xl8> = f,(x - 8)
Finally,.fx(x) can be obtained from the total probability theorem (6-31):
.fx(x)
=
f . /.(xj8)/,(8) d8
(8-23)
The conditional density ft<xl9> considered as a function of 8 is called the
Sf.C.
8-2
THF. MAJOR AREAS OF STATISTICS
247
likelihood function. Omitting factors that do not depend on x. we can write
(8-22) in the following form:
Posterior - likelihood x prior
(8-24)
This is the basis of Bayesian estimation.
We conclude with the model interpretation of the densities: The prior
density / 8(0) models the diameters of all rods coming out of the production
line. The posterior density fB<6 x) models the diameters of all rods of measured diameter x. The conditional density j~(xj6) models all measurements
of a particular rod of true diameter 0. The unconditional density h(.t) models
all measurements of all rods.
1\'ot'' In Bayesian estimation, the underlying model is a product space ~ ""
~~.where ~ 8 is the space of the RV Band~~. is the space of the RV x (fig.
8.6). In the measurement problem, rfH is the space of all rods and rt. is the
space of all measurements of a particular rod. The product space ~is the space
of all measurement'i of all rods. The numb~r 8 ha.. two meanings: It is the value
of the RV 8 in the space :tH: it is also a parameter specifying the density
J:<xj8) ""fv<x - 61 in the space :1,.
:JH x
The Controversy. Bayesian statistics is a topic of continuing controversy
between those who interpret probability "objectively.·· as a measure of
averages, and those who interpret it ··subjectively." as a measure of belief.
The controversy centers on the meaning of the prior distribution F 6(0). For
the objectivists, 1-'11 (8) is interpreted in terms of averages as in (4-25); for the
subjectives, F 8 (8) is a measure of our state of knowledge concerning the
unknown parameter 8.
According to the objectivists. parameter estimation can be classical or
Bayesian, depending on the nature of the problem. The classical approach is
used if 8 is a single number (the diameter of a single rod. for example). The
Bayesian approach is used if 8 is one of the values of an RV 0 (the diameters
of all rods in the production line). The subjectivists use the Bayesian approach in all cases. They assign somehow a prior to the unknown 8 even if
little or nothing is known about 6. A variety of methods have been proposed
for doing so; however, they are of limited interest.
Figure 8.6
X
Obso:rn: x
l'n:di.:tfl
l~aycsian ~-stim:ttion
248
CHAP.
8
THE MEANING OF STATISTICS
The practical difference between the two approaches depends on the
number n of available samples. If n is small. the two methods can lead to
very different results; however, the results are not very reliable for either
method. As n increases, the role of the prior decreases, and for large nit has
no effect on the result. The following example is an illustration.
Example 8.4
We toss a coin n times, and heads shows k times. Estimate the probability p of heads.
In the classical approach to estimation, pis an unknown parameter, and its
empirical estimate is the ratio kin.
In the Bayesian approach, p is the value of an Rv p with density /p(p ), and the
LMS estimate p of p equals
p = 'Y J~ pp•o - P>"-kJ,<p> dp
<8-25>
This follows from (6-21) and (6-54) (see Problem 6-9). For small values of n, the
estimate depends on /,(p). For large n, the term p~l - p)"-• is a sharp narrow
function centered at the point kin (Fig. 6.3). This shows that the right side of (8-25)
approaches kin regardless of the form of /,(p). Note that if /p(p) is constant, then
[see (6-24)]
k+ J
k
p=---n+t .. -xn
Thus, for large n, the Bayesian estimate of p equals its classical estimate. •
Entropy
Given a partition A
the sum
= [.Sif,, •••
, sfN] consisting of N events
sf~o
we form
N
H(A)
= - L P1 In P1
i•l
Pi
= P(sfi)
(8-26)
This sum is called the entropy of the partition A. Thus entropy is a number
associated to a partition as probability is a number associated to an event.
Entropy is a fundamental concept in statistics. It is used to complete the
specification of a partially known model in terms not of observations but of a
principle. The following problem is an illustration.
The average number of faces up of a given die equals 4.S. On the basis
of this information, we wish to estimate the probabilities Pi = P{Ji} of the six
faces {Ji}. This leads to the following problem: Find six numbers Pi such that
Pt + 2p2 + · · · + 6p6 = 4.5 (8-27)
This problem is ill-posed; that is, it does not have a unique solution because
there are six unknowns and only two equations. Suppose, however, that we
wish to find one solution. What do we do? Among the infinitely many solutions, is there one that is better in some sense than the others? To answer
this question, we shall invoke the following principle: Among all possible
solutions of(8-27), choose the one that maximizes the entropy H(A) = -(p 1
In Pt + · · · + P6ln p 6) of the partition A consisting of the six events {Ji}.
This is called the principle of maximum entropy.
PI
+ P2 + · · · + P6 = I
SEC.
8-2
THE MAJOR ARF.AS OF STATISTICS
249
What is the justification of this principle? The pragmatic answer is that
it leads to results that agree with observations. Conceptually. the principle is
often justified by the interpretation of entropy as a measure of uncertainty:
As we know, the probability of an event .s4 is used as a measure of our
uncertainty about its occurrence. If P(sA.) = .999. we are prctctically certain
that sA. will occur; if P(sA.) = .2, we are reasonably certain that it will not
occur; our uncertainty is maximum if P(sA.} = .5. Guided by this. we interpret
the entropy H(A) of a partition A as a measure of uncertainty not about a
single event but about the occurrence of any event of A. This is supported by
the following properties of entropy (see Section 12-1 ): H(A) is a nonnegative
number. If P(sA.•J = I for some k, then P(sA.;) = 0 for every i i= k and H(A) =
0; in this case, our uncertainty is zero because at the next trial,
we are certain that only the event sA.1c will occur. Finally, H(A) is maximum if all events of A have the same probability; our uncertainty is then
maximum.
The notion of entropy as a measure of uncertainty is subjective. In
Section 12-3, we give a different interpretation to H(A) based on the concept
of typical sequences. This concept is related to the relative frequency interpretation of probability, and it leads to an objective meaning of the notion of
entropy. We introduce it next in the context of a partition consisting of two
events.
Consider the partition A = [sA., .<4 Jconsisting of an event sA. and its complement .<4. If we repeat
the experiment n times, we obtain sequences s1 of the form
TYPICAL SEQt.:ENCES AND RF.LATIVE FR.:QUENCY
s1 = sA. .<;{ sA. • • • .<4
= I, .
. . • 2"
(8-28)
Each sequence s1 is an event in the space 9',.. and its probability equals
P(si) = plcqn-4
(8-29)
where k is the number of successes of .<4. The total number of sequences of
the form (8-28} equals 211 • We shall show that if pis not .5 and n is large, only
a small number of sequences of the forms (8-28) is likely to occur. Our
reasoning is based on the familiar approximation k == np. This says that of all
2" sequences, only the ones for which k is close to np are likely to occur.
Such sequences are denoted by t1 and are called typical; all other sequences
are called atypical. If t1 is a typical sequence, then k == np, n - k = n - np =
nq. Inserting into (8-29), we obtain
(8-30)
P(ti) = p'q" ·lc == p""q""
10
10
Since p = e P and q = e q. this yields
P(lj)
j
= e"P In p-llq In q = e-..
1/(A)
(8-31)
where H(A} = -(pIn p + q In q) is the entropy of the partition A =[sA., sA.).
We denote by fJ the set {k == np} consisting of all typical sequences. As
we noted, it is almost certain that k == np; hence, we expect that almost
every observed sequence is typical. From this it follows that P(fJ) == I.
250
CHAP.
8
THE MEANING OF STATISTICS
Denoting by n, the number of typical sequences, we conclude that
P(5"} = n,P{tj} == I
n, == enH!AI
(8-32)
This fundamental formula relates the number of typical sequences to the
entropy of the partition [.si, .si], and it shows the connection between entropy and relative frequency. We show next that it leads to an empirical
justification of the principle of maximum entropy.
lfp = .5, then H(A) = -(.51n .5 + .51n .5) =In 2 and n, ==en In 2 = 2n.
In this case, all sequences of the form (8-28) are typical. For any other value
of p, H(A) is less than In 2. From this it follows that if n is large,
(8-33)
Thus if n is large, the number 2n of all possible sequences is vastly larger
than the number enH!AI of typical sequences. This result is fundamental in
coding theory. Equation (8-33) shows that H(A) is maximum itT n, is maximum. The principle of maximum entropy can thus be stated in terms of the
observable number n,. If we choose p; so as to maximize H(A), the resulting
number of typical sequences is maximum. As we explain in Chapter 12, this
leads to the empirical interpretation of the maximum entropy principle.
Concluding Remarks
All statistical statements are probabilistic based on the assumption that the
data are samples of independent and identically distributed (i.i.d.) Rvs. The
theoretical results lead to useful practical inferences only if the underlying
physical experiment meets this assumption. The i.i.d. assumption is often
phrased as follows:
I. The trials must be independent.
2. They must be performed under essentially equivalent conditions.
These conditions are empirical and cannot be verified precisely. Nevertheless, the experiment must be so designed that they are somehow met. For
certain applications, this is a simple task (coin tossing) or requires minimum
effort (card games). In other applications, it involves the use of special
techniques (polling). In physical sciences, the i.i.d. condition follows from
theoretical considerations supported by experimental evidence (statistical
mechanics). In the following chapters, we develop various techniques that
can be used to establish the validity of the i.i.d. conditions. This, however,
requires an infinite number of tests: We must show that the events {x1 s x 1},
• . • , {xn s Xn} are independent for every n and for every x;. In specific
applications, we select only a small number of tests. In many cases, the
validity of the i.i.d. conditions is based on our experience.
Numerical data obtained from repeated trials of a physical experiment
form a sequence of numbers. Such a sequence will be called random if the
SEC.
8-3
RANJ)OM NUMBERS A~J) ('OMPUTI!R SIMULATION
251
experiment satisfies the i.i.d. condition. The concept of a rc1ndom sequence
of physically generated numbers is empirical because the i.i.d. condition
applied to real experiments is an empirical concept. In the next section. we
examine the problem of generating random numbers using computers.
8-3
Random Numbers and Computer Simulation
Computer simulation of experimental data is an important discipline based
on computer generation of sequences of random numbers (RN). It has applications in many fields. including: usc of statistical methods in the numerical
solution of deterministic problems; analysis of random physical phenomena
by simulation; and use of random numbers in the design of random experiments, in tests of computer algorithms. in decision theory. and in other
areas. In this section, we introduce the basic concepts stressing the meaning
and generation of random numbers. As a motivation. we start with the
explanation of the Monte Carlo method in the evaluation of definite integral!;.
Suppose that a physical quantity is modeled by an RV u uniformly
distributed in the interval (0. I) and that x = g(u) is a function of u. Since
f,,(u) = I for u in the interval (0, I) and 0 elsewhere. the mean of x equals
.,, = E{g(u)}
= Jofl
g(u)f,,(u) du
1
= Jn( ·l!(u) du
(8-34)
As we know [sec (4-90)1. the mean of x can be expressed in terms of its
empirical average .'i. Inserting this approximation into (8-34). we obtain
l
l
II
g(u) du
I ''
I "
L
x; = - L g(u;)
II
II
=-
i-1
(8-35)
i-1
where u; are the observed values of u in n repetitions of the underlying
physical experiment. The approximation (8-35) is an empirical statement
relating the model parameter.,. to the experimental data u;. It is based on the
empirical interpretation (1-1) of probability and is valid if n is large and the
data u; satisfy the i.i.d. condition. This suggests the following method for
evaluating statistically a deterministic integral. The data u;, no matter how
they are obtained, arc R~s; that is. they are numbers having certain properties. If, therefore. we can develop a method for generating such numbers. we
have a method for evaluating the integral in <8-35).
Random Numbers
"What are RNs? Can they be generated by a computer·? Arc there truly
rc1ndom sequences of numbers'!" Such questions were raised from the early
years of computer simulation, and to this day they do not have a generc11ly
accepted answer. The reason is that the term random sequence has two very
252
CHAP.
8
THE MEANING OF STATISTICS
different meanings. The first is empirical: RNs are real (physical) sequences
of numbers generated either as observations of a physical quantity. or by a
computer. The second is conceptual: RNs are mental constructs.
Consider, for example. the following, extensively quoted. definitions.·
D. H. Lehmer (1951): A random sequence is a vague notion embodying the
ideas of a sequence in which each term is unpredictable to the uninitiated
and whose digits pass a certain number of tests, traditional with statisticians and depending somewhat on the uses to which the sequence is to be
put.
J. M. Franklin (1962): A sequence of numbers is random if it has every
property that is shared by all infinite sequences of independent samples of
random variables from the uniform distribution.
These definitions are fundamentally different. Lehmer's is empirical:
The terms vague, unpredictable to the uninitiated, and depending somewhat
on uses are heuristic characterizations of sequences of real numbers. Franklin's is conceptual: infinite sequences, independent samples, and uniform
distribution are model concepts. Nevertheless, although so different, both
are used to define random number sequences.
To overcome this conceptual ambiguity, we shall proceed as in the
interpretation of probability (Chapter 1). We shall make a clear distinction
between RNS as physically generated numbers. and RNs as a theoretical
concept.
THE DUAL INTERPRETATION OF R~s Statistics is a discipline dealing
with averages of real quantities. Computer-generated random numbers are
used to apply statistical methods to the solution of various problems. Results
obtained with such numbers are given the same interpretation as statistical
results involving real data. This is based on the assumption that computergenerated RNS have the same properties as numbers obtained from real
experiments. It suggests therefore that we relate the empirical and the theoretical properties of computer-generated RNs to the corresponding properties of random data.
Empirical Definition A sequence of real numbers will be called random if it
has the same properties as a sequence of numerical data obtained from a
random experiment satisfying the i.d.d. condition.
As we have repeated noted, the i.d.d. condition applied to real data is a
heuristic statement that can be claimed only as an approximation. This
vagueness cannot, however, be avoided no matter how a sequence of random numbers is specified. The above definition also has the advantage that it
is phrased in terms of concepts with which we are already familiar. We can
·Knuth, D. E .•
Th~
Art ofComput~r Programming (Reading. MA: Addison Wesley. 1969).
SEC.
8-3
RANDOM NUMBERS AND COMPUTER SIMULATION
253
thus draw directly on our experience with random experiments and use the
well-established tests of randomness for testing the i.d.d. condition.
These considerations are relevant also in the conceptual definition of
RNs: All statistical results are based on model concepts developed in the
context of a probabilistic space. It is natural therefore to define RNS in terms
of RVS.
Conceptual Definition A sequence of numbers x; is called random if it
equals the samples x1 = x1(C) of a sequence of i.d.d. RVS x;.
This is essentially Franklin's definition expressed directly in terms of
Rvs. It follows from this definition that in applications involving computergenerated numbers we can rely solely on the theory of probability. We
conclude with the following note.
In computer simulation of random phenomena we use a single sequence of numbers. In the theory of probability we use afamily of sequences
x;(C) forming a sequence x; of RVS. From the early years of simulation,
various attempts were made to express the theoretical properties of RNS in
terms of the properties of a single sequence of numbers. This is in principle
possible; however, to be used as the theoretical foundation of the statistical
applications of RNS, it must be based on the interpretation of probability as a
limit. This approach was introduced by Von Mises early in the century [see
(1-7)] but has not been generally accepted. As the developments of the last
50 years have shown, Kolmogoroff's definition is preferable. In the study of
the properties of RNS, a new theory is not needed.
GENERATION OF
RNs A good source of random numbers is a physical
experiment properly selected. Repeated tosses of a coin generate a random
sequence of zeros (heads) and ones (tails). We expect from our past experience that this sequence satisfies the i.i.d. requirements for randomness.
Tables of random numbers generated by random experiments are available;
however, they are not suitable for computer applications. They require excessive memory, access is slow, and implementation is too involved. An
efficient source of RNS is a simple algorithm. The problem of designing a
good algorithm for generating RNS is old. Most algorithms used today are
based on the following solution proposed in 1948.
Lehmer's Algorithm Select a large prime number m and an integer a between 2 and m - I. Form the sequence
z,. = az,.- 1 mod m
n~ I
(8-36)*
Starting with a number Zo =I= 0 we obtain a sequence of numbers z,. such
that
J:sz,.sm-1
From this it follows that at·least two of the first m numbers of the sequence z,.
• The notation A = B mod m means that A equals the remainder of the division of B by m. For
example, 20 mod 13 = 7; 9 mod 13 = 9.
254
CHAP.
8
THE MEANING OF STATISTICS
will be equal. Therefore. for"~ m. the sequence will be periodic with period
mo!S;m-1:
mo
Example 8.5
!5;
m- I
We shall illustrate with m = 13. Suppose firstthat a= 5. lfZo = I, then z, equals I. 5.
12, 8, I, . . . ; if Zo = 2, then z, equals 2. 10, It, 3, 2•... The sequences so
gcncrcttcd arc periodic with period m 0 = 4.
Suppose next that a= 6. In this case, the sequence I, 6, 10, 8. 9. 2, 12, 7, 3. 5.
4, I I, I, . . . with the maximum period m0 = m - I =- 12 results. •
A periodic sequence is not random. However, if in the problems for
which it is intended the required number of samples is smaller than m,
periodicity is irrelevant. It is therefore desirable to select a such that the
period mo of z,. is as large as possible. We discuss next the properties that the
multiplier a must satisfy so that the resulting sequence z,. has the maximum
period m0 = m- I.
• Theonm. If mo = m - I, then a is a primitive root of m; that is,
am 1 = I mod m
a" :/= I mod m for I < n < m - I
(8-37)
• Proof. From (8-36) it follows by a simple induction that
z,. = Zoll" mod m
Since z,. has the maximum period m - I. it takes all values between I and
m - I; we can therefore assume that zo = I. If am• = I mod m and m0 < m I, then Zm. = amo = I mod m = Zo· Thus m0 is a period; this, however, is
impossihle; hence, (8-37) is true.
Note that if m is a prime number, (8-37) is also a sufficient condition for
maximum period. Thus the sequence z,. generated by the recursion equation
(8-36) has maximum period iff a is a primitive root of m.
Suppose that m = 13. In this case, 6 is a primitive root of m because the
smallest integer n such that 6" = I mod m is 12 = 13- I. However, 5 is not a
primitive root because 54 = I mod 13.
To complete the specification of the RN generator (8-36), we must
select the integers m and a. A value form. suggested many years ago· and
used extensively today, is the number
m = 2 31 - I = 2,147,483,647
This number is prime and is large enough for most applications. In the
selection of the constant a, our first requirement is that the resulting sequence z,. have maximum period. For this to occur, a must satisfy (8-37).
Over half a billion numbers satisfy (8-37), but most of them yield poor RNs.
To arrive at a satisfactory choice, we subject the multipliers that yield maximum period to a variety of tests. Each test reduces the potential choices
• D. H. Lehmer, "Mathematical Methods in Large Scale Computing Units," An11u. Camput.
l.ah. Harvard Univ. 26 !1951).
SEC.
8-3
RANDOM NUMBERS A~D COMPUTER SIMULATION
255
until all standard tests are passed. We are thus left with a relatively small
number of choices. Some of those are then subjected to special tests and are
used in particular applications. Through a combination of additional tests
and experience in solving problems. we arrive at a small number of multipliers. Such a multiplier is the constant
a = 27 - I = 16.807
The resulting RN generator is*
Zn = 16807z,. 1 mod 2147483647
(8-38)
General Algorithms We describe next a number of more complex algorithms for generating R~s. We should stress that complexity does not necessarily improve the quality of a generator. The algorithm in (8-38) is very
simple, but the quality of the resulting sequence is very high.
Equation (8-36) is a first-order linear congrucntial recursion; that is, z,.
depends only on Zn-1· and the dependence is linear. A general nonlinear
congruential recursion is an equation of the form
z,. = /(z,._,, ... , z,. .. ,) mod m
The recursion is linear if
Zn
= (a,zn-1 + · · · -
a,z,._, + c) mod m
The special case
z,. = (Zn-1
•
z,..,) mod
m
(8-39)
is of particular interest. It is simple and efficient. and if r is large. it might
generate good sequences with period larger than m.
Note. finally, that the assumption that m is prime complicates the
evaluation of the product azn-l mod m. The computation is simplified if m =
2' where r is the available word length. This. however. weakens the randomness of the resulting sequence. In such cases. algorithms of the form
Zn
= (llZm-1
+ c) mod m
(8-40)
are used.
TESl'S OF RA~DOMNESS The sequence Z; in (8-38) is modeled by a sequence of discrete type RVs z; taking all integer values from 1 to m - I. The
sequence
Z;
u·=, m
(8-41)
is modeled essentially by a sequence of continuous type Rvs u; taking all
values from 0 to 1. If the RVs u; are i.i.d. and uniformly distributed, we shall
say that the numbers u1 are random and uniformly distributed or, simply,
random. The objective of testing is to establish whether a given sequence of
RNS is random. The i.i.d. condition requires an infinite number of tests. We
• S. K. Park and K. W. Miller... Random Number Generations: Good Ones Are Hard to Find ...
Communications of til~ ACM 31, no. 10 (October 1988).
256
CHAP.
8
THE MEANING OF STATISTICS
must show that
P{u,
S
P{u; s u} = u
0< u< I
P{u 1 s u., u2 s u2} = P{u 1 s u 1}P{u2 s u 2}
u., U2 S U2, u3 S u3} = P{u 1 S u 1}P{u2 S u 2}P{u3 s
(8-42)
u 3}
and so on. In real problems. we can have only a finite number of tests. We
cannot therefore claim that a particular sequence is random but another is
not. Tests lead merely to plausible inferences. We might conclude, for example, that a given sequence is reasonably random for certain applications or
that this sequence is more r.mdom than that.
There are two kinds of tests: theoretical and statistical. We shall explain with an example. We wish to test whether the RNs z; generated by the
algorithm (8-38) are the samples of an RV with mean m/2. In a theoretical
test, we reason using the properties of prime numbers that in a sequence of
m - I samples, each integer will appear once. This yields the average
_1_~
m- I ,_,
1
m- I= m
m- I
2
In a statistical test, we generate n numbers and form their average =
I.z1/n. If our assumption is correct, then i:.. m/2.
In a theoretical test, no samples are used. All conclusions are exact
statements based on mathematical reasoning; however, for most tests the
analysis is difficult. Furthermore, all results involve averages over the entire
period. Thus they might not hold for various subsequences of z1• For example, the fact that each integer appears once does not guarantee that the
sequence is uniform.
In an empirical test, we generate n numbers z1 where n is reasonably
large but much smaller than m, and we use the numbers Z; to form various
averages. The underlying theory is simple for most tests. However, the
computations are time-consuming, and the conclusions are probabilistic,
leading to subjective decisions.
All empirical tests are tests of statistical hypotheses. The theory is
developed in Chapter 10. Let us look at some illustrations.
Z; =I+ 2 + · · · +
z
Distribution We wish to test whether the RNs u; are the samples of an
RV u with uniform distribution. To do so, we use either the x2 test (10-85) or
the Kolmogoroff-Smimov test (10-44).
lndepelllknct We wish to test whether the subsequences
X; = U2i
Y; = U2i•l
are the samples of two independent RVs x andy. To do so, we form two
partitions involving the events {a; s x s fj;} and {'y1 s y s 31} and apply the
x2 test (10-78).
Tests based directly on (8-42) are, in general. complex. Most standard
tests use (8-42) indirectly. For example, to test the hypothesis that the Rvs x
andy are independent, we test the weaker hypothesis that they are uncorre-
SEC.
8-3
RANDOM NUMBERS AND COMPt:TF.R SIMt:I.ATION
257
lated using as an estimate of their correlation coefficient r the empirical ratio
r in (10-40). If the test fails, we reject the independence hypothesis. We give
next two rather special illustrations of indirect testing.
Gap Test Given a uniform RV u and two constants a and ~ such that 0 <
I, we form the event
.vl = {a < u <
P(sf.) = ~ - a = P
We repeat the underlying experiment and observe sequences of the form
a<~<
m
(8-43)
where sf. appears in the ith position if the event .<Ji. occurs at the ith trial. We
next form an RV x the values of which equal the gap lengths in (8-43), that is.
the number of times Si appears between successive .it's. The RV x so constructed has a geometric distribution as in (4-66):
P{x = r} = p, = (I - p)'p
r = 0. I. . . .
(8-44)
We shall use (8-44) to test whether a given sequence u; is random. If the
numbers u; are the samples of u. then a < u; < {:J iff the event .<4 occurs.
Denoting by n, the number of gaps of length r of the resulting sequence
(8-43). we expect with near certainty that n, = p,n where n is the length of
the sequence. This empirical statement is rephrased in Section 10-4 as a
goodness-of-fit problem: To examine whether the numbers n, fit the probabilities p, in (8-44). we apply the x2 test (10-61). If the test fails. we conclude
that u; is not a good RN sequence.
Spectral Test Here is a simplified development of this important test. limited to its statistical version. Given an RV u with uniform distribution. we
form its moment-generating function
<l>(s) = E{e'•} =
l" e'" du
t
e' - I
= --
s
With s = jcu, this yields (Fig. 8. 7)
"' . >I
I'I"(Jcu
= 2lsin cu/21
lwl
Figure 8.7
~!sin w/~1
lw;
258
CHAP.
8
THE MEANING OF STATISTICS
Hence, for any integer r,
«<>(j27Tr)
= {~
r=O
r:FO
(8-45)
We shall usc this result to test whether a given sequence u; = z;lm is
uniform. For this purpose, we approximate the mean of the function ei2w•• by
its empirical average:
«<>(j27Tr)
., }
= E{e'·w••
=
I ~
- LJ
, .
e'·w•:.-m
(8-46)
ni-l
If the uniformity assumption is correct, the right side is small compared to I
for every integer r :F 0.
We shall now test for the independence and uniformity of the se·
quences
If two avs u and v are uniform and independent, their joint moment function
equals
«1>,.0 (s,, .t2) = E{e••••J:"} = «<>(s 1)«<>(s:!)
From this and (8-46) it follows that
r, = r2 = 0
I
«1>(j21rr1 • j21rr2) = {
h
.
ot erwase
0
Proceeding as in (8·45), we conclude that if the subsequences Z2; and Z2i+ 1 of
an RN sequence Z; are independent. then
'• = ,2 = 0
otherwise
(8-47)
The method can be extended to an arbitr.uy number of subsequences.
Note Hypothesis testing is based on a number of untested assumptions. The
theoretical results are probabilistic statements, and the conclusions involve the
subjective choice of various parameters. Applied to RN sequences, testing
leads, therefore. only to plausible inferences. In the final analysis, specific RN
generators are adopted for general use not only because they pass standard
tests but also because they have been successfully applied to many problems.
RNs
with Arbitrary Distributions
All RN sequences z, generated by congruential algorithms are integers with
uniform distribution between I and m - I. The sequence
z,
u, = -
(8·48)
m
is essentially of continuous type, and the corresponding RV u is uniform in
the interval (0, 1). The RV a+ bu is uniform in the interval (a. a+ b), and the
av I - u is uniform in the interval (0, I). Their samples are the sequences
a + bu; and I - u;, respectively.
SEC.
8-3
£
u = F(x)
X
L
259
RANDOM NUMBERS AND COMPUTER SIMULATION
¥-
u
X
Figure 8.8
:r
,...
(
X
We shall use the sequence u; or, equivalently. I - u; to generate RN
sequences with arbitrary distributions. Algorithms generating nonuniform
RNs directly are not available. We shall discuss two general methods and
several special techniques.
Pf:RCE~TILE-TRANSFORMATION !\IETHOD If~; and w; arc the samples
of two RVS z and wand ifw = g(z), then w; = g(z;). Using this observation.
we shall generate an RN sequence .t; with distribution a given function F(x ),
in terms of the RN sequence II; in (8-48). The proposed method is based on
the following theorem.
• Theorem. If x is an RV with distribution F(x) and
u
= F(x)
then u is uniformly distributed in the interval (0, I); that is. F,.(u)
us I.
(8-49)
= u for 0 s
• Proof. From (8-49) and the monotonicity of F(x) it follows that the events
{u s u} and {x s x} arc equal. Hence (fig. 8.8),
F,(u) = P{u s 11} = P{x s x} = F(x) = u
and the proof is complete.
Denoting by F 1-ll(u) the inverse of the function u = F(x), we conclude
from (8-49) that
• X= F 1- 11(u)
(8-50)
From this it follows that if II; are the samples of u, then
F'
11(11;)
(8-51)
is an RN sequence with distribution F(x). Thus, to form a sequence x; with
distribution F(x), it suffices to form the inverse p- 11(u) of F(x) and to
compute its values for u = u;.
X;=
Example 8.6
We wish to generate a sequence of RNS with distribution
F(x) = I - e·•'A
x>0
(8-52)
The inverse of F(x) is the function x = -A In (I - u). If u is uniform in (0, 1). then I -
260
CHAP.
8
THE MEANING OF STATISTICS
F(x)
JC
0
u
(a)
(b)
Flpre 8.9
u is also uniform in (0, 1). Hence, the sequence
x1 =-A In u1
bas an exponential distribution as in (8-52). •
(8-53)
Discrete Type RNs We wish to generate an RN sequence x1 taking them
values a" with probabilities p 1" The corresponding distribution is a staircase
function with discontinuities at the points a 1 < · · · < am (Fig. 8.9a); its
inverse is a staircase function with discontinuities at the points c 1 < · · · <
Cm = 1 where (Fig. 8.9b)
c" = F(a~c) = Pt + · · · Pk
k = I, • . . , m
Applying (8-51) we obtain the following rule for generating x1:
Set x1 = a, iff c~c-t s u; < c~c
co= 0
(8-54)
Example 8.7
(a)
Binary RNs. The sequence
0
0 < u, < p
{I
p < U; < I
takes the values 0 and I with probabilities p and I - p, respectively.
(b) Decimal RNs. The sequence
.
k
k+ I
X; = k dT 10 < U; < 10
k = 0, I' . . . ' 9
X;=
takes the values 0, I, . . . , 9 with equal probability.
(c)
Bernoulli RNs. If
k
= 0, I, . . .
,n
then the sequence x 1 in (8-54) has a Bernoulli distribution with parameters
n andp.
SEC.
8-3
(d) Poisson
RANDOM NUMBERS AND COMPliTER SIMUl.ATION
RSs.
If
k
then the sequence
A.
261
.T;
= 0.
I •...
in (8-54) has a Poisson distribution with parameter
•
FromF1 (X) toF.•(Y) We have an RN sequence X; with distribution Fx(x) and
we wish to generate another RN sequence y; with distribution Fv(y). As we
know lsee (8-40)], the sequence u; = F.,(x;) is uniform. Applying (8-51). we
conclude that the numbers
.V; = F;-: 1(u;) = F. 1IF,(x;)J
(8-55)
are the values of an RN sequence with distribution F,.(y) (Fig. 8.10).
REJECTION METHOD The determination of the inverse x = F- 1(u) of a
function F(x) by a computer is a difficult task involving the solution of the
equation F(x) = u for every u. It is therefore desirable to avoid using the
transformation method if the function F- 1(u) is not known or cannot be
computed efficiently. We shall now develop a method for realizing an arbitrary F(x) that avoids the inversion problem. The method is based on the
properties of conditional distributions and their empirical interpretation as
averages of subsequences.
Conditional Distributions Given an RV x and an event .M., we form the
conditional distribution
F.,<xi.M.> = P{x :s xi.M.} = P{x P~•.~) At}
(8-56)
[sec (6-8)]. The empirical interpretation of Fx<xi.M.> is the relative frequency
of the occurrence of the event {x :s x} in the subsequence
)'; = Xk,
of trials in which the event .M. occurs. From this it follows that y1 is a
sequence of RNs with distribution
F 7{x} = P{y :s x} = f'.,(x;.-«.)
(8-57)
We shall use this result to generate a sequence)'; of RNS with distribution a given function.
Flpre 8.10
...
X·
II;
I·~( X)= U
_JL r+II·
II
F,.(y·c
.v
262
CHAP.
8
THE MEANING OF STATISTICS
• Rejection Theortm. Given an RV u uniform in the interval (0, I) and independent of x and a function rC.t) such that
0 :s r(x) :s I
we form the event .M. = {u :s r(x)}. We maintain that
.fx<xi.M.>
= Z,f,(x)r(x)
c
= f.J,(x)r(x)dx
(8-58)
• Proof. The density of u equals I in the interval (0, I) by assumption;
hence,
.fx,.(x, u) = .fx(x )j,.(u) = .fx(x)
O:su:sl
As we know
" I
{
dxl 11 }
P{x < x :s x + dx, .M.} 8 5
Jx(X .M.) dx = P X < X :S X +
.M. =
P(.M.)
( - 9)
The set of points on the xu plane such that u :s r( x) is the shaded region of
Fig. 8.11. The event {x < x :s x + dx} consists of all outcomes such that x is
the vertical strip (x, x + dx). The intersection of these two regions is the part
of the vertical strip in the shaded region u :s r(x). In this region, .fx(x) is
constant; hence,
P{x :s x
<
x
+ dx, .M.} = .fx(x)r(x)dx
P(.ttl
=
r.
fx(x)r(x)dx
=c
Inserting into (8-59), we obtain (8-58).
We shall use the rejection theorem to construct an RN sequence y; with
a specified density /y(y). The function.J;.(y) is arbitrary, subject only to the
mild condition that/y(x) = 0 for every x for which.fx(x) = 0. We form the
function
- /y(x)
r(x)- a .fx(x)
where a is a constant such that 0 :s r(x) :s I. Clearly,
r.
r(x).fx(x)dx =a
r.
J;.<x)dx =a
F.....-e 8.11
II
X
dx
SEC.
8-3
RANDOM NUMBERS AND COMPUTER SIMULATION
263
and (8-58) yields
.fx<xi.M.)
=J;.(x)
, ={
""'
afv(x)}
<
u- f,(x)
(8-60)
From (8-60) it follows that if x; and u; are the samples of x and u,
respectively, then the desired sequence y; is formed according to the following rejection rule:
= x; 1.f
Set Y;
u;
s
h-<x;>
a "( )
Jx Xi
(8-61)
Reject xi otherwise
Example 8.8
We wish to generate a sequence Yi with a truncated normal distribution starting from
a sequence x; with expontential distribution:
/y(y) =
2 e·r'12 U(y)
V'2,;
/,(:c)=
e·xu(xl
In this problem,
rc )
~(~)
With a
i'2
= \;
•
~· .•··"2
f2e
=
\1T e·'·' u:~
= v"iiife, (8-61) yields the following rejection rule:
= X;
if u1 < e·"·
Reject x; otherwise
Set Y;
11 ::
•
SPECIAL :\IF.THODS The preceding methods arc general. For specific distributions, faster and more accurate methods are available, some of them
tricky. Here are several illustrations, starting with some simple observations.
We shall usc superscripts to identify several RVs. Thus x 1, • • • , x"'
are m RVS and x}, ••• , x7' are their samples. If x 1 is an RN sequence with
distributionf(x), then any subsequent of x1 is an RN sequence with distribution j(x ). From this it follows that the RNS
(8-62)
X~ = Xmi-m• ~ • • • X7' = Xm;
x} = Xmi-m·l
1
are the samples of the m i.i.d. RVs x , • • • , x"'; their distribution equals the
functionf(x).
Transformations If z is a function
= g(x1, • • • , x"')
(8-63)
of the m RVS xk, then the numbers Zi = g(x J. . . . • x7') form an RN sequence
z
with distribution fr.(z). To generate an RN sequence Z; with distribution a
given function fr.(z) it suffices, therefore, to find g such that the distribution
of the resulting z equals /.(z).
264
CHAP.
8
THE MEANING OF STATISTICS
Example 8.9
We wish to generate an RN sequence Z; with Gamma distribution:
(8-64)
/:(z) - zm·le-:'AU(z)
x 1, • • • ,
We know (see Example 7.5) that if the RVS
xm are i.i.d. with exponential distribution, their sum
z = x 1 + · · ·- xm
has a gamma distribution. From this it follows (see Example 8.6) that
(8-65) •
Example 8.10
(a) Chi-square RNs. We wish to generate an RN sequence z 1 with distribution
x2(n):
/:(z) - t''2-le·:-z
If n
= 2m, then this is a special case of (8-64) with A. =
Ill
-2 ~In
4; =
•••
2; hence,
u:
To realize z1 for n = 2m + I, we observe that ify is x2<2m), w is NCO, 1),
and y and w are independent, then [see (7-87)] the sum z = y + w2 is
x2(2m -t- I); hence,
Ill
-2 ~ In
Z; =
•=•
u: . .
(w1)2
where u~ are uniform RNS and w1 are RNS with normal distribution.
(b) Student t RNs. If the RN sequences z1 and w; are independent with distributions N(O, I) and x2(n), respectively, then [see (7-101)] the ratio
X1
-
l;
Vw;fn
is an RN sequence with distribution t(n).
(c) Snedecor RNs. If the RN sequences Z; and w1 are independent with distributions x 2(m) and xl(n), respectively, then [see (7-106)] the ratio
Ztlm
X;= W;fn
is an RN sequence with distribution F(m, n).
Example 8.11
•
lh 1,
• • • , x"' are m RVs taking the values 0 and I with probability p and q, respectively, their sum has a binomial distribution. From this it follows that if X; is a binary
sequence as in Example 8. 7, then the sequence
Z1
= X1
+· · ·+
has a binomial distribution.
X 111
Z2
= Xm+l
+•' ' +
X2m
•
Mixina We wish to realize an RN sequence x1 with density a given function
f(x). We assume that/(x) can be written as a sum
/(x) = p,fj(x)
+ · · · + p,,.J,.(x)
(8-66)
SEC.
8-3
RANDOM NUMBERS AND COMPUTER SIMULATION
265
where Jk(x) are m densities and p 1 are m positive numbers such that p 1 +
· · · + Pm = 1. We develop a method based on the assumption that we know
the RN realizations x' of the densities Jk(x ).
We introduce the constants
co= 0
c:~c = P1 + · · · + P•
k = I •... , m
and form the events
k=l. . . . ,m
where u is an RV uniform in the interval (0, I) with samples u1• We maintain
that the sequence x; can be realized by mixing the sequences x' according to
the following rule:
if
(8-67)
Set x1 = xf
£'4 I !5 U; < £'4
• Proof. We must show that the density h(.lc) of the RN sequence x1 so
constructed equals the functionf(x) in (8-66). The sequence x~ is a subsequence of the sequence x1 conditioned by the event .s4.4 ; hence, its density
equals h<xlsf~c). This yields
h(xi.<Ak)
= fi.(x)
because the distribution of xf equals Jk(x) by assumption. From the total
probability theorem (6-8) it follows that
h(X) = h(xisfi)P(.s4.1) + · · · + h(xls4m)P(s4.m)
But P(sfk) = c1c- c1c-1 = p~c; hence,h(x) = /(x).
Example 8.12
We wish to construct an
sequence X; with a Laplace distribution
I _
I _
I
f(x) = 2 e x! = 2 e xu(x) + 2 e'U{-xl
RN
(Fig. 8.12). This is a special case of (8-66) with
jj(x)
= e·xu(x)
fz(x) =jj(-xl
P1..:..
P2
= .5
The functionsjj(x) andfz(x) are the densities of the avs -In v and In v respectively,
where vis uniform in the interval (0, 1). lnsening into {8-67). we obtain the following
rule:
Set x1 = -In u;
if
0 S II;< .5
Set x; = In u1
if
.5 S II;< 1.0
•
Figure 8.12
/{x)
0
}(
0
}(
266
CHAP.
8
THE MEANING OF STATISTICS
Normal RNs We discuss next some of the many methods for realizing a
normal RN sequence Z;. The percentile-transformation method cannot be
used because the normal distribution cannot be efficiently inverted.
Rejection and Mixing. In Example 8.8 we constructed a sequence x;
with density the truncated normal curve
=
ft(x)
j; e-x:' U(x)
2
The normal density /(z) can be written as a sum
I
I
/(z) = 2 ft(z) + 2ft(-z)
Applying (8-67), we obtain the following rule for realizing z;:
Set z; = x;
if
0 s u, < .5
Set Z; = - x;
if
.5 s u; < 1.0
Polar Coordinates If the avs z and w are NCO, 1) and independent and
z = r cos f!,
w = r sin f!,
then (see Problem 5-27) the avs rand f!, are independent, f!, is uniform in the
interval (-17', 1r), and r has a Rayleigh density
f,(r)
= re-,.:12
Thus f!, = 1T(2 - u) where u is uniform in the interval (0, 1). From (4-78) it
follows (see also Problem 5-27) that ifx has an exponential density e-xu(x)
and r = vlx, then r has a Rayleigh density. Combining with (8-63), we
conclude that if the sequences u; and v; are uniform and independent, the RN
sequences
~;
= V -2 In Vt cos 17'(2 -
w; =
Ut)
V -2 In Vt sin 7T(2 - u;)
are normal and independent.
Central Limit Theonm If u 1,
avs and m >> 1, then the sum
••• ,
u"' are m independent uniform
z = u1 + · · · + u"'
is approximately normal [see (7-66)]. In fact, the approximation is very good
even if m is as small as 10. From this it follows that if
ut =
Umi-m+lt
are m subsequences of a uniform RN sequence u1 as in (8-62), then the
sequence
Zt
= Ut + · • • + u,
Z2
= U, .. , + • · • + Ullfl
is normal.
Mixing In the computer literature, a number of elaborate algorithms
have been proposed for the generation of normal RNs, based on the expansion of /(z) into a sum of simpler functions, as in (8-67). In Fig. 8.13, we
show such an expansion. The· major part of /(z) consists of r rectangles ft,
... ,J,. These functions, properly normalized, are densities that can be
SEC.
8-3
RANDOM NUMBF.RS AND COMPUTER SIMIJI.ATJON
267
h
0
;:
Figure 8.13
easily realized in terms of a uniform RN sequence. The remaining functions
J, • ., . . . . fr+m are approximated by simpler curves: however, since their
contribution is small, the approximation need not be very accurate.
Pairs of RNs We conclude with a brief note on the problem of generating
pairs of RNS (x;. y;) with a specified joint distribution. This problem can, in
principle. be reduced to the problem of generating one-dimensional RNS if we
usc the identity
f(x, y) = f(x)f(ylx)
The implementation. however. is not simple. The normal case is an exception. The joint normal density is specified in terms of five parameters:.,.., 'TIY•
ux. u_,.. and p.. This leads to the following method for generating pairs of
normal RN sequences.
Suppose that the RVs z and w are independent MO. I) with samples z;
and w; and
(8-68)
x = a 1z + b,w + c,
y = a2z + b2w + c2
As we know, the Rvs x and y are jointly normal. and
ui = ai + h1
u~ = a~ - b~
p. = a1a2 + b1b2
By a proper choice of the coefficients in (8-68), we can thus obtain two RVs x
and y with an arbitrary normal distribution. The corresponding RN sequences
arc
x; = a,z; + b,w; + c,
Y; = t12Z; + b~w; ..... c2
At the end of this section we discuss the generation of RNs with multinomial distributions.
'Tix
= c,
.,, = c2
The Monte Carlo Method
A major application of the Monte Carlo method is the evaluation of multidimensional integrals using statistical methods. We shall discuss this important topic in the context of a one-dimensional integral. We assume-intro-
268
CHAP.
8 THE MEANING OF STATISTICS
ducing suitable scaling and shifting, if necessary-that the interval of
integration is (0, 1) and the integrand is between 0 and 1. Thus our problem is
to estimate the integral
I
= I~ g(u)du
where 0 s gCu) s 1
(8-69)
Method 1 Suppose that u is an RV uniform in the interval (0, I) and x = g(u)
is a function of u. As we know, I is the mean of the RV x = g(u):
I
= E{g(u)} = TJ~
Hence, our problem is to estimate the mean of x. To do so, we apply the
estimation techniques introduced in Section 8-2 (which will be developed in
detail in Chapter 9). However, instead of real data, we use as samples of u
the computer-generated RNs u1 • This yields the empirical estimate
1 II
1 II
(8-70)
I::<-~ x1 =- ~ g(u;)
n i-1
n jcl
To evaluate the quality of this approximation, we shall interpret the
RNs x 1 = g(u1) as the values of the samples x1 = gCu;) of the RV x = g(u). With
1 II
x=-~x;
n jcl
we conclude from (8-8) that
E{i}
= TJ~ =I
2
U'i
2
u~
=n
= Ut2
where
u~ = E{x2}
.,~ = I~ g2(u) du - [f~ g(u) du
-
r
(8-71)
Thus the average x in (8-70) is the point estimate of the unknown integral/.
The corresponding interval estimate is obtained from (8-15), and it leads to
the following conclusion: If we compute the average in (8-70) a large number
of times, using a different set of RNS u1 each time, in 100(1 - a)% of
the cases, the correct value of I will be between + ~.o12u.,!Vn and
1.oJ2UJ:IVn.
x
x-
Method 2 Consider two independent RVs u and v, uniform in the interval
(0, 1). Their joint density equals 1 in the square 0 s u < 1, 0 s v < 1; hence,
the probability masses in the region v s g(u) (shaded in Fig. 8.14) equals I.
From this it follows that the probability p = P($4) of the event s4 = {v s g(u)}
equals
p = P{v s g(u)} = I
This leads to the following estimate of I. Form n independent pairs (u1 , v1) of
uniform RNs. Count the number nsa of times that y 1 s g(u1). Use the approximation
p
=
1
1
o
g( u) du
nsa
::< -
n
SEC.
8-3
RANDOM NUMBERS AND COMPUTER SIMULATION
269
u
u
0
•·igure 8.14
To find the variance of the estimate. we observe that the RV n31 has a
binomial distribution: hence, the ratio n,;~ln is an RV with mean p = I and
variance u~ = /(1 - /)ln.
Note. finally. that
,
<T2-
I ('
u,2 =;; Jo
g(u)du
II-
g(u)]du > 0
Thus the first method is more accurate. Funhermore, it requires only one RN
sequence u1•
COMPUTERS IN STATISTICS A computer is used in statistics in two fundamentally different ways. It is used to perform various computations and to
store and analyze statistical data. This function involves mostly standard
computer programs, and the fact that the problems originate in statistics is
incidental. The second use entails the numerical solution of various problems that originate in statistics but are actually deterministic, by means of
statistical methods; they are thus statistical applications of the Monte Carlo
method. The underlying principle is a direct consequence of the empirical
interpretation of probability: All parameters of a probability space are deterministic, they can be estimated in terms of data obtained from real experiments, and the same estimates can be used if the data are replaced by
computer-generated RNs.
Next we shall apply the Monte Carlo method to the problem of estimating the distribution F(x) and the percentiles xu of an RV x.
Distributions To estimate the distribution
F(x) = P{x s x}
ofx for a specific x, we generate n RNS x1 with distribution F(x) and count the
number n., of entries such that x1 s x. The desired estimate of F(x) [see
(4-24)] is
nx
F(x) ""n
(8-72)
270
CHAP.
8
THE MEANING OF STATISTICS
This method is based on the assumption that we can generate the RNs
We can do so ifx is a function of other Rvs with known distributions. The
method is used even if F(x) is of known form but its evaluation is complicated or not tabulated or if access to existing tables is not convenient.
X;.
Example 8.13
We wish to estimate the values of a chi-square distribution with m degrees of freedom. As we know [see (7-89)], ifz 1, • • • • z"' are m independent N(O. I) avs, then
the RV
x = (z 1) 2 + · · · - (z"')~
is x2Cm). It therefore suffices to form m normal RN sequences zf. To do so, we
generate a single normal sequence Z; and form msubsequences zt = z,.;-,..• as in
(8-62). The sum
x; = <z}>2 + · · · + <zi') 2
is an RN sequence with distribution x2(m). Using the sequence x; so generated, we
estimate the chi-square distribution from (8-72). Another method for generating x; is
discussed in Example 8.10. •
Percentiles The u-percentile of a function F(u) is a number x, such that
F(x,) = u
Whereas F(x) is a probability, x, is not a probability; therefore, it cannot be
estimated empirically. As in the inversion problem, it can be only approximated by trial and error: Select x and determine F(x); if F(x) < u, try a
larger value for x; if F(x) > u, try a smaller value; continue until F(x) is
close enough to u.
In certain applications, the problem is not to find x,. but to establish
whether a given number xis larger or smaller than the unknown X 11 • In such
problems, we need not find x,. Indeed, from the monotonicity of F(x) it
follows that
(8-73)
x>x,itTF(x)>u
x < x,, itT F(x) < u
It thus suffices to find F(x) and to compare it to the given value of u. We
discuss next an important application.
Computer Simuladon in Hypothesis Testing Suppose that
q = g(x 1, • • • , x"')
is a function of m RVs x'. We observe the values x' of these Rvs and form the
corresponding value q = g(x 1, • • • , x"') of the RV q. We wish to establish
whether the number q so obtained is between the percentiles qa and qp of q
where a and /3 are two given numbers:
a = Fq(qa)
/3 = Fq(qfJ)
(8-74)
In hypothesis testing, q is called a test statistic and (8-74) is the null hypothesis.
To solve this problem we must determine q0 and qfJ. This, however,
can be avoided if we use (8-73): From the monotonicity of Fq(q) it follows
that (8-74) is true iff
(8-75)
a< Fq(q) < /3
qa
<
q
<
qfJ
SF.C.
8-3
RANDOM NUMBERS ANI> COMPUTER SIMULATION
271
To establish (8-74), it suffices therefore to establish (8-75). This involves the
determination of Fq(q) where q is the value of q obtained from the experimental data x'. The function Fq(q) can be determined in principle in terms of
the distribution of the RVs x'. This, however, might be a difficult task particularly if many parameters are involved. In such cases. a computer simulation
is used based on (8-72):
We start with m RN sequences xi simulating the samples of the RVs x'
and form the RN sequence
q; = l((x] •. . . , x:")
i = 1. . . . . n
(8-76)
The numbers q; are the computer-generated samples of the RV q. Hence,
their distribution is the unknown function Fq(q). We next count the number
n" of samples such that q; s q and form the ratio n,fq. This ratio is the
desired estimate of Fq(q). Thus. (8-75) is true iff
(8-77)
Note that we have here two kinds of rc1ndom numbers. The first consists of the data x' obtained from a physical experiment. These data are used
to form the value
q
= g(x 1,
•••
,x'• . . . • .t"')
of the test statistic q. The second consists of the computer-generated sequences xr. These sequences are used to determine the sequence q; in (8-76)
and the value Fiq> of the distribution of q from (8-72).
Example 8.14
We have a partition A = [.~ 1 • • • • , .'1l,.J consisting of them events :A, and we wish
to test the hypothesis that the probabilities P(~,) of these events equal m given
numbers p,. To do so, we perform the underlying physical experiment N times and
we observe that the event :A, occurs J.:' times where
k1 . . . . . . + k"' = N
PI ...... T p"' = I
Using the data k'. we form the sum
q
=
i
,.,
(k' - Np,)~
(8-78)
Np,
Our objective is to establish whether the number q so obtained is smaller than the upercentile q,. of the RV
q
~
= r-1
£J
(k'- Np,) 2
Np,
(8-79)
This RV is called PearJon's test statistic, and the resulting test chi-square test (see
Section 10-4).
As we have explained.
< q, iff F.,(ql < u
(8-80)
To solve the problem, it suffices therefore, to find F>4(q) and compare it to u. For
large N, the RV q has approximately a x2(m - I) distribution tsee (10-63)). For
moderate values of N, however, its determination is difficult. We shall find it using
computer simulation.
q
272
CHAP.
8
THE MEANING OF STATISTICS
The avs k' have a multinomial distribution as in 13-41). It suffices therefore to
generate m RN sequences
k), · · · , k'{'
k] .... · · · + k':' = N
with such a distribution. We do so at the end of this section.
Using the sequences kf so generated, we form the samples
q;
"' (kj- Np,)2
N
= i~l
L
p,
i
= I. 2. . . . . n
(8-81)
of Pearson's test statistic q, and we denote by n9 the number of entries q; such that
q; < q. The ratio nqln is the desired estimate of Fq(q). Thus (8-80) is true itT nqlq
< u.
Note that q is a number determined from (8-78) in terms of the experimental
data k', and q; is a sequence determined numerically from (8-81) in terms of the
computer-generated RNs kj. •
RNS with Multinomial Distribution In the test of Example 8.13 we made use
of the multinomial vector sequence
Kt = [k}, ••. , kf')
This sequence forms the samples of m multinomially distributed avs
K = [k 1,
• • • ,
k"')
of order N. To carry out the test, we must generate such a sequence. This
can be done as follows:
Starting with a sequence u; of RNs uniformly distributed in the interval
(0, I), we form N subsequences
(8-82)
U; = [u}, . . . , u{, . . . , ufJ
i = I, 2, . . .
as iri (8-62). These sequences are the samples of the i.i.d. avs
u1,
• •• ,
ui, ... , u.-.·
From this it follows that
P{p, + · · · + p,. 1 sui< Pt + · · · + p,} = p,
(8-84)
The vector U; in (8-82) consists of N components.
We denote by ki the number of components u{ such that for a specific i
+ · · · + p,_ t < u{ < Pt + · · · + Pr-t + p,
(8-85)
Comparing with (8-84), we conclude after some thought, that the sequence
Pt
k}' ... ' /(;, .•. ' k'!'
so generated has a multinomial distribution of order N as in (3-41).
g _ _ __
Estimation
Estimation is a fundamental discipline dealing with the specification of a
probabilistic model in terms of observations of real data. The underlying
theory is used not only in parameter estimation but also in most areas of
statistics, including hypothesis testing. The development in this chapter consists of two parts. In the first part (Sections 9-1 to 9-4), we introduce the
notion of estimation and develop various techniques involving the commonly used parameters. This includes estimates of means, variances. probabilities, and distributions. In the last part (Sections 9-5 and 9-6), we develop
general methods of estimation. We establish the Rao-Cramer bound, and we
introduce the notions of efficiency, sufficiency. and completeness.
9-1
General Concepts
Suppose that the distribution of an RV xis a function F(x, 6) of known form
depending on an unknown parameter 6, scalar or vector. Parameter estimation is the problem of estimating 6. To solve this problem, we repeat the
underlying physical experiment n times and denote by :c; the observed values
273
274
CHAP.
9
ESTIMATION
of the av x. We shall find a point estimate and an interval estimate of 8 in
terms of these observations.
A (point) estimate is a function 6 = g(X) of the observation vector X=
[x,, . . . , x, ]. Denoting by X = (x,, . . . , x,] the sample vector of x, we
form the av 8 = g(X). This RV is called the (point) estimator of 8. A statistic:
is a function of the sample vector X. Thus an estimator is a statistic.
We shall say that 8 is an unbiased estimator of 8 if E{ 8} = 8; otherwise,
8 is called biased, and the difference £{8} - (J is bias. In general, the
estimation error 8- 8 decreases as n increases. If it tends to 0 in probability
(see Section 7-3) as n - ~.then 8 is a consistent estimator. The sample mean
i of x is an unbiased estimator of its mean ., = E{x} and its variance equals
u 2/ n; hence, E{(i - 71)2} = u 2/ n- 0. From this it follows that i tends to., in
the MS sense, therefore, also in probability. In other words, i is a consistent
estimator of.,.
In parameter estimation, it is desirable to keep the error 8 - 8 small in
some sense. This requirement leads to a search for a statistic g(X) having as
density a function centered near the unknown 8. The optimum choice of
g(X) depends on the error criterion. If we use the LMS criterion, the optimum 8 is called best.
• Definition. A statistic 8 = g(X) is the best estimator of 8 if the function
g(X) is so chosen as to minimize the MS error
e
= E{[8
- g(X)J~}
r. ···r.
=
(9-1)
(8- g(X}]2f(x., 8) · · · f<x,. 8)d:c 1
• • •
dx,
Unlike the nonlinear prediction problem (see Section 6-3), the problem
of determining the best estimator does not have a simple solution. The
reason is that the unknown in (9-1) is not only the function g(X) but also the
parameter 8. In Section 9-6, we determine best estimators for certain classes
of distributions. The results, however, are primarily of theoretical interest.
For most applications, 6 is expressed as the empirical estimate of the mean
of some function of x. This approach is simple and in many cases leads to
best estimates. For example, we show in Section 9-6 that ifx is normal, then
its sample mean i is the best estimator of.,.
An interval estimate of a parameter (J is an interval of the form (81 , 62 )
where 61 = g 1(X) and 62 = g 2(X) are functions of the observation vector X.
The corresponding interval (8" Bz) is the interval estimator of 8. Thus an
interval estimator is a random interval, that is, an interval the endpoints of
which are two statistics 81 = g 1(X) and 8z = g 2(X).
• Definition. We shall say that (81 • #h) is a y-confidence interval of 8 if
P{8, < 8 < 9:!} = 'Y
(9-2)
where 'Y is a given constant. This constant is called the confidence coefficient. The difference 8 = I - 'Y is the confidence level of the estimate. The
statistics 8 1 and fh are called C()njidence limits.
SF.C.
9-2
EXPECTED VALUES
275
If 'Y is a number close to I, we can expect with near certainty that the
unknown 8 is in the interval (8 1 , 82 ). This expectation is correct in 1()()-y% of
the cases.
Part~meter Trt~n¥ormation Suppose that (8,, 8~) is a 'Y confidence
interval of a parameter 8 and q(8) is a function of 8. This function specifies
the parameter T = q(8). We maintain that if q(8) is a monotonically increasing function of 8, the statistics .;., = q( 8,) and 1-:! = q( 8:?> are the 'Y confidence
limits ofT. Indeed, the events {81 < 8 < ~}and {T1 < T < ~}arc equal;
hence,
P{ t,
<
T
<
~}
= P{ 8, < 8 < 8:!} = 'Y
(9-3)
If q(8) is monotonically decreasing, the corresponding interval is (f2 • f,).
The objective of interval estimation is to find two statistics 81 and ~
such as to minimize in some sense the length 8:! - 81 of the estimation
interval subject to the constraint (9-2). This problem does not have a simple
solution. In the applications of this chapter, the statistics 81 and ~ are
expressed in terms of various point estimators with known distributions. The
results involve percentiles of the normal. the chi-square. the Student t and
the Snedecor F distributions introduced in Section 7-4 and tabulated at the
back of the book.
9-2
Expected Values
We start with the estimation of the mean., of an
point estimate the average
-
I
RV
x. We shall use as its
II
X=-n LX;
;~I
of the observations x; and as interval estimate the interval (i- a, i +a). To
find a, we need to know the distribution of the sample mean i ofx. We shall
assume that i is a normal RV. This assumption is true if x is normal; it is
approximately true in general if n is sufficiently large (central limit theorem).
Suppose first that the variance of x is known. From the normality
assumption it follows as in (8-1 5) that
(9-4)
where (Fig. 9.1)
I + 'Y
3
u=--=1--
2
2
276
CHAP. 9
ESTIMATION
u = 1-!
2
'Y = 2u- l
x- -z - 0
x+z.!!...
"..rn
"..rn
Figure 9.1
Unless otherwise stated, it will be assumed that u and 'Yare so related.
Equation (9-4) shows that the interval
(9-5)
is a 'Y confidence interval of 71; in fact, it is the smallest such interval. Thus to
find the 'Y confidence interval of the mean 71 of an RV x, we proceed as
follows:
Observe the samples x; of x and form their average x.
2. Select a number 'Y = 2u - I close to 1.
3. Find the percentile z, of the normal distribution.
4. Form the interval i ± z,u/Vn.
I.
As in the prediction problem, the choice of the confidence coefficient 'Y
is dictated by two conflicting requirements: If 'Y is close to 1, the estimate is
reliable, but the size 2z,u/Vn of the confidence interval is large; if 'Y is
reduced, z, is reduced, but the estimate is less reliable. The final choice is a
compromise based on the applications. The commonly used values of 'Y are
.9, .95, .99, and .999. The corresponding values of u are .925, .975, .995, and
.9995 yielding the percentiles (Table 1)
Z.m = 1.440
Z.97S = 1.967
Z.99S = 2.576
Z.999S = 3.291
Note that Z.97s == 2. This leads to the slightly conservative estimate
x± ~
Example 9.1
for 'Y
= .95
(9-6)
We wish to estimate the weight w of a given object. The error of the available scale is
an N(O, u) av ., with u = 0.06 oz. Thus the scale readings are the samples of the av
ll = w + "·
SEC.
9-2
EXPECTED VALUES
277
i+z ..!!..
"..;n
(a)
l'igure 9.2
(a) We weigh the object four times, and the results are 16.02, 16.09, 16.13, and 16.16
oz. Their average = 16.10 is the point estimate of w. The .95 confidence interval is
obtained from (9-6):
x
-
X±
2u
Vn
=
16.10 ± 0.06
(b) We wish to obtain the confidence interval i ± 0.02. How many times must we
weigh the object? Again using (9-6), we obtain 2u/Vn = 0.02; hence, n = 36. •
One-Sided Intervals In certain applications, the objective is to establish
whether the unknown 11 is larger or smaller than some constant. The resulting confidence intervals arc called one-sided.
As we know (Fig. 9.2a),
P{i < 11 + a} = G
(u!Vn) = 'Y
if
a= \1n
Hence,
p { 11 > i -
~} = 'Y
(9-7)
This leads to the right 'Y confidence interval., > i - zp!Vn.
Similarly, the formula (Fig. 9.2b)
P {.,
<i -
\?n} = 'Y = I -
8
(9-8)
leads to the left 'Y confidence interval.,< i + Zyu!Vn because Z3 = - Zy·
For these estimates we use the percentiles
Z.9 = 1.282
Z.9S = 1.645
Z.99 = 2.326
Z.999 = 3.090
Example 9.2
A drug contains a harmful substance H. We analyze four samJ11es and find that the
amount of H per gallon equals .41, .46, .SO and .54 oz. The analysis error is an N(O,
u) av with u = 0.05 oz. On the basis of these observations. we wish to state with
confidence coefficient .99 that the amount of H per gallon does not exceed c. Find c.
278
CHAP. 9
ESTIMATION
In this problem,
i =- .478
u = 0.05
n =4
z. 99 = 2.326
and (9-8) yields c = i + z.,u/Vn = 0.536 oz. Thus we can state with confidence
coefficient .99 that on the basis of the measurements, the amount of II does not
exceed 0.54 oz. •
Tcbebycbeff Inequality If the RV x is not normal and the number n of
observations is not large, we cannot use (9-4) because then i is not normal.
To find a confidence interval for.,, we must find first the distribution of i.
This involves the evaluation of n - I convolutions. To avoid this difficulty,
we use Tchebycheff's inequality. Setting k = I IV'S in (4-115) and replacing x
by i and u by u/Vn, we obtain
(9-9)
I
Thus the interval x ± u/vna contains the 'Y confidence interval of '11· If we
therefore claim that 11 is in this interval, the probability that our claim is
wrong will not exceed a regardless of the form of F(x) or the size n of the
given sample.
Note that if')'= .95, then 1/Ya = 4.47 and (9-9) yields the interval x::!:
4.47/Yn. Under the normality assumption, the corresponding interval is.i::!:
2/Yn.
UNKNOWN VARIANCE If u is unknown, we cannot use (9-4). To find an
interval estimate of.,, we introduce the sample variance:
s2
I
"
= --2
(X;-
x)2
n- I i=l
As we know [see (7-99)], the RV s2 is an unbiased estimator of u 2 and its
variance tends to 0 as n - :lO; hence, s == u for large n. Inserting this
approximation into (9-5), we obtain the estimate
ZuS
ZuS
x--<.,<x+-
Vii
v'ii
(9-10)
This is satisfactory for n > 30. For smaller values of n, the probability that 11
is in the interval is somewhat smaller than 'Y· In other words, the exact 'Y
confidence interval is larger than (9-10).
To determine the exact interval estimate of.,, we assume that the RV x
is normal and form the ratio
i-.,
s/Vn
Under the normality assumption, this ratio has a Student t distribution with
SEC.
9-2
EXPf.CTED VALUES
279
n - I degrees of freedom [see (7-104)]. Denoting by t, its percentile, we
obtain
P { -t, <
With 2u - I = "Y
=I
-
~~~ < t,} = 2u
- I
a, this yields
t,s
_
t,s }
P {x-:r-<'I'J<x+-v: =-y
\n
VII
(9-11)
Hence, the exact "Y confidence interval of '11 (Fig. 9.3) is
_
t,s
_
t,s
x--<'I'J<x~-
Vn
Vn
(9-12)
For n > 20, the t(n) distribution is approximately N('I'J. u) with variance u 2 = nl(n- 2) [see (7-103)). This yields
t,(n) == Zu
"'
\j n
_ 2
for n > 20
The determination of the interval estimate of '11 when u is unknown
involves the following steps.
I.
2.
3.
Observe the samples x; of x, and form the !\ample mean .~ and the
sample variance sz.
Select a number "Y = 2u - I close to I. Find the percentile 1,, of the
t(n - I) distribution.
Form the interval i :t t,.~!Vn.
Figure 9.3
280
CHAP.
9
Example 9.3
ESTIMATION
We wish to estimate the mean '1'1 of the diameter of rods coming out of a production
line. We measure 10 units and obtain the following readings in millimeters:
10.23
10.22
10.15
10.:!3
10.26
10.15
10.26
10.19
10.14
10.17
Assuming normality, find the .99 confidence interval of '1'1·
In this problem, n = 10, t. 99s(9) = 3.25
_
I
to
~
Xt = 10.2
10 i=l
Inserting into (9-11), we obtain the interval 10.2 :t. 0.05 mm. •
X
= -
Reasoning similarly, we conclude that the one-sided confidence intervals are given by (9-8) provided that we replace z, by t, and cr by s. This
yields
(9-13)
Example 9.4
A manufacturer introduces a new type of electric bulb. He wishes to claim with
confidence coefficient .95 that the mean time to failure '1'1 = exceeds c days. For this
purpose, he tests 20 bulbs and observes the following times to failure:
x
~
M
~
~
~
N
~
~
~
83
10
79
85
81
84
73
91
73
Assuming that the time to failure of the bulbs is normal, find c.
In this problem,
X
I
I
20
= 20 ~ X; = 80.05
20
s2 = - ~
19 t~t
(Xi -
n
n
X)2 = 29.74
Thus
s
and (9-13) yields c
= 5.454
n
= 20
= x- t.95 s!Vn = 77.94.
t.9s(l91
= 2.09
•
In a number of applications, the distribution of x is specified in terms of a single parameter 8. In such cases, the mean
71 and the variance cr 2 of x can be expressed in terms of 8; hence, they are
functionally related. Thus we cannot use the preceding results. To estimate
.,, we must develop special techniques for each case. We illustrate with two
examples. We assume in both cases that n is large. This leads to the assumption that the sample mean i is normal.
MEAN-DEPENDENT VARIANCE
Exponential Distribution Suppose first that
1 11
f(x) =- e-·' U(x)
A
(fig. 9.4). As we know, 71 =A and cr =A= "1· From this and the normality
assumption it follows that the RV i is normal with mean A and variance A2/n.
SEC.
9-2
EXPECTED VALUES
281
fx(X)
I
~
0
X
Figure 9.4
We shall use this observation to find they confidence interval of A. From
= u = A. that
(9-4) it follows with 71
ZuA
ZuA }
P {-x-\l,;<A<x+Vn
=y
Rearranging terms, we obtain
i
i
}
p { I + ZuiVn < A < I - zuf'Vn = 'Y
(9-14)
and the interval
X
y = 2u - I
± ZuVn
results.
Example 9.5
Suppose that the duration of telephone calls in a certain area is an exponentially
distributed RV with mean.,. We monitor 100 calls and find that the average duration
is 1.8 minutes.
(a) find the .95 confidence interval of.,.
With i
= 1.8 and z,.
== 2, (9-14) yields
-
X
I ± 2/Vn
-=(1~ .,.,~)
...
~·~-
(b) In this problem, F(x) = I - e·x'"; hence, the probability p that a telephone
call lasts more than 2.5 minutes equals
p = I - F(2.51
Find the .95 confidence interval of p.
= (' ·
2
·~'"
The number p is a monotonically increasing function of the parameter .,. We
can therefore use (9-3). In our case, the confidence limits of ., are 1.5 and 2.25;
282
CHAP. 9
ESTIMATION
hence, the corresponding limits of p are
Pt = e-2.S;u = .19
P2 = e-2.~1225 =
.33
We can thus claim with confidence coefficient .95 that the percentage of calls lasting
more than 2.5 minutes is between 19% and 33%. •
Poisson Distribution Consider next a Poisson-distributed
ter A
P{x
A"
= k} = e-A k!
k
= 0,
RV
with parame-
I, . . .
In this case [see (4-110)],., =A. and u 2 = A.= 71; hence, the sample mean i of
x is approximately N(A, VX). This approximation holds, of course, only for
the distribution of x. Since x is of the discrete type, its density consists of
points.
With u = VX, (9-4) yields
P
{x- z,. ~<A< i
+ z,.
~} = 'Y
(9-15>
Unlike (9-14), this does not lead readily to an interval estimate. To find such
an estimate, we note that (9-15) can be written in the form
P {<A- i)2
<~A} = 'Y = 2u- I
(9-16)
The points (.i, A) that satisfy the inequality are in the interior of the parabola
,
(A - .i)2
= z; A.
n
(9-17)
From this it follows that the 'Y confidence interval of A is the vertical segment
(A 1, A2) of Fig. 9.5. The endpoints of this interval are the roots of the quadratic (9-17).
FIID'e9.5
SF.C.
9-2
EXPECTED VALUES
283
Note that in this problem. the normal approximation holds even for
moderate values of n provided that nA. > 25. The reason is that the av ni =
x 1 + · · · + x,. is Poisson-distributed with parameter nA.: hence, the normality approximation is based on the size of nA..
Example 9.6
The number of monthly fatal accidents in a region is a Poisson RV with parameter A.
In a 12-month period. the reported accidents per month wc:rc
4
2
5
2
7
3
I
6
3
8
4
5
Find the .95 confidence interval of A.
In this problem. i = 4.25. 11 = 12. and :.u ,.. 2. Inserting into 19-17). we obtain
the equation
.
4
(A - 4.25)· A
12
The roots of this equation are A1 ...: 3.23. A~ = 5.59. We can therefore claim with
confidence coefficient .95 that the mean number of monthly accidents is between 3.23
and 5.59. •
Probabilities
We wish to estimate the probability p = P(sl.) of an event .<A. To do so, we
repeat the experiment n times and denote by k the number of successes of st.
The ratio fi = kin is the point estimate of p. This estimate tends top in
probability as n - x (law of large numbers). The problem of estimating an
interval estimate of p is equivalent to the problem of estimating the mean of
an av x with mean-dependent variance.
We introduce the zero-one av associated with the event .<A. This RV
takes the values I and 0 with
P{x
= I}
P{x = 0}
= p
=q
= I - p
Hence,
1h
=p
U7' = pq
The sample mean .'i or x equals kin. furthermore. TJr
=
p. u}
= pqln. as in
(8-8).
Largen The RV i is of discrete type. taking the values kin. For large n. its
distribution approaches a normal distribution with mean p and variance
pqln. It therefore follows from (9-4) that
(9-18)
We cannot use this expression directly to find an interval estimate of
the unknown p because p appears in the variance term pqln. To avoid this
difficulty. we introduce the approximation pq = 114. This yields the interval
- k
(9-19)
x=n
284
CHAP.
9
ESTIMATION
Fipre 9.6
This approximation is too conservative because p(l - p) s 1/4, and it is
tolerable only if pis close to 1/2. We mention it because it is used sometimes.
The unknown p is close to kin; therefore. a better approximation
results if we set p ""'i in the variance term of (9-18). This yields the interval
y=2u-l
(9-20)
for the unknown parameter p.
We shall now find an exact interval. To do so, we write (9-18) in the
form
(9-21)
The points (i, p) that satisfy the inequality are in the interior of the ellipse
- p)
.'f = I!
(9-22)
n
n
From this it follows that the y-confidence interval of p is the vertical segment
(p 1 , p 2 ) of Fig. 9.6. The endpoints of this segment are the roots of the
quadratic (9-22).
(p _ .i)2 =
Example 9.7
z! p(l
In a local poll, 400 women were asked whether they favor abortion; 240 said yes.
Find the .9.5 confidence interval of the probability p that women favor abortion.
In this problem, i = .6, n = 400, and z, == 2. The exact confidence limits p 1 and
p 2 are the roots of the quadratic
I
(p - .6)2 = 100 p(l - p)
Solving, we obtain p 1 = ..5.50, p 2 = .647. This result is usually phrased as follows:
"Sixty percent of all women favor abortion. The margin of error is ±.5%." The
SEC.
9-2
EXPECTED VALUES
285
approximations (9-19) and (9-20) yield the intervals
x- -+ - -•.2:r. . : . .6-+ . I
2vll
.t
~
-:-r:I .vr=:-·.
.r-o · .l'1
-· .6 +: .049
VII
•
Note that (9-18) can be used to predkt the number k of successes of s4
if pis known. It leads to the conclusion that. with confidence coefficient 'Y·
the ratio x = kl tr is the interval
vp(. _ p)
k
{p(l-:.. p)
(9-23)
p - z,.
tr
<;; < p -t z,. 'J-n-This interval is the horizontal segment
Example 9.8
(Xt. x~)
of Fig. 9.6.
We receive a box of 100 fuses. We know that the probability that a fuse is defective
equals .2. Find the .95 confidence interval of the number k = ni of the good fuses.
In this problem. p = .8. Zu == 2. n = 100. and (9-23) yields the interval
np ::!: z,. v'np(l - p)
= 80
::!:
8
(9-24)
Thus the number of good fuses is between 72 and 88. •
Small n For small n, the determination of the confidence interval of p is
conceptually and computationally more difficult. We shall first solve the
prediction problem. We assume that p is known, and we wish to find the
smallest interval (k 1 , k2 ) such that if we predict that the number k of successes of .s4 is between k 1 and k2 , our prediction will be correct in l()()y% of
the cases. The number k is the value of the RV
k = ni = Xt + · · · x.,
(9-25)
This RV has a binomial distribution; hence, our problem is to find an interval
(kt, k2 ) of minimum size such that
= P{k1 s
k s k2}
= 2•. (k) p 4q"- 4 = I
(9-26)
- 6
n
This problem can be solved by trial and error. A simpler solution is obtained
if we search for the largest integer k 1 and the smallest integer k2 such that
'Y
4
L'
•·k,
L .
6
" ( n ) pkq" k < _
6
<_
(9-27)
k
2
k-42
k
2
To find k1 for a specific p, we add terms on the left side starting from
k = 0 until we reach the largest k 1 for which the sum is less than 6/2.
Repeating this for every p from 0 to I, we obtain a staircase function k1(p),
shown in Fig. 9.7 as a smooth curve. The function k2(p) is determined
similarly. The functions k1(p) and k2(p) depend on nand on the confidence
coefficient 'Y· In Fig. 9.8, we show them for n = 10, 20. SO, 100 and for 'Y =
.95 and .99 using as horizontal variable the ratio kin. The curves k 1(p)/n and
k2(p)ln approach the two branches of the ellipse of Fig. 9.6.
4=0
(
n ) p•q"-k
286
CHAP.
9
ESTIMATION
0
Figure 9.7
Example 9.9
A fair coin is tossed 20 times. Find the .95 confidence interval of the number k of
heads.
The intersection points of the line p = .SS with then = 20 curves of Fig. 9.8 are
k 1/n = .26 and k2/n = .74. This yields the interval 6 !S k s 14. •
We turn now to the problem of estimating the confidence interval of p.
From the foregoing discussion it follows that
P{k,(p) s; k s; k2(p)} = 'Y
(9-28)
Fipre 9.8
'Y = 0.95
'Y = 0.99
:!0
~0
10
10
10
20
20
k
n
k
n
SEC.
9-2
EXPECTED VAI.UES
287
This shows that the set of points <k. p) that satisfy the inequality is the
shaded region of Fig. 9.7 between the curves k1(p) and k 2(p); hence, for a
specific k, the 'Y confidence interval of p is the vertical segment (p 1 • p 2 )
between these curves.
Example 9.10
We examine" = 50 units out of a production line and find that k = 10 are defective.
Find the .95 confidence interval of the probability p that a unit is defective.
The intersection of the line i =kin = .2 with then = 50 curves of Fig. 9.8 yields
p 1 = .I and p 2 = .33; hence •. I < p < .33. If we use the approximation (9-20), we
obtain the interval .12 < p < .28. •
Bayesian Estimation
In Bayesian estimation, the unknown parameter is the value 9 of an RV 8
with known density /,(9). The available information about the RV x is its
conditional density ,h(xl9). This is a function of known form depending on 9,
and the problem is to predict the RV 8, that is. to find a point and an interval
estimate of 9 in terms of the observed values x; of x. Thus the problem of
Bayesian estimation is equivalent to the prediction problem considered in
Sections 6-3 and 8-2. We review the results.
In the absence of any observations, the LMS estimate iJ of 8 is its mean
£{8}. To improve the estimate. we form the average .f of then observations
x;. Our problem now is to find a function c/>(.f) so as to minimize the MS error
E{l8- c/>(i)J2}. We have shown in (6-54) that the optimum c/>(.f) is the conditional mean
fJ
= £{81 i} = f~ OJ4(tJ l.f) d9
(9-29)
I
The function f,(9l.t> is the conditional density of 9 assuming i = i, called
posterior density. From Bayes' formula (6-32) it follows (Fig. 9.9) that
1
/,(9 I i) = -y.fi<il9)/,(9)
'Y = - (9-30)
/;:(:c)
The unconditional density ,h-(x) equals the integral of the productf,(il9)/8 (9).
Thus to find the LMS Bayesian point estimate of 9, it suffices to determine
the posterior density of 8 from (9-30) and to evaluate the integral in (9-29).
The function .fx-(il9) can be expressed in terms of fx(xi9). For large n. it is
approximately normal.
The 'Y confidence interval of 9 is an interval (9 - a 1 , 9 - a2) of minimum
length a 1 + a2 such that the area of / 11(91i) in this interval equals 'Y
P{fJ - a, < 8 < fJ + a2li} = 'Y
(9-31)
These results hold also for discrete type RVs provided that the relevant
densities are replaced by point densities and the corresponding integrals by
sums; example 9-12 is an illustration.
288
CHAP.
9
ESTIMATION
prior X likelihood = posterior
9
Fipre9.9
Example 9.11
The diameter of rods coming out of a production line is a normal
RV
with density
= _ I _ e· IB-Bu~l2no
(9-32)
uov'2W
We receive one rod and wish to estimate its diameter 8. In the absence of any
measurements, the best estimate of 8 is the mean 80 of 6. To improve the estimate.
we measure the rod n times and obtain the samples x, =- 8 + II; of the RV x = 8 + "·
We assume that the measurement error" is an N(O. u) RV. From this it follows that
the conditional density of x assuming 6 = 8 is N(8, ul and the density of its sample
mean i is N(8, u/Vn). Thus
/,(8)
J.r<il8) =
I
t' ·mi-s.=rza:
(9-33)
uVf.iiin
Inserting (2-28) and (2-29) into (2-26) we conclude omitting the fussy details
that the posterior density of j,(8li) of 6 is a normal curve with mean fJ and standard
deviation (see Fig. 9.9) where
u
nu 2
---..-u~> -
u!
1
u! + u 21n , ...."
d- 2
nd- 2 8 = 2 80 + - 2 X •
0'
0'
, ...."
(9-34)
X
This shows that the Bayesian estimate fJ of 8 is the weighted average of the prior
estimate 80 and the classical estimate i. Furthermore, as n increases, (J tends to i.
Thus, as the number of measurements increases, the effect of the prior becomes
negligible.
From (8-9) and the normality of the posterior density f 1(8li) it follows that
P{fJ - uz., < 6 < fJ + uz.,li} = 'Y = 2u - I
(9-35)
This shows that the Bayesian 'Y confidence interval of 8 is the interval
(J ~ z.,U
The constants fJ and u are obtained from (9-34). This interval tends to the classical
interval i ± z.,u/Vn as n - ao. •
SEC.
9-2
EXPECTED VALUES
289
Bayesian Estimation of Probabilities In Bayesian estimation, the unknown
probability of an event sf. is the vaJue p of an RV p with known density /p(p).
The available information is the observed value k of the number k of successes of the event sf. in n trials, and the problem is to find the estimate of p
in terms of k. This problem was discussed in Section 6-1. We reexamine it
here in the context of Rvs.
In the absence of any observation, the LMS estimate p of p is its mean
E{p}
= J: p/p(p)dp
(9-36)
in agreement with (6-15). The updated estimate p of pis the conditional mean
p=
E{plk} =
f~
pf,Cpllddp
(9-37)
as in (6-21). To find p, it thus suffices to find the posterior density f,,(plk).
With .-:A. = {k = k} it follows from (6-14) that
f,(plk) = yP{k = klp}f,(p)
where P{k =kip}=
P{k = k}
(~) p4q"
= f~ P{k =
4
'Y = P{k = k}
(9-38)
and
kip}/p(p) dp
(
= f~ ~)
p4q" 4f,(p) dp
Thus to find the Bayesian estimate p of p, we determine f,,(plk) from (9-38)
and insert it into (9-37). The result agrees with (6-21) because the term ( ~)
cancels.
Example 9.12
We have two coins. The first is fair, and the second is loaded with P{h} = .35. We
pick one of the coins at random and toss it 10 times. Heads shows 6 times. Find the
estimate fJ of the probability p of heads.
In this problem, p takes the values p 1 ""' .5 and p~ = .35 with probability 1/2;
hence, the prior density f,(p) consists of two points as in Fig. 9.10a, and the prior
l'igure 9.10
fp(PI
.75
.5
.5
.15
0
0.35 0.5
p
(a)
0
0.35 0.5
p
(b)
290
CHAP.
9
ESTIMATION
estimate of p equals p,l2
that
P{k
~
p 212
= 6} = ( ~) (
= .425. To find the posterior estimate p. we observe
4 2~ 0 4
X
-
X
.356
Inserting into (9-38) and canceling the factor (
~) = ( ~)
X
~0 )
X
.00135
we obtain
p~qt
{ .15
i =I
= .00135 == .25
i = 2
Thus the posterior density of p consists of two points as in Fig. 9.10b, and
p = . 15 X .5 "'" .25 X .35 = .4625
r(
Jp
P;
lk)
•
Difference of Two Means
We consider, finally, the problem of estimating the difference Tlx -.,,of the
means Tlx and TIY of two RVs x andy defined on an experiment~. It appears
that this problem is equivalent to the problem of estimating the mean of the
RV
w=x-y
considered earlier. The problem can be so interpreted only if the experiment
~ is repeated n times and at each trial both Rvs are observed yielding the
pairs (Fig. 9.lla)
(x,, y,), (x2, .Y2), ••• , (x,., y,.)
(9-39)
The corresponding samples of w are the n numbers w; = x; - y;. The point
estimate of the mean .,,.. = Tlx - .,,. of w is the difference w= x - y, and the
corresponding 'Y confidence interval equals [see (9-5)]
-
_ zp,..
X - Y -
Yn
<
Tlx - Tly
<
-
-
X - )'
zp,..
Yn
+
(9-40)
. t he vanance
.
where cr,..2-- cr .•2 + cr.v2 - 2P.x,. IS
of w.
This approach can be used only if the available observations are paired
samples as in (9-39). In a number of applications, the two Rvs must be
sampled sequentially. At a particular trial, we can observe either the RV x or
the RV y, not both. Observing x n times andy m times, we obtain the values
(Fig. 9.11b)
Xt, • • • , Xn, Yn-1 • • • • • Yn-tn
(9-41)
of then samples x;ofx and of them samples y;ofy. In this interpretation, the
n + m RVs X; and Y• are independent. Let us look at two illustrations of the
need for sequential sampling.
Figure 9.11
Sequential samples
Paired samples
X,
Yt
Y;
y,.
Yn+tn
SEC.
9-2
EXPECTED VALUES
291
I.
Suppose that x is an RV with mean Ylx. We introduce various
changes in the underlying physical experiment and wish to establish whether these changes have any effect on TJx. To do so, we
repeat the original experiment n times and the modified experiment
m times. We thus obtain then + m samples in (9-41). The first n
numbers arc the samples x1 of the RV x representing the original
experiment, and the next m numbers y 1 arc them samples of the av
y representing the modified experiment. This problem is basic in
hypothesis testing.
2. A system consists of two components. The time to failure of the
first component is the RV x and of the second. the RV y. To estimate
the difference TJx - TJ:-, we test a number of systems. If at each test
we can determine the times to failure x1 and y1 of both components,
we can obtain the paired samples (x1 , yt). Suppose, however, that
the components are connected in series. In this case, we cannot
observe both failure times at a single trial. To estimate the difference TJx - Tl.r• we must usc the independent samples in (9-41).
We introduce the
RV
I "'
(9-42)
Yt
m i•l
The RV wis defined by (9-42); it is not a sample mean. Clearly,
TJ;r = TJ.f - TJ.v = TJx - TJ~·
Hence, wis an unbiased point estimator of Ylx - .,, • To find an interval
estimate, we must determine the variance of w. From the independence of
the samples x1 and y1 it follows that
w= i
-
y
where i
x
= u;
n
~
.,
.,
u!
I
= -n 2
x;
1-1
0'~
~
y =-
2
.,
.,
0'(
u~ = -;:-
O'f = ,;;
u~
+ ,;;
(9-43)
Under the normality assumption, this leads to the 'Y confidence interval
.t - Y - O';;;Z, < TJx - TJ, < i - y + O';;;:Z11
(9-44)
This estimate is used of O'x and u>. are known. If they are unknown and
n is large, we use the approximations
u·, == s·,
·'
·'
I- LJ
~(x· =n - I i=l '
x-)2
,
u~
·
,
""' s;
·
I
X.
=, LJ
m- i-1
-2
(y; - y)
in the evaluation of u;;;.
Example 9.13
Suppose that x models the math grades of boys and y the math grades of girls in a
senior class. We examine the grades of SO boys and 100 girls and obtain x = 80. y =
82, = 32, s~ = 36. Find the .99 confidence interval of the difference 1Jx - .,, of the
grctde means 1Jx and TJr.
s;
292
CHAP. 9
ESTIMATION
In this problem, n = SO, m = 100. zll = z. 99, ""'2.S8. Inserting the approximation
s2
u!.. ,.,. ....!. +
"'
so
s2
..:L
=1
100
into (9·44), we obtain the interval -2 ± 2.S8. •
The estimation of the difference "111 - .,, when u 11 and u 1 are unknown
and the numbers n and m are small is more difficult. If u 11 =I= uv, the problem
does not have a simple solution because we cannot find an. estimator the
density of which does not depend on the unknown parameters. We shall
solve the problem only for
U 11
= Uy =
(T
The unknown u
be estimated either by s~ or by s~. The estimation
error is reduced, however, if we use some function of sJl and s,. We shall
select as the estimate u2 of u 2 the sum as~ + bs~ where the constants a and b
are so chosen as to minimize the MS error. Proceeding as in (7-38), we obtain
(see Problem 7-29)
u2 = (n - 1)s~ + (m - l)s~
(9-45)
n + m- 2
2 can
The variance of wequals u~ = u 2(1/n + 1/m). To find its estimate, we
replace u 2 by u2• This yields the estimate
o-t = (n "'
(! ..!.)
1)s~ + (m - 1)s:
+
n+m-2
n m
(9-46)
We next form the RV
(9-47)
As we show in Problem 7-29, this RV has a 1 distribution with n + m - 2
degrees of freedom. Denoting by Ill its u-percentile, we conclude that
P {-Ill <
w~w"'w < Ill}
= 'Y
= 2u -
I
Hence,
(9-48)
Thus to estimate the difference
sampled, we proceed as follows:
"'x - .,, of two RVs x and y sequentially
1. Observe the n + m samples Xt and y 1 and compute their sample
means x, y and variances s~, s~.
2. Determine {r from (9-45) and cTw from (9-46).
3. Select the coefficient 'Y = .2u - 1 and find the percentile lll(n +
m - 2) from Table 3.
4. Form the interval x - y : luUw.
SEC.
Example 9.14
9-3
VARIANCE AND CORRELATION
293
Two machines produce cylindrical rods with diameters modeled by two normal Rvs x
andy. We wish to estimate the difference of their means. To do so. we measure 14
rods from the first machine and 18 rods from the second. The resulting sample means
and standard deviations. in millimeters. are as follows:
.t = 85
y = 85.05
.f, = 0.5
.~.
::. 0.3
Inserting into (9-46). we obtain
•~ =- 13s; ~. 17s; { _!_
Thus
uw = 0.14. n
-t
~·
30
m - 2
= 30. 1,9(30)
_!_ 1 = () ()"
14 + 18 ;
. -
= 1.31. and IIJ-481 yields the estimate
•
·0.24 < .,.,,, - '11.• < 0.14
9-3
Variance and Correlation
We shall estimate the variance v = cr 2 of a normal RV x. We assume first that
its mean TJ is known, and we use as point estimate of v the sum
.
V
~ (
= -nI ;-t
~ X;
,
(9-49)
71)·
-
where x; are the observed samples of x. As we know. the corresponding
estimator v is unbiased, and its variance tends to 0 as n -+ x; hence, it is
consistent.
To find an interval estimate of v, we observe that the RV nv/u 2 has a x2
distribution with n degrees of freedom (see (7-92)]. Denoting by x~(n) its upercentiles, we conclude (Fig. 9.12) that
P {xk<n> <
;~ < xi -&12(n)} = 'Y = I
This yields the 'Y confidence interval
nO > 2
-2-CT >
}(&12(n)
The percentiles of the
2
-
a
(9-50)
nO
(9-51)
Xt-&12(n)
x2 distribution are listed in Table 2.
Figure 9.12
0
X
294
CHAP.
9
ESTIMATION
We note that the interval so obtained does not have minimum length
because the x2 density is not symmetrical; as noted in (8-12), it is used for
convenience.
If 11 is unknown, we use as point estimate of v the sample variance
s2
i
=~
n
I ;.,
(x; - i)2
Again, this is a consistent estimate of v.
To find an interval estimate, we observe that the
x2(n - I) distribution; hence,
2
(n - l)s
u2
2
P { Xat2Cn - 1) < .
2
< X•
RV (n -
l)s2/u 2 has a
}
&'2(n -
I) = 'Y
(9-52)
This yields the interval estimate
(n - l)s
xi-at2<n -
Example 9.15
2
< u 2 < ...:<~n_-__;_:l):..;;.s_
2
xk<n -
1)
(9-53)
I)
The errors of a given scale are modeled by a normal av " with zero mean and
variance u 2• We wish to find the .95 confidence interval of u 2• To do so. we weigh a
standard object 11 times and obtain the following readings in grams:
98.92
98.68
98.85
97.07
98.98
99.36
98.70
99.33
99.31
98.84
99.20
These numbers are the samples of the av x = w • " where w is the true weight of the
given object.
(a) Suppose, first, that w = 99 g. In this case, the mean w of x is known;
hence, we can use (9-51) with
() =
I
II
TI ,_,
L (X; -
99)2 = 0.626
Inserting the percentiles x~o:sOI) = 3.82 and i97~(11) = 21.92 of x2Cll),
into (9-51) we obtain the interval
~- 0 30
2
11021.92 - · < u < 3.82 - 1.8
Note that the corresponding interval for the standard deviation u [see (93)) is
vr:s
vU30 =.55< u <
= 1.34
(b) We now assume that w is unknown. From the given data, we obtain
i
I
II
L
= x1 = 99.02
lli~l
s2 = -
I
II
L (x; -
10;-1
99.02)2 = 0.622
SEC. 9-3
VARIASCE AND CORRELATION
295
In this case, x~o2sOO) =3.25 and i97~(10) = 20.48. Inserting into (9-53), we
obtain the estimate
0.303 < u 2 < 1.91
•
Covariance and Correlation
A basic problem in many scientific investigations is the determination of the
existence of a causal relationship between observable variables: smoking
and cancer, poverty and crime, blood pressure and salt. The methods used
to establish whether two quantities X and Y are causally related are often
statistical. We model X and Y by two Rvs x andy and draw inferences about
the causal dependence of X and Yfrom the statistical dependence ofx andy.
Such methods are useful, however. they might lead to wrong conclusions if
they are interpreted literally: Roosters crow every morning and then the sun
rises; but the sun does not rise because roosters crow every morning.
To establish the independence of two Rvs, we must show that their
joint distribution equals the product of their marginal distributions. This
involves the estimate of a function of two variables. A simpler problem is the
estimation of the covariance P.11 of the Rvs x andy. If p. 11 = 0, x andy are
uncorrelated but not necessarily independent. However, if p. 11 :/: 0, x andy
are not independent. The size of the covariance of two Rvs is a measure of
their linear dependence, but this measure is scale-dependent (is 2 a high
degree of correlation or 200?). A normalized measure is the correlation
coefficient r = p. 11 /o:tuy ofx andy. As we know, lri ~ I. and if lrl = I, the Rvs
x and y are linearly dependent-that is. y = ax + b; if r = 0, they are
uncorrelated.
Since P.11 = E{<x- 'I.• )(y -1)y)}, the empirical estimate of p. 11 is the sum
(1/n)}:(x; - i)(y; - y). This estimate is used if the means 11x and 11r are
known. If they are unknown, we use as estimate of p. 11 the sample covariance
•
#Lll
1 L,.
~
= ---=--t
n
i-1
(X;
-
- (y;
X)
-
.V-)
(9-54)
The resulting estimator 1'11 is unbiased (see Problem 9-26) and consistent.
We should stress that since r is a parameter of the joint density of the
avs x andy, the observations x; andY; used for its estimate must be paired
samples, as in (9-39). They cannot be obtained sequentially, as in (9-41).
We estimate next the correlation coefficient r using as its point estimate
the ratio
; = JLJJ =
I(x; - .f)(y; - y)
(9-55)
VI(x; - x)2I(y; - y)2
This ratio is called the sample correlation coefficient of the RVs x andy. To
find an interval estimate of r, we must determine the density Jf(r) of the RV r.
The functionf,(r) vanishes outside the interval (-1. I) (Fig. 9.13a) because
lrl ~ I (Problem 9-27). However, its exact form cannot be found easily. We
give next a large sample approximation.
SxSy
296
CHAP.
9
ESTIMATION
J~(Z)
z
X
(a)
(b)
Figure 9.13
Fisher's AuxiUary Variable
z=
We introduce the transformations
1
1+f
_ t
2 In I
t
e2z-1
I
= e2z +
(9-56)
It can be shown that for large n, the distribution of the RV z so constructed is
approximately normal (Fig. 9.13b) with mean and variance
1+r
1
I
Tit"" 2ln 1 _ r
(9-57)
(7'2"" - -
n- 3
l
The proof of this theorem will not be given.
From (9-57) and (8-7) it follows that
P {Tit -
~
< z < Tit + ~} = 'Y = 2u n-3
n-3
1
In this expression, Zu is not the u-percentile of the RV z in (9-56); it is the
normal percentile.
To continue the analysis, we shall replace in (9-57) the unknown r by its
empirical estimate (9-55). This yields
A
Tit -
1
1 +;
2 In 1 _
(9-58)
1
From (9-58) and the monotonicity of the transformations (9-56) it follows
[see also (9-49)] that
where
r _ e2t• - 1
t - e2t• + 1
e'Ztz - 1
r2
= e2t +
2
1
z2
lt
=.
+
Zu
Tit - .y;;-::-j
(9-59)
Thus to find the 'Y confidence interval (r1 , r2 }ofthe correlation coefficient r,
we compute f from (9-55), 'J}t from (9-58), and r1 and r2 from (9-59).
SEC.
Example 9.16
9-4
297
PERCENTILES AND DISTRIBUTIONS
We wish to estimate the correlation coefficient rofSAT scores, modeled by the RV x,
and freshman rankings, modeled by the RV y. For this purpose, we examine the
relevant records of 52 students and obtain the 52 paired samples (x;, y;). We then
compute the fraction in (9-55) and find f = .6. This is the empirical point estimate of r.
We shall find its .95 confidence interval. Inserting the number f = .6 into (9-58), we
obtain .,;l == .693, and with Z11 = 2, (9-59) yields
zt=.41
zz=.98
rt=.39
rz=.75
Hence, .39 < r < .75 with confidence coefficient .95. •
9-4
Percentiles and Distributions
Consider an RV x with distribution F(x). The u-percentile of x is the inverse
of F(x); that is, it is a number X 11 such that F(x11 ) = u (Fig. 9.14a). In this
section, we estimate the functions X 11 and F(x) in terms of the n observations
x; of the RV x. In both cases, we assume that nothing is known about F(x). In
this sense, the estimates are distribution-free.
Percentiles
We write the observations x1 in ascending order and denote by y1 the ith
number so obtained. The resulting Rvs y1are the ordered statistics
Yt s Y2 s · · · s Yn
introduced in (7-44). In particular, y1 equals the minimum and Yn the maximum of x;. The determination of the interval estimate of X 11 is based on the
following theorem.
• Theonm. For any k and r,
P{yk <XII< Yk+r}
= kfl (~) um(l
- u)n•"'
(9-60)
X.s
X
lll&k
l'igure 9.14
Y1r
0~~----~~~~
(a)
Y40
(b)
298
CHAP.
9
ESTIMATION
• Proof. From the definition of the order statistics it follows that Yt < xu itT
at least k of the samples X; are less than x,; similarly, y 4 ., > x, itT at least k +
r of the samples x; are greater than x,. Therefore, }'4 < x, < Ylc+r itT at least k
and at most k + r- I of the samples x; are less than x,. In other words, in the
sample space~~~. the event {y" < x, < y• .,} occurs itT the number of successes of the event {x s x,} is at least k and at most k + r - I. This yields (960) because P{x s x,} = u.
This theorem leads to the following 'Y confidence interval of Xu: We
select k and r such as to minimize the length Ylc+r - Ylc of the interval (yk, Yk.,)
subject to the condition that the sum in (9-60) equals 'Y· The solution is
obtained by trial and error involving the determination of 'Y for various
values of k and r.
Example 9.17
We have a sample of size 4 and use as estimate of x, the interval (y 1 • y 4 ). In this case.
'Y equals the sum in (9-60) for m from I to 3. Hence.
'Y
For u
= P{y, < x, < Y•} = 4u(l - u)3 + 6u 2(1 - u)2 + 4u 3(1
= .5 • .4•. 3, we obtain 'Y = .875, .845, .752. •
- u)
If n is large, we can use the approximation (3-35) in the evaluation of 'Y·
This yields
'Y
- 0.5 - nu)
- G ( kV+ 0.5 - nu) (9-61)
nu(l - u)
nu(l - u)
For a given 'Y· the length of the interval (y,, y,.,) is minimum if its
center is Y11• where nu is the integer closest to nu. Setting k + 0.5 == n, - m
and k + r - 0.5 == nu + min (9-61), we obtain
= P{yk <Xu <
'Y
Y~.:+,} == G (
+Vr
= P{ytt.-m <X, <
This yields the 'Y
=
Ytt.+m} == 2G (
V
m
) tru(l - 11)
(9-62)
I - 8 confidence interval
Ytt.-m <
Example 9.18
k
X,
< Ytt.+m
m == Z1-61~ V nu(l - u)
(9-63)
We have 64 samples ofx, and we wish to find the .95 confidence interval of its median
x. 5 • In this problem,
n = 64
u = .5
n, = nu = 32
Z1-a'2 ""' 2
m =8
and (9-63) yields the estimate y2• < x.5 < y40 • We can thus claim with probability .95
that the unknown median is between the 24th and the 40th ordered observation of x
(Fig. 9.14b). •
Distribution
The point estimate of F(x) is its empirical estimate
k
F(x) = 2
n
(9-64)
SEC.
9-4
Pf.RCf.NTII.F.S ASD DISTRIBUTIONS
299
where kx is the number of samples x; that do not exceed x [see (4-25)]. The
function F(x) has a staircase form with discontinuities at the points x;. lfx is
of continuous type, almost certainly the samples x; are distinct and the jump
of i"(x) at x; equals 1/n. It is convenient, however, if II; samples are close, to
bunch them together into a single sample x; of multiplicity n;. The corresponding jump at x; is then n;ln.
We shall find two interval estimates of F(x). The first will hold for a
specific x. the second for any x.
Variable-length estimate For a specific x. the function F(x) is the probability p = P{x::: x} of the event s4. = {x::: x}. We can therefore apply the earlier
results involving estimates of probabilities. For large n. we shall use the
approximation (9-20) with p replaced by F(:c) and.{ by i"(x). This yields they
confidence interval
F(x):!: a
a=
~
VF(x)(l - F(x))
(9-65)
for the unknown F(x). In this estimate. the length 2a of the confidence
interval depends on x.
KolmogorotT estimate The empirical estimate of F<x> is a function F(x) of x
depending on the samples x; of the RV x. It specifics therefore a random
family of functions F(x), one for each set of samples x;. We wish to find a
number c, independent of x, such that
P{IF<x> - F<x>l ::: c} ::: y
for every x. The constant y is the confidence coefficient of the desired
estimate F(x) :!: c of F(x).
The difference IF(x) - F(x)l is a function of x and its maximum (or
least upper bound) is a number w (Fig. 9 .15) depending on the samples x;. It
specifies therefore the RV
w = max IF(x) - F(x)l
(9-66)
-:~.<.t<z
Fipre 9.15
II
)()()
CHAP. 9
ESTIMATION
From the definition of wit follows that w < c iff IF(x) - F(x)l < c for every x,
hence,
P{maxiF(x) - F(x)l :S c} = P{w :S c} = F.,.(c) = 'Y
(9.67)
It
where F.,(w) is the distribution ofthe av w. To find c, it suffices therefore to
find F,..(w).
The function F...(w) docs not depend on the distribution F(x) ofx. This
unexpected property of the maximum distance w between the curves F(x)
and F(x) can be explained as follows: The difference F(x) - F(x) does not
change if the x axis is subjected to a nonlinear transformation or, equivalently, if the av xis replaced by any other RV y = g(x). To determine F.,.(w),
we can assume therefore without loss of generality that F(x) = x for
0 s x s I [see (4-68)]. Even with this simplification, however. the exact form
of F.,.(w) is complicated. For most purposes, it is sufficient to use the following approximation due to KolmogorotT:
1
F,..(w) =r I - 2e- 2""' for w > llYn
(9-68)
This yields
'Y = I - 8 = F.,.(c) ""' I - 2e- 21".1
Hence, the 'Y confidence interval of F(x) is
(9-69)
F(x) ± c
Thus we can state with confidence coefficient 'Y that the unknown distribution is a function F(x) located in the zone bounded by the curves F(x) + c
and F(x)- c.
Example 9.19
The IQ scores of 40 students rounded off to multiples of 5 are as follows:
Xi
l'(x) = .025
15
80
85
90
95
100
lOS
110
115
120
2
3
5
6
8
6
4
2
2
125
Find the .95 confidence interval of the distribution F(x) of the scores.
In this problem, we have II samples xi with multiplicity n;, and
.075
.150
.275
.425
.625
.775
.875
.925
.915
1.000
for xis x <xi+ 1, i = I, . . . , 11 where x12 = =. With 8 = .05, equation (9-69) yields
SEC.
9-5
301
MOMENTS AND MAXIMliM I.IKE.LIHOOD
the interval
F(x) ±.
c:
(' =
I I .05
i
V- 80 n T = .217
For example, for 90 :s x < 95. the interval .425 ±. .217 results. •
Kolmogoroff's test leads to reasonable estimates only if n is large. As
the last example suggests, n should be at least of the order of 100 (see also
Problem 9-31 ). If this is true, the approximation error in (9-68) is negligible.
9-5
Moments and Maximum Likelihood
The general parameter estimation problem can be phrased as follows: The
density of an RV x is a function f(x, 61 , • • • , 6,} depending on r ~ I
parameters 6; taking values in a region 9 called the parameter space. Find a
point in that space that is close in some sense to the unknown parameter
vector (6 1 , • • • , 6,). In the earlier sections of this chapter, we developed
special techniques involving the commonly used parameters. The results
were based on the following form of (4-89): If a parameter 6 equals the mean
of some function q(x) of x, then we use as its estimate fJ the empirical
estimate of the mean of q(x):
•
I~
6 = E{q(x)}
(9-70)
6 = Ll q(x;)
n
In this section, we develop two general methods for estimating arbitrary
parameters. The first is based on (9-70).
Method of Moments
The moment m• of an RV x is the mean of xk. It can therefore be estimated
from (9-70) with q(x} = x". Thus
•
mk
n
"-"' X;k
= I Ll
(9- 71)
The parameters 6; of the distribution of x are functions of m• because
m"
=
r.
x"f(x, 6,, . . . , 6,)dx
(9-72)
To find these functions, we assign to the index k the values I to r or some
other set of r integers. This yields a system of r equations relating the
unknown parameters 6; to the moments m". The solution of this system
expresses 6; as functions 6; = y;(m 1 , • • • , m,) of m". Replacing m~c by their
estimates "'", we obtain
iJ; = y,{m, •...• m,)
(9-73)
These are the estimates of 6; obtained with the method of moments.
302
CHAP.
9
ESTIMATION
Note that if 6 is the estimate of a parameter 6 so obtained, the estimate
of a function 1'(8) of 8 is T{B).
Example 9.20
Suppose that x has the Rayleigh density
j(x.
To estimate 8, we set k
8)
= ~ e_.,:. 2,:U(xl
= I in (9-72). This yields
m, =
~2
J: x e
2
,,:!26: dx
= 8 ~~
Hence,
•
Example 9.21
The av xis normal with mean 2. We shall estimate its standard deviation u. Since
m2 = .,2 .,. u 2 = 4 + u 2 , we conclude that
•
Example 9.22
We wish to estimate the mean., and the variance u~ of a normal av. In this problem,
m1 = Tl and u 2 = m2- mi: hence,
-;; = ,;,, "' ;
u2
= ,;,2
- (i)2.
•
Method of Maximum Likelihood
We shall introduce the method of maximum likelihood (ML) starting with
the following prediction problem: Suppose that the RV x has a known density
f(x, 8). We wish to predict its value x at the next trial; that is, we wish to find
a number x close in some sense to x. The probability that xis the interval (x,
x + dx) equals/(x, (J)dx. If we decide to select x such as to maximize this
probability, we must set x = Xmax where Xmax is the mode of x. In this
problem, 8 is specified and Xmax is the value of x for which the density j(x, 8),
plotted as a function of x, is maximum (Fig. 9.16a).
In the estimation problem, 8 is unknown. We observe the value x ofx,
and we wish to find a number 6close in some sense to the unknown 8. To do
so, we plot the density f(x, 8) as a function of 8 where xis the observed value
of x. The curve so obtained will be called the likelihood function of 8. The
value 8 = 6max of 8 for which this curve is maximum is the ML estimate of 8.
Again, the probability that xis in the interval (x, x + dx) equalsf(x, 8)dx. For
a given x, this is maximum if 8 = 8ma,..
We repeat: In the prediction problem, 8 is known and Xmu is the value
of x for which the density f(x, 8) is maximum. In the estimation problem, xis
SEC.
9-5
MOMENTS A~D MAXIMUM LIKELIHOOD
303
Likelihood
Density
.,
"'
.\'
8=-
0
.\'
Ia I
lbl
Figure 9.16
known and 8max is the value of 8 for which the likelihood functionf(x, 8) is
maximum.
Note Suppose that 6 = 8(T) and/!x, 8!T)) is maximum forT= f. It then follows
that.f{x, 8) is maximum for 8 = 8(f). From this we conclude that iff is the ML
estimate of a parameter T, the ML estimate of a function 8 = 8(f) ofT equals
8(f).
t:xample 9.23
The time to failure of type A hulhs is an
RV
x with density
f(x. IJ) = O~.H' "'U!x)
In this example. the density of x !Fig. 9.16al is maximum for.\· - l/8, and its
likelihood function !Fig. 9.16h) is maximum for 0 "' 2/x. This leads to the following
estimates:
Prediction: The MI. estimate .i of the life length x of u particular bulb equals 110.
Estimation: The MI. estimate {J of the parameter (J in terms of the observed life
length x of a particular bulb equals 2/x. •
We shall now estimate IJ in terms of the 11 observations X;. The joint
density of the corresponding samples x; is the product
,((X. H) ...:. /{x,. H)· · • f<x,. H)
X= lx, .... x,]
Fora given 8./(X. (})is a function of then variablesx;. For a given X.f(X, 8)
is a function of the single variable fJ. If X is the vector of the observations x;.
then.f<X. fJ) is called the likelihood function of H. Its logarithm
= In f<X.
8) == I In .f(x;. 9)
(9-74)
is the log-/ikelilwod function of 8. The M L estimate II of 0 is the value of 8 for
which the likelihood function is maximum. If the maximum of f<X. 8) is in
the interior of the domain (o-) of 8. then {J is a root of the equation
L(X. fJ)
nf<X. 8) ,., 0
atJ
<9_75 >
304
CHAP.
9
ESTIMATION
In most cases, it is also a root of the equation
aL(X, 6) = ~
1
aj(x~o 6) =
(9-76)
0
a6
f(x;, 6)
a6
Thus to find (J, we solve either (9-75) or (9-76). The resulting solution depends on the observations x;.
Example 9.24
The RV x has an exponential density 8e-~U(x). We shall find the ML estimate (J of 8.
In this example,
f(X, 8) = 8"e-11Jit+···•JI• 1 = 8"e_,,.;
L(X, 8) = n In 8 - 8nX
Hence,
aL = !! - ,x = o
fJ = ~
atJ 8
x
Thus the ML estimator of 8 equals 1/i. This estimator is biased because E{i}
118 and £{1/i} :1: 11.,)1. •
= .,)1 =
In Examples 9.23 and 9.24, the likelihood function was differentiable at
6 = fJ, and its maximum was determined from (9-76) by differentiation. In the
next example, the maximum is a comer point and (J is determined by in-
spection.
Example 9.25
The RV xis uniform in the interval (0, 8), as in Fig. 9.17a. We shall find the ML
estimate (J of lJ.
The joint density of the sample x; equals
1
/(X, 8) = ,.
for 0 < x 1• • • • • x,. < 8
8
and it is 0 otherwise. The corresponding likelihood function is shown in Fig. 9.17b,
where z is the maximum of x;. As we see from the figure,f(X, 8) is maximum at the
comer point z =max X; of the sample space; hence, the ML estimate of 8 equals the
maximum of the observations x;. This estimate is biased (see Example 7.4) because
E{x.n..} = (n + 1)8/n. However, the estimate is consistent. •
Figure 9.17
I<X,8)
/(x,8)
·~----...
8
8"
0
0
(a)
8
(b)
SEC.
9-5
MOMENTS AND MAXIMUM LIKELIHOOD
305
p
0
Figure 9.18
Example 9.26
We shall find the ML estimate p of the probability p of an event stl. For this purpose,
we form the zero-one av x associated with the event .IIi. The joint density of the
corresponding samples x; equals p 4q 11 -t where k is the number of successes of stl. This
yields
iJL k n- k
------=0
L(X. p) = kIn p + (n - k)ln q
ilp
p
q
Solving for p, we obtain
k(l - p) - (n - k)p
=0
•
k
n
p.::-
This holds if p is an interior point of the parameter space 0 s p s l. If p = 0 or I, the
maximum is an endpoint (Fig. 9.18). However, even in thi$ case, p =kin. •
The determination of the ML estimates 6; of several parameters 8;
proceeds similarly. We form their likelihood function f(X, 81 , • • • , 8, ),
and we determine its maxima or, equivalently, the maxima ofthe log-likelihood
L(X, 8., . .. , 9,) = ln/(X, 9, • . . . • 9,)
Example 9.27
We shall find the ML estimates ij and 0 of the mean 11 and the variance u = u 2 of a
normal av.
In this case,
f(X, .,, u)
I
{
= (Yl11'u)"
exp -
l ,...
•}
lu £... (x; - TJ)-
n In (211'u)
=- 2
u £... (x; - 11>2
2
L(X, .,, u)
'\."' (X·..,) = 0
a., =!u £...
'
.,
aL
Solving the system. we obtain
ij
= .t
-
I~
+ J., ~ (x· au = - .!!..
2u 2u- £... '
aL
..
,...
-·
u =- £... (x; - x)-
n
(9-77)
71)
.,
2
=0
306
CHAP.
9
ESTIMATION
The estimate;; is unbiased, but the estimate 0 is biased because E{v} = (n - l)u 21n.
However, both estimates are consistent. •
Asymptotic Properties or ML Estimators The ML method can be used to
estimate any parameter. For moderate values of n, the estimate is not particularly good: It is generally biased, and its variance might be large. Furthermore, the determination of the distribution of iJ is not simple. As n increases,
the estimate improves, and for large n, iJ is nearly normal with mean 8 and
minimum variance. This is based on the following important theorem.
• Theonm. For large n, the distribution of the ML estimator iJ of a parameter 8 approaches a normal curve with mean 8 and variance 1/nl where
1=E
{1:8 L<x. 8>n
= lnf(x, 8)
L(x, 8)
(9-78)
In Section 9-6, we show that the variance of any estimator of 8 cannot
be smaller than Ifni. From this and the theorem it follows that the ML
estimator of a parameter 8 is asymptotically the best estimator of 8. The
number I in (9-78) is called the information about 8 contained in x. This
concept is important in information theory. Using integration by parts, we
can show (Problem 9-36) that
I
=-E
ca; L(x, 8)}
(9-79)
2
In many cases, it is simpler to evaluate I from (9-79).
The theorem is not always true. For its validity, the likelihood function
must be differentiable. This condition is not too restrictive, however. The
proof of the theorem is based on the central limit theorem but it is rather
difficult and will be omitted. We shall merely demonstrate its validity with an
example.
Example 9.28
Given an RV x with known mean.,, we wish to find the ML estimate 0 of its variance
v = u 2• As in (9-77),
!n 2
0=
The
RV
Tl)2
(XI -
n¥/u 2 has a x2(n) distribution; hence,
E{t}
=u2
u~u
= -2un
4
Furthermore, for large li, tis normal because it is the sum of the independent avs
asymptotically, the ML estimate t of u 2 is normal with mean u 2 and
variance 2u 4 /n. We shall show that this agrees with the theorem. In this problem.
(xt - 11>2• Thus,
-I
aL(x, v>
av
=
2v
(x -
+
11>2
2V2
a2L(x, v>
aV2
and (9-79) yields
• {-I (x - 11>2}
2v2 +
v3
in agreement with the theorem. •
I
=E
I
= 2u 4
I
==
2v2 -
(x - 11>2
v3
SEC.
9-6
BEST ESTIMATORS AND THE RAO-CRAMf.R BOUND
307
9-6
Best Estimators and the Rao-Cramer Bound
All estimators considered earlier were more or less empirical. In this section, we examine the problem of determining best estimators. The best estimator of a parameter 6 is a statistic (J = g(X) minimizing the MS error
e
=
E{(B - 8)2}
=
t (g(X) - 8) /(X. 8)
2
dX
(9-80)
In this equation, 6 is the parameter to be estimated, f(X, 6) is the joint
density of the samples x;. and g(X) is the function to be determined. To
simplify the problem somewhat, we shall impose the condition that the estimator (J be unbiased:
(9-81)
E{iJ} = 6
This condition is only mildly restrictive. As we shall see, in many cases, the
best estimators are also unbiased. For unbiased estimators, the LMS errore
equals the variance of fl. Hence, best unbiased estimators are also minimumvariance unbiased estimators.
The problem of determining an unbiased best estimator is difficult because not only the function g(X) but also the parameter 8 is unknown. In
fact, it is even difficult to establish whether a solution exists. We show next
that if a solution does exist. it is unique.
• Theorem. If fJ1 and {h. are two unbiased minimum-variance estimators of a
parameter 6. then fJ1 = {h.
• Proof. The variances of fJ1 and {h. must be equal because otherwise the
one or the other would not be best. Denoting by u 2 their common variance,
we conclude that the statistic
A
I .
.
"=
2 (8, + ~)
is an unbiased estimator of 6 and its variance equals
2
I 2
I 2
2
2
CT;
= 4 (u + CT
+ 2ru
)
= 2u
(1
+ r)
where r is the correlation coefficient of (J, and ~. This shows that if r < I,
then u; < u, which is impossible because (J, is best; hence, r = I. And since
the avs (J, and {h. have the same mean and variance. we conclude as in (5-46)
that (J, = ~.
We continue with the search for best estimators. We establish a lower
bound for the MS error of all estimators and develop the class of distributions for which this bound is reached. This material is primarily of theoretical interest.
Regularity The density of an av x satisfies the area condition
r.
f(x, 6)dx
=I
308
CHAP.
9
ESTIMATION
The limits of integration may be finite, but in most cases they do not depend
on the parameterS. We can then deduce, differentiating with respect to S,
that
aj(x, 6) dx =
(9-82)
0
_,.
as
I"
We shall say that the density f(x, 6) is regular if it satisfies (9-82). Thus
f(x, 6) is regular if it is differentiable with respect to 6 and the limits of
integration in (9-82) do not depend on 6. Most densities of interest are
regular; there are, however, exceptions. If, for example, xis uniform in the
interval (0, 6), its density is not regular because 6 is a boundary point of the
range 0 s x s S of x. In the following, we consider only regular densities.
lnjol'lllation The log-likelihood of the RV x is by definition the function L(x, 6) = lnf(x, 6). From (9-82) it follows that
_ I- aj(x, 6) f(x, 6) dx = 0
6)
as
I_..,"' aL(x,a6 6) f(x, S) dx = I"_., f(x,
This shows that the mean of the
ance by I, we conclude that
RV
aL(x, S)/ a6 equals 0. Denoting its vari-
(9-83)
The number I is the information about 6 contained in x [see also (9-78)].
Consider, next, the likelihood function L(X, 6) = In f(X, 6) of the
sample X = [x., . . . , x,.]. Its derivative equals
aL(X, S) =
1 af(X, S) = L aL(x;, S)
as
f<X. s>
as
as
The avs aL(x;, 6)1a6 have zero mean and variance I. Furthermore, they are
independent because they are functions of the independent RVS X;. This leads
to the conclusion that
(9-84)
The number nl is the information about 6 contained in X.
We tum now to our main objective, the determination of the greatest
lower bound of the variance of any estimator II = g(X) of 6. We assume first
that II is an unbiased estimator. From this assumption it follows that
E{ II} =
t g(X)f(X'
6) dX = s
Differentiating with respect to S, we obtain
( (X) aj(X, 6) dX = I
JR 8
as
This yields the identity
E {g(X)
aL~, 8)}
= I
(9-85)
SEC.
9-6
BEST ESTIMATORS AND THE RAO-CRAMER BOUND
309
The relationships just established will be used in the proof of the following fundamental theorem. The proof is based on Schwarz's inequality
[see (5-42)]: For any z and w,
E 2{zw} s E{z2}E{w2}
(9-86)
Equality holds iff z = cw.
THE RAO-CRAMER BOUND The variance of an unbiased estimator iJ =
g(X) of a parameter 6 cannot be smaller than the inverse I I nl of the information n/ contained in the sample X:
u~
= E{[g(X) -
I
6]2} ~ nl
(9-87)
Equality holds iff
iJL(X, (})
ao
= nl [g(X) -
6]
(9-88)
• Proof. Multiplying the first equation in (9-84) by 6 and subtracting from
(9-85), we obtain
I = E {lg(X)- 9) iJL(:e, (})}
We square both sides and apply (9-86) to the Rvs g(X) - 6 and iJL(X, 6)/a8.
This yields [see (9-84)]
I s E{lg(X) - 6]2}£ { liJL(!,
OW} = u~nl
and (9-87) results.
To prove (9-88), we observe that (9-87) is an equality iff g(X) - 6
equals ciJL(X, 8)/iJ8. This yields
I = E { cliJL(!, O)n
= en/
Hence, c = 1/n/, and (9-88) results.
Suppose now that iJ is a biased estimator of the parameter 8 with mean
E{ IJ} = 7(6). If we interpret iJ as the estimator of 'T(O), our estimator is
unbiased. We can therefore apply (9-87) subject to the following modifications: We replace the function aL(X, 8)/iJ8 by the function
iJL[X, 6(7)] = iJL(X, 6) _I_
iJ7
iJ(J
7'(8)
and the information n/ about 8 contained in X by the information
1
E { liJL[X, 6(7)]1
iJ7
'
2
}
= _1_ E {laL(X,BW} = _!!!_
[7'(6)] 2
iJ(J
[7'(6)] 2
310
CHAP.
9
ESTIMATION
about 1'(8) contained in X. This yields the following generalization of the
Rao-Cramer bound.
• Corollary. If iJ = g(X) is a biased estimator of a parameter 8 and E{ iJ} =
1'(8), then
u~ = E{[g(X)
- T(8)]2} :::
(T'!~)J2
(9-89)
(Jj
(9-90)
Equality holds iff
iJL(X, 8)
a8
= ...!!!.._ I
(X) _
T'(8) g
EmCIENT ESTIMATORS AND DESSffiES OF EXPONENTIAL TYPE We
shall say that iJ is the most efficient estimator of a parameter 8 if it is
unbiased and its variance equals the bound 1/n/ in (9-87). If Sis biased with
mean 1'(8) and its variance equals the bound in (6-89), it is the most efficient
estimator of the parameter T((J).
The Rao-Cramer bound applies only to regular densities. If j(x, 8) is
regular, the most efficient estimator of 8 is also the best estimator.
The class of distributions that lead to most efficient estimators satisfy
the equality condition (9-88) or (9-90). This condition leads to the following
class of densities.
• Definition. We shall say that a density j(x, 9) is of the exponential type
with respect to the parameter 8 if it is of the form
j(x, (J) = h(x) exp{a((J)q(x) -- b(9)}
(9-91)
where the functions a((J) and b((J) depend only on (J and the functions h(x)
and q(x) depend only on x.
We shall show that the class of exponential type distributions generates
the class of most efficient estimators.
• Theorem. If the density /(:c. 8) is of the form 19-91). the statistic
iJ = g(X) = !n
L q(X;)
(9-92)
is the most efficient estimator of the panlmeter
T((J)
= b'(8)
a'(8)
The corresponding Rao-Cramer bound equals
U~ _ (T'((J))2 _ 7'(8)
' nl - na'((J)
• Proof. From (9-91) it follows that
ln/(x, 8) =In h(x) + a(8)q(x)- b((J)
and (9-74) yields
iJL(X, 8)
~
a
= a'(8) ~ q(x;) - nb'(8) = na'(8)[g(X) - 1'(8)]
8
(9-93)
( _ )
9 94
SEC.
9-6
BEST ESTIMATORS AND THE RAO-CRAMER BOUND
311
This function satisfies (9-90) with
T~~) = na'(8)
Hence, I = a'(8)T'(8). Inserting into (9-89), we obtain (9-94). Note that the
converse is also true: If L(X, 8) satisfies (9-90), thcnf<x. 8) is of the exponential type. We give next several illustrations of exponential type distributions.
Normal (a) The normal density is of the exponential type with respect to its mean because
f (x.
71)
I
= '\I21TU
cxp { -
Iv (x-, - 2x71 - 71-)
'}
2
This density is of the form (9-91) with
0('17) =
!1.
b('l1)
v
= '17~
q(x)
2v
=x
In this case.
'11 == g(X) =
.!n L X; =
.f
Hence, the sample mean i is the most efficient estimator of '17·
(b) The normal density is also of the exponential type with respect to
its variance because
j(x.
v) =
~ cxp {-In vV- dv (x- '17>~}
This is also of the form (9-91) with
a(v) = - -
I
2v
b(v)
= In vV
q(x) = (x - 71)~
In this case. T{v) = v; hence. the statistic v = I(x; - 71)~/n is the best
estimator of v. The variance of the estimation [sec (9-94)] equals
,
T'(v)
2u 4
(7'~
v
= - - =na'(v)
n
Exponential The exponential density
j(:c, 8) = 8e" 9~ = exp{-8x + In 8}
is of the exponential type with a(8) = -8, b(8) = -In 8. and q(x) = x. Hence.
the statistic "i.x;ln = .'f is the most efficient estimator of the parameter
T(8)
= b'(8)
a'(8)
=
!
8
The Rao-Cramer bound also holds for discrete type Rvs. Next we give
two illustrations of most efficient estimators and point densities of the exponential type.
Poisson The Poisson density
8~
t
j(x, 8) = e ·II x! = x! exp {x In 8 - 8}
X=
0, I •...
312
CHAP.
9
ESTIMATION
is of the exponential type with
a(8) =In 8
b(8)
=8
q(x)
1'(8) = b'(8) = 8
=x
a'(8)
Hence, the statistic~ q(x;)ln =xis the most efficient estimator of 8, and its
variance equals l/na'(8) = 8/n. Note that the corresponding measure of
information about 8 contained in the sample X equals nl = na'(8}r'(8) = n/8.
Probability If p = P(.s4) is the probability of an event .s4 and x is the
zero-one RV associated with .s4, then x is an RV with point density
f(x, p) = p'"(l - p) 1- .. = exp {x In p + (I - x) In (I - p)}
x = 0, I
This is an exponential density with
a(p) =In _P_
b(p) =In (I - p)
1-p
Hence, the
ratio~
q(x)
=x
T(p)
=p
x;ln = kin is the most efficient estimator of p.
Sufficient Statistics and Completeness
We tum now to the problem of determining the best estimator il of an
arbitrary parameter 8. If the density /(x, 8) is ofthe exponential type, then il
exists, and it equals the most efficient estimator g(X) in (9-92). Otherwise, il
might not exist, and if it does, there is no general method for determining it.
We show next that for a certain class of densities, the search for the best
estimator is simplified.
• Dejinition. If the joint density /(X, 8) of the n samples X; is of the form
/(X, 8) = H(X)J[y(X), 8]
(9-95)
where the functions H(X) and y(X) do not depend on 8, the function y(X) is
called a sufficient statistic of the parameter 8.
From the definition, it follows that if y(X) is a sufficient statistic, then
ky(X) or, in fact, any function of y(X) is a sufficient statistic. If f(x, 8) is a
density of exponential type as in (9-91), then
/(X, 8) =
n h(x;) exp {a(8) L q{.t;) -
nb(8)}
is of the form (9-95); hence, the sum y(X) = ~ q(x;) is a sufficient statistic
of8.
The importance of sufficient statistics in parameter estimation is based
on the following theorem.
• S'4fliciency Tlaeortm. If z is an arbitrary statistic and y is sufficient statistic
of the parameter 8, then the regression line
lll(y)
= E{zly} =
f. ~C<:Iy)dz
is also a statistic; that is, it does not depend on 8.
SEC. 9-6
BEST ESTIMATORS AND THE RAO-CRAMh BOUND
3J3
• Proof. To prove this theorem, it suffices to show that the functioni<zjy)
does not depend on 8. As we know,
I )d _ h·z<Y· z)dydz -_ P{(y. z) ED,.:}
(9-96)
' z y zj..(y)dy
P{y E DJ
.
.
where D.vz is a differential region and D_,. a vertical strip in the .vz plane (Fig.
9.19a). The trctnsformation y = y(X), z = z(X) maps the region D_,.l of the yz
plane on the region A,.: of the sample space and the region D_,. on the region A_,.
(Fig. 9.19b). The numerator and the denominator of (9-96) equal the integral
of the density f<X. 8) in the regions A_,.; and A,. respectively. In these regions. y(X) = y is constant; hence the term J(y.O) can be taken outside the
integral. This yields rsee (9-95)].
P{(y, z) E D,-:} = J(y, 8)
1/(X)dX
~"(
J:
L,,
P{y ED..} = J(y. 8)
t.
H(X)dX
Inserting into (9-96) and canceling J(y, 8), we conclude that the function
.f:<zly) does not depend on 8.
• Corollary 1. From the Rao-Blackwell theorem (6-58) it follows that if z is
an unbiased estimator of 8, then the statistic lJ = E{zly} is also unbiased and
its variance is smaller than u~. Thus if we know an unbiased estimator z of 8,
using the sufficiency theorem, we can construct another unbiased estimator
E{zly} with smaller variance.
• Corollary 2. The best estimator of 8 is a function 1/l(y) of its sufficient
statistic.
• Proof. Suppose that z is the best estimator of 8. If z does not equali/J(y),
then the variance of the statistic lJ = E{zly} is smaller than u~. This, however, is impossible; hence. z = 1/l(y).
•·igure 9.19
z
X spal.'C
y
(a)
tb)
3J4
CHAP.
9 ESTIMATION
It follows that to find the best estimator of 8, it suffices to consider only
functions of its sufficient statistic y. This simplifies the problem. but even
with this simplification, there is no general solution. However, if the density
/y(y, 8) of the sufficient statistic y satisfies certain conditions related to the
uniqueness problem in transform theory, finding the best estimator simply
entails finding an unbiased estimator, as we show next.
COMPLETENESS Consider the integral
=
Q(8)
f.
q(y)k(y, 8)dy
(9-97)
where k(y, 8) is a given function of y and 8 is a parameter taking values in a
region 8. This integral assigns to any funciton q(y) for which it converges a
function Q(8) defined for every 8 in e. This function is called the transform
of q(y) generated by the kernel k(y, 8). A familiar example is the Laplace
transform generated by the kernel e-". We shall say that the kernel k(y, 8) is
complete if the transform Q(8) has a unique inverse transform. By this we
mean that a specific Q(8) is the transform of one and only one function q(y).
We show next that the notion of completeness leads to a simple determination of the best estimator of 8.
• Definition. A sufficient statistic y is called complete if its density /y(y, 8) is
a complete kernel.
• Theorem. If iJ is a function
mean equals 8:
E{iJ} =
then
iJ = q(y) of the complete statistic y and its
f.
q(y)J;.(y, 8)dy
=8
(9-98)
iJ is the best unbiased estimator of 8.
• Proof. Suppose that z is the best unbiased estimator of 8. As we have
seen, z = t/l{y); hence,
E{z}
=
f.
t/l(y)/y(y, 8)dy
=8
(9-99)
The last two equations show that 8 is the transform of the functions q(y) and
t/J(y) generated by the kernel/y(y, 8). This kernel is complete by assumption;
hence, 8 has a unique inverse. From this it follows that q(y) = t/l(y); therefore, iJ = z.
These results lead to the following conclusion: If y is a sufficient and
complete statistic of 8, then to find the best unbiased estimator of 8, it
suffices to find merely an unbiased estimator w. Indeed, starting from w, we
form the RV z = E{wly}. This RV is an unbiased function ofy; hence, according to the sufficiency theorem, it is the best estimator of 8.
These conclusions are based on the completeness of y. The problem of
establishing completeness is not simple. For exponential type distributions,
completeness follows from the uniqueness of the inversion of the Laplace
SF.C.
9-6
315
BEST ESTIMATORS ANI> TilE RAO-CRAMER BOUND
transform. In other cases, special techniques must be used. Here is an illustration.
Example 9.20
We are given an RV x with uniform distribution in the interval((). 8) IFig. 9.20a), and
we wish to find the best estimator iJ of 8.
In this case.
0<
f(X. tl) = ;,
X~o • • • • Xn
< fJ
and zero otherwise. This density can be expressed in terms of the maximum z = Xma•
and the minimum w = Xmin of X;. Indeed./( X. 8) '- 0 itT w < 0 or z > 8; hence.
I
/(X. 8> = , U(w)U(8 - <:)
(9-100)
8
where U is the unit-step function. This density is of the form 19-95) with y( X) = z;
hence. the function y = Xma• = z is a sufficient statistic of 8. and its density (sec
Example 7.6) equals
j;.(y. 8)
= ; y''
1
0 < y < II
(9-101)
as in Fig. 9.20b. Next we show that y is complete. It suffices to show that if
Q(8)
= .!!._
9"
(H q(y)\·11·1 J\•
Jo
·
·
then Q<8> has a unique inverse q(y). For this purpose. we multiply both sides by 8"
and differentiate with respect to 8. This yields
118 11 - I Q{ 8) + 8"Q' (8) -"- llq( (;I)(J 11 1
Hence. q(8) .._ Q(8) + (81n)Q'C8l is the unique inverse of QlfJ).
To complete the determination of iJ. we must find an unbiased estimator of 8.
From (9-101) it follows that E{y} ,_ n8/(11- 1). This leads to the conclusion that the
statistic
•
n-1
11+l
6=.
n - y = -11- x rna•
is an unbiased estimator of 8. And since it is a function of the complete statistic y, it is
the best estimator. •
Figure 9.20
J;.ly. 0)
f,<x.OI
y = "'"·~
I
0
t---------.
0
0
(a)
X
0
0
I hi
316
CHAP.
9
ESTIMATION
Let us tum, finally, to the use of completeness in establishing the
independence between the sufficient statistic y and certain other statistics.
• Theorem. If z is a statistic such that its density is a functionf.(z) that does
not depend on 6 and y is a complete statistic of 6, then the RVS z and y are
independent.
• Proof. It suffices to show that I.<ziy> = f.(::.). The area of the density
J;.(y, 6) of y equals I. From this and the total probability theorem (6-31) it
follows that
fz<z>
=
f . i:<zly)J;.(y, 6)dy = J". I:<z>J;<y. 6)dy
(9-102)
The density fz(z) does not depend on 6 by assumption, and the conditional
density fz<zly) does note depend on 6 because y is a sufficient statistic of 6
(sufficiency theorem). And since the kemelfv(y, 6) is complete, we conclude
·
from (9-102) thatfz<zly> = f:(z).
Using the properties of quadnttic forms, we have shown in Section 7-4
that the sample mean i and the sample variance s2 of a normal RV are
independent. This result follows directly from the theorem just proved. Indeed, i is a complete statistic of the mean of x because its density is normal
and a normal kernel is complete. Furthermore. the density of s2 does not
depend on 11· Hence, i and s2 are independent.
Problems
9-1
9-2
9-3
The diameter of cylindrical rods coming out of a production line is a normal av
x with cr = 0.1 mm. We measure n = 9 units and find that the average of the
measurements is i = 91 mm. (a) Find c such that with a .95 confidence
coefficient, the mean.,., ofx is in the interval x ±c. (b) We claim that.,., is in the
interval (90.95, 91.05). Find the confidence coefficient of our claim.
The length of a product is an av x with cr = I mm and unknown mean. We
measure four units and find that x = 203 mm. (a) Assuming that x is a normal
av, find the .95 confidence interval of,.,. (b) The distribution ofx is unknown.
Using Tchebycheff's inequality. find c such that with confidence coefficient
.95, .,., is in the interval 203 ± c.
We know from past records that the life length of type A dres is an av x with
cr = 5,000 miles. We test 64 samples and find that their average life length is
= 25,000 miles. Find the .9 confidence interval of the mean of x.
We wish to determine the length a of an object. We use as an estimate of a the
average of n measurements. The measurement error is approximately normal with zero mean and standard deviation 0.1 mm. Find n such that with 95%
confidence, xis within ±0.2 mm of a.
An object of length a is measured by two persons using the same instrument.
The instrument error is an N(O, cr) av where cr = I mm. The first person
x
9-4
9-5
x
PROBLEMS
9-6
9-7
9-8
9·9
9-10
9·11
9·12
9-13
9-14
9·15
317
measures the object 25 times, and the average of the measurements is i =
12 mm. The second person measures the object 36 times. and the average of
the measurements is y = 12.8. We use as point estimate of a the weighed
average c = ai - by. (a) Find a and b such that c is the minimum-variance
unbiased estimate of cas in (7-37). (b) Find the 0.95 confidence interval of c.
In a statewide math test, 11 students obtained the following scores:
49 57 64 72 75 77 78 79 81 81 82 84 85 87 89 93 96
Assuming that the scores are approximately normal, find the .95 confidence
interval of their mean (a) using (9-10); (b) using (9-12).
A grocer weighs 10 boxes of cereal, and the results yield i = 420 g and s = 12 g
for the sample mean and sample standard deviation respectively. He then
claims with 95% confidence that the mean weight of all boxes exceeds c g.
Assuming normality, find c.
The RV xis uniformly distributed in the intervaltJ - 2 < x < 8 + 2. We observe
100 samples x 1 and find that their average equals i = 30. Find the .95 confidence interval of 8.
Consider an av x with density /(x) = xe-xU(x). Predict with 95% confidence
that the next value of x will be in the interval (a, b). Show that the length b -a
of this interval is minimum if a and h are such that
/(a) =/(b)
P{a < x < h} = .95
Find a and b.
(Estimation-prediction). The time to failure of electric bulbs of brand A is a
normal av with u = 10 hours and unknown mean. We have used 20 such bulbs
and have observed that the average i of their time to failure is 80 hours. We
buy a new bulb of the same brand and wish to predict with 95% confidence that
its time to failure will be in the interval 80 ::!: c. Find c.
The time to failure ofan electric motor is an RV x with density ~e-IJxU(x). (a)
Show that if i is the sample mean of n samples of x, then the av 2ni/~ has a
x2(2n) distribution. (b) We test n = 10 motors and find that i = 300 hours. Find
the left .95 confidence interval of~· (c) The probability p that a motor will be
good after 400 hours equals p = P{x > 400} = t>- 400#. Find the .95 confidence
interval p >Po of p.
Suppose that the time between arrivals of patients in a dentist's office constitutes samples of an RV x with density 8e" 8'U(x). The 40th patient arrived 4
hours after the first. Find the .95 confidence interval of the mean arrival time
11 = 118.
The number of particles emitted from a radioactive substance in I second is a
Poisson-distributed av with mean A. It was observed that in 200 seconds, 2,550
particles were emitted. Assuming that the numbers of particles in nonoverlapping intervals are independent, find the .95 confidence interval of A.
Among 4,000 newboms,,2,080 are male. Find the .99 confidence interval of the
probability p = P{male}.
In an exit poll, of 900 voters questioned, 360 responded that they favor a
particular proposition. On this basis, it was reported that 40% of the voters
favor the proposition. (a) Find the margin of error if the confidence coefficient
of the results is .95. (b) Find the confidence coefficient if the margin of error is
::!:2%.
318
CHAP.
9
ESTIMATION
9-16 In a market survey, it was reported that 29Ck of respondents favor product A.
The poll was conducted with confidence coefficient .95, and the margin of
error was ±4%. Find the number of respondents.
9-17 We plan a poll for the purpose of estimating the probability p of Republicans in
a community. We wish our estimate to be within ..t.02 of p. How large should
our sample be if the confidence coefficient of the estimate is .95?
9-18 A coin is tossed once, and heads shows. Assuming that the probability p of
heads is the value of an av p uniformly distributed in the interval (0.4. 0.6) find
its Bayesian estimate (9-37).
9-19 The time to failure of a system is an avx with densityf(x, 8) = 8e- 11'U(x). We
wish to find the Bayesian estimate 8 of 8 in terms of the sample mean i of the
n samples x1 of x. We assume that 8 is the value of an av 8 with prior density
/,(8) = ce-•'BU(8). Show that
•
n
I
8=----(' +
ni ,....,.
x
9-20 The av x has a Poisson distribution with mean 8. We wish to find the Bayesian
estimate 8 of 8 under the assumption that 8 is the value of an av 8 with prior
density /,(8) - 8be-< 1U(8). Show that
.
ni+b+l
n-c
8=----
9-21 Suppose that x is the yearly starting income of teachers with a bachelor's
degree andy is the corresponding income of teachers with a master's degree.
We wish to estimate the difference 11.• - 11~· of their mean incomes. We question n = 100 teachers with a bachelor's degree and m = 50 teachers with a
master's degree and find the following averages:
9-22
9-23
9-24
9-25
9-26
i = 20K
j = 24K
s, = 3.1 K
.f~ = 4K
Assuming that the avs x and y are normal with the same variance, find the .95
confidence interval of .,.,, - 11...Suppose that the IQ scores of children in a certain grade are the samples of an
N(71, u) av x. We test 10 children and obtain the following averages: i = 90,
s = 5. Find the .95 confidence interval of 11 and of u.
The avs X; are i.i.d. and N(O, u). We observe that
+ • • · + .do = 4. Find
the .95 confidence interval of u.
The readings of a voltmeter introduces an error "with mean 0. We wish to
estimate its standard deviation u. We measure a calibrated source V = 3 volts
four times and obtain the values 2.90, 3.15, 3.05, and 2.96. Assuming that "is
normal, find the .95 confidence interval of u.
We wish to estimate the correlation between freshman grades and senior
grades. We examine the records of 100 students and we find that their sample
correlation coefficient is;= .45. Using Fisher's auxiliary·variable, find the 0.9
confidence interval of r.
Given then paired samples (x;, y1) of the avs x andy, we form their sample
means i and y and the sample covariance
xt
• = n _1~
-)
-)
1£11
~ (x; - x Cy; - y
1
Show that E{liu} = #£11.
PROBLEMS
3)9
9-27 (a) (Cauchy-Schwarz inequality). Show that
11
(
LI a;h;
i·
(b) Show that if
..
r2
,. = 2
)~
n
"'" ~ af
i
(X;- .t)(y;-
(x;-
I
n
LI hi
i-
_v;r
_,L (y; ·· yl·
_,
,;; s I
then
xl·
9-28 Given the 16 samples
93
75
40
73
61
42
68
64
78
54
87
H4
71
49
72
58
of the RV ¥.find the probability that its median is between 68 and 75 (a) exactly
from (9-60): (b) approximately from (9-61).
9-29 The avs y 1• • • • • y~ are the order statistics of the five samples ¥ 1, • • • • x~
ofx. Find the probability P{y 1 < x,. < y~} that the u-percentile x, oh is between
Y1 and y~ for u = .5 • .4. and .3.
9-30 We use as estimate of the median x~ of an RV ¥the interval <y4 • Y4. 1) between
the order statistics closest to the center of the range <y 1 • Ynl: k :.,: n/2 s k ... I.
t.;sing (9-611. show that
X;=
... 2
P{y4 1<-~.~<y,.l},..
9-31
\.1r11
We use as the estimate of the distribution Fix 1 of an RV x the empirical
distribution f"(x) in (9-67) obtained with 11 samples X;. We wish to claim with
90% confidence that the unknown F<x) is within .02 frum F(x). Find n.
The RV x has the F.rlang density j'(.r 1 - c·4.\' 3c• ..., VI x ). We observe the samples
x; = 3.1, 3.4. 3.3. Find the ML estimate(' of c-.
The RV x has the truncated exponential density f(xl =- c·e-•c.•-·'"1U(x - x0 ).
Find the ML estimate c of c- in terms of the 11 samples .\'; of x.
The time to failure of a bulb is an av x with density ce · '• U< x). We test 80 bulbs
and find that 200 hours later. 62 of them are still good. Find the ML estimate
of c:.
The av x has a Poisson distribution with mean 8. Show that the ML estimate of
8 equals .'t.
Show that if L(x, 8) = ln.f<x. 8) is the likelihood function of an RV "·then
=
9-32
9-33
9-34
9-35
9-36
E{la'-~:· 8~} = -Er'~~;~~- o}
9-37 The time to failure of a pump is an RV x with density lie• 8·'U(x). (a) Find the
information I about 8 contained in x. (b) We have shown in Example 9-24 that
the ML estimator iJ of 8 equals 1/i. Show that for, > 2
E{ iJ}
= _!!!!_
u~ -
~~~~
9
n - I
- <n - I )2(n - 2)
9-38 Show that if y is the best estimator of a parctmcter 8 0 and z is an arbitrary
statistic with zero mean, then y and z arc uncorrelated.
9-39 The RV" has the gamma density 82xe- 1"U(:c). Find the best estimator of the
parameter 118.
*
320
CHAP.
9
ESTIMATION
9-40 The av x has Weibull distribution x•-le-•''1U(81. Show that the most efficient
estimate of 8 is the sum
IJ
= _!._
~ x'
n8~ '
9-41 Show that if the function/(x. 8) is ofthe exponential type as in (9-91), the ML
estimator of the parameter 1' = b'(8)/a'(8) is the best estimator of 7'.
9-42 Show that ifx;are the samples of an rv x with density e-••-'1U(x). the RV w =
min X; is a sufficient and complete statistic of 8.
9-43 Show that if the density f(x, 8) has a sufficient statistic y (see (9-95)) and the
density /y(y, 8) ofy is known for 8 = 80 , then it is known for every 8.
9-44 If/(x) = 8e- 11 U(x) and n = 2, then the sum y = x 1 + x2 is a sufficient statistic
of 8. Show that if z = x 1, then E{zly} = y/2, in agreement with the sufficiency
theorem.
9-45 Suppose that Xi are the samples of anN(.,, 5) av x. (a) Show that the sum y =
x1 + · · · + X11 is a sufficient statistic of.,. (b) Show that if a; are n constants
such that a 1 + · · · + a11 = 0 and z = a 1x1 .... • • • + a11x11 , then the avs y and z
are independent.
10 _ _ __
Hypothesis Testing
Hypothesis testing is part of decision theory. It is based on statistical considerations and on other factors. often subjective, that are outside the scope of
statistics. In this chapter. we deal only with the statistical part of the theory.
In the first two sections. we develop the basic concepts using commonly
used parameters, including tests of means, variances. probabilities, and distributions. In the next three sections, we present a variety of applications,
including quality control, goodness-of-fit tests, and analysis of variance. The
last section deals with optimality criteria. sequential testing. and likelihood
ratios.
10-1
General Concepts
A hypothesis is an assumption. A statistical hypothesis is an assumption
about the values of one or more parameters of a statistical model. Hypothesis testing is a process of establishing the validity of a hypothesis. In hypothesis testing, we arc given an RV x modeling a physical quantity. The distribution ofx is a function F(x, 8) depending on a parameter 8. We wish to test the
321
322
CHAP.
10
HYPOTHESIS TESTING
hypothesis that 8 equals a given number 80 • This problem is fundamental in
many areas of applied statistics. Here arc several illustrcltions.
I.
We know from past experience that under certain experimental
conditions, the parameter 8 equals 80 • We modify various factors
of the experiment. and we wish to establish whether these modifications have any effect on the value of 8. The modifications might
be intentional (we try a new fertilizer), or they might be beyond our
control (undesirable changes in a production process).
2. The hypothesis that 8 = 80 might be the result of a theory to be
verified.
3. The hypothesis might be a standard that we have established (expected productivity of a worker) or a desirable objective.
Terminology The assumption that 8 = 8o will be denoted by H 0 and
will be called the null hypothesis. The assumption that 8 :/= 80 will be denoted
by H 1 and will be called the alternative hypothesis. The set of values that 8
might take under the alternative hypothesis will be denoted by 9 1 • If 9 1
consists of a single point 8 = 81 , the hypothesis H 1 is called simple: otherwise, it is called composite. Typically. 9 1 is one of the following three sets:
8 :/= 80 , 8 > 80 • or 8 < 80 • The null hypothesis is in most cases simple.
THE TEST The purpose of hypothesis testing is to establish whether experimental evidence supports rejecting the null hypothesis. The available evidence consists of then samples X= [x 1 • • • • , x,.] of the av x. Suppose that
under the null hypothesis the joint density f(X, 80 ) of the samples x; is
negligible in a certain region D,.. of the sample space, taking significant values
only in the complement Da of D,... It is reasonable then to reject H0 if X is in
D,.. and to accept it if X is in Da. The set D,.. is called the critical region of the
test, and the set D,. is the region of acceptance of H0 • The terms "accept"
and "reject" will be interpreted as follows: If XED,.., the evidence supports
the rejection of H 0 ; if X fE D,., the evidence does not support the rejection of
Ho. The test is thus specified in terms of the set D,... The choice of this set
depends on the nature of the decision errors. There are two types of errors,
depending on the location of the observation vector X. We shall explain the
nature of these errors and their role in the selection of the critical region of
the test.
Suppose, first, that H 0 is true. If X E D,., we reject H 0 even though it is
true-a Type I error. The Type I error probability is denoted by a and is
called the significance level of the test. Thus
a= P{X E D,.IHo}
(lo-1)
The difference 1 - a equals the probability that we accept Ho when true.
Suppose, next, that H0 is false. If X fE De, we accept H0 even though it
is false-a Type II error. The Type II error probability depends on the value
of 8. It is thus a function {3(8) of 8 called the operating characteristic: (OC) of
SEC.
10-1
GF.~ERAI. CONCI:.PTS
323
the test. Thus
~(0) = P{X
fE /J,IIIt}
(10-2)
The difference
P(8) = I - ~HI) = P{X E D,IHt}
(10-3)
equals the probability that we reject 1/11 when it is false. The function P(O) is
called the power ofthe test. For brevity. we shall often identify the two types
of errors by the expressions a error and #(IJ) error. respectively.
To design an optimum test. we assign a value to a and select the critical
region D,. so as to minimize the resulting #· If we succeed. the test is called
most powerful. The critical region of such a test usually depends on 8. If it
happens that a most powerful test is obtained with the same /), for every
IJ E (N) 1 • the test is called uniformly mmil powerful.
Note Hypothesis testing belongs to decision theory. Statistical considerations
lead merely to the following conclusions:
If H 0 is true. then P!X E 1>, } = a
If H11 is false. then P{X e I>, } ,.., /3161
Guided by this. we reach a dt:c·Mmr:
Reject 1/11 iff X E /),
(10-4)
( 10-5)
This decision is not based only on ( 10-4). It takes into account our prior
knowledge concerning the validity of 1/0 • the consequences of a wrung decision. and possibility other. often subjective factors.
Test Statistic:
The critical region is a set D,. in the sample space. If it is properly chosen, the
test is more powerful. This involves a search in the ll-dimensional space. We
shall use a simpler approach. Prior to any experimentation we select a function g(X) and form the RV q = g(X). This RV will be called the test statistic.
The function g(X) may depend on 80 and on other known parameters, but it
must be independent of 8. Only then is G(X) a known number for a specific
X. The test of a hypothesis involving a test statistic is simpler. The decision
whether to reject H0 is based not on the value of the vector X but on the
value of the scalar q = g(X).
To test the hypothesis H 0 using a test statistic, we find a region R,. on
the real line, and we reject H 0 itT q is in R, .. The resulting error probabilities
are
a= P{q E
~(8) = P{q fE
L. j~(q.
= t ..l~<q.
R,lllu} '-'
flu)dq
R,IH.}
8)c/:.
where Ra is the region of acceptance of Ho. The density fq(q, 8) of the test
statistic can be expressed in terms of the function g(X) and the joint density
324
CHAP.
10 HYPOTHESIS TESTING
f(X, 8) of the samples X; [see (7-6)]. The critical region Rc is determined as
before: We select a and search for a region Rc minimizing the resulting OC
function {j(8).
In the next section, we design tests based on empirically chosen test
statistics. In general, such tests are not most powerful no matter how Rc is
chosen. In Section 10-6, we develop the conditions that a test statistic must
satisfy in order that the resulting test be most powerful (Neyman-Pearson
criterion) and show that many of the empirically chosen test statistics meet
these conditions.
10-2
Basic Applications
In this section, we develop tests involving the commonly used parameters.
The tests are based on the value of a test statistic q = g(X). The choice of the
function g(X) is more or less empirical. Optimality criteria are developed in
Section 10-6. In the applications of this section, the density fq(q, 8) of the test
statistic has a single maximum at q = qmu. To be concrete, we assume that
[q(q, 8) is concentrated on the right of qmu if 8 > 80 , and on its left if8 < 80 as
in Fig. 10.1.
Our problem can be phrased as follows: We have an RV x with distribution F(x, 8). We wish to test the hypothesis 8 = 80 against one of the
alternative hypotheses 8 :f= 80 , 8 > 80 , or 8 < 80 • In all three cases, we shall
use the same test statistic. To carry out the test. we select a function g(X).
We form the RV q = g(X) and determine its density [q(q, 8). We next choose
a value for the a error and determine the critical region Rc. We compute the
resulting OC function {j(8). If {j(8) is unacceptably large, we increase the
Figure 10.1
c
(a)
(b)
q
c
(c)
q
SEC.
10-2
BASIC APPLICATIONS
325
number n of the samples x;. Finally, we observe the sample vector, compute
the value q = g(X) of the test statistic, and reject H 0 iff q is in the region R, ..
To complete the test, we must determine its critical region R,.. The
result depends on the nature of the alternative hypothesis. We consider each
of the three cases 8 :/: 80 , () > 0. 8 < 0, and we determine the corresponding
OC function.
I.
Suppose that /1 1 is the hypothesis (J :/: 9.,. In this case, R,. consists
of the two half-lines q < c 1 and q > c~. as in fig. IO.Ia. The
corresponding a error is the area under the two tails of the density
/q(q, 8o):
a=
f~Jq(q,
8o)dq +
J.:i~(q,
(10-6)
8o)dq
The constants c1 and c2 are so chosen as to minimize the length
c:2 - c 1of the interval (c: 1 , c:2 ). This leads to the condition.t;,(c 1 • 80 ) =
/q(c: 2 , 80 ); however, the computations required to determine c: 1 and
c2 are involved. To simplify the problem. we choose c:, and c:2 such
that the area of /q(q, 80 ) under each of its tails equals a/2. This
yields
(10-7a)
Denoting by q, the u-percentile of the test statistic q, we conclude
that c: 1 = qa12 • c2 = qt-a/2. The resulting OC function {3(8) equals
the area of /q(q, 8) in the interval (c 1 • c:~).
2. Suppose that H 1 is the hypothesis 8 > 80 • The critical region is now
the half-line q >cas in Fig. IO.Ib, and the corresponding a error
equals the area of the right tail of /q(q, 80 ):
a
=
f'
fq(q, 8o)dq
c
= q,
a
(10-7b)
The OC function {3(8) equals the area of/q(q, 8) in the region q <c.
3. Suppose that H 1 is the hypothesis 8 < 80 • The critical region is the
half-line q < c of Fig. IO.lc, and the a error equals
a=
J:.. .t;,(q, 8o)dq
c
= qa
(10-7c)
The OC function is the area of /q(q, 8) in the region q > c.
Summary To test the hypothesis 8 = 80 against one of the alternative hypotheses 6 :/: 80 , 8 > 80 , or 8 < 80 • we proceed as follows:
1. Select the statistic q = g(X) and determine its density fq(q, 8).
2. Assign a value to the significance level a and determine the critical
region R,. for each case.
3. Observe the sample X and compute the function q = g(X).
326
CHAP.
10
HYPOTHESIS TESTING
4. Reject H 0 itT q is in the region R,..
5. Compute the OC function fj((J) for each case.
Here is the decision process and the resulting OC function fj(6) for each
case. The numbers qu are the u-percentiles of the test statistic q under
hypothesis Ho.
Accept H 0 itT c; s q s c2
H, : 6 =I= 6o
C2 = ql-o/2
fj((J)
H 1 : 6 > 60
..
(10-Sa)
(}) dq
Accept H 0 iff q s c
fj((J)
H 1 : 6 < 60
= f.C1 /q(q,
=
f..
=
f
= q 1 ,.
(10-Sb)
/q(q. (J) dq
Accept H0 itT q
fj((J)
c
~
c·
c
= qa
(10-8c)
/q(q. (J)dq
Notes
I. The test of the simple hypothesis 8 = 80 against the alternative simple
hypothesis 8 = 81 > 80 is a special case of ( I0-8b); the test against 8 =
81 < 80 is a special case of (10-8c).
2. The test of the compositt> null hypothesis 8 s 80 against the alternative
8 > 80 is identical to (10-8b); similarly, the test of the composite null
hypothesis 8 =:: 80 against 8 < 80 is identical to (10-8c). For all three tests,
the constant a is the Type I error probability only if 8 = 80 when H 0 is true.
3. For the determination of the critical region of the test, knowledge of the
density f 9(q, 8) for 8 = 80 is necessary. For the determination of the OC
function {J(8), knowledge of/9 (q, 8) for every 8 E 0 1 is necessary.
4. If the statistic q generates a certain test, the same test can be generated by
any function of q.
5. We show in Section 10-6 that under the stated conditions. all tests in (10-8)
are most powerful. Furthermore, the test!i against 8 > 80 or 8 < 80 are
uniformly most powerful. This is not true for the test against 8 :1: 80 • For a
specific 8 > 80 , for example. the critical region q > c yields a Type II error
smaller than the integral in (10-8a).
6. The OC function {J{8) equals the Type II error probability. If it is too large,
we increase a to its largest tolerable value. If {J(8) is still too large, we
increase the number n of samples.
7. Tests based on (10-8) require knowledge of the perc·c•ntiles q, of the test
statistic q for various values of u. However, all tests can be carried out in
terms of the distribution Fq(q. 8) of q. Indeed. suppose that we wish to test
the hypothesis 8 = 80 against 8 :/: 80 • From the monotonicity of distributions it follows that
SF.C.
10-2
BASIC APPI.ICATIONS
327
Hence, the test (2-2a) is equivalent to the following test:
Determine the function F.,(q, 80 ) where q is the observed value of the
test statistic
Accept //11 itT~ < F.,lq. flul < I - ~
This approach is used in tests based on computer simulation (see Section
8-3).
Me• em
x with mean .,, and we wish to test the hypothesis
1/o: 71 = Tlo
against
H,: 71 =/:. Till• 71 > .,,, or 71 < Tlo
We have an
RV
Assuming that the variance of x is known. we use as the test statistic the
Till
u!V'n
ll -
q
=
RV
(10-9)
where i is the sample mean of x. With the familiar assumptions, the RV i is
N(TJ, u!Vn ). From this it follows that under hypothesis //0 , the test statistic
q is N(O, 1); hence, its percentile q, equals the u-percentile z, ,lfthe standard
normal distribution. Setting q, = z, in (I 0-8), we obtain the critical regions of
Fig. 10.2. To find the corresponding OC functions. we must determine the
density of q under hypothesis 1/1 • In this case. i is N(TJ. u!Vn ). and q is
N(.,.,. 1) where
., - Till
(10-10)
., = - -
Since
z1_,
"
-z,.
u!Yn
=
(10-8) yields
H,: 71 =/:. Tlo
Accept 1/o iff.:..,~< q.,:;: -za•2
{J(TJ) = G(-z..,2- .,.,) - G(z.. -~- .,.,,
H,: 71 > Tlo
Accept Ho iff q
{J(TJ) =
G<-z.. -
.,.,>
!:i
~•-·•
(10-Jia)
(10-llb)
t'ipre 10.1
I- a
17<1'1u
l'lo
1'1
l'lo
328
CHAP.
10 HYPOTHESIS TESTING
Accept Ho iff q ~ Za
J3(71) = I - G(z.. - .,,,)
The OC functions J3(71) are shown in Fig. 10.2 for each case.
H,: 71 <'rio
Example 10.1
(10-llc)
We receive a gold bar of nominal weight 8 oz .• and we wish to test the hypothesis that
its actual weight is indeed 8 oz. against the hypothesis that it is less than 8 07.. To do
so. we measure the bar 10 times. The results of the measurements are the values
X;= 7.86
7.90
7.93
7.95
7.96
7.97
7.98
8.01
8.02
8.04
of the RV x = 71 + "where 71 is the actual weight of the bar and " is the measurement
error, which we assume normal with zero mean and cr = 0.1. The test will be
performed with confidence level a= .05.
In this problem.
Ho: 71 = 71o = 8
H 1 : 71 < 8
CT
x- 71o
.'i = 7.96
Vn = 0.032 q = - - = ··1.25 Zo = -1.645
cr/Vn
Since -1.25 is not in the critical region q < -1.645. we accept the null hypothesis.
The resulting OC function equals
~(71) =
I - G ( - 1.65 -
~.C~3 :)
If the variance of x is unknown, we use as test statistic the
q
•
RV
=i
-'riO
(10-12)
s/Yn
where s2 is the sample variance of x. Under hypothesis H 0 • this RV has a
t(n - 1) distribution. We can therefore use (10-8), provided that we set q,.
equal to the t,.(n - 1) percentile. To find J3(71), we must determine the distribution of q for ., =I= 'rio. This is a noncentral Student t distribution introduced
in Chapter 7 [see (7A-9)].
Example 10.2
The mean 71 of past SAT scores in a school district equals 560. A class of 25 students
is taught by a new instructor. The students take the test, and their scores x1 yield a
sample mean i = 569 and a sample standard deviation s = 30. Assuming normality,
test the hypothesis 71 = 560 against the hypothesis 71 :/: 560 with significance level
a = .05. In this problem.
Ho: 71o = 560
H,: 71 :/: 71o
569 560
q=
= I5
lt-o/2(24) = 1.97~.(24) = 2.06 = -l.cm
30/v'B
.
Thus q is in the interval (-2.06, 2.06) of aceeptance of H0 : hence. we accept the
hypothesis that the average scores did not change.
SEC.
10-2
BASIC APPLICATIONS
329
EQt.;AUTY o•· TWO M•:ANS We have two normal Rvs x andy, and we
wish to test the hypothesis that their means are equal:
Ho: TJx = TJ~·
(10-13)
As in the problem of estimating the difference of two means, this test has two
aspects.
Paired Samples Suppose, first, that both avs can be observed at each trial.
We then have n pairs of samples (x;, Y; ). as in (9-39). In this case. the RV
w=x-y
is also observable, and its samples equal w; = x;- y;. Under hypothesis H 11 ,
the mean.,.,. = Tl.t - TJ~· of w equals 0. Hence, (10-13) is equivalent to the
hypothesis H 0 : ., .. = 0. We thus have a special case of the test of a mean
considered earlier. Proceeding as in ( 10-9), we form the sample mean w =
i - y and use as test statistic the ratio
w
u.,.tvn
=
i-y
u...tvir
where
IT;..
= cr~
+ cr~ - 2p...-
(10-14)
Under hypotheses Ho, this RV is N(O, I); hence. the test of //0 against one of
the alternative hypothesis ., ... =I= 0 • ., ... > 0 . ., ... < 0 is specified by ( 10-11)
provided that we replace q by the ratio Vntr - .v>hr. . and TJq by Vn(TJ, - TJ~· )/
u ....
If u.., is unknown, we use as test statistic the ratio
w
(10-15)
s.,../Vn
where s~ is the sample variance of the RV w, and proceed as in (10-12).
Example 10.3
The RVS x andy are defined on the same experiment 9'. We know that ax = 4, u 1 = 6,
and #Lxy = 13.5, and we wish to test the hypothesis.,, = .,_.against the hypothesis
TJx :/: TJr Our decision will be based on the values of the 100 paired samples (x~o Y;).
We average these samples and find i = 90.7, y = 89.
In this problem, n = 100, I - 01/2 = .915,
u~
Since 3.4
> z.97,
= 36
+ 16 - 27 = 25
w=l.l
1\'
--=3.4
u ..
= 2, we reject the hypothesis that .,. = TJ~·. •
tvn
Sequential Samples We assume now that the Rvs x andy cannot be sampled
in pairs. This case arises in applications involving Rvs that are observed
under different experimental conditions. In this case, the available observations are n + m independent samples, as in (9-41). Using these samples, we
form their sample· means x, y and the RV
w= i - y
(10-16)
330
CHAP.
10
HYPOTHESIS TESTING
Known Variances Suppose, first, that the parameters CTx and
known. To test the hypothesis Tl.t = Tlv. we form the test statistic
-
where
q =wCTw
If Ho is true, then Tlw = Tl.t - Tl.v
therefore use ( I 0-11 ) with
= 0;
.,
.+
~
CT(
CT...,="
n
hence, the
-
CT,.
,
CT;
~
(10-17)
m
RV
are
·
q is N(O, I). We can
.,,.
Tlq=~
CT;;:
Example 10.4
A radio transmitter transmits a signal offrequency '1· We wish to establish whether
the values"" and.,, of., in two consecutive days are different. To do so, we measure
'1 20 times the first day and 10 times the second day. The average of the first day's
readings x1 = "" + v1 equals 98.3 MHz, and the average of the second day's readings
y1 = .,, + v1 equals 98.32 MHz. The measurement errors v1 and v1 are the samples of a
normal RV "with.,, = 0 and u, = 0.04 MHz. Based on these measurements, can we
reject with significance level .OS the hypothesis that "" = .,,. ?
In this problem, n = 20, m = 10, z. 97~ = 2, x = 98.3, y = 98.32.
uw
= CT,
J;
+
~ = 0.015
w= 0.02
w
-=1.3
uw
Since 1.3 is not in the critical region (- 2. 2), we accept the null hypothesis.
•
Unknown Y IU'iances Suppose that the parameters CT., and a,. are unknown. We shall carry out the test, under the assumption that CT, = U'\· = CT.
As we have noted in Section 9-2, only then is the distribution of the test
statistic independent of CT. Guided by (9-47), we compute the sample variances s~ and s~ of x and y, respectively, and form the test statistic
= (n-l)s~+(m-l)s;(l
-n + -mI) (IO-I 8)
n + m- 2
is the estimate of the unknown variance CT~ of the RV w= i - y [see (9-46)].
q
w
= -.
u-.
h
.~
w ere CT~
Under the null hypothesis, this statistic is identical to the estimator in (9-47)
because if Tlx = Tlv, then Tlw = 0. Therefore, its density is t(n + m - 2). This
leads to the following test: Form the n + m samples x; and y1 : compute i,
s~. s~; compute a-,.. from (10-18); use the test (10-11) with
.v.
X-
Y
q = -.CT-.
'rlq
Replace the normal percentile Zu by t,.(n
Example 10.5
= Tlx. 'rlv
CT;;:
+m
- 2>.
The avs x andy model the IQ scores of boys and girls in a school district. We wish to
compare their means. To do so, we examine 40 boys and 30 girls and find the
following averages:
Boys: x = 95, s" = 6
Girls: y = 93, s,. = 8
SEC.
10-2 BASIC APPLICATIONS
331
Assuming that the avs x and y are normal with equal variances, we shall test the
hypothesis H0 : TJ., =.,,.against H1 : TJ., :1= .,.,_.with a = .05.
In this problem. ·n = 40. m = 30, t Y7~(6M) = :.111< ...,. :!.
~t
v
...
= 39s~
+ 29s~ ( ..!._ ~ .! ) = 1 LA~
68
40 30.
·'"
q
= 95
93 = 1 , 2
·-
1.64
Since 1.22 < 2. we accept the hypothesis that the mean IQ scores of boys and girls
are equal. •
Mean-dependent Variances The preceding results of this section cannot be
used if the distributions of x andy depend on a single parameter. In such
problems, we must develop special techniques for each case. We shall consider next exponential distributions and Poisson distributions.
Exponential Distributions Suppose that
I
H,
r.( \-') = .... (' \' Ho(j( \')
J• •
.
In this case, 1)., = 60 and.,,.= 8 1 : hence, our problem is to test the hypothesis
Hn: H1 = 60 against /1 1 : 8.1 + 60 • Clearly. the RVS xlfJ11 and y/8 1 have a x 2(2)
distribution. From this it follows that the sum ni/fJo of the" samples x;/00 has
a x~(2n) distribution and the sum my/8 1 of them samples y;IH 1 has a x 2(2m)
distribution (see (7-87)). Therefore, the mtio
ni12n80
81i
my/2m8, = Boy
has a Snedecor F(n. m) distribution.
We shall use as test statistic the ratio q = i/y. Under hypothesis H 0 , q
has an F{n, m) distribution. This leads to the following test.
I.
2.
3.
Example 10.6
Form the sample means i and f and compute their ratio.
Select a and find the a/2 and I - a/2 percentiles of the F(n, m)
distribution.
Accept Ho iff Fo12(n, m) < ily < Ft-odn. m).
We wish to examine whether two shipments of transistors have the same mean time
to failure. We select 9 units from the first shipment and 16 units from the second and
find that
·"• + · · · + x 11 ::. 324 hours
_v 1 + · · · 1 -"'" = 380 hours
Assuming exponential distributions. test the hypothesis 8 1 = 611 against fl 1 :/: flo with
u=.l.
In this problem,
F. 11~(9, 16) =- 2.54
F. 11 ~(9, 16) = 0.334
.f = 36
y ""' 23.75
Since .'il.v = 1.52 is between 0.334 and 2.54. we accept the hypothesis that fl 1 = fl0 • •
332
CHAP.
10
HYPOTHESIS TESTING
Probabilities
Given an event ~ with probability p = P(~). we shall test the hypothesis
p = Po against p :/= Po, p > Po, or p < Po, using as test statistic the number k
of successes of~ inn trials. We start with the first case:
Ho: p =Po
against H,: p ?=Po
(10-19)
As we know, k is a binomially distributed RV with mean np and variance npq.
This leads to the following test: Select the consistency level a. Compute the
largest integer k 1 and the smallest integer k2 such that
~ (~) PAqs-• < I
Determine the number k of successes
.~, (~) pAqs-' < I
of~
<t0-2o>
inn trials. Accept Ho iff k, s
k s kl.
The resulting OC function equals
p(p)
= P{k 1 s
k s k2IHd
=
t (~)
·-··
p'q"-'
For large n, we can use the normal approximation. This yields
kt
= npo +Zan.~ k2 = npo- Zan.Vnpoqo
p(p) = G(k2 - np) _ G(k' - np)
vnpq
Vnpq
One-sided Tests We shall now test the hypothesis
Ho: p = Po
against
H,: p >Po
The case against p < p 0 is treated similarly.
We determine the smallest integer k2 such that
(10-2 l)
(10-22)
i (~) pAq8 ·' < a
4 -4!
and we accept Ho itT k < k2 • For large n,
k2
= npo + Zt-o Vnpoqo
p(p)
= G (k~)
npq.
(10-23)
Note that (10-22) is equivalent to the test of the composite hypothesis
H(,: p s Po against Hj: p >Po·
Example 10.7
We toss a given coin 100 times and observe that heads shows 64 times. Does this
evidence support rejection of the fair-coin hypothesis with consistency level .OS? In
this problem, np0 =SO, Vnpoqo = S, z.o25 = -2; hence, the region of acceptance of H0
is the interval
npo ± 2Vnpoqo ==SO± 10
Since 64 is outside this interval. we reject the fair-coin hypothesis. The resulting OC
SEC.
10-2
BASIC APPI.ICATIONS
333
function equals
~(p) = (; ( ~ .~ lOOp) _ G ( 40 -· ~I')
IOVpq
lliVpq
•
Rare l<:vents Suppose now that p11 << I. If n is so large that 11p11 >> I, we can
use (10-21). We cannot do so if np0 is of the order of I. In this case, we use
the Poisson approximation
4qt· . =
(n)
k P
4
>.4
(,-A _
A -· liP
k!
( 10-24)
Applying (10-24) to (10-19) and (10-20). we obtain the following test.
Set Ao = npo and compute the largest integer k 1 and the smallest integer
k2 such that
( 10-25)
Accept Ho iff k 1 s: k
!5
k2 • The resulting OC function equals
~(A) = ('
4)
L
A'
(10-26)
k•
The one-sided alternatives p > Po and p < p11 lead to similar results. Here is
an example.
A
--:-j
4 4o
Example 10.8
A factory has been making radios using process A. Of the completed units, 0.6% are
defective. A new process is introduced. and of the first 1,000 units, 5 are defective.
Using a = 0.1 as the significance level. test the hypothesis that the new process is
better than the old one.
In this problem. n = 1,000, Ao = 6, and the objective is to test the hypothesis
110 : p = p0 = .006 against H 1 : p < p0 • To do so, we determine the largest integer k1
such that
(10-27)
and we accept H0 iff k > k1 • In our case. k = 5. a
hence. we accept the null hypothesis. •
= .I, and (10-27) yields k1 = 2 < 5:
EQUALITY OF TWO PROBABII.ITIES In the foregoing discussion, we assumed that the value p0 of p under the null hypothesis was known. We now
assume that p 0 is unknown. This leads to the following problem: Given two
events .s4o and sf 1 with probabilities p 0 = P(.s40 ) and p, = P(.slft), respectively,
we wish to test the hypothesis Ho: p, =Po against H,: p, =I= Po· To do so, we
perform the experiment no + n, times, and we denote by ko the number of
successes of .s4o in the first no trials and by k1 the number of successes of .s41
in the following n 1 trials. The observations are sequential; hence, the Rvs ko
334
CHAP.
10
HYPOTHESIS TESTING
and k 1 are independent. Sequential sampling is essential if :A0 models a
physical event under certain experimental conditions (a defective component in a manufacturing process. for example), and .~ 1 models the same
physical event under modified conditions.
We shall use as our test statistic the RV
ko kl
q =- - (10-28)
no n1
and, to simplify the analysis, we shall assume that the samples are large.
With this assumption, the RV q is normal with
"''Jq
CT2
=Po- P1
= poqo + P1q1
no
q
n1
(10-29)
Under the null hypothesis, Po =PI; hence,
"''Jq
=0
, = poqo ( -1 + -I )
no n1
(10-30)
CTq
This shows that we cannot use q to determine the critical region of the test
because the numbers Po and qo are unknown. To avoid this difficulty, we
replace in (10-30) the unknown parameters p 0 and q0 by their empirical
estimates Po and q0 obtained under the null hypothesis.
To find Po and qo, we observe that if Po= p 1 • then the events .s40 and .s4 1
have the same probability. We can therefore interpret the sum k0 + k 1 as the
number of successes of the same event .s40 or .s4 1 in n0 + n 1 trials. This yields
the empirical estimates
A
-
ko + kl - I - .
Po - no + nl -
qo
A2
CT q
= Poqo
A
A
(
1 I)
- +no n1
(10-31)
where d'q is the corresponding estimate of CTq· Thus under the null hypothesis, the RV q is N(O, d'q); hence, its u-percentile equals
Applying (10-8a)
to our case, we obtain the following test:
z,aq.
Determine the number k0 and k 1 of successes of the events .s40 and
.s4 1 , respectively.
2. Compute the sample q = k01n0 - k 1/n 1 of q.
3. Compute d'q from (10-31).
4. Accept Ho itT ZoncTq s q s -Zond-q.
1.
Under hypothesis H 1 , the RV q is normal with "''Jq and CTq• as in (10-29).
Assigning specific values to Po and P1 , we determine the OC function of the
test from (10-Sa).
Example 10.9
In a national election, we conducted an exit poll and found that of 200 men. 99 voted
Republican and of 120 women, 45 voted Republican. Based on these results, we wish
to test the hypothesis that the probability p 1 of female voters equals the probability Po
SEC.
of male voters with consistency level a
99
45
tf"'" ~(Kl
•
(r
~
'1
-'-'
•
4'i
•
· 120 -· .I:!
X
'i'i (
·--
= .05.
10-2
RASIC APPLICATIONS
In this problem.
99 1- 4'i
/;u = 2(KI +·120 -·' .4 5
··~··· + _!_.
)
120
2(KI
:.u~i · -
:.•m "'.
335
2.
ciu = 55
tr,, - o.o57
Since 0.12 > :. ..11 ~tr,, == 0.114. the hypothesis that p
= l'n is rejected. hut barely.
•
Poisson Distributions
The RV xis Poisson-distributed with mean>... We wish to test the hypothesis
Hn: A -= >..o against H 1 : A =I= A0 • To do so, we form the 11 samples X; and usc as
test statistic their sum:
q = X1 I • • ·+X,
This sum is a Poisson-distributed RV with mean nA. We next determine the
largest integer k 1 and the smallest integer k~ such that
_ ,
~ (tr.\n)1
a
(' 11"''L.J--<H
k!
2
_,,. ~ (11.\n)'
a
('"''L.J--<-
( 10-32)
k!
2
The left sides of these inequalities equal P{q <; kdHn} and P{q ~ k:!IHn}
respectively. This leads to the following test: Find the sum q .:. . .t 1 + · · · +
x, of the observations X;; accept H 0 iff k 1 c;: q s; q~. The resulting OC function
equals
4 ':
( 10-33)
For large n. we can use the normal approximation. Under hypothesis H 0 •
TJ,, = n>..o: u~ = n.\o. Hence lsee (10-11)1.
( 10-34)
Example 10.10
The weekly highway accidents in a certain region are the values of a Poisson-distributed RV with mean A = Ao = 5. We wish to examine whether a change in the speed
limit from the present 55 mph to 65 mph will have any effect on A. To do so, we
monitor the number x; of weekly accidents over a tO-week period and obtain
X; = 5
2
9
6
3
7
9
6
8
Choosing a = .05, we shall test the hypothesis A -"' 5 against A :/: 5.
In this problem. ;:,.12 = -2. nA11 = 50.
L
~ 2 viii;; """ 50 ± 15
q -=
X; .:..: 56
Since 56 is between 35 and 65. we accept the A ..... 5 hypothesis. •
llAo
336
CHAP.
10
HYPOTHESIS TESTING
EQUALITY OF 1WO POISSON DISTRIBUTIONS We have two Poissondistributed avs x and y with means Ao and A1 , respectively, and we wish to
test the hypothesis Ho: A1 = Ao against H, ; At =/: Ao or A, > Ao or At < Ao where
Ao and A1 are two unknown constants. Our test will be based on the no+ n,
samples x; and y1, obtained sequentially.
The exact solution of this problem is rather involved because the variances of x andy depend on their mean. To simplify the analysis, we shall
consider only large samples. This leads to two simplifications: We can assume that the avs x andy are normal, and we can replace the variances u~ =
Ao and
= At by their empirical estimates.
Under hypothesis Ho. A1 = A0 ; hence, the avs x andy have the same
distribution. We can therefore interpret the n0 + n 1 observations x; andy; as
the samples of the same av x or y. The resulting empirical estimate of Ao is
u;
Ao = no
! n, (t + I Y;)
X;
;-t
1-t
Proceeding as in (10-17), we form the sample means i and yofthe avs x and
y and use as test statistic the ratio
i-v where u;.~ = Ao• ·-I + -I )
(10-35)
q=~
1·no nt
u ...
is the empirical estimate of the variance of the difference w = i - y. From
this it follows that if H 0 is true. then q is approximately N(O, I). We can
therefore use the test (10-11).
In particular, (10-lla) yields the following test of the hypothesis Ho:
A1 = Ao against H,: At =/: Ac,. Compute x..Y. and u,..
Accept Ho iff l.f
Example 10.11
- .VI!u,.. < - Zal1·
The number of absent boys and girls per week in a certain school district are the
values x; and Y• of two Poisson-distributed avs x and y with means Ao and At,
respectively. We monitored the boys for 10 weeks and the girls for 8 weeks and
obtained the following data:
27
24
14
10
10
16
23
19
13
8
Yl = 13
We shall test the hypothesis At = Ao against At :F Ao with consistency level a = .05.
In this problem, -loa .,. 2.
X;=
J5
12
i.o = 21~2 =
- y = 180 -
17
16.2
19
12
23
a-~=
16.2 (
4
17
1~ + ~-i
_!g = 4
q = 1.91 = 2.09 > 2
10
8
Hence, we reject the null hypothesis, but barely. •
i
a;:= 1.91
SEC.
10-2
BASIC APPLICATIONS
337
Variance and Correlation
Given anN(.,, u) RV x, we wish to test the hypothesis
Ho: u = cro
against
II,: cr ':/:: uo. cr > cru. or u < O'o
Suppose. first. that the mean., of x is known. In this case, we use as
test statistic the RV
q-=
~ (X; - T'/)~
£.J - - , -
( 10-36)
Uii
; I
Under hypothesis H 0 , the RV q is x2(n). We can therefore use the tests (10-8)
where q,. is now the x~(n) percentile. To find the corresponding OC function
{3(u), we must determine the distribution of q under hypothesis H 1 • As we
show in Problem 10-10, this is a scaled version ofthe x2(n) density.
Example 10.12
We wish to compare the accuracies of two measuring instruments. The error 11o ofthe
first instrument is an N(O, u 0 ) RV where uo = 0.1 mm, and the error.,, of the second is
an N(O. u) RV with unknown u. Our objective is to test the hypothesis H 0 : u = u 0 =
0.1 mm against H 1 : u :/:: u 0 with a = .05. To do so. we measure a standard object of
length 11 = 8 mm 10 times and record the samples
X;= 8.15
7.93
8.22
8.04
7.H5
7.95
8.06
IU2
7.86
7.92
of the
RV
x
= 11 + ,, . Inserting the results into ( 10-361. we obtain
q =-
(X;- 8)~
LI~ · - ·=
0.01
i I
14.64
From Table 2 we find the percentiles
Since 14.64 is in the interval (3.25. 20.4H) of acceptance of 1111 • we accept the hypothesis that u = cr0 • •
If ., is unknown, we use as test statistic the sum
i
i
i = .!.
x,
(10-37)
uo
n ; .,
Under hypothesis H 0 , the RV q is x2(n - 1). We can therefore again use
(10-8) where q,. is now the x~(n - I) percentile.
q
=
(X; -:-2 i)2
1-1
Example 10.13
Suppose now that the length 11 of the measured object in Example I0.12 is unknown.
Inserting the 10 measurements x1 into (10-37), we obtain
x = 8.01
q=
L10 (
i-1
::\'
x,~ = 14.54
0.01
X; -
338
CHAP.
10
HYPOTHESIS TESTING
From Table 2, we find that
X~o2~(9) = 2.70
Since 14.54 is in the interval (2.70. 19.02). we accept the hypothesis that u-= u 11 •
•
Equality of Two Variances The RV xis N(-q,,. u .• ), and the RV y is N(-q,. "·' ).
We shall test the hypothesis
= u,.
against
H 1 : u .• :F u.,.. u, > u,.• or u .. < a,.
Suppose, first, that the two means Yl.< and -q~ are known. In this case.
we use as test statistic the ratio
I "
Ho:
CTx
q
=
-n L (X; ;~•
I
'f/x)2
(10-38)
m
-m L (y; - -q,>2
;~•
where X; are the n samples of x and y; are the m samples of y obtained
sequentially. From (7-106) it follows that under hypothesis H 0 , the Rv q has
a Snedecor F(n, m) distribution. We can therefore use the tests (10-8) where
q,. is the F,.(n, m) percentile.
If the means 'fix and 'f/~· are unknown. we use as test statistic the ratio
~
s;
s;.
(10-39)
q = -;
where si is the sample variance of x obtained with the n samples x; and s; is
the sample variance of y obtained with them samples y;. From (7-106) it
follows that if CTx = u,, then the RV q is F(n - I, m - 1). We can therefore
apply the tests (10-8) where q,. is the F,.(n - I, m - I) percentile.
Example 10.14
We wish to compare the accuracies of two voltmeters. To do so, we measure an
unknown voltage 10 times with the first instrument and 17 times with the second
instrument. We compute the samples variances and find that
Sx = 8 p.V
Sy = 6.4 p.V
Using these data, we shall test the hypothesis Ho: Ux = Uv against H 1 : Ux ::1: u,. with
·
·
consistency level a = .I. In this problem.
~
s;
s;
q = .. = 1.56
n = 10
= 17
m
From Table 4 we find
F.~(9,
16) = 2.54
F. 0 ~(9,
16)
I
(l 6 9) = 0.334
.9~
•
Since 1.56 is in the interval (0.334, 2.54). we accept the hypothesis that cr. = u,. •
= F
Correlation We wish to investigate whether two Rvs x andy are uncorrelated. With r their correlation coefficient, our problem is to test the hypothesis
H 0 : r = 0 against H 1 : r :F 0. To solve this problem, we perform the underly-
SEC.
10-2
BASIC APPLICATIONS
339
ing experiment n times and obtain then paired samples (x;, y1). With these
samples, we form the estimate Pof r as in (9-55) and the corresponding value
z of Fisher's auxiliary variable z as in (9-56):
•
~(X; - .t)(y; - v)
r = v'~Cx; - i)2~(y;
I
·..v)2
I- f
z "' 2 In I +
( I0-40)
f
We shall use as test statistic the RV
q = zv'ii- 3
(10-41)
Under hypothesis H 11 • the RV z is N(O. llvli ·-=--3) Isec (9-57)). Hence, the
test statistic q is N(O, 1). We can therefore use the test (10-ll) directly.
Example 10.15
We wish to examine whether the freshman grades x and the senior grades yare
correlated. For this purpose, we examine the grades x; and Y; of n = 67 students and
compute f. z. and q from (10-40) and (10-37). The results are
f =- 0.462
z - 0.5
q - 4
To test the hypothesis r = 0 against r t. 0, we apply ( 10-Kal. Since q
we conclude that freshman and senior gl"'cldes are correlated. •
= 4 > lz..12 l -= 2,
Distributions
We wish to examine whether the distribution function F(x) of an RV x equals
a given function Fo(x). Later we present a method based on the x2 test. The
following method is based on the empirical distribution (9-64).
Our purpose is to test the hypothesis
H 0 : F(x) = Fo(x)
against
1/1: F(x) :/: F 0(x)
(10-42)
For this purpose. we form the empirical estimate PCx) of FCx) as in (9-64)
and usc as test statistic the maximum distance between F(x) and fj,(x ):
q = maxl.'(x) - Fo(..\)1
(10-43)
KOLMOGOROFF-SMIRNOV T•:sT
t
If Ho is true. £{F<x)} = F0(x); hence q is small. It is therefore. reasonable to
reject H11 itT the observed value q of q is larger than some constant £'. To
complete the test. it thus suffices to find c such that P{q > c:ll/0 } = a. Under
hypothesis H 0 • the statistic q equals the RV win (9-66). Hence [see (9-68)),
a = P{q > ciHo} ==- 2e 2n..•
This yields the following test: Using the samples X; of x. form the empirical
estimate hx> of F(x); plot the difference i-'(x) - F0(x) (Fig. 10.3), and
evaluate q from ( 10-43):
Accept H 0 itT q <
,~ 2~ In·~ = c
( 10-44)
340
CHAP.
10 HYPOTHESIS TESTING
125
75
X
Figure 10.3
Example 10.16
We shall test the hypothesis that the tabulated IQ scores of Example 9.19 are
the samples of a normal av x with 1J = 100 and u = 10. In the following table, we
list the bunched samples X;, the corresponding values of the empirical distribution
!(x), the normal distribution Fo(x), and their distance L\(x) = IF(x) - Fo(x)l (see also
Fig. 10.3).
7S
80
8S
90
9S
100
lOS
110
liS
120
12S
F(x1)
.02S
0.7S
I. SO
.21S
.425
.62S
.71S
.81S
.92S
.97S
.100
Fo(x1)
.006
.023
.067
.IS9
.308
.soo
.691
.841
.933
.917
.994
A(x1)
.019
.OS2
.083
.116
.117
.12S
.084
.034
.008
.002
.006
Xj
As we see from this table. L\(x) is maximum for x = 100. Thus
c
I
= V-
I I
80 n
2.OS = .217
q = L\( 100) = .125
Since .12S < .217, we accept the hypothesis that the
RV
xis N(IOO. 10). •
EQUALITY OF TWO DISTRIBUTIONS We are given two independent RVS
x and y with continuous distributions, and we wish to test the hypothesis
H 0 : F.r(w) = F,.(w)
against
H 1 : f"'.c(w) :1= F,.(w)
(10-45)
We shall give a simple test based on the special case p 0 = .5 of the test
(10-19) of the equality of two probabilities.
Sip Test From the independence of the RVs x and y it follows that if H 0 is
true, then
/x(x)/y(y)
= fx(y)/y(x)
SEC. 10-2
BASIC APPLICATIONS
341
This shows that the joint density ft.,.<x. y) is symmetrical with respect to the
line x = )'; hence. the probability masses above and below this line are equal.
In other words.
P{x
> y}
= P{x
< y}
=
.5
(10-46)
Thus under hypothesis H11 , the probability p = P(dl) of the event .54
equals .S. To test (10-45). we shall test first the hypothesis
Ho: p
=
.5
Hi: p
against
:1=
.5
= {x > y}
(10-47)
To do so. we proceed as in ( 10-20) with p11 = .5: Select a consistancy level a',
and compute the largest integer k 1 and the smallest integer k~ such that
(10-48)
In this case. k2 = n-
k, because ( ~) = (n ~ k) . ..-or large n, we can use the
normal approximation (10-21). With p 0 = .5. this yields
vii
n
v'li
k. = - - ... , (10-49)
- 2 ""' - 2
Form the n paired samples (x 1• y1) ignoring all trials such that X; = y 1•
Determine the number k of trials such that X; > y 1• This number equals the
number k of times the event sll occurs.
or
k > k~
(10-50)
Reject H0 iff k < k 1
If H 0 is false. then H0 is also false. This leads to the following test of
/111 • Compute k. k,. and k2 as earlier.
or
k > k~
(10-51)
Reject H11 if k < II.,
n
2
k, =- +
~ ·~-
41
"~
2
Note The tests of H 0 and H 0are not equivalent because H 0might be true even
when H 0 is false. In fact, the corresponding error probabilities a,~ and a',~·
are different. Indeed, Type I error occurs if H 0is rejected when true. In this
case. H0 might or might not be true: hence. a< a'. Type II error occurs if H 0is
not rejected when false. In this case. 110 is false; however. it might be false
even if H 0is true; hence, ~ > ~·.
Example 10.17
The Rvs x andy model the salaries of equally qualified male and female employees in
a certain industry. Using (10-50), we shall test the hypothesis that their distributions
are equal. The consistency level a of the test is not to exceed the value a' = .OS.
To do so, we interrogate 84 pairs of employees and find that X; > y1 for 54 pairs,
X; < y 1 for 26 pairs. and X; =- y 1 for 4 pairs. Ignoring the last four cases, we have
:0'1.!
== -·2
n
2~
z.,·,~
vii
2
= 40:!: 8.94
Hence. k 1 = 31 and k2 = 49. Since k = 54 > 49. we reject the null hypothesis.
342
CHAP.
10
HYPOTHI::SIS TESTING
We have assumed that the Rvs x1, y1 are i.i.d.; this is not. however,
necessary. The sign test can be used even if the distributions of x1 and y1
depend on i. Here is an illustration.
Example 10.18
We wish to compare the effectiveness of two fertilizers. To do so, we select 50 farm
plots, divide each plot into two parcels. and fertilize one half with the first fertilizer
and the other half with the second. The resulting yields are the values (x1• y;) of the
avs (x;. y1). The distributions of these avs vary from plot to plot. We wish to test the
hypothesis that for each plot, x1 and y1 have the same distribution. Our decision is
based on the following data: X;> y 1 for 12 plots and X;< y, for 38 plots. We proceed as
in (10-49) with a' = .05. This yields the interval
n
Vn
2: 2T =
Since k
= 12 <
25 : 1.01
18 = k 1 , we reject the null hypothesis.
•
10-3
Quality Control
In a manufacturing process, the quality of the plant output usually depends
on the value of a measurable characteristic: the diameter of a shaft, the
inductance of a coil, the life length of a system component. Due to a variety
of .. random" causes, the characteristic of interest varies from unit to unit.
Its values are thus the samples of an RV x. Under normal operating conditions, this Rv satisfies certain specifications. For example, its mean equals a
given number 7Jo. If these specifications are met during the production
process, we say that the plant is in control. If they are not, the plant is out of
control. A plant might go out of control for a variety of reasons: machine
failure, faulty material, operator errors. The purpose of quality control is to
detect whether, for whatever reasons, the plant goes out of control and, if it
doe~. to find the errors and eliminate their causes. In many cases, this
involves interruption of the production process.
Quality control is a fundamental discipline in industrial engineering
that involves a variety of control methods depending on the nature of the
product. the cost of inspection. the consequences of a wrong decision,
the complexity of the analysis, and many other factors. We shall introduce
the first principles of statistical quality control. Our analysis is a direct
application of hypothesis testing.
The problem of statistical quality control can be phrased as follows:
The distribution of the RV x is a known function F(x, 8) depending on a
parameter 8. This parameter could, for example. be the mean of x or its
variance. When the plant is in control, 8 equals a specified number 80 • The
null hypothesis Ho is the assumption that the plant is in control, that is, that
8 = 80 • The alternative hypothesis H 1 is the assumption that the plant is out
SEC.
10-3
Ql,;ALITY CONTROl.
343
of control, that is, that 6 :/: 60 • In quality control, we test the null hypothesis
periodically until it is rejected. We then conclude with a certain significance
level that the plant is out of control, stop the production process. and remove the cause of the error. The test proceeds as follows.
We form a sequence x; of independent samples of x. These samples are
the measurements of production units selected in a variety of ways. Usually
they form groups consisting of n units each. The units of each group are
picked either sequentially or at random forming the samples
X1, • • • • •l'11 • • • •
,
r,,.,, ....
.Tm ·I • • • • • .
Using the 11 samples of each group, we form a test of the hypothesis H 11 • We
thus have a sequence of tests designed as in Section 10-2. The testing stops
when the hypothesis H 0 is rejected.
Control Test Statistic
The decision whether to reject the hypothesis that the plant is in control is
based on the values of a properly selected test statistic. Proceeding as be• x,) of
fore, we choose a significance level a and a function q = g(x 1 , •
then samples x;. We determine an interval (c 1 , c~) such that
P{c, < q < C2;Hn} = 1 - a
( 10-52)
We form the samples
q,
= g(x,. 1 • • • •
,
x,.,)
of q. If q, is between c, and c~. we conclude that the plant is in control. If q,
< c, or q, > c2. we reject the null hypothesis: that is. we conclude that the
plant is out of control. The constant c 1 is called the upper control limit
(UCL); the constant c2 is called the lower control limit (LCL). In Fig. 10.4
we show t.he values of q, as the test progresses. The process terminates
Ti' w;-------------~""~---UCL
99';; 95'.1
JL ~-------------------------------
I
_1_
Warning
c,~------------------------------------------LCL
2
3
4
S
6
7
8
9
10
II
m
344
CHAP.
10 HYPOTHESIS TESTING
when qm is outside the control interval (c 1 , c2 ). The graph so formed is called
the control chart.
This test forms the basis of quality control. To apply it to a particular
problem, it suffices to select the function g(x 1 , • • • , x,.) appropriately. This
we did in Section 10-2 for most parameters of interest. Special cases involving means, variances, and probabilities are summarized at the conclusion of
this section.
Warning Limits If the time between groups is long, it is recommended that the testing be speeded up when qm approaches the control
limits. In such cases, we select a test interval (w,, w2) with a' >a:
P{w, < qm < wziHo} = I - a'
and we speed up the process if qm is in the shaded area of Fig. 10.4.
Control Error In quality control, two types of errors occur. The first is a
false alarm: The plant is in control, but qm crosses the control limits. The
second is a faulty control: The plant goes out of control at m = m0 , but qm
remains between the control limits, crossing the limits at a later time.
Faist Alarm The control limits might be crossed even when the plant
is in control. Suppose that they are crossed at the mth step (Fig. 10.5a).
Clearly, m is a random number. It therefore defines an av z taking the values
J, 2, . . . We maintain that
(10-53)
P{z = miHo} = (I - a)m-la
Indeed, z = m iff the statistic q is in the interval (c 1 , c2 ) in the first m - J
tests and it crosses to the critical region at the mth step. Hence, (10-53)
follows from (10-52). Thus z has a geometric distribution, and its mean [see
(4-109)] equals
J
.,_
(10-54)
.,, = -a
This is the mean number of tests until a false alarm is given, and it suggests
that if qm crosses the control limits for m > 11:, the plant is probabily still in
control. Such rare crossings might be interpreted merely as warnings.
Figure 10.5
UCL
UCL
of
l- Out
control
II
z =II
w=3
(a)
(b)
SEC.
10-3 QUALITY CONTROL
345
Faulty Control Suppose that the plant goes out of control at the mth
test but q,. crosses the control limits not immediately but k tests later (Fig.
10.5b). The plant is then in operation fork test intervals while out of control.
Again, k is a random number specifying the RV w. This RV takes the values 1,
2. . . . with probability
P{w
= k!H.} = ~'- (9)( I
1
- ~((l)j
(10-55)
where 8 is the new value of the control parameter and
~(0) = P{c, < q < <"l!H,}
is the OC function of the test. Thus w has a geometric distribution with p
I ·- ~(8) = P(8). and its mean equals
I
=
(10-56)
.,., ... = P(6)
where P(8) is the power of the test. If 8 is distinctly different from 80 , .,.,.,. is
small. In fact, if ~(8) < .5. then.,.,.,.< 2. As the difference 18 - 80 1decreases,
.,., ... increases; for 8 close to 80 , .,.,.,. approaches the mean false alarm length TJ;.
Control of the Mean
We shall construct a chart for controlling the mean .,., of x. Suppose, first,
that its standard deviation u is known. We can then use as the test statistic
the ratio q in (I 0-9):
X - TJo
q = u!Vn
Xm -
TJo
qm = u!Vn
-
I "
x, = ;; ~ x,..;
.,.,=
When the process is in control •
170 ; hence, q is N(O, 1), and the control
limits are +Zaf2 • If we use as the test statistic the sample mean x,. of the
samples x,. .. ; of the mth group. we obtain the chart of Fig. 10.6a with control
Figure 10.6
x,.
q,.
flo+ 3a/vn
t
UCL
L
to/2(11 - 1)
~
UCL
-;-
~
~flo
~
0
......
_L
8
LCL
flo- 3a/vn
t
m
-r.12 (11- 1)
x,.- flo
s,./yn
q =-m
m
0
(a)
(b)
LCL
346
CHAP.
J0
HYPOTHESIS TESTING
limits
(10-57)
These limits depend on a. In the applications. it is more common to specify
not the significance level a but the control interval directly. A common
choice is the 3u interval TJo ± 3u/Vn. Since
P {11o- 3
:n
< i < TJo + 3
:n
= .997
IHo}
the resulting consistency level a equals .003. The mean false alarm length of
this test equals 1/a - 333.
Example 10.19
A factory manufactures cylindrical shafts. The diameter of the shafts is a normal RV
with u = 0.2 mm. When the manufacturing process is in control, ., = 1Jo = 10 mm.
Design a chart for controlling., such that the mean false alarm length 11: equals 100.
Use 25 samples for each test.
In this problem, a= 11.,, = .Ol,z.995 = 2.576, and n = 25.1nserting into (10.57),
we obtain the interval
2
10 z 2.576
... 10: 0.1
°;
If u is unknown, we use as the test statistic the
_ Xm - TJo
qm -
:r
Sm1 vn
RV
2 _ _1_ ~
Sm -
n-
I
~
;~r
(
q in (10-12). Thus
Xm+i
_ -
)2
Xm
The corresponding chart is shown in Fig. 10.6b. Since the test statistic q has
a t(n - I) distribution, the control limits are
LCL = ladn- I)
UCL = -1,,,2(n- I)
Example 10.20
Design a test for controlling ., when u is unknown. The design requirements are
CL = :t3
False alarm length 11: = 100
In this problem. a = II1J: = .01: hence. the only unknown is the number n of
samples. From Table 3 we see that t.99~(n - I) = 3 if n - I = 13; hence. n = 14.
In quality control, the tests are often one-sided. The corresponding
charts have then one control limit, upper or lower. Let us look at two
illustrations.
Control of the Standard Deviation
In a number of applications, it is desirable to keep the variability of the
output product below a certain level. A measure of variability is the variance
u 2 of the control variable x. The plant is in control if u is below a specified
level u 0 ; it is out of control if u > u 0 • This leads to the one-sided test H 0 : u =
SEC.
10-3
QUAI.ITY CONTROL
..;q;,
qm
x!<n-
I)
UCL
X1 • 0 (n- I)
ll':
'i
-.....g
347
q,
=
II
.l:
~3 '=I
<x,.t -
0
UCL
x,)2
0
m
m
(a)
(b)
l'igure 10.7
uuagainst /1 1: u > u 0 or. equivalently, H 0: u~ = ufiagainst u; u~ > ufi. To
carry out this test, we shall use as the test statistic the sum in ( 10-37):
II
( X;- -X·
)'
)'
_ ~
_ , II.. ( X 111 .; -X,.,q - ~
'
q,- ~
,
i-1
U(i
; 1
Ujj
Under hypothesis H 0• the RV q has a x~(, - I) distribution. From this it
follows that if P{q > c:l H 0}= a, then<. = xi _.,(n - I): hence, the UCL equals
xi-,.(11 - 1), and the chart of Fig. 10.7a results.
Note Suppose that q is a test statistic for the parameter 8 and the c.:>rresponding control limits are c 1 and c~. If T = r(8) is a monotonic increasing function of
6. then r(q) is a test statistic of the parameter T, and the corresponding control
limits are r(c- 1) and r(c2 ). This is a consequence of the fact that the events {c 1 <
q < c-2} and {rktl < r(q) < r(C'2)} are equal. Applying this to the parctmeter 6 =
o- 2 with r(fJ) :.:: VB. we obtain the chan of Fig. 10.7h for the control of the
standard deviation u of x.
In a number of applications, the purpose of quality control is to keep the proportion p of defective units coming
out of a production line below a specified level Po· A defective unit can be
identified in a variety of ways, depending on the nature of the product. It
might be a unit that docs not perform an assigned function: a fuse, a switch, a
broken item. It might also be a measurable characteristic. the value of which
is not between specified tolerance limits; for example. a resistor with tolerance limits 1,000 :t 5% ohms is defective if R < 950 orR > 1050. In such
problems, p = {defective} is the probability that a unit is defective, and the
objective is to test the hypothesis Ho: p s Po against H,: p >Po where Po is
the probability of defective parts when the plant is in control. This test is
identical to the test Hi.: p = Po against Jlj: p > Pn·
Proceeding as in (10-22), we use as the test statistic the number x of
successes of the event :A.= {defective} inn trials: q = x and q, = x,. Thus x,
is the number of defective units tested as the mth inspection. As we know,
the RV x has a binomial distribution. This leads to the conclusion that the
CONTROL OF DEFECTIVE UNITS
348
CHAP.
10
HYPOTHESIS TESTING
t
"Po + Zt- .. .../npoqo
k21--_..._................;......__ t:CL
::0:.
~
"Po
I
_jL_ o~~~~~~~~~~
Figure 10.8
UCL is the smallest integer k2 such that
P{x
~ k2IHo} =
i
·~·l
(~)P3ti3-t < a
For large n [see (10-23)],
UCL = k2 = npo + Zt o v;;p;;;j;,
and the chart of Fig. 10.8 results. Thus if at the mth inspection, the number
Xm of defective parts exceeds k2 , we decide with significance level a that the
plant is out of control.
10-4
Goodness-of-Fit Testing
The objective of goodness-of-fit testing is twofold.
I.
It seeks to establish whether data obtained from a real experiment
fit a known theoretical model: Is a die fair? Is Mendel's theory of
heredity valid? Does the life length of a transistor have a Weibull
distribution?
2. It seeks to establish whether two or more sets of data obtained
from the same experiment, performed under different experimental conditions, fit the same theoretical model: Is the proportion of
Republicans among all voters the same as among male voters? Is
the distribution of accidents with a 55-mph speed limit the same as
with a 65-mph speed limit? Is the proportion of daily faulty units in
a production line the same for all machines?
Goodness of fit is part of hypothesis testing based on the following
seminal problem: We have a partition
A = [Jilt • • • • .simJ
SEC.
10-4
GOODNESS-OF-FIT TESTING
349
A
•·igure IO.CJ
consisting of m events s4; (Fig. 10.9), and we wish to test the hypothesis that
their probabilities Pi = P(.sfli) have m given values Pu,:
Ho: Pi= Poi• all i
against
1/1: p; ~ Poi• some i
(10-58)
This is a generalization of the test (10-19) involving a single event.
To carry out the test, we repeat the experiment n times and denote by
ki the number of successes of s4;.
Pearson's Test Statistic We shall use as the test statistic the exponent of the
normal approximation (7-71) of the generalized binomial distribution:
~ (ki - npo;) 2
(10-59)
;-•
npo;
This choice is based on the following considerdtions: The RVS ki have a
binomial distribution with
E{ki} = npi
ui. = np;q;
(10-60)
Hence. the rdtio k;ln- Pi as"- x. This leads to the conclusion that under
hypothesis Hr.. the difference lki- npo;l is small. and it increases as IPi- Pnil
increases.
The test proceeds as follows: Observe the numbers k; and compute the
sum in (10-59), select a significance level a. and find the percentile q 1 _0 of q.
• "!, ( ki
IIPo;)~ .
Accept Hutff L
<.. q 1 ...
(10-61)
q=
~
;· 1
IIPu;
Note that the computations in (10-59) are simplified if we expand the
square and use the identities ~ po; = I, ~ k; = n. This yields
m k?
q=L-' _,
i-1
npo;
(10-62)
CHI-SQUARE TEST To carry out the test, we must determine the percentile q,. of Pearson's test statistic. This is, in general, a difficult problem
involving the determination of a function of u depending on many parameters. In Sec. 8-3, we solve this problem numerically using computer simulation. In this section we use an approximation based on the assumption that n
is large
• Theorem. If n is large, then under hypothesis Ho. Pearson's test statistic
q
is
x2(m
- II
(10-63)
350
CHAP.
10
HYPOTHESIS TESTING
This is based on the following facts. For large n, the Rvs k; are normal.
Hence [see (10-60)]. if p; = po;, the Rvs (k;- npo;) 2/npmqo; are x2(1). However, they are not independent because I (k;- npo;) = 0. Using these facts,
we can show, proceeding as in the proof of (7-97). that q can be written as the
sum of the squares of m- I independent N(O, I) Rvs; the details. however.
are rather involved (see (7A-6)].
We shall verify the theorem form = 2. In this case (see (7-72)].
kt + k~ = ll
Pot = I - Pn2
lkt - "Pot I = lk~ - llPn~l
Hence.
(kt - npm )~ (k2 - trPo2>~ (k, - trpud~
q=
+
:-:
(10-64)
nPut
llPu~
llPotPn~
This shows that form = 2, q is x~( I). in agreement with ( 10-63).
From (10-63) it follows that if n is large. Pearson's test takes the form
Accept //11
iff
q < x=<m ·· I)
( 10-65)
A decision based on (10-65) is called a chi-sqtwrt• ll'!it.
The Type II error probability depends on p; and can be expressed in
terms of the noncentral x2 distribution (see the Appendix to Chapter 7).
Example 10.21
We wish to test the hypothesis that a die is fair. To do so, we roll it 450 times and
observe that the ith face shows k; times where
k; = 66
60
84
72
81
77
In this problem. n = 450. pu; = 116. npo; = 75. and
±
75)2 = 7.68
75
Since x~9~(5) = II > 7.68, we accept the fair-die hypothesis with a = .05. •
q =
(k;-
i=l
The next example illustrates the use of the chi-square test in establishing the validity of a scientific theory, Mendel's theory of heredity.
Example 10.22
Peas are divided into the following four classes depending on their color and their
shape: round-yellow, round-green, angular-yellow, angular green. According to the
theory of heredity. the probabilities of the events of each class are
9 3 3 l
Po; = 16 16 16 16
To test the validity of the theory. we examine 556 peas and find that the number of
peas in each class equals
k; = 315
101
32
108
Using these observations. we -shall test the hypothesis that P;
= po; with cr
= .05.
J0-4
SEC.
i.,~m
In this problem.
- 6.25. and
11 ""'"
liP«~ =
556 and
=2
4
q
GOOONESS·OI'·FfT TESTING
35 J
312.75. 104.25. 1114.25. 34.75. m = 4.
(k·; - llpu;)·, = 0.470
npo;
i•t
Since 6.25 is much larger than 0.470, the evidence !;tronl,tly supports the null hypothesis. •
Refined Data In general, the variations of a random experiment are due to
causes that remain the same from trial to trial. In certain cases, we can
isolate certain factors that might affect the data. This is done to refine the
data and to improve the test. The next example illustrates this.
Example 10.23
In Mendel's theory. the ratio of yellow to green pea!; is 3: I. We wish to test the
hypothesis that the theory is correct. We pick 478 pea!; from 10 plants and observe
that 355 of them are yellow. What is our decision'!
The problem is to test the hypothesis p = p 0 = .75 where pis the probability
that a pea is yellow. To solve this problem. we use the chi-square test with m = 2 or,
equivalently, the test (10-19). In our case. k = 355. 11 = 478. np0 q 0 = 89.6, and (10-14)
yields
(k - 11Pulfu)2
q .._ - - - · ....
IIPut/o
= O.IJ6 <
,
x~.,.111-
3.84
Hence. we accept the 3: I hypothesis with a ,... .05.
Refinement There might be statistical differences among plants. This possibility
leads to a refined chi-square test: We group the data according to plant and observe
that the jth plant has II; peas where
ni = 36
39
19
97
37
:!6
45
53
64
62
of which
k, - 25
32
14
70
arc green. Using the data k1 and
II;
24
20
32
44
50
44
of the jth plant. we J'brm the 10 statistics
q; _ ~kL.-, ~'iPu.'ful.~
lljPot/o
(/;
and we obtain
= 0.59 1.03 0.02 0.34 2.03 0.05 0.36 1.82 o.:n 0.54
To usc the differences among plants. we shall use as the test statistic the sum
q
= z~
~ t/j
1 t
z~
= ~
;-t
<k, -
n;puqu)~
..... ....
n;puqu
::
7.19
The corresponding RVs q 1 arc independent. and each is x2<I l: hence. their sum q is
x2(10). Since x 2~(10J = 18.3 > 7.19. we accept the 3: I hypothesis. •
352
CHAP.
10 HYPOTHESIS TESTING
Incomplete Null Hypothesis
We have assumed that under the null hypothesis, them probabilities p; = po;
are known. In a number of applications, they are specified only partially.
The following cases are typical.
Cast 1 Only m - r of the m numbers p 01 are known.
Cast 2 The m numbers p 01 satisfy m - r equations:
f/>j(Po~o • • • • Pom) = 0
j = I, . . . , m - r
(10-66)
These equations might, for example. be the results of a scientific theory.
Case 3 Them numbers p 01 are functions of r unknown parameters 8i:
po; = t/1;(8,, . . . • 8,)
i = I, . . . • m
(10-67)
In all cases, we have r < m unknown pantmeters.
MODIFIED PEARSON'S STATISTIC To carry out the test, we find the maxi-
mum likelihood (ML) estimates of the r unknown parameters and use the
given constraints to determine the ML estimates p01 of all m probabilities p 01 •
In case I, we find the ML estimates of the r unknown probabilities po;. In
case 2, we find the ML estimates of r of the m numbers po; and use the m - r
equations in (10-66) to find the ML estimates of the remaining m - r probabilities. In case 3, we find the ML estimates (Ji of the r parameters 8j and use
them equations in (10-66) to find the ML estimates p01 of them numbers po;.
Inserting the estimates so obtained into (10-59), we obtain the modified
Pearson's statistic,
(10-68)
where p01 = po; for the values of i for which p 01 is known. To complete the
test, we need the distribution of q.
• Tlttonm. If n is large. then under hypothesis H 0 , the statistic
q
is
x2Cm - r - I)
(10-69)
The proof is based on the fact that the m parameter. po; satisfy r constntints
and their sum equals I.
This theorem leads to the following test: Find the ML estimates po;
using the techniques of Section 9-5. Compute the sum q in ( 10-68):
Accept Ho
itT
q < xf-a<m - r - I)
(10-70)
A decision based on (10-70) is called a modified chi-square test.
The foregoing is an approximation. An exact determination of the distribution of q can be obtained by computer simulation.
Example 10.24
The probabilities p 1 of the six faces {Ji} of a die are unknown. We participate in a
game betting on {even}. To examine our chances, we count the number k1 oftimesfi
10-4
SEC.
GOODNESS-OF-FIT nSTING
353
shows in 120 rolls. The results are
fori = I. . . . . 6.
k; = IK 16 15 24 22 25
On the basis of these data. we shall test the hypothesi!>
H 11 : P{even}
= .5
against
/11: P{cven} -" .5
This is case 2 of a modified Pearson's tclit with
the null hypothesis. we have the constrctints
11 .,..
120. m -= 6, r
= I.
Under
h
P{even}
= Po~ + P04 + P06
LPo;
= 0.5
(10-71)
=I
i I
We use as free parameters the probabilities Po•· Po2. Po)• and Pn4· To find their ML
estimates. we determine the maxima of the density
f<PoJ• · · • • Ptlfol "- ')'(Pol )A'
subject to (1(). 7 J). With /. = In f. (9-761 yields
• • •
(JI•w• )A,.
ill. = ~ __ k~ = O
ilL_ _ h _ ~'!.. ., 0
ilpm
Pol
Po•
ilpu~
l'u~
l'•w·
(10-72)
ilL _ ~} _ !_~ -= O
.f!l._ = k~ _ k!!. _ 0
ilpo)
P11.1 Po~
i1Pt14 P04 Ptw.
The solutions of the six equations in (1()-71) and ( 10-721 yield the ML estimates of p 111 :
Po~ = .123
Po) = .200
Pt14 = .IKS
fiu• - .200
fJ,"' = .192
Inserting into ( 10-68). we obtain q = O.H34. Since]( ~.,~(41 ...; 9.49 > 0.834. we conclude
that the evidence strongly supports the null hypothesis with (~ "" .05.
One might argue that we could have applied (10-19) to the event {even} with
k = k2 - k4 + kt,. We could have; however, the resulting Type II error probability
would be larger.
P111 = .164
Test of Independence and Continpency Tahles
We arc given two
events;~ and~;
h =-
with probabilities
P(~)
c .:...: /'('(!.)
and we wish to test the hypothesis that they are independent:
H 0 : P(f'A n '~·;) "" be
against
H,: P<:13 n '(.;) ::/: he
(10-73)
To do so, we form the four events~ n ~~. ;~ n 'f.. ;in 'L and~ n <t shown
in Fig. IO.IOa. These events form a partition of~f; hence. we can use the chisquare test. properly interpreted.
Figure 10.10
flu
~II
.w!l
.'Jl !!
.(II
.l.u
1'21
1'22
J; ~·
:;l 22
't
(a)
1'12
lbl
k!2
354
CHAP. 10
HYPOTHESIS TESTING
As preparation for the generalization of this important problem. we
introduce the following notations. We identify the four events by :A;1 where
i = I, 2 andj = I. 2. their probabilities by pq. and their number of successes
by k;1 • For example.
.9f,~ = \~ n 't
P•~ = P<.7ld
The number k 1 ~ equals the number of successes of the event .?t 1 ~. that is. the
number of times:" occurs but'~ does not occur. Figure IO.IOb is a redrawing
of fig. IO.IOa using these notations. This diagram is called a nmtingetrQ'
tclhle.
We know that P(~) = I - band P('t) = I - (', Furthermore. if the
events ~~ and '-€ are independent. then the events :~ and 't. :1l and <f.. and 00
and <i arc also independent. From this it follows that under hypothesis /111 •
Pu
= be
P1~ =
b(l - c)
P~1 = (I - b)£'
P!~
Applying ( I0-70) to the four-clement partition
A = ).s.f11. sf,~. sf21·
we obtain the following test of ( 10-73):
= (I
( 10-74)
.,(3)
(10-75)
:A~~)
., , k
,
Accept Ho itT! ! ( ·v - npv>: <
; • ,.,
np;,
Example 10.25
- b)(l - c)
xi
It is known that in a certain district, 52% of all voters are male and 400-i are Republican. We wish to test whether the proportion of Republicans among all voters equals
the proportion of Republicans among males. that is, whether the events ~ = {Republicans} and '€ = {male} are independent. For this purpose, we poll200 voters and find
that
k11 =3.S
k 12=?l
k2t=43
k22=Sl
In this problem, n = 200, b = .52, c = .4. Hence.
Pu = .208
P12 = .312
P21 = .192
Pn = .288
Inserting into the sum in (10-75). we obtain q ~ 3.54. Since
accept the hypothesis of independence with a = .OS. •
x: ,(3)- 7.81 > 3.54, we
9
Suppose now that the probabilities b = P(~) and c = P('€) are unknown. This is case 3 of the incomplete null hypothesis with 6 1 =band 6~ =
c. In our case, the constraints (10-67) are the four equations (10-74) expressing the probabilities pq in terms of the r = 2 unknown parameters b and c:.
The ML estimates of band c: are their empirical estimates n'Ain and n'f./n,
respectively, where n~ = k 11 + k12 is the number of successes of the event
~ = .!lt 11 U .!lt 12 and n·• = ku + k21 is the number ofsuccesses of the event <€ =
.!lt21 U .!lt22. Thus
,~ -- kll + k21
6 -- kll +n kl2
n = k11 + k12 + k21 + k22
n
Replacing in (10-74) the probabilities band c: by their ML estimates 6 and c,
we obtain the ML estimates Pii of Pii· To complete the test, we form the
SEC.
J0-4
355
GOODNESS-OF-FIT TESTING
modified Pearson's sum qas in (10-68). In our case. m- r- 1 = I, and the
following test results:
Accept Ho iff q =
Example 10.26
2
2
L
L
i=J
np
(k·· _
I)
i~t
•
npii
)2
xi-..<0
v <
(10-76)
Engineers might have a bachelor's, master's, or doctor's degree. We wish to test
whether the proportion of bachelor's degrees among all engineers is the same as the
proportion among all civil engineers, that is. to test whether the events 00 = {bachelor} and <€ = {civil} are independent. For this purpose, we interrogate 200 engineers
and obtain the contingency table of Fig. IO.IIa.
In this problem. n = 400. n 11 = 54 + 186 '- 240. "' - 54 • 26 = KO.
h = .6
,~ = .2
''" = .12
,;,2 = .48
ti!t = .OK
P!! = .32
and (10-67) yields q = 2.34. Since x~9 ~ (I)= 3.84 > 2.34. we accept the assumption of
independence with a = .05.
GENERAL CONTING.:NCY T ARI.ES Refining the objective of Example
10-26. we shall test the hypothesis that each of the three events-engineers
with bachelor's (00 1). master's (~Ah>. and doctor's t:1l~) degrees-is independent of each of the four events-civil (<€, ). electrical ('~ 2 ). mechanical (<!6l),
and chemical (<!54) engineer. For this purpose. we interrogate the same 400
engineers and list the data in the table of Fig. 10.11 b. This refinement is a
special case of the following problem.
We are given two partitions
B
= r~ •..... :1.\,J
c=
l't, ..... '{?,.J
consisting of the u events~; and the v events ~;1 with respective probabilities
b; = P(~;)
1 sis u
'1 = P(<€i)
1s j s v
We wish to test the hypothesis Ho that each of the events ~; is independent
of each of the events <f,i· For this purpose. we form a partition A consisting
l'igure 10.11
't
·t
d'.
54
186
240
::A
26
134
160
80
320
(a)
-
400
.:II,
54
101
so
35
240
:1\2
20
44
33
23
120
"1\3
6
IS
12
7
40
80
160
95
65
400
(b)
356
CHAP.
10
HYPOTHESIS TESTING
of the m
= uv events
~ij
= ~~ n <€1
(Fig. 10.12) and denote by PIJ their probabilities. Under the null hypothesis,
PIJ = b1CJ; hence, our problem is to test the hypothesis
H 0 : Pu = b1c.). all i,j
against
H 1: PiJ :/: b1tj, some i,j (10-77)
We perform the experiment n times and observe that the event ~il occurs kiJ
times. This is the number of times both events ~1 and '€i occur. For example,
in the table of Fig. 10.11, k23 is the number of electrical engineers with a
doctor's degree.
Suppose, first, that the probabilities b1 and CJ are known. In this case,
the null hypothesis is completely specified because then Pu = b1c1 • Applying
the chi-square test (10-63) to the partition A, we obtain the following test:
~ ~ (k·· - nb·c·1 )
=~
~ v b· . '
1-1 1-1
n ,ti
2
Accept Ho itT q
~
< Xi-a(uv -
I)
(10-78)
If the probabilities b1 and t) are unknown. we determine their ML
estimates 6; and i) and apply the modified chi-square test (10-70). The number n~. of occurrences of the event ~; is the sum of the entries kiJ of the ith
row of the contingency table of Fig. 10.12, and the number n·t, of occurrences
of the event <€1 equals the number of entries ku of the jth column. Hence,
n~
1u
6; = -=
= - L
n
n J=l
k/j
Cj
= n·tn = -n1~
~ k/j
_
-...!
(10-79)
1 1
In (10-79), there are actually r = u + v - 2 unknown parameters
because l: b; = 1 and l: c1 = 1. Hence,
m - r - 1 = uv - (u + v - 2) - 1 = (u - l)(v - 1)
Inserting into (10-70), we obtain the following test:
,
u k
t •
q = L L iJ - n_D;Cj < xi-al<u - l)(v - l)J (10-80)
if
Accept Ho
1• 1 1• 1
n6;c1
Fipre 10.12
~
'{:1
.111
~li ..
.. .
li
(
.
.!Iii•
...
.111
dl II
sA if
.14,1
::1,.
SI'C.
Example 10.27
Goodness
J0-4
GOODNJ:::SS-01"-FIT TESTING
357
We shall apply ( 10-80) to the data in Fig. 10.11 b. In this problem. n - 400. and ( 10-791
yields
h; = .6 .3 .I
c~ = .2 .4 .2375 .1625
The resulting estimates Pii = hl; of Pii equal
p;1 = .12
.24
.1425
.0975
P~i = .06
.12
.0712
.(l4HH
PJJ ''" .02
.04
.0238
.0162
Hence. q "" 5.94. Since u = 3. v = 4. (u - l)(v - II = 6. and x:01~(6) = 12.59.
we conclude with consistency level a "' .05 that the proportion of engineers with
bachelor's. master's. or doctor's degrees among the four groups considered is the
same. •
l~f' Fit (~r Distributions
In Section 10-2. we presented a method for testing the hypothesis that the
distribution F(x) equals a given function .1-j,(x) for all x:
Ho: .1-'(x)
= F 0(x). all x
against
H,: F(x) ::/:. .1-j,(x). some x
(10-81)
In this section. wt: havt: a more modest objective. We wish to test the
hypothesis that f'(x) = F 0(x) only at a set of m - I points a; (Fig. 10.13):
H(,: f'(a;)
= Fo(a;), all i
against
Hi: F(a;) ::/:. .1-j,(a;), some i
Figure 10.13
X
(10-82)
358
CHAP.
10
HYPOTHESIS TESTING
If x is of the discrete type, taking the values a;, then the two tests are
equivalent. If x is of the continuous type and H~ is rejected, then Ho is
rejected; however, if H 0 is accepted, our conclusion is that the evidence
does not support the rejection of H 0 •
The simplified hypothesis H;, is equivalent t~ ( 10-58): hence, we can
apply the chi-square test. To do so. we form them events
(10-83)
.s41 = {a;. 1 < x s a;}
Is is m
The probabilities p; = P(.s41) of these events equal
( 10-84)
P(.si;) = F(a;) - F(a; ,)
tlo = -x
Clm = +x
and under hypothesis H 0• p1 = po; = fQ(cl;) - Fo(tl; t). Hence, (10-82) is
equivalent to (10-58). This equivalence leads to the following test of //~:
Compute them probabilities Pcli = Fu(a;)- F 0(cl;. 1I. Determine then samples
x 1 of x, and count the number k1 of samples that arc in the interval (tl; , < x <
a;). Form Pearson's statistic (10-59).
(10-85)
Reject H~ iff q > Xl-o(m - I)
Example 10.28
We wish to test the hypothesis that the time to failure x of a system has an exponential distribution Fo(x) = I - e-x'2• x > 0. Proceeding as in (10-83), we select the
points 3. 6. 9 and form the events
{x s 3}
{3 < x s 6}
{6 < x ~ CJ}
{x > 9}
In this problem.
Put
= fo(3) = .221
Pu2
Pol
= Fo(9)
PU4
- ffj.6) .:. .134
= fj,(61 -· Fu(3) = .173
= I - fi~9) = .472
We next determine the times to failure :c; of n .._ 200 systems. We observe that
k; are in the ith interval, where
k; = 53
42
35
70
Inserting into ( 10-59). we obtain
±
2
(k; - npo;) = 12.22 > x:.,(3) = 11.34
;-c
npo;
Hence. we reject H 0, and therefore also Ho. with a= .I. •
q=
Note The significance level a used in (10-85) is the Type I error probability of
the test (10-82). The corresponding error a' of the test (10-81) is smaller. An
increase of the number m of intervals brings a • closer to a decreasing, thus the
resulting Type II error. This, however. results in a decrease of the number of
samples k1 in each interval (a1_,, a;), thereby weakening the validity of the large
n approximation (10-63). In most cases, the condition k; > 3 is adequate. If this
condition is not met in a particular test, we combine two or more intervals into
one until it does.
We have assumed that the function
Fo(x) is known. In a number of cases, the distribution of x is a function F0(x,
81 , • • • , 8,) of known form depending on r unknown parameters 8i. In this
INCOMPLETE NULL HYPOTHESIS
SEC.
J0-4
GOODNESS-OF-FIT TESTING
359
case, the probabilities po; satisfy them equations f'u; = 1/1; as in ( 10-67) where
1/1;(8,, . . . • 8,) = Fo(a;. 81• • • • • 8,) - J-j,(a; ,. 8,. . . . , 8,)
This is case 3 of the incomplete null hypothesis. To complete the test, we
find the ML estimates {Ji of 8i and insert into (10-84). This yields the ML
estimates of p11;. Using these estimates, we form the modified Pearson's
statistic (10-68). and we apply (10-70) in the following form:
(10-86)
Reject
if
llu
q > xi- a<m r - I)
Example I 0.29
We wish to test the hypothesis that the number of particles emitted from a radioactive substance in to seconds is a Poisson-distributed RV with parameter 8 = Ato:
84
P{x = k} = e· 9 k!
k = 0, I. . . .
(10-87)
We shall carry out the analysis under the assumption that the numbers of
particles in nonoverlapping intervals are independent. In Rutherford's experiment,
2,612 intervals were considered, and it was found that in n4 of these intervals, there
were k particles where
k= 0
n4 =51
2
203
383
3
525
4
5
532
408
6
273
7
8
139
49
9
27
10
20
II
4
12
>12
0
Thus in 57 intervals, there were no particles, in 203 intervals there was only one
particle, and in no one interval there were more than 12 particles (Fig. 10.14). We
next form the 12 events
{x = 0}. {x = 1}•...• {x = to}. {x 2:: II}
and determine the ML estimates po; of their probabilities p,~. To do so, we find the
ML estimate {J of8. As we know. {J equals the empirical estimate i ofx (see Problem
Figure 10.14
t
t
soo r
nk
I
100 r
t
0
2
3
4
s
7
,• •
8
9
10
•I I •12
k
36()
CHAP.
10 HYPOTHESIS TESTING
9-35). The RV x takes the values k = 0, I, 2, . . . with multiplicity nk; hence.
.
_ I 12
12
9 = x = - ~ knt = 3.9
n = ~ n4 = 2,612
n k=O
k=O
Replacing 9 by 8 in (10-87). we obtain
•A
• = e -i !_
POi
k!
•
. = k. + I = I • . . . • II
· = ,. ' (
POi
I
8 = 3.9. this yields
Po;= .020 .079 .154 .012 .200 .195
811
81~
m + 1:!! )
i = 12
and with
.152 .099 .055 .021 .012 .005 .003
To find Pearson's test statistic q. we observe that the numbers k, in ( 10-68) are for this
problem the numbers n;. 1• The resulting sum equals
"'
q=~
i•l
(
.
)'
-?POi-= 10.67
n;.,po;
n;-t
We have m = 12 events and r = I unknown (namely. the parameter 9): hence. cj is
x2(10). Since x:95(10) = 18.31 > q. we conclude that the evidence does not support
the rejection of the Poisson distribution.
•
10-5
Analysis of Variance
The values of an RV x are usually random. In a number of cases, they also
depend on certain observable causes. Such causes are called factors. Each
factor is partitioned into several characteristics called groups. Analysis of
variance (ANOVA) is the study of the possible effects of the groups of each
factor on the mean Tlx of x. ANOVA is a topic in hypothesis testing. The null
hypothesis is the assumption that the groups have no effect on the mean of x;
the alternative hypothesis is the assumption that they do. The test (10-13) of
the equality of two means is a special case. In ANOV A, we study various
tests involving means: the topic, however, is called analysis of variance. As
we shall presently explain, the reason is that the tests are based on ratios of
variances.
We shall illustrate with an example. Suppose that the RV x represents
the income of engineers. We wish to study the possible effects of education,
sex, and marital status on the mean of x. In this problem, we have three
factors. Factor 1 consists of the groups bachelor's degree, master's degree,
and doctor's degree. Factor 2 consists of the groups male and female. Factor
3 consists of the groups married and unmarried.
In an ANOVA investigation, we have one-factor tests, two-factor
tests, or many-factor tests. In a one-factor test, we study the effects of the
groups of a single factor on the mean of x. W'e refine the underlying experiment, forming a different RV for each group. We thus have m RVS x1 where m
is the number of groups in the factor under consideration. In a two-factor
SEC. 10-5 ANALYSIS
OF
361
VARIANCE
test, we form mr RVS xi" where m is the number of groups of factor I and r is
the number of groups of factor 2.
In a one-factor problem where x is the income of engineers and the
factor is education, the RVS x1 • x2 , x3 represent the income of engineers with
a bachelor's, a master's, and a doctor's degree, respectively. If we wish to
include the possible effect of sex, we have a two-factor tests. In this case, we
have six avs lj/c where j = I, 2, 3 and k = I, 2. for example, xu represents
the income of male engineers with a doctor's degree.
In the course of this investigation, we shall use samples of the avs xi
and Xjk. To avoid mixed subscripts, we shall identify the samples with superscripts, retaining the subscripts for the various groups. Thus xj will mean the
ith sample of the RV xi; similarly, xj11 will mean the ith sample of the RV xik·
The corresponding sample means will be identified by overbars. Thus
-I~;-
Xj
= -ni L,-
'Xjk
'Xj
1
= -nikI~;
2.. 'Xjk
,_ •
(10-88)
where ni and nik are the number of samples of the Rvs xi and xi4 , respectively.
TH•: ANOVA PRINCIPLE We have shown in (7-95) that if the RVS
uncorrelated with the same mean and variance and
Qo
= L"'
(Zj -
i)2
E{Qo} = (m - l)u 2
then
Z;
are
(10-89)
j•l
Furthermore, if they are normal, then the ratio Q0 /u 2 has a x2(m - I)
distribution. The following result is a simple generalization.
Given m uncorrelated Rvs wi with the same variance u 2 and means Tli•
we form the sum
Q =
L"' (wj- w>2
-
I "'
w=-Lwi
m j-1
j~l
(10-90)
• Theonm
E{Q} = (m - l)u 2 + e
(10-91)
where
~ (Tli-TI
-)2
I ~
,_.{-}
(10-92)
m i-1
Furthermore, if the Rvs wi are normal and e = 0, then the ratio Q/u 2 has a
x2(m - I) distribution.
e=~
-
.,=-~Tii=c.w
i"l
• Proof. This result was established in (7A-14); however, because of its
importance, we shall prove it directly. The Rvs zi = wi - Tli are uncorrelated,
with zero mean and variance u 2 ; furthermore, wi- w = zi - i + Tli- :rj.
Inserting into (10-90), we obtain
Q
= L"'
j•l
[(Zj -
i) + (Tii - Tj)] 2
362
CHAP.
10 HYPOTHESIS TESTING
We next square both sides and take expected values. This yields
E{Q}
= E {i (zi- i)2}
i
+
(7Ji- 1j)2
/=1
J~l
because E{(zj - i)(7Ji - Tj)} = 0, and (10-91) follows from (10-89). If e = 0,
then Q = Q0 ; hence, Q/u 2 is x2(m - 1). If e + 0, then the distribution of
Q/u 2 is noncentral x2(m - 1, e).
This theorem is the essence of all ANOV A tests; based on the following consequence of (10-91). Suppose that u2 is an unbiased consistent estimator of u 2• If n is sufficiently large, then Q is concentrated near its mean
(m - l)u 2 + e, and u 2 is concentrated near its mean u 2• Hence, if e = 0that is, if the RVS wi have the same mean-then the ratio Q/(m - l)u2 is
close to I, and it increases as e increases.
One-Factor Tests
Them
RVS
llt • • • • , Xj• • • • , X,.
represent them groups of a factor. We assume that they are normal with the
same variance. We shall test the hypothesis that their means Tli = E{xi} are
equal:
Ho: 111 = · · · = .,,.
against
H,: 7J; + 71/• some i, j (10-93)
For this purpose, we sample the jth RV ni times and obtain the samples xJ
(Fig. 10.15). The total number of samples equals
N
= n1 + · · · + n,.
We next form the sum
Q,
m
=L
"'
x>2 = L <w,- w)2
nJ<xj-
j-1
J~l
Fipre 10.15
Samples
x!
X~
x~'
x,
X~
X~
x:z
x2
x,
xi
I
x'"'
I
xn"'
"'
x,.
(10-94)
SEC.
10-5
ANALYSIS OF VARIANCE.
363
where ii is the average of the samples xj of x1 as in ( 10-88), w1 .:.. Vnji1, and
I m
-x=-~x
~-
I
-
1
,.
m
w=-~w1
m J-1
mi I
The RVs i 1 are independent, with mean TIJ and variance u 2/ n1• From this
it follows that the RVs w1 are independent, with mean TIJ Vnj and variance u 2 •
Hence lsee (10-91)],
E{Q,}
= (m-
l)u 2 + e
e
'"
= ,....
~
i
(
-
~
n, TIJ - Tl )-
(10-95)
I
The constant e equals the value of Q 1 when the RVs i 1 and i are replaced by
their means Tli and Tj.
To apply the ANOV A principle, we must find an unbiased estimator of
u 2• For this purpose, we form the sum
(10-96)
For a specificj, the Rvs z; = xj are i.i.d., with variance u 2 and sample mean
i = i 1 . Applying (10-89). we obtain
E{
i
(xj - i1) 2 }
r•l
= (n1 -
l)u 2
Hence,
E{Q2} =
L'" (ni -
l)u 2 = (N - m)u 2
(10-97)
j-1
Reasoning as in the ANOV A principle. we conclude that the ratio
Q 1/(m - I )(r 2
q =
(10-98)
Q2/(N- m)(T 2
is close to I if e = 0, and it increases as (' increases. It is reasonable,
therefore, to use q as our test statistic. To complete the test, we must find
the distribution of q.
• Theorem. Under hypothesis H0 • the RV
(N- m)Q,
q-=
(m- I)Q:!
is
F<m- I. N
m)
(10-99)
• Proof. If H 0 is true, the Rvs w1 = i 1Vnj are i.i.d., with variance u 2• Hence,
as in (10-90), the ratio Q 1/u 2 is x2(m - 1). Similarly, the ratio
Yi
= ....·~(;
2L
Xj i-1
-)2.
'Xj
IS
u
From this and (7-87) it follows that the sum
(10-100)
364
CHAP.
10
HYPOTHESIS TESTING
degrees offreedom. Furthermore, Yi is independent ofiJ [see (7-97)], and Q1
is a function of ii; hence, the RVS Q1 and Q2 are independent. Applying
(7-106) to the Rvs Q 1/u 2 and Q 2/u 2, we obtain IIQ-99).
A one-factor ANOV A test thus proceeds as follows: Observe the ni
samples xj of each RV xi. Compute ii, i, and the sums Q1 and Q2 •
. (N- m)Q 1
Accept H 0 afT (m _ UQ < F 1 -a(m - I. N - m)
(IQ-101)
2
where Fu(m - I, N - m) is the u-percentile of the Snedecor distribution.
OC Function The Type II error probability is a function {3(e) depending only on e. To find it, we must determine the distribution of q under the
alternative hypothesis. The distribution of Q 2/u 2 is >c 2(N - m) even if H 0 is
not true because the value of Q 2 does not change if we replace the Rvs xj by
xj- Tli· If e =I= 0, the distribution of Q 1/u 2 is a noncentral >c 2(m - I, e) with
eccentricity e (see the Appendix to Chapter 7). Hence, the ratio q has a
noncentral F(m - I, N - m, e) distribution.
Example 10.30
A factory acquires four machines producing electric motors. We wish to test the
hypothesis that the mean times to failure of the motors are equal. We monitor nJ = 7,
S, 6, 4 motors from each machine and find the following times to failure xj, in weeks.
Machine 1:
Machine 2:
Machine 3:
Machine 4:
6.3
6.3
6.9
8.2
7.2
7.9
9.2
7.7
s.s
9.0
8.S
7.S
In this problem, m
= 4,
N
Q, = 4.09
Since F.,(3, 18)
Example 10.31
GS:
HS:
C:
= 3.16 <
9.4
6.9
4.9
6.8
= 22. i
Q2
4.7
4.2
8.1
S.8
8.S
7.0
i,
= 7.0
i2 = 6.8
x1 = 6.9
= 8.0
6.9
i.
= 7.17S,
= S.68
q
18
= 3 ~' = 4.32
4.32, we reject the null hypothesis with a = .OS. •
We wish to test the hypothesis that parental education has no effect on the IQ of
children. We have here three groups: grammar school (GS), high school (HS), and
college (C). We select 10 children from each group and observe the following scores.
77
88
102
92
99
90
102
80
94
lOS
118
9S
liS
92
76
83
112
108
90
92
80
120
108
122
7S
79
86
In this problem, m = 3, n1 = n2 = n3 = 10, N = 30, i
Q,
= 134.9
Q2 = S,9S6
q=
2
~~~·
96
123
110
x, = 94.4
= 97.8
x3 = 99.S
X2
= 97.23,
= 3.06
Since F.,(2, 27) = 3.3S > 3.06, we accept the null hypothesis with a = .OS. •
10-5
SEC.
ANALYSIS OF VARIANCE
365
Two-Factor Tests
In a two-factor test. we have mr RVs
xi,
j = I. . . . . m
k = I. . . . , r
The subscript j identifies the m groups of factor I. and the subscript k
identifies the r groups of factor 2. The purpose of the test is to determine the
possible effects of each factor on the mean TIJk = E{x1k} of Xjk.
We introduce the averages
I ,
I "~
I "' ;
x·~· =-r k-1 x·,
X.4 = Xjk
x.. = mr ~ ~ XjA (10-102)
1
L
m
L
;-1
Thus Xj. is the average of all RVS in the jth row of Fig. 10.16, and x.k is the
average of all Rvs in the kth column. The average of all RVS (global average)
is x...
We shall study the effects of the groups on the means
(10-103)
TJ.;. = E{x1.}
71.4 = E{x.d
71 .. = f;{x .. }
of the Rvs so formed. For this purpose. we introduce the constants (Fig.
10.17)
a; = 7li. - 71..
{34 = 71.4 - 71..
"'/iA - TJ.;4 - 71 •• - CXj -- {3, (10-104)
Thus ai is the deviation of the row mean 71;. from the global mean 71 ••• f3 4 is the
deviation of the column mean 7l.k from the global mean 71 ••• and "Yik is the
deviation of 7liA from the sum 71 •• + ai + {34 • Each of the sums of the m
numbers ai, the r numbers {34 , and the mr numbers -y14 is 0.
In a two-factor test, the objective is to study the effects of the groups
on the parameters ai, {3'" and Yik· This problem has many aspects. We shall
briefly discuss two cases related to the following concept.
Figure 10.16
Groups of factor :!
..
0
xu
XJ2
XI,
·"I·
X21
x22
x2,
·"2·
'!$
....!!
.Y;.
Xpc
0
a
::1
e x,.,
~
x.,
.\',2
.\'
..
x,,
x,..
.,
.\',.
.\'
366
CHAP.
10
HYPOTHESIS TESTING
'ljlc
..,,.
.,,.
(Jk
"'
1/ ..
/
/
...
011
J
F~pn
10.17
ADDITIVITY We shall say that two factors are additive if the corresponding parameter 'Yik is 0 or, equivalently, if
T/jt = TJ .. + Clj + 134
(10-105)
In the engineer e:llample. the condition 'Yi' = 0 means that the variations
in average income due to education are the same for male and female engineers and the variations due to sex are the same for engineers with bachelor's, master's, and doctor's degree.
As in the one-factor tests, the analysis of a two-factor test is based on
the ANOV A principle. We start with the following problem: We assume that
factprs 1 and 2 of a two-factor problem are additive, and we wish to test the
hypothesis that ai = 0, that is, that the mean TJi4 ofx14 does not depend onj:
H0 : 'Yik = 0, ai = 0, allj
against
H 1 : 'Yik = 0. ai :1= 0. somej (10-106)
One might argue that ( 10-106) could be treated as a one-factor problem:
Ignore factor 2 and test the hypothesis that the groups of factor 1 have no
effect on the mean of Xjk. We could do so. The reason we choose a twofactor test is that as in Example 10.23, the possible variations within the
groups of factor 2 lead to a more powerful test.
Proceeding as in (10-94), we form the sum
QJ
= r L"'
J-1
(Xj.-
x.Y = L"' (w1 J•l
w.)2
W·J =X·'J·
Vr
(10-107)
This sum is essentially identical to Q 1 if we replace sample means by column
averages and ni by r. This yields [see (10-95)]
e
m
=rL
j-J
(TJj. - TJ.Y
"'
=rL
J•l
a} (10-108)
SEC.
10-5
ANALYSIS OF VARIANCE
367
We next form the sum
,.,
Q4 =
r
LL
(x_~, - XJ. - x.k • x.. )~
(10-109)
i' I 4 ·J
Under the additivity assumption. 'Y.i4 = 0: hence.
7JJ4
= T}j. + TJ.4
- TJ ..
for both hypotheses. From this it follows that if we subtract the mean from
all RVs in (10-109). the sum Q. remains the same: hence. it has the same
distribution under both hypotheses.
Reasoning as in (7-97). we can show that the ratio
Q!
is x~[(mu·
l)(r- I))
(10-110)
The details of the proof are rather involved and will be omitted. Since the
mean of an RV with x2<n> distribution equals n. we conclude that
I::{Q4 } ..::. (m - I )(r -· l),r~
( 10-111)
Proceeding as in ( 10-98). we form the ratio
Q~/(m - I )IT~
q = Q 4 /(m - I ><r _:. ' I)~~~
(10-112)
If //11 is true. then a;= O: hence. the eccentricity t• in ( 10-108) is 0. From this
it follows that under hypothesis llu.
Q,.
,
-:;IS x·(m -·
IT"
I
(10-113)
)
Combining with ( 10-110), we conclude that if H11 is true. then the ratio q in
(10-112) is Flm - I, (m - l)(r- 1)]. This yields the following test of (10106):
. (r- I)Q,
Accept H 0 tff
Q. · < F 1 ,.[m - I. (m - l)(r - I)] (10-114)
Example 10.32
We shall examine the effects of sex and education on the yearly income x;k offactory
workers. In this problem, we have two factors. Factor I. sex, consists of the two
factors M and F. Factor 2. education. consists of the three factors GS. HS, and C.
We observe one value of each RV
and obtain the following list of incomes in
thousands.
"i'
M
F
xu
X24
X.J,;
GS
GC
c
Xjl
Xj'!
Xj.l
Xj,
15
14
14.5
18
16
17
27
24
25.5
20
18
X ..
= 19
Assuming that there is no interaction between the groups (additivity), we shall test
the hypothesis that 011 = 0 and the hypothesis that ~4 - 0.
368
CHAP.
10
HYPOTHESIS TESTING
= 0. With m = 2 and r = 3, we obtain
2
QJ = 6
Q. = I
q = Q3 = 12
Q.
Since F.95 (1, 2) = 18.6 > q, we conclude with a = .OS that the evidence
(a) H0 : OlJ
does not support the rejection of the hypothesis that sex has no effect on
income.
(b) H0 : /34 = 0. We proceed similarly, interchangingj and k. We have the same
Q4 , but m = 3 and r = 2. This yields
q
= QJ
= 133
Q4
Since F.95(2, 2) = 19 < q, we reject the hypothesis that education has no
effect on income.
Note that in both cases, the evidence is limited, and the resulting
Type II error probability is large. •
In (10-106), we assumed that the two factors are
additive, and we tested the hypothesis that the first factor has no effect on
"'lk. Now we test the hypothesis that the two factors are additive:
Ho: 'Yik = 0, all j, k
against
Ha : 'Yilt :/= 0, some j, k (I 0-115)
Unlike (10-106), the test of (10-115) cannot be carried out in terms of a single
sample. We shall use n samples for each RV x1". This yields the mrn samples
(Fig. 10.18)
TEST OF ADDmVITY
i =I, . . . , n
j =I, . . . , m
k =I, . . . . , r
Flpre 10.18
•tlJ/c
~~-'"
~~~
Groups of factor ::!
k
xl,
.t\t
x]k
x),l
i
x),,
SEC. 10-6
NF.YMAN-PEARSON. SEQUENTIAL, AND I.IKELIHOOD RATIO TF.STS
369
The test is based again in the ANOVA principle. We form the sum
"' ,
Q,
= n j•l
L
2
(iiA -
k:l
iJ. -
i.A
+
i.Y
(10-116)
where overbars indicate sample averages and dots row and column averages
as in (10-102). The sum Q 5 is identical to the sum Q 4 in (10-109) provided that
we replace the RVs Xjk byiik Vii. From this and (10-110) it follows that if H 0 is
true, then
(10-117)
If H 0 is not true, that is, if Yit :/= 0. then the ratio Q~/u 2 has a noncentral x2
distribution, and
E{Q 5 } = (m - l)(r - 1)u~ + e
(10-118)
where again, e is the value of Q 5 if all RVs are replaced by their means:
Ill
e
= n j•l
2
I
2 <11ik -
A-·1
M
1Jj. - 11.• + 11.Y
r
=2
2 YJ•
j· I k-1
We next form the sum
"'
Q6
,
If
= ~~"'
~ ~ ~
i
(Xjt -
-,
(10-119)
Xjt )-
j-1 k-1 j·l
For specificj and k, the RVS xJ, are i.i.d .. with sample mean iJk; hence [see
(7-97)),
12 ~
j
~ (Xj4
(T
-
-
•
Xj4 )'• IS
)(-'( n -
1)
i· I
The quadratic form Q6 /u~ is the sum of mr such terms: therefore [see (7-87)),
Q~ is )( 2[mr(ll
(T·
-· 1))
(10-120)
Combining with (10-117). we conclude that the ratio
q
=
Q~/(m
- l)(r -
Q.lmr(n _ 1)
1) .
•
IS f[(m -
l)(r- 1), mr(n - 1)]
This leads to the following test of (10-115):
Accept Hu iff q < F 1-,.[(m - l)(r - 1), mr(n - 1)]
(10-121)
(10-122)
10-6
Neyman-Pearson, Sequential,
and Likelihood Ratio Tests
In the design of a hypothesis test, we assign a value a to the Type I error
probability and search for a region De of the sample space minimizing the
resulting Type II error probability. Such a test is called most powerful. A
370
CHAP.
10
HYPOTHESIS TESTING
simpler approach is the determination of a test based on a test statistic: We
select a function q = g(X) of the sample vector X and search for a region R,.
of the real line minimizing {3. We shall say that the statistic q is most powerful if a test so designed is most powerful. In our earlier discussion, we
selected the function g(X) empirically. In general, such a choice of g(X)
does not lead to a most powerful test. In the following, we determine the
function g(X) such that a test based on the test statistic g(X) is most powerful, and we determine the properties of the corresponding critical region.
The analysis will involve the simple hypothesis 8 = 80 against the simple
hypothesis 8 = 8 1 •
We denote by f<X. 8) = /(x 1 • 8) · · · f<xn. 8) the joint density of then
samples x; of x. and we form the ratio
r=r(X)=/(X. 8o)
(10-123)
/(X. 81 )
We shall show that r is a most powerful test statistic.
NEYMAN-PEARSON CRITERION Suppose that D, is the critical region of
the test of the hypothesis
Ho: 8 = 80
against
H,: 8 = 8 1
(10-124)
We maintain that the test is most powerful itT the region D,. is such that
r(X) ~ c
for XED,.
and
r(X) > c
for XeD, (10-125)
Thus r = c: on the boundary of D,.• r < c in the interior of D,.• and r > c
outside De.
The constant c: is specified in terms of the Type I error probability
a = P{X E D,.IHo} = P{r ~ clllo}
(10-126)
The resulting Type II error probability equals
{3 = P{X e D,.IH,} = P{r > cjH,}
(10-127)
• Proof. Suppose that D;. is the critical region of another test and the corresponding error probabilities equal a' and {3'. It suffices to show that
if
a' = a
then
{3' > {3
In Fig. 10.19, we show the sets Dr = A U Band D; = A U B' where A = D,. n
D;. Under hypothesis H 0 , the probability masses in the regions Band B' are
equal; hence, we can assign to each differential element~ V; of B centered at
X a corresponding element ~Vi of B' centered at X', with the same mass
/(X. 8o)4V; =/(X', 8o)~V;
Under hypothesis H 1 , the masses in the regions ~ V; and ~ v; equal /(X,
8 1 )~ V; and/(X'. 8 1 )~Vi. respectively. Hence, as we replace~ V; by~ Vi. the
resulting increase in {3 equals
~{3
=/(X. 8, )~ V; -/(X'. 8, )d Vi
Sf.C.
10-6 NEYMAN-PEARSON, SEQUENTIAl., AND LIKELIHOOD RATIO TESTS
371
r=C'
Figure 10.19
But
f(X, Bo) < cf(X, Btl
/(X'. Bo) > cf(X', B.>
Therefore,
4a;
f-1
> !c [/(X, Bo)4 V - f(x', Bo)] = 0
And since {3' == {3 + I.4{3;, we conclude that {3' > {3.
The ratio r will be called the NP test statistic. It is a statistic because
the constants Bo and B1 are known. From the Neyman-Pearson criterion it
follows that r is a most powerful test statistic. The corresponding critical
region is the interval r < c where c = r1-a is the I - a percentile of r [see ( 10126)]. The most powerful test of(I0-124) thus proceeds as follows: Observe
X and compute the nttio r.
f(X, 8o)
(10-128)
Reject
iff
Ho
/(X, B,) S TJ-a
Note There is a basic difference between the NP test statistic and the test
statistics considered earlier. The NP statistic r is specified in terms of the
density of x. and it is in all cases most powerful. In general this is not true for
the empirically chosen test statistics. However. as the following illustrations
suggest, for most cases of interest empirical test statistics generate most powerful tests.
To carry out the test in (10-128) we must determine the distribution of
r. This we can do, in principle, with the techniques of Section 7-1; however,
the computations are not always simple. The problem is simplified if r is of
the form
r = 1/J(q)
(10-129)
where q is a statistic with known distribution. We can then replace the test
(10-128) with a test based on q. Such a test is most powerful, and its critical
region is determined as follows: Suppose, first, that the function 1/J(q) is
monotonically increasing as in Fig. 10.20a. In this case, r s c iff q s C4 ;
372
CHAP.
10
HYPOTHESIS TESTING
,
,
Figure 10.20
hence,
a
= P{r ~
clHo}
= P{q ~
cuiHo}
(10-130a)
Denoting by qu the u-percentile of q, we conclude that Ca = qQ. Thus Ho is
rejected itT q < qQ. Suppose, next, that l{l(q) is monotonically decreasing
(Fig. 10.20b). We now have
a
= P{r ~ c!Ho} = P{q
Ho is rejected itT q
~
~
c,lHo}
c,
= q,_Q
(10-130b)
q,_Q.
The general case is not as simple because the critical region R, might
consist of several segments. For the curve of Fig. 10.20c, R,. consists of the
half-line q ~ c 1 and the interval c2 ~ q ~ c3 ; hence,
a= P{q s c,IHo} + P{c2 ~ q ~ cliHo}
To find R,., we assign values to c from 0 to I, determine the corresponding points c;, and compute the sum until its value equals a.
Example 10.33 The RV xis N(6, u) where u is a known constant. In this case, the NP ratio r equals
r = exp { 2 2 [2 (x;- 6,)2- 2 (X;- 6o)2 ]} = exp { 2; 2 ((6f- 6i>- 2(6,- 6o).iJ} (10-131)
!
This shows that the NP test statistic is a function r = 1/J(i) of the sample mean i of x.
From this it follows that i is a most powerful test statistic of the mean 6 of x. To find
the corresponding critical region, we use (10-130): If6 1 < fJo, then X(il is a monotonically increasing function; hence, we reject H 0 iff i < q0 where q,. is the u-percentile of
i under hypothesis H0 • If 6, > 60 , then t/l(.i) is monotonically decreasing; hence, we
reject H0 iff .i > q 1-o.
Note that the test against the simple hypothesis 6 = 6 1 < 60 is the same as the
test against the composite hypothesis 6 < 60 because the critical region < q 0 does
not depend on 61 • And since the test is most powerful for every 61 < 60 , we conclude
that .i < q0 is the uniformly most powerful test of the hypothesis 6 = 60 against 6 <
6o. Similarly, .i > qt-o is the uniformly most powerful test of the hypothesis 6 = 60
against 6 > 6o. •
x
Sf.C.
J0-6
NEYMAN-PEARSON, SEQUENTIAL, AND LIKELIHOOD RATIO TESTS
373
Exponential Type l)istributions The results of Example 10.33 can be readily generalized. Suppose that the RV x has an exponential type distribution
f(x, 8) = h(x) exp {a(8)cb(x) - b(9)}
as in (9-91). In this case. the NP ratio equals
r = exp {[a(60 ) - a(8 1 )JG(X) - [b(Hu) - b(8 1 )]}
( 10-132)
where G(X) = ~cb(x;). Thus r is a function til(q) of the test statistic q = G(X);
hence, q is a most powerful test statistic of the hypothesis 8 = 80 against 8 =
9,. We shall now examine tests against the composite hypothesis 8 < 80 or
8 > 8o.
Suppose that a(8) is a monotonically increasing function of 8. If 81 < flo
[see (10-132)], the function 1/l(q) is also monotonically increasing. From this
it follows that the test statistic q generates the uniformly most powerful test
of the hypothesis fl = 80 against fl < fl 1 , and its critical region is G(X) s qa
where q" is the u-percentile of G(X) under hypothesis H0 • If 81 > 80 , then
tiJ(q) is monotonically decreasing, and the test against 0 > 00 is uniformly
most powerful with critical region G(X) =:!:: q 1• ... If a(8) is monotonically
decreasing, G(X) is again the uniformly most powerful test statistic against
the hypothesis 8 < 80 or fl > 8o, and the corresponding critical regions are
G(X) =:!:: q,_a and G(X) :S qa. respectively.
Example 10.34
The
RV
xis N(TJ. 8) where., is a known constant. In this case.
f(x. 8) :.::
~ exp {- 2 ~ 2 {.t
-
TJ) 2
·
In 8}
This is an exponential density with
=-
I
cb(x) = (x - TJ)~
q = L(X; - TJ):
282
Thus a(8) is a monotonically increasing function of 8; hence, the statistic q generates
the uniformly most powerful tests of the hypothesis 8 = 811 against 8 < 80 or 8 > 81 •
The corresponding critical regions are q s qo and q ?. q 1• a• respectively. To complete the test, we must find the distribution of q under hypothesis H 0 • As we know. if
8 = 8o. then the RV
a(8)
~ = ~~ L (x; -
TJ) 2
is
x2<n)
Hence. q11 = 8~)(~(n). This leads to the following tests:
H,: 8 < 80
Reject Ho iff ~(x; - 71) 2 < 8Ax~(n)
H,: 8 > 8o
Reject Ho iff~(x;- TJ)1 > 8ijxi-a(n)
We have thus reestablished earlier results )see (10-36)). However, using the NP
criterion, we have shown that the tests are uniformly most powerful. •
Suftident Statistics Suppose that the parameter 8 has a sufficient statistic
y(X); that is, the functionf(X, 8) is of the form
f(X, 8)
= H(X)J[y(X). 8)
374
CHAP.
10
HYPOTHESIS TESTING
as in (9-95). ln this case, the NP r.ttio equals
J[y(X). 8o)J
f(X, 8o)
8,)
' = f(X,
= J(y(X). 8, IJ
Thus r is a function of y(X); hence, the sufficient statistic q = y(X) of 8 is a
most powerful test statistic of the hypothesis 9 -= 90 against (J = 9, .
Example 10.35
We have shown in Example 9.29 that if x is uniform in the interval (0, 8), then the
maximum q = Xma• of its samples X; is a sufficient statistic of 8 and its distribution [see
(9-100)] equals
I
Fq(q, 8) = ,. q"U(8- ql
8
From this it follows that q is the most powerful test statistic of 8, and
u
Hence. q, =
88Vu.
= P{q
s q,iHo}
= Fq(q,, 80 ) =
I
88
q:
Furthermore [see 110-142)].
, = (~
r
U(8 - q)
Since r is a function of q, we shall determine the critical region R, of the test directly
in terms of q. If 81 < 80 • then R, is the interval q < c where
a= P{q s c!Ho} =
The corresponding
p error equals
p = P{q ::e c!H,} = 1 -
(~r
(f.r
c
= 8o~
= 1 - a (~
r
If 81 > 80 , then R.- is the interval q > c, and
a= P{q > c!Ho} = 1-
p = P{q s ciH,}
=
(~r
(;,r =
c
(1- a)
= 8o~
(:~r
•
Sequential Hypothesis Testing
In all tests considered so far, the number n of samples used was specified in
advance, either directly in terms of the complexity of the test or indirectly in
terms of the error probabilities a and p. Now we consider a different approach. We continue the test until the data indicate whether we should
accept one or the other of the two hypotheses. Suppose that we wish to iest
whether a coin is fair. If at the 20th toss, heads has shown 9 times, we decide
that the coin is fair; if heads has shown 16 times, we decide that it is loaded;
if heads has shown 13 times, we consider the evidence inconclusive and
continue the test. In this approach, the length of the test-that is, the number n of trials required until a decision is reached-is random. It might be
SEC.
J0-6
NEYMAN-PEARSON, SEQUENTIAl.. ASD LIKELIHOOD RATIO TESTS
375
arbitrarily large. However, for the same error probabilities, it is on the
average smaller than the number required for a fixed-length test. Such a test
is called sequential.
We shall describe a sequential test of two simple hypotheses:
Ho: 8 = 8o
against
11,: 8 = 8,
The method is based on the NP statistic (10-123).
We form the ratio
r
"'
80 )
/(X,, 81)
=/(X,,
wheref(X,, 8) = .f(.t,. 0) · · · f(x,, 8)
is the joint density of the first m samples x1 , • • • • x,., of x.
We select two constants c0 and c 1 such that c0 < c: 1 •
At the mth trial, we reject H0 if r, < c0 : we accept 110 if r, > c 1 : we
continue the test if c0 < r, < c 1 • Thus the test is completed at the nth step
(Fig. 10.21) iff
cn < r, < c,
for every m < 11 and r, < ''u orr, > c 1 00-133)
The ratio r,., can be determined recursively:
r,
/(Xm• 8u)
0,)
= r,., /(X,.
To carry out the test. we must express the constants co and c 1 in terms
of the error probabilities a and p. This is a difficult task. We shall give only
an approximation.
• Theorem. In a sequential test carried out with the constants c0 and c 1 • the
Type 1 and Type II error probabilities a and fJ satisfy the inequalities
a
~ !5 C'o
_L
<
_ a
1
_!_
c,
(10-134)
Figure 10.21
,
,
f<X.,, 0 1
= . ··- 0
[lX,,. 8 1 )
f
Rcjco:t /10
c, ~--------------\
cnl-----------~-~
Accept //11
0
\
\
11
m
376
CHAP.
10
HYPOTHESIS TESTING
{J
Figure 10.22
The proof is outlined in Problem 10-30. These inequalities show that the
point (a, {J) is in the shaded region of Fig. 10.22 bounded by the lines a +
{3,.0 = c0 and a + {Jc 1 = I. Guided by this, we shall use for c0 and c 1 the
constants
a
I -- a
(10-135)
co=-C'J = - 1-{J
{3
In the construction of Fig. 10.22, (a, {J) is the intersection of the two lines.
From (10-134) it follows that with c0 and c 1 so chosen, the resulting error
probabilities are in the shaded region; hence, in most cases. the choice is
conservative.
Example 10.36
The av xis N(TJ, 2). We shall test sequentially the hypothesis
ho: .,
= 20
against
H 1:
.,
= 24
with a= .05, ~=.I.
Denoting by i,. the average of the first m samples X; of x, we conclude from
(10-131) with 80 = 20, 81 = 24, and u = 2 that
i ((24
r,. = exp {
2 -
202) - 2(24 - 20)i,.)} = exp {- m(i,. - 22)}
From (10-135) it follows that
05
co= ·
95
,., = ·
lnc0 = -2.89
Inc·, = 2.25
.9
.I
And since lnr,. = -m(i,. - 22), we conclude that c·0 < r,. < c 1 itT -2.89 <
SEC.
377
10-6 NEYMAN-PEARSON, SEQUENTIAL, AND LIKELIHOOD RATIO TESTS
.45 • s.~
24
~'
22 + 2.89
m
~
RcjectH0
im.~
22
~
~
22- 2.25
m
234567m
12 13 14 IS 16 17
(a)
m
lbl
Figure 10.23
· m(.t',,
22) < 2.25. that is, iff
22 .
~·m25 < i, <
22 +
2 8
· ~
m
x,
Thus the test terminates when
crosses the boundary lines 22 • 2.25/ m and 22 2.89/m of the uncertainty interval (Fig. 10.23a). We accept 1/11 if.i,., crosses the lower
line; we reject H 11 if it crosses the upper tine. •
Example 10.37
Suppose that :tl is an event with probability p. We shall test the hypothesis
against
with a= .05. /3 = .I.
We form the 1.ero-one RV x associated with the event :A. This RV is of the
discrete type with point density p•(l - p) 1 ·' where x = 0 or I. Hence • .f<km• p) p4.,(1 - p)m 4•• and
rm =
(Po)4.( I
PI
- Pn)"'
I -PI
4. = (~)4~( ~ )"' 4m
4
6
where k,, is the number of successes of :A in the first m trials. In this case. r,., is a
monotonically increasing function of the sample mean .i, = k,.,lm; hence. -2.89 <In
r,., < 2.25 iff
045 5.56
045
. ·- 7.14
--,;;-<x,.,<
. +
m
This establishes the boundaries of the uncertainty region of the test as in Fig.
10.23b. •
378
CHAP.
10
HYPOTHESIS TESTING
Likelihood Ratio Tests
So far, we have considered tests of a simple null hypothesis H 0 against an
alternative hypothesis H 1, simple or composite. and we have assumed that
the parameter 0 was scalar. Now we shall develop a general method for
testing any hypothesis H 0 , simple or composite, involving the vector 0 =
[o,, ... , o.J.
Consider an RV x with density f(x, 0). We shall test the hypothesis H 0
that 0 is in a region 8 0 of the k-dimensional parameter space against the
hypothesis H 1 that 0 is in a region 8 1 of that space:
Ho: 0 E 8o
against
H,: 0 E 8,
(10-136)
The union 8 = So U 8 1 of the sets So and 8 1 is the parameter space.
For a given sample vector X, the joint density f(X, 0) is a function of
the parameter 0. The maximum of this function as 0 ranges over the parameter space 8 will be denoted by Om. In the language of Section 9-5,/(X, 0) is
the likelihood function and Om the maximium likelihood estimate of 0. The
maximum off( X, 0) as 0 ranges over the set 8 0 will be denoted by Om0. If H0
is the simple hypothesis 0 = 00 , then the region 8 0 consists of the single point
Oo and Om0 = 8o.
With Bm and 8,.., so defined, we form the ratio
A =/(X, OmO)
(10-137)
/(X, Bm>
A test of (10-136) based on the corresponding statistic A is called the likelihood ratio (LR) test. We shall determine its critical region. Clearly. /(X.
8m0) s /(X, Om): hence. for every X,
0 s As I
(10-138)
From this it follows that the density of A is 0 outside the interval (0. I). We
maintain that it is concentrated near 1 if 0 E 8 0 and to the left of 1 if 0 E 8 1,
as in Fig. 10.24. The data X are the observed samples of an RV x with
distribution/(x, B) where ii is a specific unknown number. The point Om is
the ML estimate of ii; therefore (see Section 9-5), it is close to ii if n is large.
Figure 10.24
SEC.
10-6
NEYMAN-PEARSO~. SEQUENTIAL, A~D I.IKELIHOOD RATIO TF.STS
379
Under hypothesis H0 • 0is in the set (·)0 ; hence. with probability close to I. 8,
equals 8,o and A = 1. If 0 is in e,, then 8, is different from Bmo and A is less
than I. These observations form the basis of the LR test. The test proceeds
as follows: Observe X and find the extremes 6m and 8m0 of the likelihood
functionf(X, 6).
Reject Ho
(10-139)
iff
To complete the test. we must find c:. If Ho is the simple hypothesis 6 =
Bo. then c equals the corresponding percentile A,. of .\. If, however. H0 is
composite, the Type I error probability a(6) = P{.\ s c:IHo} is no longer a
constant. In this case, we select a constant a0 as before and determine c such
that a(9) does not exceed au. Thus c is the smallest number such that
P{.\ s d s a 11
for every 6 E (-)11
(10-140)
In many cases. P{.\ s dH11 } is maximum for 0 -= ()n. In such cases. c is
determined as in the simple hypothesis 8 = 00 : that is. the constant cis such
that P{.\ s ci8 = 9o} = ao.
Example 10.38
The av 8has an exponential density /(x. 91.::: 9e- 8·'U<xl. We shall test the hypothe-
sis
fJ ~
Hn :
against
611
using the LR test. In this case (fig. 10.25al.
j'( X. 0) = H"(' H.tH
(I
> II
As we sec.
I
6,, = -::.
6 11
"'
X
1/.t
= {Oo
.t011 >I
.tfJo < I
itln > I
.tHn < I
Figure 10.25
!IX, 8)
:>
I
0-
.~
,.,
I
I
I
0
"'
,I
I
I
./
\
(J
-<
\'
X
,
I
O
o '.
'
II
'
0
.i:
(a)
(b)
380
CHAP.
10
HYPOTHESIS TESTING
Thus Ais an increasing function of x (Fig. 10.25b). Hence, in this case. the LR test is
equivalent to a test with test statistic q = i. Since A. s c· itT s c 1, the corresponding
critical region is x s c 1 where c 1 is such that P{i s c 1l90 } = ao. •
x
ASYMPTOTIC PROPERTIES OF LIKELIHOOD RATIOS In an LR test, we
must find the distribution of the LR A; in general, this is not a simple prob-
lem. The problem is simplified if we can find a function q = 1/I(A) with known
distribution. If 1/I(A) is monotonically increasing or decreasing, then the LR
test>. s cis equivalent to the test q s c 1 or q ~ c 1• Example 10-38 illustrated
this. Of particular interest in the transformation w = -2 In A. As the next
theorem shows, for large n. the RV w has a x2 distribution regardless of the
form of f(x, 8). We shall need the following concept.
Free Pammeters Suppose that the distribution of x depends on k parameters 81 • • • • , 8k. We shall say that 81 is a free parameter if its possible
values in the parameter space e are noncountable. The number of free
parameters of 8 will be denoted by N. We define similarly the number N 0 of
free parameters of 9 0 • Suppose that x is a normal RV with mean ,., and
variance u 2 where ,., is any number and u > 0. In this case, e is the halfplane -:x <,., < :x, u > 0; hence, N = 2. If under the null hypothesis,,., = 'lo
and u > 0, then only u is a free parameter, and No = I. If '1 ~ 'lo and u > 0,
then both parameters arc free, and No = 2. Finally, if,., = 'lo and u = uo.
then No= 0.
• Theorem. If N > N0 and n is large, then under hypothesis H0 , the statistic
w = -2 In
A
is
x~<N-
N0 )
(10-141)
The proof of this rather difficult theorem will not be given. We shall demonstrate its validity with two examples.
Example 10.39
The
RV
xis normal with mean 11 and known variance u. We shall test the hypothesis
/111 : 11
=-=
1/1: 11 ~ lJo
against
lJo
(10-142)
u:~ing the LR test. and we shall show that the result agrees with (10-141). In this
problem. we have one unknown parameter and
f(X. 111- exp {-
In this problem. lJmO
observe that
2~ L
(x1 -
1112 }
= lJo because Ho is the simple hypothesis 11 "' lJn· To find 11m• we
'\.'
, '\.'
_,
(10-143)
L. (x;- 11~ = L. (X; - x)- + n(x - lJ)-
.
Thus/(X, lJ) is maximum if the term (X - 1112 is minimum. that is. if 11 = .,,
From this it follows that
exp {A=
{
2~ L (x; 1 ...
exp - 2u
L
1Jo)2 }
_ •}
Cx1 - x )·
= exp {-
;u (.r-
2
1Jo) }
= x.
(10-144)
SEC.
10-6
NEYMAN-PI:.ARSOr.;, SEQUENTIAL, AND LIKELIHOOD RATIO TESTS
381
Mi)
0 flo- cl
:~
flo
l'igure 10.26
In Fig. 10.26, we plot A. as a function of i. As we see. A. s c iff !.i- 11ol ~ c 1 • This
shows that the LR test is equivalent to the test based on the test statistic q of (10-9).
To verify (10-141) we observe that,
w = -2 In A = { i - 1Jo)~
vln
Under hypothesis H 0 , the RV i is normal with mean 1Jo and variance vln; hence, w is
x2(1). This agrees with (10-141) because N = I and No -= 0. •
Example 10.40
We shall again test {10-142), but now we as!lume that both parameters 11 are v are
unknown. Reasoning as in Example (9-27), we find
"'ntO =
"'l
VntO =
I~
,
~ {X; - TJo)•
n
1Im
vm
=X
= -11t L
-,
<x1 - x)· (10-145)
The resulting LR equals
A~ u;."" exp j- f.;~ ex, - ..rj
v,,,2 exp {-
2!, L
{X; -
= (;.:,)"'
.~)2}
This yields [see (10-143>1
, (X - 1JI)!
.v- =
v,., '
Thus A. is a decreasing function of lyl; hence. A s c iff IYI
~
<·,.
Ltlrge 11 In this example. N = 2 and N0 = I, to verify {10-141) we must show
that the RV - 2 In A is x2{1).
Suppose that H 0 is true. We form the avs
v
"'
= -I
n
L
-,
(x· - x)•
'
,
{i ··
1Jn)2
y- " " - - -
v,.,
382
CHAP.
10
HYPOTHESIS TESTING
As n increases, the variance of v, approaches 0 [see 17-99)]. Hence, for large n
" --I v == v
v, == E{v,} = n
Furthermore, i - 'l'lo- 0 as n - ~. From this it follows that
2
« I
2
= n In (I
+ y 2) == ny: == ( i - '17o)
vln
with probability close to I. This agrees with (10-141) because the RV i- 'l'lo is normal
with 0 mean and variance vln. •
Y2 ==
(i - '17o)
v
-2 In A
Problems
10-1
10-2
10-3
10-4
10-5
10-6
We are given an RV x with mean., and standard deviation u = 2. and we wish
to test the hypothesis '11 = 8 against '11 = 8. 7 with a = .0 I using as test statistic
the sample mean i of n samples. (a) Find the critical region Rc of the test and
the resulting fj if n = 64. (b) Find n and Rr if fj = .OS.
A new car is introduced with the claim that its average mileage in highway
driving is at least 28 miles per gallon. Seventeen cars are tested, and the
following mileage is obtained:
19 20 24 25 26 26.8 27.2 27.5
28 28.2 28.4 29 30 31 32 33.3 35
Can we conclude with significance level at most .05 that the claim is true'?
The weights of cereal boxes are the values of an RV x with mean '11· We
measure 64 boxes and find that .t = 7. 7 oz. and s = 1.5 oz. Test the hypothesis
H 0 : '11 = 8 oz. against H 1 : '11 ::F 8 oz. with a =- .I and a = .01.
We wish to examine the effects of a new drug on blood pre!tsure. We select
250 patients with about the same blood pressure and separate them into two
groups. The first group of 150 patients receives the drug in tablets, and the
second group of 100 patients is given identical tablets without the drug. At the
end of the test period, the blood pressure of each group, modeled by the avs x
andy, respectively, is measured, and it is found that i = 130, sA = 5 andy=
135, Sy = 8. Assuming normality, test the hypothesis 'I'IA = '11~ against 'l'lx "' 'l'lr
with a= .05.
Brand A batteries cost more than brand 8 batteries. Their life lengths are two
avs x andy. We test 16 batteries of brand A and 26 batteries of brand 8 and
find these values, in hours:
i = 4.6
Sx = 1.1
y = 4.2 s~ = 0.9
Test the hypothesis 'I'IA = 'l'l.v against '17• > 'l'ly with a = .05.
Given r avs xk with the same variance u 2 and with means E{x4} = 'l'lk· we wish
to test the hypothesis
,
H0 : ~
k~t
,
,.,.,,
= 0
against
H 1:
L
ck'l'lk "'0
4~1
where c4 are r given constants. To do so, we observe n4 samples x4, of the kth
PROBLEMS
383
RV x4 and form their respective sample means .'f4 • Carry out the test using as
the test statistic the sum y = I c·,i,.
10·7 A coin is tossed 64 times. and heads shows 22 times. Test the hypothesis that
the coin is fair with significance level .05.
10-8 We toss a coin 16 times, and heads shows f.; times. If J.; is such that k 1 s k s
k~. we accept the hypothesis that the coin is fair with significance level a =
.05. Find k 1 and k~ and the resulting {3 error: (a) using (10-20); (b) using the
normal approximation (I 0-21 ).
10-9 In a production process. the number of defective units per hour is a Poissondistributed RV x with parameter>..= 5. A new process is introduced. and it is
observed that the hourly defectives in a 22-hour period are
X; = 3 0 5 4 2 6 4
I 5 3 7 4 0 8 3 2 4 3 6 5 6 9
Test the hypothesis >.. = 5 against >.. < 5 with a = .05.
10-10 Given anN(.,, u) RV x with known.,, we test the hypothesis u = u 0 against
u > u 0 using as the test statistic the sum q =
-:h
~
uo
the resulting OC function {3(u) equals the area of the
xi -..(n)u~/u 2 (Fig. PIO.IO).
,
0
O(j
c
'
.,. Xi-o
Oj
(X; -
T/) 2• Show that
x2(n) density from 0 to
'
=X!-c.
Figure PIO.IO
10-11 The Rvs x and y model the gntdes of students and their parental income in
thousands of dollars. We observe the gmdes x; of 17 students and the income
.V; of their parents and obtain
X;
Y;
50
65
55
17
59
70
63
20
66
45
68
15
69
55
70
30
70
25
72
28
72
42
75
28
79
18
84
28
89
32
93
75
96
32
Compute the empirical correlation coefficient; of x andy (see (10-40)1. and
test the hypothesis r = 0 against r :/: 0.
384
CHAP.
10
HYPOTHESIS TESTING
10.12 It is claimed that the time to failure of a unit is an av x with density
3x 2e- 3x'U(x) (Weibull). To test the claim, we determine the failure times x1 of
10.13
18-14
10.15
10.16
10.17
10.18
10.19
80 units and form the empirical distribution F'lx) as in (10-43), and we find
that the maximum of the distance between Fix) and the Weibull distribution
Fo(x) equals 0.1. Using the KolmogorofT-Smimov test, decide whether the
claim is justified with o = .I.
We wish to compare the accuracies of two measuring instruments. We measure an object 80 times with each instrument and obtain the samples (x1, y 1) of
the avs x = c + .,,. and y = c + ~'b· We then compare x1 and y 1 and find that
x1 > y 1 36 times, x1 < y 1 42 times, and x1 = y 1 2 times. Test the hypothesis that
the distributions F,.(v) and Fb(v) of the errors are equal with o = .I.
The length of a product is an RV x with cr = 0.2. When the plant is in control,
11 = 30. We select the values o = .05, p (30.1) = .I for the two error
probabilities and use as the test statistic the sample mean i of n samples of x.
(a) Find nand design the control chart. (b) Find the probability that when the
plant goes out of control and 11 = 30.1 the chart will cross the control limits at
the next test.
A factory produces resistors. Their resistance is an RV x with standard deviation cr. When the plant is in control. cr > 200; when it is out of control. cr >
20!1. (a) Design a control chart using 10 samples at each test and o = .01. Use
as test statistic the RV y = VI(x1 - i )2• (b) Find the probability p that when the
plant is in control, the chart will not cross the control limits before the 25th
test.
A die is tossed 102 times, and the ith face shows k1 = 18, 15, 19, 17. 13, and 20
times. Test the hypothesis that the die is fair with o = .05 using the chi-square
test.
A utility proposes the construction of an atomic plant. The county objects
claiming that 55% of the residents oppose the plant, 35% approve, and 10%
express no opinion. To test this claim, the utility questions 400 residents and
finds that 205 oppose the plant, 150 approve, and 45 express no opinion. Do
the results support the county's claim at the 5% significance level?
A computer prints out 1,000 numbers consisting of the 10 integers j = 0,
I, . . . , 9. The number ni oftimesj appears equals
ni = 85
110
118
91
78
105
122
94
101
96
Test the hypothesis that the numbers j are uniformly distributed between 0
and 9, with o = .05.
The avs x andy take the values 0 and I with P{x = 0} = Pu P{y = 0} = p,. A
computer generates the following paired samples (x1, y 1) of these avs:
X;
Y1
0
0
I
0
I
I
0
0
0
I
I
0
0
I
I
I
0
0
I
0
I
0
I
0
I
0
I
0
0
I
I
I
0
0
I
I
0
0
0
I
0
0
0
0
0
0
I
I
0
0
0
0
0
I
I
0
I
PROBLEMS
385
Using a = .05. test the following hypotheses: (a) p, ::: .5 against Px :/: .5; (b)
.5 against p~. :/: .5: (c) the RVS x and y are independent.
10-20 Suppose that under the null hypothesis. the probabilities P; satisfy the m
equations (10-67). Show that for large n the sum
P~· '-'
•
• •
'"' <k; - "P;l 1s
'-~ -k; -ap;
q• = £...
m1mmum
1'f £...
; I
liP;
, I p; a8,
2
=0
This shows that the ML estimates of the par.tmeters H; minimize the modified
Pearson's sum q.
10-21 The duration of telephone calls is an RV x with distribution F(x). We monitor
100 calls x; and observe that x; < 7 minutes for every i and the number n4 of
calls of duration between k - I and k equals 24. 20. 16. 15, II, 8, 6 fork =- I.
2. . . .• 7, respectively. Test the hypothesis that FL'C) = (I - e "x)UI.t)
with a = .1. (a) Assume that 8 = 0.25. (b) Assume that 8 is an unknown
parctmeter. as in (10-67).
10-22 The Rvs xj are i.i.d. and N(.,, u). Show that if
m
Q = ~
,,
L (xJ -
i) 2
then
i- I j- I
where Q 1 and Q2 are the sums in (10-94) and (10-96).
10-23 Three sections of a math class, taught by different teachers. take the test and
their grades are as follows:
Section 1:
Section 2:
Section 3:
38
42
41
45
51
53
56
46
50
54
57
58
61
60
65
62
66
67
65
68
70
66
72
73
69
74
75
71
77
80
73
80
81
74
82
84
76
87
92
79
91
83
96
86
90
96
(a) Assuming normality and equal variance, test the hypothesis that there is
no significant difference in the average grades of the three sections with a =
.05 [see (10-101)]. (b) Using (10-97) with N = 46 and m -=- 3, estimate the
common variance of the section grades.
10-24 We devide the pupils in a certain gntde into 12 age groups and 3 weight
groups. and we denote by x1" the grades of pupils in the jth weight group and
the kth age group. Selecting one pupil from each of the 36 sets so formed. we
observe the following gntdes.
k
60
58
58
2
3
4
66
70
65
66
69
69
74
73
72
5
77
75
75
6
80
77
77
7
80
78
79
8
82
80
79
9
85
83
81
10
89
87
84
1I
92
91
88
12
96
93
92
Assuming normality and additivity, test the hypothesis that neither weight
nor age has any effect on grades.
386
CHAP.
10
HYPOTHESIS TESTING
10-25 An
RV
x has the Erlang distribution
x"' ~ xulx)
m!
Using the NP criterion, test the hypothesis m = 0 against m = I (Fig. PI0.25)
with a = .25 in terms of a single sample x of x. Find the resulting Type II error
f<x) = -
/3.
0
Figure PlO.lS
18-26 Suppose thatf,(r, 8) is the density of the NP ratio r =/(X. 80 )//(X, 81). (a)
Show thatf,(r, 80 ) = rf,(r, 81). (b) Findf,(r, 8) for n = I if/(x. 8) = 8e-•xu(x)
and verify (a).
18-27 Given an RV x with/(x, 8) = 8 2xe-•xu(x), we wish to test the hypothesis 8 =
80 against 8 < 80 using as test statistic the sums q = x1 + · · · + x,.. (a) Show
that the test
Reject Ho itT q < c =
x;~on)
is most powerful. (b) Find the resulting OC function /3(8).
18-28 Given an RV x with density 8x 9 - 1U(x- I) and samples X~t we wish to test the
hypothesis 8 = 2 against 8 = I. (a) Show that the NP criterion leads to the
critical region x 1 • • • x,. <c. (b) Assuming that n is large. express the constant c and the Type II error probability 13 in terms of a.
18-29 We wish to test the hypothesis H0 that the distribution of an av xis uniform in
the interval (0, I) against the hypothesis H 1 that it is N(l.25. 0.25), as in Fig.
PI0.29, using as our data a single observation x of x. Assuming a = .I,
determine the critical region of the test satisfying the NP criterion, and find /3.
18-30 We denote by A,, R,. and U, the regions of acceptance, rejection, and
uncertainty, respectively, of H0 at the mth step in sequential testing. (a)
Show that the sets A,. and R,, m = l, 2, . . . are disjoint. (b) Show that
a=
i
m•l
J.R.. /(X,, 8o)dX, s
I -a=
if
m•l
A.,
co
/(X,, 8o)dX,
i
,..,
JR.. /(X,, 81)dX, =coO
i
~ c1 m=l fA., /(X,, 81)dX,. =
-/3)
CJ/3
PROBLEMS
0
387
X
~------Rc--------
Figure P10.29
10-31 Given an event .Slf with p ..::. P(.Slf). we wish to test the hypothesis p = p 0
against p -:1: p0 in terms of number k of successes of~ in n trials, using the LR
test. Show that
_
, pfttl - Po)"··A
A - n kk(n - k)"-A
10-32 The number x of particles emitted from a radioactive substance in I second is
a Poisson RV with mean 8. In 50 seconds. 1.058 particles are emitted. Test the
hypothese 80 = 20 against 8 20 with a = .OS U!iing the asymptotic approximation (10-141).
10-33 Using (10-146) and Example 10-40, show that the ANOVA test (10-101) is a
special case of the LR test (10-139).
*
II
The Method of Least
Squares
A common problem in many applications is the determination of a function
fb(x) fitting in some sense a set of n points (x;, y;). The function fb(x) depends
on m < n parameters";, and the problem is to find ";or, equivalently, to
solve the overdetermined system y; = fb(x;), i = l, . . . , n. This problem
can be given three interpretations. The first is deterministic: The coordinates
x; and y; of the given points are n known numbers. The second is statistical:
The abscissa x; is a known number (controlled variable), but the ordinate y; is
the value of an RV y; with mean fb(x;). The third is a prediction: The numbers
x; and y; are the samples of two avs x and y, and the objective is to find the
best predictor'= fb(x) ofy in terms ofx. We investigate all three interpretations and show that the results are closely related.
11-1
Introduction
We are given n points
(x 1 , y,), . ·.. , (x,., y,.)
on a plane (Fig. 11.1) with coordinates the 2n arbitrary numbers x; and y;.
These numbers need not be different. We might have several points on a
388
SEC.
11-1
INTRODUCTION
389
J
0
Figure 11.1
horizontal or vertical line; in fact, some of the points might be identical. Our
first objective is to find a straight line y = a + bx that fits "y on x" in the
sense that the differences (errors)
}'; - (a + bx;) = v1
(11-1)
are as small as possible. The unknowns arc the two constants a and b; their
determination depends on the error criterion.
This is a special case of the problem of fitting a set of points with a
general curve c/J(x) depending on a number of unknown parameters. This
problem is fundamental in the theory of measurements, and it arises in all
areas of applied sciences. The curve c/J(x) could be the statement of a physical law or an empirical function used for interpolation or extrapolation.
Suppose that y is the temperature inside the earth at a distance x from the
surface and that theoretical considerations lead to the conclusion that y =
a+ bx where a and bare unknown parameters. To determine a and b, we
measure the temperature at n depths x1• The numbers x1 are controlled variables in the sense that their values are known exactly. The temperature
measurements y1, however, involve errors, as in (11-1). We shall develop
techniques for estimating the parameters a and b. These estimates can then
be used to determine the temperature y 0 = a + bx0 at a new depth x0 • An
example of empirical curve fitting is the stock market. We denote by y1 the
price of a stock at time x1, and we fit a straight line y = a + bx, or a higherorder curve, through the n observations (x1, y1). The line is then used to
predict the price y 0 of the stock at some future time x0 •
THREE INTERPRETATIONS The curve-fitting problem can be given the
following interpretations, depending on our assumptions about the points
(X;, y;).
Deterministic In the first interpretation, x 1 and y 1 are viewed as pairs of
known numbers. The5e numbers might be the results of measurements involving random errors; however, this is not used in the curve-fitting process,
and the goodness of fit is not interpreted in a statistical sense.
390
CHAP.
II
THE METHOD OF LEAST SQUARES
Statistical In the second interpretation, the abscissas x1 are known numbers
(controlled variables), but the ordinates y1 are the observed values of n RVs y1
with expected values a + bx1 •
Prediction In the third interpretation, x; and y1 are the samples of two RVS x
and y. The sum a + bx is the linear predictor of y in terms of x, and the
function cf>(x) is its nonlinear predictor. The constants a and b and the function cf>(x) are determined in terms of the statistical propenies of the Rvs x
andy.
We shall develop all three interpretations. The results are closely related. The deterministic interpretation is not, strictly, a topic in statistics. It
is covered here, however, because in most cases of interest, the data x1 and y 1
are the results of observations involving random errors.
REGRESSION LINE This term is used to characterize the straight line a +
bx, or the curve cf>(x), that fits yon x in the sense intended. The underlying
analysis is called regression theory.
Note The errors v1 are the deviations of y 1 from a + bx1 (Fig. 11.2a). In this
case, y = a + bx is a line fitting yon x. We could similarly search for a line x =
a + ~Y that fits x on y. In this case, the errors p.1 = x1 - (a - ~y1 ) are the
deviations of x1 from a + ~y1 (Fig. 11.2b). The errors can, of course, be defined
in other ways. For example, we can consider as errors the distances d1 of the
points (x;. y 1) from the line Ax + By + C = 0 (Fig. 11.2c). We shall discuss only
the first case.
Overdetermined Systems The curve-fitting problem is a problem of solving
somehow a system of n equations involving m < n unknowns. For m = 2,
this can be phrased as follows: Consider the system
a + bx, = Y;
i = I, . . . , n
where x1 and y 1 are given numbers and a and b are two unknowns. Clearly, if
the points x1 and y 1 are not on a straight line, this system does not have a
Figure ll.l
y
X
(a)
0
X
(b)
II:)
SEC.
ll-2
DETERMINISTIC INTERPRETATION
391
solution. To find a and b, we form the system
Y; - (a + bx;) = v;
i = l. . . . , n
and we determine a and b so as to minimize in some sense the "errors" v1 •
11-2
Deterministic Interpretation
We are given n points (x1 , y;), and we wish to find a straight line y =a + bx
fitting these points in the sense of minimizing the least square (LS) error
Q = Iv1
(11-2)*
where v1 = y; - (a + bx1). This error criterion is used primarily because of its
computational simplicity. We start with two special cases.
Horizontal Line Find a constant a0 such that the line (Fig. ll.3a) y = a0 is
the LS fit of the n points (x;, y;) in the sense of minimizing the sum
Qo = I(y;- ao) 2
This case arises in problems in which y1 are the measured values of a constant ao and v; = y; - ao are the measurement errors. The abscissas x1 might
be the times of measurement; their values, however, are not relevant in the
determination of ao.
Clearly, Q0 is minimum if
aQo
aao
= -2I (y·'
ao)
=0
This yields
Do=
n·~ Y; = Y-
(11-3)
~
Thus a0 is the average of the n numbers y 1•
II
• The notation ~ will mean ~ .
i•l
Figure 11.3
y
0
X
X
(a)
(b)
0
X
{c)
392
CHAP.
11 THE METHOD OF LEAST SQUARES
Homogeneous Line Find a constant b 1 such that the line y
11.3b) is the best LS fit of then points (x;, y;).
In this case,
Q 1 = I(y; - b 1x;)2
aQ,
ab;
= -2! (y;- b,x;)X; = 0
= b 1x
(Fig.
(11-4)
(11-5)
Hence, Q 1 is minimum if
b
_ Ix;y;
,--Ix1
(11-6)
From (11-5) it follows that the LS error equals
Q,
= I(y; -
b;x;)Y;
= Iv;y;
Geometric Interpretation The foregoing can be given a simple geometric interpretation. We introduce the vectors
X = [Xt, • • . , Xn]
Y = [y,, . . . , Yn]
N = [v,, . . . , Vn]
and we denote by (X, Y) the inner product of X and Y and by lXI the
magnitude of X. Thus
(X, Y) = Ix;y;
(X, X) = Ixf = IXI2
(11-7)
Clearly,
where N = Y - b 1X
is the error vector (Fig. I I .4). The length INI of N depends on b 1 , and it is
minimum if N is orthogonal to X, that is, if
(N, X) = I(y; - b1x;)X; = 0
(11-8)
in agreement with (11-5). The linear LS fit can thus be phrased as follows: Q 1
is minimum if the error vector N is orthogonal to the data vector X (orthogonality principle) or, equivalently, if btX is the projection of Yon X (projection theorem).
GENERAL LINE Find the LS fit of the points (X;, Y;) by the line (Fig. 11.3c)
y = 11 + bx
In this case, the square error
(11-9)
Q = I[y; - (a + bx1)]2
Ftpre 11.4
X
SEC. 11-2
DETERMINISTIC INTERPRETATION
393
is a function of the two constants a and b, and it is minimum if
aQ
aa
= -2~(y,-
~~ = -2~[y1 -
(a + bx;)]
=0
(a + bx;)]x,
(11-lOa)
=0
(11-lOb)
This yields the system
na + bl:x; = l:y1
(Il-l Ia)
al:x; + bl:xr = Ixm
(11-llb)
Denoting by and j the averages of x1 andy,, respectively. we conclude from
(1 1-1 I) that
a= j - hi
(11- 12a)
b = n~x;y; - ~x,~y, = I(x; - i)(y; - f) = l:(x1 - i)y1
nl:x1- (l:x,) 2
l:(x;- i)2
l:xr - n(x)2 (ll-l 2b)
This yields
y- y = b(x- i)
(11-13)
Hence, the regression line y = a + bx passes through the point (i, y), and its
slope b equals the slope b 1 of the homogeneous line fitting the centered data
(y, - j, x, - i).
Note [see (11-10)] that
Ily1 - (a + bx1)1(a + bx1) = 0
From this it follows that
Q = l:[y;- (a + bx;))y;
(11-14)
x
The LS error Q does not depend on the location of the origin because
(see (I 1-12)]
Q = ~[(y,- y) - b(x; - i)]l = l:(y1 - y) 2 - b2~(x1 - X)2 (11-15)
The ratio
Q
~(y,-
j)2
=l-r2
2 - [l:(x, - x)(y; - j)]l
r - ~(x;- x)2I(y;- y)2
is a normalized measure of the deviation of the points (x1, y1) from a straight
line. Clearly, (see Problem 9-27)
0 s lrl s I
(I 1-16)
and lrl = 1 iff Q = 0, that is, iff y1 = a + bx1 for every i.
Example 11.1
We wish to fit a line to the points
X;
2.3
6
Yt
56
73
3
7.2
52
70
3.4
8
51
82
4.2
9
57
89
4.2
9.8
5.1
6
It
61
67
99
12
73
86
lOS
394
CHAP.
11 THE METHOD OF LEAST SQUARES
y
X
Flpre 11.5
Case 1: y
= b.x
From the data we obtain
l:.x;y; = 7,345
l:.xl
= 717
b,
= 10.24
l:y~
= 78,973
Hence (fig. 11.5),
Q,
= '}:y1-
bt'l:.X;Y;
= 3,760
=
Case2: y 11 + 6r The data yie1di = 6.514, j = 73.357.1nserting into (11-12)
and (11-14), we obtain
a = 38.67
b = 5.32
Q = ISO
•
Multiple Linear Regression
The straight line fit is a special case of the problem of multiple linear regression involving higher-order polynomials or other controlled variables. In this
problem, the objective is to find a linear function
(11-17)
y = C1W1 + • • ' + CmWm
of the m controlled variables w4 fitting the points
Wu, • .. , wmi, Y;
i = I, . . . , n
This is the homogeneous linear regression. The nonhomogeneous linear regression is the sum
C0
+ CJWJ
'
'
'
+ CmWm
This, however, is equivalent to (11-17) if we replace the term c0 by the
product C0 W 0 where W 0 = 1 [see also (11-72)].
As the following special cases show, the variables w4 might be functionally related: If WA: = x•, we have the problem of fitting the points (x1, Y;)
SEC.
11-2
DETERMINISTIC INTERPRETATION
395
by a polynomial. If w4 = cos w4x, a fit by a trigonometric sum results. We
discuss first these cases.
Parabola We wish to fit the n points (x;. y;) by the parabola
(11-18)
y =A+ Bx + Cx 2
This is a special case of (11-17) with
m = 3
w1 = I
w~ = x
,~., = x 2
Our objective is to find the constants A = c 1 • B = c2 • C = c3 so as to
minimize the sum
(11-19)
Q = l:[y; - (A + Bx; + Cxl>F
To do so, we set
:~
=
-2~lY;- <A
aQ
aB
= -2l:[v·(A + Br·
+
.1
• 1
!~ = - 2l:[y; -
+ Bx; + cxr>J
=
Cx~)]x·
I
o
=
I
(A ... Bx; + cxT>Jxr =
0
(11-20)
o
This yields the system
nA + B~x; + Cl:.xr
= ~)';
Al:x; + Buf + Cl:xl = l:.t;.V;
Al:x7 + Bl:xl • Cl:x1 = };.xry;
(11-21)
Solving, we obtain A, B, and C.
From (11-20) it follows that the LS error equals
Q = l:ly;-
Example 11.2
<A
+ Bx; + cxr>b·;
(11-22)
Fit a parabola to the points (Fig. 11.6)
X;
Y;
0.1
1.31
0.2
1.15
0.3
0.98
0.5
0.4
1.27
0.6
1.41
1.40
0.8
1.92
0.7
1.60
0.9
1.75
I
2.04
In this case,
~X;=
~Y; =
55
14.83
and (11-21) yields
A = 1.196
~xf
= 3.85
~XtYi =
8.98
B = 0.314
~xl = 3.02
}'.x~y;
C
= 1.193
Given then points (wu • . . . .
the m constants c4 such that the sum
GEN•:RAL CASE
}'.xf = 2.53
= 6.68
}'.yr
= 23.03
Q = 0.14
Wm;.
y;),
•
we wish to find
(11-23)
396
CHAP.
11
THE METHOD OF LEAST SQUARES
y
•
•
0
Figure 11.6
is minimum. This is the problem of fitting the given points by the function in
(11-17). In the context of linear equations, the problem can be phrased as
follows: We wish to solve the system
CtWu + ... + CmWm; = y;
I s ;s n
of n equations involving m < n unknowns c,, ... , em. To do so, we
introduce the n errors
Y; - (c,wu + · · · + CmWm;) = II;
I s iS n
(11-24)
and we determine c11 such as to minimize the sum in (11-23). Clearly, Q is
minimum if
aQ
ack = -2~(y;- (ctWli
+ ' ' ' +
CmWm;))Wk;
=0
1S k
S
m (IJ-25)
This yields the system
c,~wi; + · · · + Cm~Wm;Wu = ~WuY;
(11-26)
Ct~WJiWm; + ' ' ' + Cm~w!,; = ~Wm;Y;
Solving, we obtain c11 • The resulting LS error equals
Q = ~~~~ = ~II;Y; = ~y~ - Ct~WuY; - ' ' ' - Cm~Wm;Y;
(11-27)
Note that (11-25) is equivalent to the system
~II;Wk; = 0
k = 1, . . . , m
(IJ-28)
This is the orthogonality principle for the general linear regression.
Nonlinear Regression
We wish to find a function
(11-29)
y = cf>(x)
fitting then points (x,, y;). Such a function can be used to describe economically a set of empirical measurement, ·to extrapolate beyond the observed
data, or to determine a number of parameters of a physical law in terms of
SEC.
IJ-2
DETERMINISTIC INTF.RPRF.TATJON
397
noisy observations. The problem has meaning only if we impose certain
restrictions on the function cl>(x). Otherwise. we can always find a curve
fitting the given points exactly. We might require. for example, that cl>(x) be
smooth in some sense or that it depend on a small number of unknown
parameters.
Suppose, first, that the unknown function cb(x) is of the form
(11-30)
where q4(x) are m known functions. Clearly. ct><x> is a nonlinear function of x.
However, it is linear in the unknown parameters ck. This is thus a linear
regression problem and can be reduced to ( 11-17) with the transformation
w4 = q4(x). A special case is the fit of the points (:r;. Y;) by the polynomial
)' = Ct + c~x + · · · + c,,x"' 1
This is an extension of the parabolic fit considered earlier. Let us look at
another case.
Trigonometric Sums
We wish to fit then points (x;, )';)by the sum
)' = C:t COS WtX + ' ' ' + C:m COS Wm:C
where the frequencies
wk
Wk
(11-31)
are known. This is a special case of (11-17) with
= COS W4X
11'4i
= COS W4X;
Inserting into (11-26), we conclude that the coefficients c4 of the LS fit are
the solutions of the system
C:t~ cos 2 w,x; + · · · + c:,~ cos w,x; cos w,x; = ~Y; cos WtX;
(11-32)
c,~ cos WmX; cos w,x; + · · · + C:m~ cos2 w,x; = ~)';cos WmX;
Sums involving sines and cosines lead to similar results.
Example 11.3
We wish to fit the curve y = c 1
X;
0.1
Y;
9
0.2
10.1
•
c 2 cos 2.5x to the points (Fig. II. 7)
0.3
0.5
0.5
0.6
0.7
9.1
8.0
7.4
6.9
6
0.8
s.o
0.9
4.6
1.0
4.2
This is a special case of (11-31) with w1 = 0 and w2 = 2.5. Inserting into (11-32),
we obtain the system
nc 1 - c 2l: cos 2.5x; = Iy,
c 1l: cos 2.5x; t c2l: cos2 2.5x; = Iy; cos 2.5x;
This yields
10c1 + 1.48c2 = 70.3
c, = 6.562
1.48c 1 ... 3.88c2 = 22.0
c2 = 3.159
•
Next we examine two examples of nonlinear problems that can be
linearized with a log transformation.
398
CHAP.
II
THE METHOD OF LEAST SQUARES
y
Y:
c1 + c2 cos 2Sx
0
X
Figure 11.7
Example 11.4
We wish to fit the curve y = ye·u to the points (fig. 11.8)
I
2
3
4
5
6
7
81
55
42
29
20
IS
II
8
7
9
5
10
3
so as to minimize the sum
(11-33)
Figure 11.8
y
0
10
SEC.
11-2
399
DETERMINISTIC I!'IITF.RPRETATION
On a log scale. the curve is a straight line In y =- In y- A.t. This line is of the form
u - h.t where
z = In y
c1 = In y
h = -A
Hence. the LS fit is given by !I 1-12). This yields
,.
- I
I
A=- -(.t;- x)_n,.v.
In 'Y =--,~In,.... A.{
~(X; - x)n .£... ·'
Inserting the given data into {11-~4). we obtain
A= 0.355
In y = 4.7R
z=
(11-34)
y = tl9e·IIW.•
Note that the constants y and A so obtained do not minimize 0 1-33); they minimize
the sum
~[z;- (tl • h.\';)f -" ~[In Y;- lin y - A.t;lF
•
Example 11.5
We wish to fit the curve y
2.0
2
3
3.0
3.8
= yx# to the points (Fig.
4
5.0
5
6.5
11.9)
6
7
7.0
7.6
8
lU
9
9.1
10
9.R
This is again a nonlinear problem; however. on a log-log scale, it is a straight
line
of the form
In y = In y - {3 In x
bw where
: = In y
u = In y
h=/3
z =u -
''"-In x
•·igure 11.9
10
•
0
10
400
CHAP.
II
THE METHOD OF LEAST SQUARES
With In x; the average ofln x;. (1)-12) yields
p = l:(ln X; -~)In)'; = 0.715
~(In x; - In x; )2
In 'Y
= !n~
~ In v - p In x
.I
The constants 'Y and
~
I
= 0.64
y = t.9xo.m
so obtained minimize the sum
•
}:[In}'; - (In 'Y + pIn x;)] 2•
PERTURBATIONS In the general curve-fitting problem, the regression
curve is a nonlinear function y = f/>(x, .\, IJ., • • • ) of known form dependinf·
on a number of unknown parameters. In most cases, this problem has no
closed-form solution. We shall give an approximate solution based on the
assumption that the unknown parameters.\, IJ., ••• are close to the known
constants .\o, 1J.o , • • •
Suppose, first, that we have a single unknown parameter.\. In this
case, y = f/>(x, .\)and our problem is to find.\ such as to minimize the sum
l:(y;- f/>(X;, ,\)) 2
(11-35)
If the unknown .\ is close to a known constant .\o, we can linearize the
problem (Fig. 11.10) using the approximation (truncated Taylor series)
f/>(x, .\) =- f/>(x, .\o) + (.\ - .\o)f/>"(x, .\o)
«/>"
=
:t
(11-36)
Indeed, with
z
=y
- f/>(x, .\o)
the nonlinear equation y = fP(X, .\)is equivalent to the homogeneous linear
equation z = (.\- .\o)w. Our problem, therefore, is to find the slope.\- .\o of
Figure 11.10
y
y=t{l(x,Al
Yt
0
X
SEC.
11-2
DETERMINISTIC INTERPRETATION
401
this equation so as to fit the points
Z; = Y;- c/>(x;, Ao)
in the sense of minimizing the LS error
I[y;- c/>(x;, Ac,) - (A - Ao)c/>A(x;, A.o>J2 = I[z; - (A - A0 )wt1 2
Reasoning as in (11-6), we conclude with b1 = A - Ao that
A - Ao
= Iz;~; = I(y;- ~(X~, Ao)]c/>A(x;, .\o)
~c/>A(x;,
Iw;
01-37)
Ao)
Suppose next that the regression curve is a function c/>(x, A, fJ.) depending on two parameters. The problem now is to find the LMS solution of the
overdetermined nonlinear system
cb(x;, A, fJ.) = Y;
i = I, . . . . n
(11-38)
that is, to find A and fJ. such as to minimize the sum
I[y;- c/>(x;, A. tJ.)J2
(11-39)
We assume again that the optimum values of A and fJ. are ncar the
known constants Ao and fJ.o. This assumption leads to the Taylor approximation
cf>(x, A, fJ.)
= cb(x, Ao, fJ.o)
+ (A - Ao)c/>A(x, .\o, fJ.o)
+ (fJ. - tJ.o)cf>,.(x, Ao, fJ.o) (11-40)
where lf'A = alf'laA and "'"' = aiPiafJ.. Inserting into (11-38) and using the
tr.msformations
z = y - c/>(x, Ao, fJ.o)
we obtain the overdetermined system
Z; = C;Wu + C2W~;
CJ = A - Ao
C~ = fJ. - fJ.o
(11-41)
This system is of the form (11-17); hence, the LS values of c 1 and c 2 are
determined from (11-26). The solution can be improved by iteration.
Example 11.6
(a) We wish to fit the curve y = 5 sin Ax to the 13 points
X;
y;
0
4
0.5
6
I
18
1.5
30
2
46
2.5
42
3
52
3.5
40
4
44
4.5
34
5
20
5.5
36
6
10
of Fig. I 1.11. As we see from the figure, a sine wave with period 12 is a
rea'ionable fit. We can therefore use as our initial guess for the unknown A
the value Ao =. 27TI12.
In this problem, f/>(x, A) = 5 sin Ax, f/>A(x, A) = 5x cos Ax,
z = y - 5 sin Aox
w = 5x cos A.0 x
402
CHAP.
II
THE METHOD OF LEAST SQUARES
•
s
6
0
)C
Fipre 11.11
Hence [see (11-37)),
A = Ao + 5 ~x1 cos Aox;(Y; - 5 sin Aox;) = .491
0
25 Ixf cos2 Aox1
and the curve y = 5 sin 0.491x results.
(b) We now wish to fit the curve y = p. sin Ax to the same points. using as
initial guesses for the unknown parameters A and p. the values
7r
Ao=6
p.o=5
With
f/l(.x, A, p.) =
(11-40) yields
y = p. sin Ax
tb,. = sin A.x
p. sin Ax
= J.&o sin Aox + (p.
- p.o) sin
A0 x + (A. - Ao)J.&oX cos Aox
and the system
y = c 1w 1 + c 2w2
w 1 = sin Aox
w 2 = p.ox cos Aox
results where c 1 = p., c2 = A - Ao. Inserting into (11-26), we obtain
Isin2 Aox1 T- c2 ~J.&oX; sin Aox 1 cos Aox; = Iy; sin Aox;
c 1 Ip.ox1 sin A0x1 cos A0x1 + c2 Ip.ijxf cos2 Aox; = Iy;p.ox; cos A.ox;
Hence, c1 = 4.592, c 2 = -0.0368,
p. = 4.592
A. = 0.487
y = 4.592 sin 0.487x
c1
•
11-3
Statistical Interpretation
In Section I 1-2, we interpreted the data (x;, Yi) as deterministic numbers. In
many applications, however, Xi and Yi are the values of two variables x and y
related by a physical law y = f/>(x ). It is often the case that the values Xi of the
SEC.
11-3 STATISTICAL INTERPRETATION
403
controlled variable x are known exactly but the values y; of y involve random
errors. For example, x; could be the precisely measured water temperature
and y; the imperfect measurement of the solubility TJ; of a chemical substance. In this section. we use the random character of the errors to improve
the estimate of the regression line f/>(x). For simplicity, we shall assume that
fb(x) is the straight line a + bx involving only the parameters a and b. The
case of multiple linear regression can be treated similarly.
We shall use the following probabilistic model. We are given n independent RVs y., . . . • y, with the same variance u 2 and with mean
E{y;} = TJ; = a + bx;
(11-42)
where x; are known constants (controlled variables) and a and b arc two
unknown parameters. We can write y; as a sum
E{v;} = 0
(11-43)
2
where 11; are n independent Rvs with variance u • In the context of measurements. TJ; is a quantity to be measured and Jl; is the measurement error. Our
objective is to estimate the parameters a and b in terms of the observed
values Y; of the Rvs y;. This is thus a parameter estimation problem differing
from earlier treatments in that E{y;} is not a constant but depends on i. We
shall estimate a and b using the maximum likelihood (ML) method for normal RVS and the minimum variance method for arbitrary RVs and shall show
that the estimates are identical. Furthermore, they agree with the LS solution ( 11-12) of the deterministic curve-fitting problem.
y;
= a + bx; +
II;
MAXIMUM LIKELIHOOD Suppose that the RVS y; are normal. In this case,
their joint density equals
1
(cr'\121T)"
exp {-
~L
2u
[y;- (a
+ bx;)J2}
(11-44)
This density is a function of the parameters a and b. We shall find the ML
estimators i and 6 of these panlmeters.
The right side of (11-44) is maximum if the sum
Q = ~[}'; - (a + bx;>P
is minimum. Clearly, Q equals the LS error in ( 11-9) hence, it is minimum if
the parameters a and hare the solutions ofthe system (11-11). Replacing in
(11-12) the average j of the numbers y; by the average y of then Rvs y; we
obtain the estimators
a=y-bx
- x)y;
6 = ~<x;
~(X;- X)2
As we shall see, these estimators are unbiased.
(11-45)
404
CHAP.
11
THE METHOD OF LEAST SQUARES
We assume again that the RVS y1 are independent
with the same variance u 2, but we impose no restrictions on their distribution. We shall determine the unbiased linear minimum variance estimators
(best estimators)
(11-46)
a= Ia;y;
of the regression coefficients a and b. Our objective is thus to find the 2n
constants a1 and /31 satisfying the following requirements:
The expected values of a and 6 equal a and b respectively:
E{i} = ~a;1J; = a
E{b} = ~/3;711 = b
(11-47)
and their variances
(11-48)
are minimum.
MINIMUM VARIANCE
• Gauss-Mtll'koff Theo~m. The best estimators of a and b are given by
(11-45) or, equivalently, by (11-46), where
a·=.!.n a.X
I
/JI
X;-
/3; = ~(x;
-
X
(11-49)
x>2
+ bx1, (11-47) yields
+ bx;) =a
~/3;(a + bx;) = b
• Proof. Since E{y1} = a
~a;(a
Rearranging terms, we obtain
(~a; - l)a + (Ia1x1)b = 0
(~/31 )a + (~/31x; - l)b = 0
This must be true for any a and b; hence,
~a;= 1
~a1 x1 = 0
(11-50a)
~/31 = 0
~/3;X; = 1
(11-50b)
Thus our problem is to minimize the sums ~a1 and ~131 subject to the
constraints (11-SOa) and (ll-50b), respectively. The first two constraints contain only a 1, and the last two only /31• We can therefore minimize each sum
separately. Proceeding as in (7-37) we obtain (see Problem 11-8)
nx;x- ~xl
a;= n(nF-
~xf)
X;-
=
~xf -
xf _
1
/3;
X
n(i)2
and (11-49) follows.
Note that
~
~
_
/3;2 -
L (x; [L (x; -
2-
X)2]
L(X; - x)-2
L a;/3; = ~ L /3;- x L 131 = L(X~= x)
L a1 = L (.!.n -
/3;x)
2
(11-51)
2
2
= .!.n + L(X;(i)
X)2
11-3
SEC.
STATISTICAl. INTERPRETATION
405
Variance and Interval Estimates From (I 1-51) and (I 1-48) it follows that
,
u-
ul = l:(X;-
-r
(I 1-52)
X-
• •
Cov (a, b)
2
"'
= cr, ~a;{3;
= ~( -u iX_),
(11-53)
~X;-
Furthermore, the sum TJk = a + bxk is an unbiased estimator of the sum a +
bx, , and its variance equals
,
u2
u 2(x, - i)~
u~, =
T
l:(x; - .tV
(11-54)
n
We shall determine the confidence intervals ofthe parameters a, b. and
a, b, and iJ are sums of
the independent avs y;, we conclude that they are nearly normal; hence, the
y = 2u - 1 confidence intervals of the parameters a, b, and m. equal
Tlk under the assumption that n is large. Since the avs
ti ~ ZPa
b~
ZuUh
'11t
=z,u.;,
2
(11-55)
2
respectively. These estimates can be used if u is known. If u is unknown,
we replace it by its estimate 6' 2 • To find 6' 2 , we replace the parameters 11; in
the sum l:(y; - Tl;) 2 by their estimates. This yields the sum
l:£1
£;
= Y; -
'11;
= }'; -
(ti ..._ bX;)
(11-56)
Reasoning as in the proof of ( 11-49) we conclude that the sum ilx = a + bx is
an unbiased estimator of Tlx and its variance equals (see Problem 11-12)
1
o- 2 = -n-- 2~
~ <Y·' - .yj..,, >2
(It-57)
is an unbiased estimate of u 2 • Replacing the unknown u 2 in (11-52) and
(11-54) by its estimate 6' 2, we obtain the large sample estimates of a 6 , uj,,
and u.,,. The corresponding confidence intervals are given by (11-55). Note,
finally, that tests of various hypotheses about a, b, and Tlk are obtained as in
Section 10-2 (see Problem 11-9).
Regression Line Estimate Consider the sums Tlx = a + bx and Yx = a +
bx + 'llx where a and bare the constants in (11-42), xis an arbitrary constant,
and "·• is an RV with zero mean (Fig. 11.12). Thus
Yx = Tlx + llx
Tlx = a + bx
(11-58)
We shall estimate the ordinate Tlx of the straight line a + bx for x :/= x; in
terms of the n data points (x;, y;). This problem has two interpretations
depending on the nature of the RV Yx.
First Interpretation The RV Yx is the result of the measurement of the
sum Tlx =a+ bx, and 'llx is the measurement error. This sum might represent
a physical quantity, for example, the temperature of the earth at depth x. In
this case, the quantity of interest is the estimate '11x of Tlx • We shall use for '11x
the sum
(11-59)
'11x = ti + bx
406
CHAP.
II
THE METHOD OF LEAST SQUARES
0
X
Figure 11.11
where a and 6 are the estimates of a and b given by (11-46). Reasoning as
before, we conclude that il~ is an unbiased estimator of 71x, and its variance
equals
cr2
CT~ = -
"'
n
+
cr2(x - ..f)2
(11-60)
l:(x1 - x)2
Second Interprettltlon The RV Yx represents a physical quantity of
interest, the value of a stock at timex, for example, and 71x is its mean. The
sum 11~ + llx relates Yx to the controlled variable x, and the quantity of
interest is the value y~ ofy~. We thus have a prediction problem: We wish to
predict the value Yx ofy~ in terms of the values y 1 of then Rvs y1 in (11-43). As
the predictor of the RV Yx = 11~ + "·• we shall use the sum
Y~ = itt = A + 6x
(11-61)
Assuming that "xis independent of then RVS v1 in (11-43), we conclude from
(11-60) that the MS value of the prediction error y, - y~ = "~ - (itt - 71x)
equals
• 2} E{-2}
2
E {(y~- Yx> =
Jl".i + CT,;,
2
-)2
21
cr
cr-,x- x
= cr,,2 +n + l:(X;- X )2
(11-62)
The two interpretations of the regression line estimate a + 6x can thus
be phrased as follows: This line is the estimate of the line 71x = a + bx; the
variance of this estimation is given by (11-60). It is also the predicted value of
the RV Yx = a + bx + vx; the MS prediction error is given by (11-62).
Example 11.7
We are given the following II points:
X;
I
2
Y;
1
5
3
II
4
10
5
15
6
12
1
16
8
19
9
22
10
20
II
25
i
=6
y=
14.73
11-4
Sf:.C.
0
5
.\'
PREDICTION
407
10
t'igure 11.13
Find the point estimates of a, b, and
Using the formulas
X·- X
{3;
we find
h = 1.836, d
<i- 2
I
= I<;;
- x)2
<r~.
h = ~{3;Y1
ti
=y
- h.t
=- 3.714,
II
= 9 L <Y; 1-1
3.714 - t.836x;>~
= 3.42
u=
In Fig. I I. 13 we show the points (X;, Y;) and the estimate .,:;.
a+ bx. •
1.85
= 3.714
- 1.836x of
11-4
Prediction
In the third interpretation of regression, the points (X;. y 1) are the samples of
two RVs x and y, and the problem is to estimate y in terms of x. Multiple
regression is the estimation of y in terms of severed Rvs w,. This topic is an
extension of the nonlinear and linear regression introduced in Sections 5-2
and 6-3.
408
CHAP.
II
THE METHOD OF LEAST SQUARES
Linear Prediction
We start with a review of the results of Section 5-2. We wish to find two
constants a and b such that the sum y = a + bx is the least mean square
(LMS) predictor of the RV y in the sense of minimizing the MS value
Q
= E:{(y
-
y)2 }
-
f . f . [y - (a + bx)J:f(x. y)dxdy
= y - y. Clearly, Q is minimum if
aQ = - 2E{y - (a + bx)} = 0
(11-63)
of the prediction error "
aa
aQ
ab
= -2E{[y-
This yields the system
a+ b.,x = .,.,,
Solving, we obtain
(a
a71x
+ bx))x}
=0
+ bE{x2 } = E{xy}
(11-65)
a= T'/)•- h.,x
b
= E{xy} E{x2}
-
T/xTI>·
71~
(11-64)
(ll-66a)
= #Lll = r ~
0'~
(11-66b)
O'x
Note that
" =y
- (a
+
bx)
= (y -
T/y) - b(x - T/x)
E{ll(a
+ bx)} = 0
Hence, the LMS error equals
Q
= E{v} = E{vy} = E{[(y = u~ - bu; = u~(l - r2)
.,.,).) - b(x - T'/x>P}
Thus the ratio Qlu~ is a normalized measure of the LMS prediction error Q,
and it equals 0 iff lrl = 1.
If the avs x and y are uncorrelated, then
r =0
Q = u~
b =0
y = a = .,.,,
In this case, the predicted value of y equals its mean; that is, the observed x
does not improve the prediction.
The solution of the prediction problem is thus based on knowledge of
the parameters T/x, .,.,, , O'x, u,, and r. If these parameters are unknown, they
can be estimated in terms of the data (x;, y; ). The resulting estimates are Tlx =
:x,.,,=y,
If these approximations are inserted into the two equations in (11-66), the
two equations in (11-12) result. Tbis shows the connection between the
prediction problem and the deterministic curve-fitting problem considered in
Section 11-2.
SEC.
11-4 PREDICTION
409
We are given m + 1 RVS
w1 , • • • ,w,,y
(11-67)
We perform the underlying experiment once and observe the values wk of the
m Rvs wk. Using this information, we wish to predict the value y ofthe RV y.
In the linear prediction problem, the predicted value of y is the sum
y = c,w, + ... + c,.w,,.
(11-68)
GF.N.:RAUZATIO~
and our problem is to find the m constants c:k so as to minimize the MS value
Q = E{ly - (t·,w 1 + · · · + c,.w,.)l2}
(11-69)
of the prediction error v = y - (c 1w1 + · · · + c,.w,).
Clearly, Q is minimum if
a~k
= -2E{[y- (c,w, + · · · + c,.w,.)}wd = 0
(11-70)
This yields the system
c 1 £{~} + · · · + c,£{w,w 1}
= £{w 1y}
(11-71)
Solving, we obtain ck.
The nonhomogeneous linear predictor of y is the sum
'Yo + 'YtWt + · · · + 'YmWm
This can be considered as a special case of ( 11-68) if we replace the constant
'Yo by the product 'YoWo where w0 = I. Proceeding as in (11-71), we conclude
that the constants 'Yk are the solutions of the system
'Yo+ 'YtE{w,} + · · · + 'YmE{w,} = E{y}
'YoE{w,} + 'YtE{wH + · · · + 'YmE{w,w,} = E{w,y}
k
=1:
'YoE{w,} + 'YtE{w,w,} + · · · + 'YmE{~} = E{w,y}
From this it follows that if E{w•} = E{y} = 0. then 'Yo = 0 and 'Y•
0.
( 11-72)
= ck for
Orthogonality Principle Two RVS x andy are called orthogonal if E{xy} = 0.
From (11-70) it follows that
E{vwk} = 0
k = I, •.. , m
(11-73)
Multiplying the kth equation by an arbitrary constant dk and adding, we
obtain
(11-74)
E{v(d1w 1 + · · · + d,w,)} = 0
Thus in the linear pr~diction problem, the prediction error v = y - y is
orthogonal to the "data" wk and to any linear function of the data (orthogonality principle). This result can be used to obtain the system in (11-71)
directly, thereby avoiding the need for minimizing Q.
410
CHAP.
II
THE METHOD OF LEAST SQUARES
From (1 I-74) it follows that
E{,Y} = E{(y - y)y}
=0
Hence, the LMS prediction error Q equals
Q
= E{(y -
y)2 }
= E{(y -
y)y}
= E{y2}
- E{Y2}
(11-75)
As we see from (11-71), the determination of the linear predictor yofy
is based on knowledge ofthejoint moments of them+ 1 avs in (II-67). If
these moments are unknown, they can be approximated by their empirical
estimates
where w11 andy; are the samples of the avs w1 and y, respectively. Inserting
the approximations into (11-71), we obtain the system (11-26).
Nonlinear Prediction
The nonlinear predictor of an RV y in terms of the m avs x 1 , •
function
«/>(x., ... , Xm) = «/>(X)
X= [x,, . . . , Xn]
minimizing the MS prediction error
e = E{[y - «/>(X)J2}
•
, Xm
is a
(11-76)
As we have shown in Section 6-3, if m = 1, then «/>(x) = E{ylx}. For an
arbitrary m, the function «/>(X) is given by the conditional mean of y assuming
X:
«/>(X)
= E{yiX} =
r.
yf(y!X> dy
(11-77)
To prove this, we shaJI use the identity [see (7-21)]
(11-78)
E{z} = E{E{ziX}}
2
With z = [y - «/>(X) ] , this yields
(11-79)
e = E{E{[y - «/>(X)J2IX}}
The right side is a multiple integral involving only positive quantities; hence,
it is minimum if the conditional mean
E{(y- «/>(X)J21X} =
r.
[y- «/>(X)Jr<y!X)dy
(11-80)
is minimum. In this integral, the function fP(X) is a constant (independent of
the variable of integration y). Reasoning as in (6-52), we conclude that the
integral is minimum if «/>(X) equals the conditional mean of y.
PROBLEMS
4I I
Note For normal RVs, nonlinear and linear predictors are identical because the
conditional mean E{y!X} is a linear function of the components X; of X (see
Problem 11-16). This is a simple extension of (6-41 ).
OrthogonaUty Principle We have shown that in linear prediction, the error
is orthogonal to the data and to any linear function of the data. We show
next that in nonlinear prediction, the error y - «!><X) is orthogonal to any
function q(X). linear or nonlinear, of the data X. Indeed, from (11-77) it
follows, as in (6-49), that
E{ly - «/>(X))q(X)}
= E{q(X)E{y -
«<><X>IX}}
From (11-77) and the linearity of expected values it follows that
E{y - «/>(X)IX} = Ey!X} - E{«J><X>IX} = E{yiX} - «/>(X)
Hence [see (11-78)).
(11-81)
E{ly - «/>(X)]q(X)} = 0
for any q(X). This shows that the error y - <b(X> is orthogonal to q(X).
Problems
11-1
11-l
Fit the lines y = b1x. y ..::: u .,. bx. andy
X;
0
.Vi
I
2
3
4
3
5
5
5
8
6
7
8
7
8
9
9
II
10
13
II
15
and sketch the results. Find and compare the corresponding errors
~(.V; - b1x;)2
}:ly; - (tl + b.t1>F
}:(y, - yxf)2
Fit the parabola y = A .,. Bx - Cx 2 to the following points
.'fj
Y;
11-3
I
3
= y:cu to the points
0
0
2
5
4
17
3
10
5
25
6
40
7
50
8
65
9
80
10
98
and sketch the results.
Here are the average grades x;. y 1, and z.; of 15 students in their freshman,
sophomore. and senior year, respectively:
Xi
Y;
li
2.8
1.5
2.6
2.5
3.1
2.6
2.2
2.7
2.9
2.6
3.4
3.1
3.0
3.9
3.4
3.6
3.8
2.9
3.5
4.0
2.5
3.7
2.4
3.9
3.6
3.4
3.4
3.3
4.0
3.8
3.4
2.6
2.9
3.2
3.4
'!t.7
1.9
2.2
2.5
2.7
2.9
2.4
3.4
3.6
3.8
412
CHAP.
11
THE METHOD OF LEAST SQUARES
(a) Find the LS fit of the plane~= c 1 • czx T CJY to the points (x1, Y~o l 1). (b)
Use (a) to predict the senior grade~ of student if his grades at the freshman
and sophmore year are x = 3.9 andy= 3.8.
a
11-4
11-5
Fit the curve y = a - b sin ~ x to the following points
Xt
0
Y;
4
X;
11-7
3
12.2
4
13.5
5
14
6
13.6
7
12
8
9.9
9
7
10
3.9
7
-9
8
-10
9
-6
10
-1
II
2
12
and sketch the results.
Fit the curve y = a sin wx to the points
y1
11-6
2
10
I
7.5
0
2
I
7
2
II
3
12
4
6
5
I
6
-5
4
using perturbation with initial guess a0 = 10, fiJo = .,/5. Sketch the results.
We measure the angles a,~. 'Y ofa triangle and obtain then triplets (x1, y 1, ~;)
where x1 - a, y1 - ~. ~~ - 'Y are the measurement errors. (a) Find the LS
estimates ci, /J, j of a,~. 'Y deterministically. (b) Assuming that the errors are
the samples of three independent NCO, u, ), NCO, uv ). N(O, 0':) RVS, find the
ML estimates of a, ~. 'Y·
·
The avs y1 are N(a + bx1, u). Show that if
= ~(Xt
- x)~;
~(Xt- .i)·
bi then E{b} = b, E{i} = a
y
11-8 (a) Find n numbers a 1 such that the sum I = u1 is minimum subject to the
constraints u 1 = I, ~1x1 = 0, where X; are n given constants. (b) Find ~.
such that the sum J = I~f is minimum subject to the constraints ~~~ = 0,
I~;X; =I.
11-9 The n avs y1 are independent and N(a • bx1, u). Test the hypothesis b = b0
against b :/< b0 in terms of the data y 1• Use as the test statistic: the sum 6 =
~~1y 1 in (11-46). (a) Assume that u is known. (b) Assume that u is unknown.
11-10 The avs y1 , • • • , y, are independent with the same variance and with mean
E{y1} =A • Bx; + Cxf
where x 1 are n known constants. (a) Find the best linear estimates
b
i = - -
A = u;y; !J = I~;Y; C = Iy;y;
of the parameters A. B. and C. (b) Show that if the avs y1 are normal, then the
ML estimates of A, B. and C satisfy the system (11-21).
11-11 Suppose that the temperature of the earth distance x from the surface equals
e. = a + bx. We measure e. at 10 points x1 and obtain the values Yt = a T
bx1 + 111 where 111 are the measurement errors, which we assume i.i.d. and
N(O, u). The results in meters for x and degrees C for e. are
X;
Yt
10
26.2
52
27.1
110
28.6
153
29.9
200
31.4
245
32.6
310
34.1
350
35.1
450
37.5
600
40.2
PROBI.EMS
413
(a) Find the best unbiased estimates of a and b and test the hypothesis a = 0
against a :1: 0 if cr = 1. (b) If cr is unknown, estimate it and find the 0.95
confidence interval of a and b. (c) Estimate 8, at x =- 800 m and find its 0.95
confidence interval if cr = I .
11-U (a) Show that the avs yand bin (11-45) are independent. (b) Show that the RV
-!, I(l); CT"
TJ;l 2 is x 2(2). (c) WithE;
show that the RVs E; and iJ1 = i +
I
cr2
= y; - iJ; .:
v; -
(iJ; -
TJ;), as in (11-56),
bx; arc uncorrelatcd. (d) Show that the RV
~ E;2 IS
• X"'( n - 2J .
11·13 (Weighted least squares) The avs y1 are independent and normal, with mean
a ... bx; and variance cr~ = cr 21w;. Show that the ML estimates a and &of a
and b are the solutions of the system
aiw1 + biw;x; = Iw;y;
aiw;x; - hiw;xl = Iw 1x 1y1
11-14 The avs x andy are such that
TJ1 = 3
'lh = 4
CT1 = 2
CT~ = 8
r1 y = 0.5
(a) Find the homogeneous predictor y = ax of y and the MS prediction error
Q = E{(y- ax) 2 }. (b) Find the nonhomogeneous predictory0 ='Yo+ y 1x and
the error Q = E{[y - (yo + y,x)]2}.
11·15 The avs x andy are jointly normal with zero mean. Suppose that y =ax is the
LMS predictor ofy in terms ofx and Q = E{(y - ax)2 } is the MS error. (a)
Show that the avs y - ax and x are independent. (b) Show that the conditional density f<ylx) is a normal curve with mean ax and variance Q.
11-16 The avs y, x 1 , • • • , x,. are jointly normal with zero mean. Show that ify =
a 1x 1 ... • • • + a,.x,. is the linear MS predictor ofy in terms ofx1 , then
E{ylx~o . . . , x,.} = a,x 1 + · · · T a,.x,.
This shows that for normal avs, nonlinear and linear predictors are identical.
The proof is based on the fact that for normal avs with zero mean, uncorrelatedness is equivalent to independence.
12_ _ __
Entropy
Entropy is rarely treated in books on statistics. It is viewed as an arcane
subject related somehow to uncertainty and information and associated with
thermodynamics, statistical mechanics, or coding. In this chapter, we argue
that entropy is a basic concept precisely defined within a probabilistic model
and that all its properties follow from the axioms of probability. We show
that like probability, the empirical interpretation of entropy is based on the
properties of long sequences generated by the repetition of a random experiment. This leads to the notion of typical sequences and offers, in our view,
the best justification of the principle of maximum entropy and of the use of
entropy in statistics.
12-1
Entropy of Partitions and Random Variables
Entropy, as a scientific concept. was introduced first in thermodynamics
(Clausius, 1850). Several years later, it was given a probabilistic interpretation in the context of statistical mechanics (Boltzmann, 1877). In 1948. Shannon established the connection between entropy and typical sequences. This
led to the solution of a number of basic problems in coding and data trans-
414
SEC.
12-J
l:.NTROPY OF PARTITIOSS ASD RASDOM VARIABLES
Pl...-1;1
.~·•x
::1!
A
415
= f/,
v
/II AI- ... ~. P;
~n
P;
I - •
Fipre 12.1
mission. Jaynes (1957) used the method of maximum entropy to solve a
number of problems in physics and, more recently, in a variety of other areas
involving the solution of ill-posed problems. In this chapter. we examine the
relationship between entropy and statistics. and we use the principle of
maximum entropy to estimate unknown distributions in terms of known
parameters (see also Section 8-2).
• Definition. Given a probabilistic model~ and a partition A = [Slf 1• •
Slf,,·l of~ consisting of the N events :4; (Fig. 12.1 ). we form the sum
s
H(A) = - ~ p;
In p;
p;
= PC sl;)
..'
(12-1)
i-1
This sum is called the etrtropy of the partition A.
Example 12.1
Consider a coin with P{h} = p and P{t} _,_ q. The events ai 1 =- {II} and .92 2 = {t} form a
partition A = [ .s4 1 , .14 2] with entropy
H<AI = -(pIn p ... q In ql
If p = q = .5, then H(Al
0.562 . •
Example 12.2
= In
2 -= 0.693: if p "" .25 and q
= .75.
then H(A) =
In the fair-die experiment, the elementary events{/;} form a partition A with entropy
H(AI = -
(lin
6 l,.
6 · · · + !6 In !)
6 = In 6 = I .79
In the same experiment, the events tA 1
tion B with entropy
= {even} and ~ 2 = {odd} form a parti-
I
I
I
I)
HCB> =- - ( 2 In 2 ~· 2 In 2 - In 2
= 0.693
•
In an arbitrary experiment ~. an event .5lf and its complement .5lf form a
partition A = l ~. Slfl consisting of the two events d, = :A. and Slf2 = St. The
416
CHAP.
12
ENTROPY
2n2
--
0.5
I
e
0
I
.s
p
e
Figure 12.2
entropy of this partition equals
q = P(sf)
H(A) = -(p In p + q In q)
where p = P(sf)
In Fig. 12.2, we plot the functions -p In p, -q In q, and their sum
fl>{p) = -(p lnp + q In q)
q =1- p
(12-2)
The function f!>(p) tends to 0 as p approaches 0 or I because
p In p ~ 0
for
p- 0
and for
p- I
Furthermore, fi>(p) is symmetrical about the point p = .5, and it is maximum
for p = .5. Thus the entropy of the partition [sf, sf] is maximum if the events
sf and sf have the same probability. We show next that this is true in general.
The proof will be based on the following.
A Bale Inequality
If c1 is a set of N numbers such that
Ct
+ . . . + CN
=I
c,
<?!
0
then
N
N
i•l
i•l
-l: pdnp, s -l: pdn c,
(12-3)
• Proof. From the convexity of the function In z it follows (Fig. 12.3) that
In z s z - I
(12-4)
SEC.
12-1
ENTROPY OF PARTITIONS AND RANDOM VARIABLES
417
2nz!Sz·
z
0
Figure 12.3
With z
= c;lp;, this yields
N
(';
/10
((';
}: p; In --: ~ }: p; - - I
i-1
p, ;-t
p;
)
.\'
= }: c·;
,\'
- }: p;
i·l
i-1
=0
Hence,
/\'
('
0 ~ }: p; In
....!.
i-1
p;
,\
.\'
= }: p; In c;
-
;-t
}:
p; In p;
i-1
and (12-3) results.
• Theorem. The entropy of a partition consisting of N events is maximum if
all events have the same probabilities, that is, if p; = I IN:
N
.\' I
I
H(A) = -}: p; In p; ~ -}:-In-= InN
(12-5)
;·IN
i-1
• Proof. Setting c·;
= liN in (12-3), we obtain
:'!,
·"
I
- ~ P; In p; ~ - }: P; In i-1
;·.t
N
From the theorem it follows th~t
0
Furthermore, H(A)
N
= In N
~ H(A) ~InN
= In N iff P1 = · · · = PN and
I
i =k
H(A) = 0
iff p; = { 0
i :/= k
(12-6)
(12-7)
The following two properties of entropy arc consequences of the convexity of the function -pIn p of Fig. 12.2.
418
CHAP.
12
ENTROPY
B
A
I
JI(AJ$.1/(8)
\
Fipre 12.4
Property 1 The partitions A and B of Fig. 12.4 consist of Nand N + I
events, respectively. The N - I events .s42, • • • , .s4.v are the same in both
partitions. Furthermore, the events ~a and ~bare disjoint and .s4 1 = ~a u
OOb; hence,
= P(sfl) = P(~a) +
Pt
P(!ib)
= Pa + Pb
We shall show that
H(A)
<
H(B)
(12-8)
• Proof. From the convexity of the function w(p) = -p In p it follows that
W(Pa + Pb) < w(pa) + w(pb)
(12-9)
Hence,
N
-p, In P1 -
L Pi In Pi s
i-2
N
-(pa In Pa + Pb In Pb) -
L Pi In Pi
i•2
and (12-8) results because the left side equals H(A) and the right side equals
H(B).
Example 12.3
The partition A consists of three events with probabilities
P1 = .55
pz = .30
Pl = .IS
and its entropy equals H(A) = 0.915. Replacing the event .!4 1 by the events :Ia and
!jib, we obtain the partitionS wherepa = P<lla) = .38. Pb = P(98b) = .17. The entropy
of the partition so formed equals
H(B) = 1.3148 > H(A)
in agreement with ( 12-8). •
Property 2 The partitions A and C of Fig. 12.5 consist of N events
each. The N - 2 events .s42 , • • • , .s4N are the same in both partitions.
Furthermore, .s41 U .s42 = <€a U C€b and
P1
= P( .s4t)
P2
= P( .s42)
Pa
= P(Cf;a)
Pb = P(C€b)
We shall show that if P1 < Pa < Pb < P2, then
H(A)
<
H(C)
(12-10)
SEC.
12-1
ENTROPY 01' PARTITIONS AND RANI>OM VARIABLES
419
A
IliA) :S 1/(C)
Figure 12.5
= Pu +Ph: hence, Pu = P1 +
the convexity of w(p) it follows that if 8 > 0, then
• Proof. Clearly, P1 + P2
w(pl)
+
w(p2)
<
w(pl
+
8)
+
8
< P2 -8 =Ph. From
w(p2 - e)
Hence,
N
-pi In P1 - P2 In P2 -
L
p; In p; s
;-)
N
-pu In p, -Ph In Ph - ~ p; In p;
i-3
and (12-10) results because the left side equals H(A) and the right side equals
H(B).
Example 12.4
Suppose that A is the partition in Example 12-3 and Cis such that
Pu = P(C(f,,.) = .52
In this case, H(C)
Ph = P(<fl,) = .33
= 0.990 > H(A) = 0.915. in agreement with 02-10).
•
We can use property 2 to give another proof of the theorem (12-5).
Indeed, if the events s4; of A do not have the same probability, we can
construct another partition Cas in (12-10), with larger entropy. From this it
follows that if H{A) is maximum, all the events of A must have the same
probability.
Note The concept of entropy is usually introduced as a measure of uncertainty
about the occurrence of the events sf; ofa partition A, and the sum (12-1), used
to quantify this measure, is derived from a number of postulates that are based
on the heuristic notion of the properties of uncertainty. We follow a different
approach. We view (12-1) as the definition of entropy, and we derive the
relationship between H(A) and the number of typical sequences. As we show
in Section 12-3, this relationship forms the conceptual justification of the
method of maximum entropy, and it shows the connection between entropy
and relative frequency.
420
CHAP.
12
ENTROPY
Random Variables
Consider a discrete type RV x taking the values x, with probability p1 = P{x =
x1}. The events {x = x1} are mutually exclusive. and their union equals ~.
Hence, they form the partition
A ... = [.!If,, . . . , .!liN]
.!lit= {x = Xt}
The entropy of this partition is by definition the entropy H(x) of the RV x:
N
H(x)
= H(AJl) = -~Pi In p;
Pi= P{x
= x;}
(12-11)
i•l
Thus H(x) does not depend on the values x; of x; it depends only on the
probabilities Pt.
Conversely, given a partition A, we can construct an RV x,. such that
x,.(,) = x;
for
.!lit
(12-12)
where x1 are distinct numbers but otherwise arbitrary. Clearly, {x,. = x1} =
.lli1; hence, H(x,.) = H(A).
'e
Entropy as Expected Value We denote by f(:c) the point density of the
discrete type RV x. Thusf(x) is different from 0 at the points x; and/(x;) =
P{x = x1} =PI· Using the functionf(x), we construct the RV lnf(x). This Rv is
a function of the RV x, and its mean [see (4-94)] equals
N
E{ln f(x)}
= L P1 In /(x;)
i•l
Comparing with (12-11), we conclude that the entropy of the RV x equals
H(x) = -E{ln/(x)}
(12-13)
Example U.S
In the die experiment, the elementary events {jj} form a partition A. We construct
the Rvx,. such that x(jj) = i as in (12-12). Clearly,f(x) is different from Oat the points
x = 1, . . . , 6, and /(xi) = p;: hence,
H(x,.)
= -E{lnf(xA)} =
6
- ~ p; In p;
juJ
= H(A)
•
CONTINUOUS TYPE avs The entropy H(x) of a continuous type RV cannot
be defined directly as the entropy of a partition because the events {x = x1}
are noncountable, and P{x = x1} = 0 for every x;. To avoid this difficulty, we
shall define H(x) as expected value, extending (12-13) to continuous type
Rvs. Note that unlike (12-11), the resulting expression (12-14) is not consistent with the interpretation of entropy as a measure of uncertainty about the
values ofx; however, as we show in Section 12-2, it leads to useful applications.
SEC.
12-1
ENTROPY OF PARTITIONS AND RANDOM VARIABLES
421
• Definition. Denoting by /(x) the density of x, we form the RV -In f(x).
The expected value of this RV is by definition the entropy H(x) of x:
H(x)
= -E{ln/(x)} = -
f ..J<x> lnf(x) dx
(12-14)
~ote that/(x) ln/(x)- 0 asf(.·c)- 0; hence. the integral is 0 in any
region of the x axis where f(x) = 0.
Example 12.6
The RV x has an exponential distribution:
/(x} = oce-QXU(.t)
In this case. E{x} = 1/oc and lnf(x} =In oc - ocx for x > 0; hence,
+ I
= In-ocI!
ln/(x) = -In cr
V27T -
1/(x) =- -E{In a - ocx}
Example 12.7
If /(x) is
N(TJ, u),
= -In oc
•
then
/(X)=~
u
('·IX·7J)2121T:
V21J'
(x - T'J)2
., ,
~u-
Hence.
H(x)
And since E{(x -
= -E{Inf(x)} = E{ln CT V27T}
T'J)2}
{(x ;~,.:')
2
}
= u 2 • we obtain
H(x) = In u
Example 12.8
+ E
V27T - 2I :
In u
•
v'21Te
The RV x is uniform in the interval (0, c). Thus /(x) = lie for 0 < x < c and 0
elsewhere; hence,
H(x) = - E {In -I} = - -I Lc In -I dx = In c
c
c 0 c
•
Joint Entropy Extending (12-13), we define the joint entropy H(x, y) oftwo
RVS x andy as the expected value of the RV -lnf(x, y) wheref(x, y) is the
joint density ofx andy. Suppose, first, that the RVs x andy are of the discrete
type, taking the values x; and y1 with probabilities
P{x =
X;, y = YJ} =
p;1
i = I, . . . , M
j
= I, . . . , N
In this case, the function/(x, y) is different from 0 at the points (x~o )j}, and
f(x;, YJ) = Pu· Hence,
H(x. y) = -E{In/(x, y)} = - Lf(x; • .v1 >lnf(x;. YJ)
(12-15)
iJ
If the Rvs x and y are of the continuous type. then
H(x, y)
= -E{ln/(x, y)} =
-
t . (,.f(x,
y) ln/(x, y) dx dy
(12-16)
422
CHAP.
12
Example 12.9
ENTROPY
If the avs x andy are jointly normal as in (5-100) (see Problem 12-15), their joint
density equals
H(x, y) = In (211'E'uru2 '\l'f=?)
•
Note that if the avs x andy are independent, their joint entropy equals
the sum of their "marginal entropies" H(x) and H(y):
H(x, y) = H(x) + H(y)
(12-17)
Indeed, in this case,
f(x, y) = fx(x)/,(y)
ln/(x, y) = Jn.fx(x) + ln/,(y)
and (12-17) results.
12-2
Maximum Entropy and Statistics
We shall use the concept of entropy to determine the distribution F(x) of an
RV x or of other unknown quantities of a probabilistic model. The known
information, if it exists, is in the form of parameters providing only partial
specification of the model. It is assumed that no observations are available.
Suppose that x is a continuous type RV with unknown density f(x ). We
know the second moment E{x2} = 8 of x, and we wish to determine /(.t ).
Thus our problem is to find a positive function j( x) of unit area such that
r.
x 2/(x) dx
=8
(12-18)
Clearly, this problem does not have a unique solution because there are
many densities satisfying (12-18). Nevertheless, invoking the following principle, we shall find a solution.
Principle or Maximum Entropy
maximize the entropy
H(x)
=-
The unknown density must be such as to
r.
f(x) ln/(x) dx
(12-19)
of the Rv x.
As we shall see, this condition leads to the unique function
f(x)
= "/t'-x t21
2
Thus the principle of maximum entropy (ME) leads to the conclusion that
the RV x must be normal. This is a remarkable conclusion! However, it is
based on a principle the validity of which we have not established.
The justification of the ME principle is usually based on the relationship between entropy and uncertainty. Entropy is a measure of uncertainty;
iff(x) is unknown, then our uncertainty about the values of x is maximum;
hence, f(x) must be such as to maximize H(x). This reasoning is heuristic
because uncertainty is not a precise concept. We shall give another justifica-
SEC.
12-2
MAXIMUM ENTROPY ASD STATISTICS
423
tion based on the relationship between entropy and typical sequences. This
justification is also heuristic; however, it shows the connection between
entropy and relative frequency, a concept that is central in the applications
of statistics. The ME principle has been used in a variety of physical problems, and in many cases, the results are in close agreement with the observations. In the last analysis, this is the best justification of the principle.
Method of Maximum Entropy
All ill-posed problems dealing with the specification of a probabilistic model
can be solved with the method of maximum entropy. However, the usefulness of the solution varies greatly from problem to problem and is greatest in
applications dealing with averages of very large samples (statistical mechanics, for example). We shall consider the problem of determining the distribution of one or more RVs under the assumption that certain statistical averages are known. As we shall see, this problem has a simple analytic solution.
In other applications the ME method might involve very complex computations. Our development is based on the following version of (12-3).
A Ba.ttic Inequality
If c(x) is a function such that
r.
=I
c(x) dx
and f(x) is the density of an RV x, then
- r2./(x) ln/(x) dx
Equality holds iff/(x)
~-
c(x) ==: 0
r.
f<x> In c(x) dx
= c(x).
• Proof. From the inequality In z ~ z- I it follows with z
-r.
f(x)
Hence,
0 ==:
r.
( 12-20)
In~~;~ dx ~ r2./(x) [;~;~-
= c(x)lf(x) that
I] dx = 0
f<x> In c(x) dx- f.J<x> ln/(x) dx
and (12-20) results.
We shall use this inequality to determine the ME solution of various illposed problems. The density of x so obtained will be denoted by fo(x) and
will be called the ME density. Thus fo(x) maximizes the integral
H(x)
=-
r.
f<x> ln/(x) dx
The corresponding value of H(x) will be denoted by H0(x).
• Fundamental Theorem. (a) If the mean E{g(x)} = 8 of a function g(x) of
the RV xis known, its ME density fo(x) is an exponential
(12-21)
fo(x) = ye-A.e<.tl
424
CHAP.
12
ENTROPY
where 'Y and A are two constants such that
'Y
f.
e-Af(xl dx
= 1
'Y
f . g(x)e-AR~xl
dx = 6
(12-22)
• Proof. lf/o(x) is given by (12-21), then
lnfo(x) =In 'Y- Ag(x)
The corresponding entropy Ho(x) equals
-t .
.Jo<x> ln.fo(x) dx = -
J"..Jo<x)[ln 'Y -
Ag(x)) dx
Hence,
(12-23)
Ho(x) = -In 'Y + M
To show thatfo(x) is given by (12-21), it therefore suffices to show that
if f(x) is any other function such that
f ..JCx) dx = 1 f . g(x)f(x) dx = 6
the corresponding entropy is less than -In 'Y + M. To do so, we set c(x) =
fo(x) in (12-20). This yields
- t..f<x> lnf(x) dx
s -
f . f(x) ln/o(x) dx = - f ..J<x)[ln 'Y -
Ag(x)) dx
Thus H(x) s In 'Y + M, and the proof is complete.
(b) Suppose now that in addition to the information that E{g(x)} = 6,
we require thatf(x) = 0 for x ~ R where R is a specified region ofthe x-axis.
Replacing in (a) the entire axis by the region R, we conclude that all results
hold. Thus
fo(x) =
{~e-Af<xl ~ ~ ~
(12-24)
and Ho(x) = -In 'Y + M. The constant A and 'Yare again determined from
(12-22) provided that the region of integration is the set R.
(c) If no information aboutf(x) is known, the ME density is a constant.
This follows from (12-24) with g(x) = 0. In this case, the problem has a
solution only if the region R in whichf(x) -:/= 0 has finite length. Thus if no
information aboutf(x) is given,fo(x) does not exist; that is, we can find an
f(x) with arbitrary large entropy. Suppose, however, that we require that
f(x) =F 0 only in a region R consisting of a number of intervals of total length
c. In this case, (12-24) yields
fo(x)
= {o1/c
x e R
xER
(12-25)
Note that the theorem does not establish the existence of a ME density. It states only that if/o(x) exists, it is the function in (12-21).
Discrete Type Rvs Suppose, finally, tbat the RV x takes the values x; with
probability p;. We shall determine the ME values po; of p; under the assumption that the mean E{g(x)} = 6 of the function g(x) is known. Thus our
SEC.
12-2
MAXIMUM ENTROPY ASD STATISTICS
425
problem is to find p; such as to maximize the entropy
H(x) =
subject to the constraints
L p; In P;
- E{ln /(x)} = -
L p;g(x;) =
LP; = 1
(12-26)
(12-27)
0
where g(x1) are known numbers.
We maintain that
po;
= 'YE'·A,III.t,l
(12-28)
where 'Y and A arc two constants such that
'YLE'·AI/Ix,l = I
'Y
L e·ANC.t,l,(.'(X;) =
(12-29)
0
This follows from (12-24) if we use for R the set of points x1• However, we
shall give a direct proof based on (12-3).
• Pmoj. If po; is given by ( 12-28), then In Po;
corresponding entropy H 0(x) equals
= In 'Y
- Ag(x;).
Hence, the
- L Po; In po; = -In 'Y L po; + ALpo;g(x;) = -In 'Y + AO
It therefore suffices to show that if p 1 is another set of probabilities satisfying
(12-27), the corresponding entropy is less than Ho(x). To do so, we set c1 =
po; in (12-3). This yields
- L p,ln p; s - L p; In Po;
and the proof is complete.
=-
L p;[ln 'Y -
Suppose. first that E{x2 }
Ag(x;))
= Ho(x)
= 0.
In this case. (12-21) yields
fo(x) = 'Ye·Ax:
(12-30)
Thus if the second moment 0 of an RV x is known. x is N(O, V7h. Hence,
ILLUSTRATIONS
I
= -I
Ho(x) = In v'21frlj
20
If the variance u 2 of x is specified, then x is N(TJ. u) where .,., is an
arbitrary constant. This follows from (12-21) with g(x) = (x- TJ)2•
'Y = - -
v21T9
Example 12.10
A
Consider a collection of particles moving randomly in a certain region. If they are in
statistical equilibrium, the x component Vx of their velocity can be considered as the
sample of an RV Vx with distribution f(vx ). We maintain that if the average kinetic
energy
Nx
= EHmv:}
of the particles is specified, the ME density of Vx equals
fo<vx) =
~ exp {- :~:}
426
CHAP.
12
ENTROPY
Indeed, this follows from (12-30) with 8
= E{•!} = 2N,Im. Thus "• is N(O, u) where
u 2 = 2N.,Im equals the average kinetic energy per unit mass. The same holds for Vv
and vl.
•
·
Suppose that xis an
RV
with mean E{x}
= 8 and such that/(x) = 0 for
x < 0. In this case,fo(x) is given by (12-24) where R is the region x > 0 and
g(x)
= x. This yields
I
(12-32)
8
with specified mean is exponential.
y=A=-
Thus the ME density of a positive
Example 12.11
RV
Using the ME method, we shall determine the atmospheric pressure P(l) as a function of the distance z from the ground knowing only the ratio N I m of the energy N
over the mass m of a column of air. Assuming statistical equilibrium, we can interpret N as the energy and m as the mass of all particles in a vertical cylinder C of unit
cross section (fig. 12.6). The location z of each particle can be considered as the
sample of an RV z with density /(d. We shall show that
fo(z) = mg e·mrz'NU(z)
(12-33)
N
where g is the acceleration of gravity.
The probability that a particle is in a cylindrical segment a between z and dz
equals/(ddz; hence, the average mass in a equals m/C;.)dz. Since the energy of a
unit mass, distance z from the ground, equals gz. we conclude that the energy of the
mass in the region a equals gzmf(z)dz., and the total energy N equals
N
=
J: mgzf(z) dz
= E{mgz)
With 8 = E{z} = Nlmg, (12-33) follows from (12-32). The atmospheric pressure P(:)
equals the weight of the air in the cylinder C above z:
P(z)
=
J: mgfo(z) dz
= mge·mtt:JN
•
Flglll'e 12.6
,_--.....
X
y
SF.C.
12-2
427
MAXIMUM F.NTROPY AND STATISTICS
fo<P>
0
p
0
p
0
(b)
(a)
Figure 12.7
Example 12.12
In a coin experiment. the probability p = P{h} that heads will show is the value of an
p with unknown density f(p). We wish to find its ME formfo(p).
(a) Clearly ,f(p) = 0 outside the interval (0, I); hence. if nothing else is known, then
[see (12-25)]/o(p) =- I. as in Fig. 12.7a.
(b) We assume that E{p} =- 8 :..: 0.236. In this case. ( 12-24) yields
0<p < I
(12-34)
.fo(p) = ye Af'
where y and A are such that
RV
y
Jof' e· Ap dp = I
y (I pe·Ap dp
Ju
= 0.236
Solving for y and A. we find y = 1.1. A = 1.2 (Fig. 12.7b). •
Example 12.13
(Brandeis Die+). In the die experiment, we are given the information that the average
number of faces up equals 4.5. Using this information. we shall determine the ME
values p 01 of the probabilities p; = P{/;}. To do so. we introduce the RV x such that
x(/;) = i. Clearly.
6
E{x}
= L ip1 = 4.5
i-1
With g(x)
= x and x1 = i, it follows from (12-28) that
Po;
= ye·Ai
i
= I. . . . . 6
( 12-35)
where the constants y and A are such that
I>
fl
YLeAI=l
;~I
y Lie AI= 4.5
i'l
To solve this system, we plot the ratio
Tj(W) =
w· 1 + 2w
2
+ · · · + 6w "
w·l + w·2 ...... + w-6
and we find the value ofw such that 1J(W) = 4.5. This yields w = 0.68 (see Fig. 12.8).
Hence, y = 0.036,
p 01 = .054
Po2 = .079
Po3 = .114 P04 = .165
Po~ = .240
P06 = .348 •
+ E. T. Jaymes, Brandeis lectures. 1962.
428
CHAP.
12
ENTROPY
o.s
0
w
Figure 12.8
GENERALIZATION Now let us consider the problem of determining the
density of an
811
RV
x under the assumption that the expected values
= E{g11(x)} =
f..
g"(x )f(x) dx
k
= I,
. . . ,n
( 12-36)
of n functions g 11(x) of x are known.
Reasoning as in (12-21), we can show that the ME density of x is the
function
(12-37)
The n + 1 constants y and A; are determined from the area condition
'Y
J: exp{- ~ "-•g (x>} dx = I
(12-38)
11
and then equations [see (12-36)]
'Y
J-·. . g"(x) exp{- .R.
±A~tg11(x>} dx =
811
(12-39)
The proof is identical to the proof of (12-21) if we replace the terms
A.g(x) and U by the sums
entropy equals
II
II
·-·
·~·
L A.11g.(x) and L A11 8~r,, respectively. The resulting
II
Ho(x)
=-
In y +
L "-•8•
·-·
(12-40)
12-2
SEC.
Example 12.14
MAXIMUM ESTROPY AND STATISTICS
429
Given the first two moments
= E{x}
8,
of the
RV x, we wish to find /oCt).
In this problem, g 1(x) = x. g 2(x)
82
= E{x2}
= x 2, and 02-37) yields
/ulx) = ye-Ao<
A).,:
This shows that Jo(.t) is a normal density with mean 11 =- t1 1 and variance u 2
8~ - 8i. •
"""
Partition Function The partition function is by definition the integral
r.
Z(},,, . . . , A,.) =
= 11-y.
As we see from (12-38). Z
:~ =
r.
-
exp{-
~ A4Rk(x)} dx
(12-41)
Furthermore,
g;(X) exp{-
~ Atgk(X)} dt
(12-42)
t, . . . • n
(12-43)
Comparing with (12-39), we conclude that
az
- z1 ax.
= fh
k
=
This is a system of n equations equivalent to (12-39); however, it involves
only then parameters Ak.
Consider, for example. the coin experiment where
E{p}
= I~ pf(p) dp
= (J
is a given number. In this case [see (12-34)],
Z = -I =
"Y
and with n
J.' e-Ap dp = -I --e-A
A
o
= I, (12-43) yields
I
az
- z ax =
I - e-'A - Ae-A
xo -
e-'A)
=8
To findfo(p) for a given 8, it suffices to solve this equation for A.
Discrete Type RVs Consider, finally. a discrete type RV x taking the values
with probability p;. We shall determine the ME values po; of p; under the
assumption that we know the n constants
X;
e. =
E{g~c(x)}
= L;
p;g~c(X;)
Reasoning as in (12-28), we obtain
I
Po; = exp {- A,g,(x;) - ·
z
k
= I. ...
'n
(12-44)
· - X,.g,.(x;)}
where
Z
= ~
i
exp {-A,g,(x;)- · · · - A,.g,.(x;)}
(12-45)
430
CHAP.
12 ENTROPY
is the discrete form of the partition function. From this and (12-44), it follows
that Z satisfies the n equations
- zI iJZaA. = Bk
k
= I •..• n
(12-46)
Thus to determine po;, it suffices to form the sum in (12-45) and solve the
system (12-46) for the n parameters A.k.
In Example 12.13, we assumed that E{x} = .,. In this case, n = I,
6
- z1 azax=
~ ie-Ai
i•l
6
~ e·Ai
;
...
=.,
To determine the probabilities po;, it suffices to solve the last equation for A..
12-3
Typical Sequences and Relative Frequency
We commented earlier on the subjective interpretation of entropy as a measure of uncertainty about the occurrence of the events sf; of a partition A at a
single trial. Next we give a different interpretation based on the relationship
between the entropy H(A) of A and the number n, of typical sequences, that
is, sequences that are most likely to occur in a large number of trials. This
interpretation shows the equivalence between entropy and relative frequency, and it establishes the connection between the model concept H(A)
and the real world.
Typical sequences were introduced in Section 8-2 in the context of a
partition consisting of an event sf and its complement $1.. Here we generalize
to arbitrary partitions. As preparation, Jet us review the analysis of the twoevent partition A = [sf, sf].
In the experiment ~11 of repeated trials, the sequence
Sj
= sfJ4J4 .
j
. . J4
= I, .
. . • 211
(12-47)
is an event with probability
P(si)
= pkq" ·•
p
= P(sf) = I -
q
(12-48)
where k is the number of successes of J4. The number ns of such sequences
equals 211 • We know from the empirical interpretation of probability that if n
is large, we expect with near certainty that k == np. We can thus divide the 211
sequences of the form (12-47) into two groups. The first group consists of all
sequences such that k""' np. These sequences will be called typical and will
be identified by the Jetter t. The second group, called atypical. consists of all
sequences that are not typical.
SEC.
12-3
TYPICAl. SEQl,;ENCES AND RELATIVE FREQUENCY
43)
St;MBER OF TYPICAL SEQUENCES We show next that the number n, of
typical sequences can be expressed in terms of the entropy
H(A)
= -(p In p
+
q In q)
of the partition A. This number is empirical because it is based on the
empirical formula k = np. As we shall show. it can be given a precise
interpretation based on the law of large numbers.
To determine
we shall first find the probability of a typical sequence
ti. Clearly. P(ti) is given by (12-48) where now
k == np
n - k = n - np = nq
Inserting into (12-48). we obtain
P(lj) == p"Pq"" = enplnp-nqlnq = (! tiHIAl
(12-49)
n,.
The union '!! of all typical sequences is an event in the space ~". This
event will almost certainly occur at a single trial of ;-ln because it consists of
all sequences with k = np. Hence, P(?i) == I. And since all typical sequences
have the same probability. we conclude that I = /'(3) = n,P(t1 ). This yields
,, = enlll.-\1
(12-50)
We shall now compare 11, to the number 2" of all sequences of the form
(12-47). If p = .5, then H(A) = In 2: hence. n, = e" 1" ~ = 2". If p :1= .5, then
H(A) < In 2. and for large n.
n, ~ e"IIIAI ~ e"'" ~ = 2"
(12-51)
This shows (Fig. 12.9) that if p :1= .5, then the number n, of typical sequences
is much smaller than the number 2" of all sequences even though P(9") = I.
Thus if the experiment ~" is repeated a large number of times. most sequences that will occur are typical (Fig. 12.10).
Figure 12.9
,II
0
p
0
fl
432
CHAP.
12
ENTROPY
......
••••••}
:J"
::::::
••••••
••••••
sI
111 "" e'llfiAI
'J
• 1 • • • •
typical
k ""np
:::::
••• I • •:}
••••••
::::::
a,
lis
••••••
••••••
= 2"
1 1 • • • •
atypical
k :Fnp
I I I I I I
I I I I I I
a
••••••••••••••••••••••••••••••••••••••••••
Flpre 12.10
Note Since P(~) "" I, most-but not all-atypical sequences have a small
probability of occurrence. If p < ..5, then plq < I. hence. the sequence p 4q"_,
is decreasing ask increases. This shows that all atypical sequences with k < np
are more likely than the typical sequences. However. the number of such
sequences is small compared ton, (see also Fig. 12.11).
Typical Sequences and Bernoulli Trials We shall reexamine the preceding
concepts in the context of Bernoulli trials using the following definitions. We
shall say that a sequence of the form (12-47) is typical if the number k of
successes of JA. is in the interval (Fig. 12.11)
k0 < k < kb
ko = np - 3'\11ii)q
kb = np + 3'\11ii)q ( 12-52)
We shall use the De Moivre-Laplace approximation to determine the
probability of the set ~ consisting of all typical sequences so defined.
Since k is in the ± 3'\11ii)q interval centered at np, it follows from (3-30) that
.I:
P(~) = ... (~) p"q"-k ... 2G(3) Flpre 12.11
n E!!
(·-)
k
n
.,
2"
I
= .997
(12-53)
fin- exp {2(
-trn
")2)(
k- -:;
n .
..
k
SEC.
12-3
TYPICAL SEQUENCES AND RF.LATIVF. FREQUENCY
433
This shows that if the experiment fin is repeated a large number of
times, in 99.7% of the cases, we will observe only typical sequences. The
number of such sequences is
n, =
To find this sum, we set p
=q =
L•· (n)
k
(12-54)
k-k.
112 in ( 12-53). This yields the approximation
(n)
1 k k = .997
n
Vn
k1 =n- - 3 X Vn
k.~ =- -4- 3 X - (12-55)
2 k=k,
2
2
2
2
and it shows that .997 x 2n of the 2n sequences of the form (12-47) are in the
interval (k 1, k2 ) of Fig. 12.11. If p =I= .5, then the interval Cku, kb) in (12-52) is
outside the interval (k 1, k2); hence n, is less than .003 x 2n.
The preceding analysis can be used to give a precise interpretation of
(12-52) in the form of a limit. With ku and kh as in ( 12-52), it can be shown that
the ratio
2
-;j ~
n,
_!
~
(n)
n - n •-•. k
tends to H(A) as n-+ oc. The proof, however, is not simple.
(12-56)
ARBITRARY PARTITIONS Consider an arbitrary partition A = [.s4 1, • • • ,
.s4Nl consisting of the N events .Sii;. In the experiment ~In of repeated trials,
we observe sequences of the form
Sj=l!Ja,· • ·?Ak· • ·~n
j= 1, . . . ,Nn
(12-57)
where ~k is any one of the events .<A.; of A. The sequence si is an event in the
space ~n• and its probability equals
(12-58)
p; = P(.Sif;)
where k; is the number of successes of .Sii;. If k; = np;. for every i then si is a
typical sequence ti, and its probability equals
P(tj) = pfP• • •• PNnP.v = enp ..np,~···•np.\ lnp~ = e·niiiAI
(12-59)
where H(A) = - (p 1 In p, + · · · + PN In PN) is the entropy of the partition
A. The union '!! of all typical sequences is an event in the space ~n, and for
large n, its probability equals almost I because almost certainly k; = np;.
From this and (12-59) it follows that the number n, of typical sequences
equals
n, = enHIAI
(12-60)
as in (12-50).1fthe events .Sii;are not equally likely. H(A) <InN, and (12-60)
yields
n, = enHIAI ~ en InN= Nn
(12-61)
This shows that the number of typical sequences is much smaller than the
number Nn of all sequences of the form (12-57).
434
CHAP.
12
ENTROPY
Maximum Entropy and Typical Sequentes We shall reexamine the concept
of maximum entropy in the context of typical sequences. limiting the discussion to the determination of the probabilities Pi of a partition. As we see from
(12-60), the entropy H(A) of the partition A is maximum iff the number of
typical sequences generated by the events of A is maximum. Thus the ME
principle can be stated as the principle of maximizing the number of typical
sequences. Since typical sequences are observable quantities. this equivalence gives a physical interpretation to the concept of entropy.
We comment finally on the relationship between the ME principle and
the principle of insufficient reason. Suppose that nothing is known about the
probabilities Pi· In this case. H(A) is maximum iff [see (12-5))
I
N
PI= . • . = PN = -
as in (1-8). The resulting number of typical sequences is Nn. If the unknown
numbers p; satisfy various constraints of the form I.p;g~r,(X;) = 81r. as in
(12-44), no solution can be obtained with the classical method. The ME
principle leads to a solution that muimizes the number of typical sequences
subject to the given constraints.
Concluding Remarks
In the beginning of our development. we stated that the principal applications of probability involve averages of mass phenomena. This is based on
the empirical interpretation p =kin of the theoretical concept p = P(s4.). We
added, almost in passing, that this interpretation leads to useful results only
if the following condition is satisfied: The ratio kin must approach a constant
as n increases, and this constant must be the same for any subsequence of
trials. The notion of typical sequences shows that this apparently mild condition imposes severe restrictions on the class of phenomena for which it
holds. It shows that of the Nn sequences. that we can form with the N
elements of a partition A, only 2nH!A> are likely to occur; most of the remaining sequences are nearly impossible.
Four Interpretations of Entropy We conclude with a summary of the similarities between the various interpretations of probability and entropy.
Probllbility In Chapter I, we introduced the following interpretations
of probability.
Axiomatic: P(s4.) is a number p assigned to an event s4. of an experiment fl.
Empirical: Inn repetitions of the experiment~.
k
p =(12-62)
n
Subjective: P(s4.) is a measure of our uncertainty about the occurrence of s4.
in a single performance of ~.
Principle ofinsufficient reason: If~ is the union of N events s4.; of a partition
A and nothing is known about the probabilities p; = P(s4.i). then p; = 1/N.
PROBLEMS
435
Entropy The results of this chapter lead to the following interpretations of H(A ):
Axiomatic:
/1( A)
is a number H(A) --' -
~P;
In p; assigned to a partition A
of~.
Empirical: This interpretation involves the repeated performance not of the
experiment~
but of the experiment ~n· In this experiment, a specific typical
sequence ti is an event with probability e nHIAI. Applying (12-62) to this
event, we conclude that if in m repetitions of~~n the event t1 occurs m1 times
and m is large, then P(tj) = e-nHIAl = m/m; hence.
H(A)
m·
= - -nI In ~
m
(12-63)
This approximation relates the model concept H(A) to the observation mi
and can be used in principle to determine H(A) experimentally. It is, however, imprctctical.
Subjective: The number H(A) equals our uncertainty about the occurrence of
the events J4; of A in a single performance of ff.
Principle of maximum entropy: The unknown probabilities p; must be such
as to maximize H(A). or equivalently, to maximize the number n, of typical
sequences. This yields p; = liN and n, = Nn if nothing is known about the
probabilities p;.
Problems
ll·l
In the die experiment. P{even}
= .4.
Find the ME values of the probabilities
P; = P{Ji}.
ll-2
12-3
ll-4
12-S
ll-6
In the die experiment, the average number of faces up equals 2.21. Find the
ME values of the probabilities p; = P{Ji}.
Find the ME density of an RV x if x = 0 for lx > I and E{x} = 0.31.
It is observed that the duration of the telephone calls is a number x between I
and 5 minutes and its mean is 3 min 37 sec. Find its ME density.
It is known that the range of an RV x is the interval (8, 10). Find its ME
density if "'.r = 9 and u. = I.
The density /(x) of an RV xis such that
J",f(x)dx
12-7
ll-8
=I
J",f(x) cosxd:c
= 0.5
Find the ME form of/(x).
The number x of daily car accidents in a city does not exceed 30, and its mean
equals 3. Find the ME values of the probabilities P{x = lc} = P•.
We are given a die with P{even} = .5 and are told that the mean of the number
x of faces up equals 4. Find the ME values of p; = P{x = i}.
436
CHAP.
12
ENTROPY
Suppose that x is an av with entropy H(x) and y = 3x. Express the entropy
H(y) of yin terms of H(x). (a) If x is of discrete type, (b) if x is of continuous
type.
12-10 Show that if c(x, y) is a positive function of unit volume and x, yare two avs
with joint density /(x, y), then
E{ln /(s, y)} s - E{ln c(x, y)}
12-9
12-11 Show that if the expected values 8tt = E{gtt(x, y)} of them functions gtt(x, y) of
the avs x and y are known, then their ME density equals
f(x, y) = 'Y exp{-A 1g 1(x, y) - • • • - A,.g,.(x, y)}
12-12 Find the ME density of the avs x andy if E{x2} = 4, E{y2} = 4, and E{xy} = 3.
12-13 Show that if the avs z and ware jointly normal as in (5-100), then H(z, w) =In
(211'eup,..~).
12-14 (a) The avs x andy are N(O, 2) and N(O, 3), respectively. Find the maximum
of their joint entropy H(x, y). (b) The joint entropy of the avs x and y is
maximum subject to the constraints E{x2} = 4 and E{y2} = 9. Show that these
avs are normal and independent.
12-15 Suppose that x1 = x + y, y 1 = x- y. Show that if the avs x andy are ofthe
discrete type, then H(x 1, y,) = H(x, y), and if they are of the continuous type,
then H(x1, y 1) = H(s, y) + In 2.
12-16 The joint entropy of n avs X; is by definition H(x., . ·· .• x,.) = -E{ln
/(x., . . . • x,.)}. Show that if the avs X; are the samples of x, then H(x1 ,
. . . , x,.) = nH(x).
12-17 In the experiment of two fair dice, A is a partition consisting of the events
.!1 1 = {seven}, .!12 = {eleven}, and .!13 = .!1 1 U .!12• (a) Find its entropy. (b)
The dice were rolled 100 times. Find the number of typical and atypical
sequences formed with the events .54., .542 , and .543•
12-18 (Coding and entropy). We wish to transmit pictures consisting of rectangular
arrays of two-level spots through a binary channel. If we identify the black
spots with 0 and the white spots with I, the required time is T seconds. We
are told that 83% of the spots are black and 17% are white. Show that by
proper coding of the spots, the time of transmission can be reduced to 0.65 T
seconds.
Tables
In the following tables. we list the standard normal distribution G(z) and the
u-percentiles
z,
XZ(n)
t,(n)
Fu<m.
n)
of the standard normal, the chi-square. the Student t, and the Snedecor F
distributions. The u-percentile x,. of a distribution F(x) is the value x, of x
such that (Fig. T .I)
u
= F(x,.) = f:J<x)dx
Thusx, is the inverse ofthe function u = F(x).lf/(x) is even, F(-x) = 1F(x) and x 1-u = -x,. It suffices, therefore, to list F(x) for x;;:: 0 only and x,
for u ::: .5 only.
In Table Ia, we list the normal distribution G(z) for 0 s z s 3. For z >
3, we can use the approximation
G(z)
= 1 - ~ e·;:~
ZV21T
In Table lb, we list the z,-percentiles of the N(O, 1) distribution G(z).
The x,-percentile of the N(TJ, u) distribution G(x ~ 71 ) is x, = 71 + z,u.
437
438
TABLES
u = F(x,)
x,
Xt-u =
-x,
f(x)
F1pre T.l
In Table 2, we list the x~(n) percentiles. This is a number depending on
u and on the parameter n. For large n, the x 2(n) distribution Fx(x) ap-
proaches a normal distribution with mean nand variance 2n. Hence,
u
= Flr(X~) =
X~(n) =
o(X&fin)
n + z, V2fi
The following is a better approximation
x~(n)
= 2I (z, + V2n-
1)2
In Table 3, we list the t,(n)-percentiles. For large n, the t(n) distribution F,(x) approaches a normal distribution with mean zero and variance nl
(n - 2). Hence,
u
= F,(r,> = o(V
1
"
nl(n - 2)
)
t,(n)
= z,.Jn
~2
The F,(m, n) percentiles depend on the two parameters m and nand
are determined in terms of their values for u ~ .S because F,(m, n) = 11
F 1 _,(n, m). They are listed in Table 4 for u = .9S and u = .99. Note that
2
1 ~
F2u-tO, n) = t,(n)
and
F,(m, n) = - x;;(m)
form~ I
m
TABI.ES
Table la
G(x)
G(x)
439
I- J·'_,. e-·····-dv
·~ .
=\1'2;
-
X
G(x)
X
G(x)
X
G(x)
X
G(x)
0.05
0.10
0.15
.51944
.53983
.55962
.93943
.94520
.95053
.57926
.59871
.61791
.63683
.65542
.67364
.69146
.70884
.72575
.74215
.75804
.77337
2.30
2.35
2.40
2.45
2.50
2.55
2.60
2.65
2.70
2.75
2.80
2.85
2.90
2.95
3.00
.98928
.99061
.99180
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
.78814
.80234
.81594
.82894
.84134
.85314
.86433
.87493
.88493
.89435
.90320
.91149
.91924
.92647
.93319
1.55
1.60
1.65
0.20
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
1.50
1.70
.95543
1.75
1.80
1.85
1.90
1.95
2.00
2.05
2.10
2.15
2.20
2.25
.95994
.96407
.96784
.97128
.97441
.97726
.97982
.98214
.98422
.98610
.9877R
.99286
.99379
.99461
.99534
.99597
.99653
.99702
.99744
.99781
.99813
.99841
.99865
Figure T.Z
I
X
v"fi
I ·· ,· e·
_,.2 1•
-tly
..;5.,,
0.3
0.2
0.1
0
0.5
1.0
1.5
Table lb
z,
Lu
I z,
.90
t.282
2.0
:!.5
u
I
I
.925
t.440
.95
1.645
.975
t.967
3.0
X
I:. . ,
I
= V27T
.,. e __,... _dy
I .99 I .995 I .999 I .9995 I
I 2.326 I 2.576 I 3.090 I 3.291 I
440
TABLES
Table 2
l'• n
1
1( )
!,;\
1
2
3
4
( 95)
.1'4.
.005
.01
.025
.05
.1
0.00
0.10
0.35
0.71
1.15
0.02
0.21
0.58
1.06
1.61
0.00
0.02
0.11
0.30
5
0.00
0.01
0.07
0.21
0.41
0.55
0.00
0.05
0.22
0.48
0.83
6
7
8
9
10
0.68
0.99
1.34
1.73
2.16
0.87
1.24
1.65
2.09
2.56
1.24
1.69
2.18
2.70
3.25
1.64
2.17
2.73
3.33
3.94
2.20
2.83
3.49
4.17
4.87
11
12
13
14
3.05
3.57
4.11
4.66
5.23
3.82
4.40
5.01
5.63
6.26
4.57
5.23
5.89
6.57
7.26
5.58
6.30
7.04
7.79
15
2.60
3.07
3.57
4.07
4.60
8.55
16
17
18
19
20
5.14
5.70
6.26
6.84
7.43
5.81
6.41
7.01
7.63
8.26
6.91
7.56
8.23
8.91
7.96
8.67
9.39
10.12
10.85
22 8.6
24 9.9
26 11.2
28 12.5
30 13.8
40 20.7
50 28.0
9.5
9.59
.9
.95
.995
6.63
9.21
11.34
13.28
15.09
7.88
10.60
12.84
14.86
16.75
10.64
12.02
13.36
14.68
15.99
12.59 14.45
14.07 16.01
15.51 17.53
16.92 19.02
18.31 20.48
16.81
18.48
20.09
21.67
23.21
18.55
20.28
21.96
23.59
25.19
17.28
18.55
19.81
21.06
22.31
19.68
21.03
22.36
23.68
25.00
21.92
23.34
24.74
26.12
27.49
24.73
26.22
27.69
29.14
30.58
26.76
28.30
29.82
31.32
32.80
9.31
10.09
10.86
11.65
12.44
23.54 26.30 28.85
24.77 27.59 30.19
25.99 28.87 31.53
27.20 30.14 32.85
28.41 31.41 34.17
32.00
33.41
34.81
36.19
37.57
34.27
35.72
37.16
38.58
40.00
47.0
40.3
43.0
45.6
48.3
50.9
42.8
45.6
48.3
51.0
53.7
59.3
71.4
63.7
76.2
19.5
11.0
12.4
13.8
15.3
16.8
12.3
13.8
15.4
16.9
18.5
14.0
15.7
17.3
18.9
20.6
30.8
33.2
35.6
37.9
40.3
22.2
29.7
24.4
32.4
26.5
29.1
37.7
51.8
55.8
34.8
63.2
67.5
~50:
.99
2.71 3.84 5.02
4.61 5.99 7.38
6.25 7.81 9.35
7.78 9.49 11.14
9.24 11.07 12.83
10.9
12.2
13.6
15.0
For n
.915
=9.49
33.9
36.4
38.9
41.3
43.8
36.8
39.4
41.9
44.5
66.8
TABLES
Table 3
~
=2.72
t,99(11)
t.(n)
.9
.95
.975
.99
.995
2
3
4
5
1.89
1.64
1.53
1.48
6.31
2.92
2.35
2.13
2.02
12.7
4.30
3.18
2.78
2.57
31.8
6.97
4.54
3.75
3.37
63.7
9.93
5.84
4.60
4.03
6
7
8
9
10
1.44
1.42
1.40
1.38
1.37
1.94
1.90
1.86
1.83
1.81
2.45
2.37
2.31
2.26
2.23
3.14
3.00
2.90
2.82
2.76
3.71
3.50
3.36
3.25
3.17
II
12
13
14
15
1.36
1.36
1.35
1.35
1.34
1.80
1.78
1.77
1.76
1.75
2.20
2.18
2.16
2.15
2.13
2.72
2.68
2.65
2.62
2.60
3.11
3.06
3.01
2.98
2.95
16
17
18
19
20
1.34
1.33
1.33
1.33
1.33
1.75
1.74
1.73
1.73
1.73
2.12
2.11
2.10
2.09
2.09
2.58
2.57
2.55
2.54
2.53
2.92
2.90
2.88
2.86
2.85
22
24
26
28
30
1.32
1.32
1.32
1.31
1.31
1.72
1.71
1.71
1.70
1.70
2.07
2.06
2.06
2.05
2.05
2.51
2.49
2.48
2.47
2.46
2.82
2.80
2.78
2.76
2.75
I 3.08
For n
~
30:
t,.(n)
rn
- 2
= z,. \j n
441
442
TABLES
Table 4a
F 95(5, 8) = 3.69
F95(m, n)
,.
m
1
2
3
4
5
6
7
8
9
10
12
14
16
18
20
30
40
50
60
70
I
s
6
7
8
9
10
12
14
16
18
20
30
40
so
60
70
3
4
5
6
8
10
20
30
40
200
161
225
230
242
216
234
239
248
251
250
29.0
18.5
19.2
19.2
19.3
19.3
19.4
19.4
19.4
19.5
19.5
9.28
9.12
8.94
8.85
10.1
9.55
9.01
8.79
8.66
8.62
8.59
7.71
6.94
6.39
6.26
6.59
6.16
6.04
5.96
5.80
5.15
5.12
6.61
5.19
5.41
5.19
5.05
4.82
4.74
4.95
4.56
4.50
4.46
5.99
5.14
4.28
4.15
4.76
4.53
4.39
4.06
3.87
3.81
3.77
4.74
3.87
3.73
5.59
4.35
4.12
3.97
3.64
3.44
3.38
3.34
5.32
4.46
3.58
3.44
3.35
3.15
4.07
3.84
3.69
3.08
3.04
5.12
4.26
3.63
3.48
3.14
3.86
3.37
3.23
2.94
2.86
2.83
4.96
4.10
3.07
2.98
2.77
3.71
3.48
3.33
3.22
2.70
2.66
4.15
3.89
3.00
2.85
2.15
2.54
3.49
3.26
3.11
2.47
2.43
3.74
4.60
2.85
3.11
2.96
2.70
3.34
2.60
2.39
2.31
2.27
4.49
3.63
2.74
3.01
2.85
2.59
2.49
3.24
2.28
2.19
2.15
3.55
2.93
2.77
4.41
2.41
3.16
2.66
2.51
2.19
2.11
2.06
4.35
3.49
2.60
3.10
2.87
2.71
2.45
2.35
2.12
2.04
1.99
4.17
3.32
2.42
2.27
2.16
2.92
2.69
2.53
1.93
1.84
1.79
2.18
4.08
2.45
2.61
3.23
2.34
2.84
2.08
1.84
1.74
1.69
4.03
3.18
2.79
2.56
2.40
2.03
2.29
2.13
1.78
1.69
1.63
4.00
2.53
2.37
3.15
2.76
2.25
2.10
1.99
1.75
1.65
1.59
3.13
2.23
3.98
2.50
2.35
2.07
1.97
2.74
1.72
1.62
1.57
Table 4b
F,(m, n)
I
2
3
4
2
F.,(5, 8) = 6.63
4,052
4.999
5,403
5,625
5.764
5,859
5,982
6.056
6.209
6.261
6.287
99.0
99.3
98.5
99.2
99.3
99.3
99.4
99.4
99.4
99.5
99.5
30.8
34.1
28.7
28.2
27.9
27.5
27.2
29.5
26.S
26.4
26.7
18.0
21.2
16.0
IS.S
15.2
14.8
14.5
14.0
16.7
13.8
18.7
16.3
13.3
11.4
11.0
12.1
10.7
10.3
10.1
9.55
9.38
9.29
13.7
10.9
9.78
9.15
8.75
7.87
8.47
8.10
7.40
7.23
7.14
7.46
12.2
9.55
8.45
7.85
7.19
6.84
6.62
6.16
5.91
5.99
8.65
7.01
6.63
7.59
6.37
11.3
6.03
5.81
5.36
5.20
5.12
10.6
8.02
6.99
6.42
6.06
5.80
5.47
4.81
4.57
5.26
4.65
10.0
7.56
5.99
5.64
5.06
6.55
5.39
4.85
4.41
4.25
4.17
9.33
5.41
5.06
4.50
6.93
5.95
3.86
3.70
3.62
4.30
4.82
8.86
5.04
4.70
4.46
4.14
6.51
3.94
S.S6
3.51
3.35
3.27
4.77
8.53
4.44
6.23
3.02
4.20
3.89
3.69
3.26
5.29
3.10
8.29
4.58
4.25
6.01
5.09
3.71
3.51
3.08
4.01
2.92
2.84
8.10
5.85
4.10
4.94
4.43
3.87
3.56
3.37
2.69
2.94
2.78
5.39
4.02
3.70
3.47
7.56
4.51
3.17
2.98
2.55
2.30
2.39
7.31
5.18
2.11
4.31
3.83
3.51
3.29
2.99
2.80
2.37
2.20
7.17
5.06
3.72
3.41
3.19
2.89
2.70
2.27
2.01
4.20
2.10
7.08
4.98
4.13
3.65
3.34
3.12
2.82
2.63
2.20
2.03
1.94
7.01
4.92
4.08
3.60
3.29
2.78
2.59
2.15
3.07
1.98
1.89
Answers and Hints
for Selected Problems
CHAPTER3
CHAPTER2
2-2
{4
S I S
{I
<
10}, {7 S
< I}
I S
8}.
{1
< 4},
4} U {8
2-3
(1°)
=- 120
3
2-S
(a)
:4 U ~ u 't:
(f)
(s4
n
3-5
.6, .8, .4, .3
2-,
(~~)1(~~)
2-11
(9018) (10)/('oo·
2
20) == ·318
3-6
3·7
3-10
3-12
3-13
3-14
3-15
3-16
=- .0036
2-12 P(f!/l) = .4
P1
= 20/30;
p~
= 811 I
P(:4) "" I
== 2.994 X 10 ·1 ;
2" - (n ... I)
p = .607
PI
3-2
7
p -= 12M
n '{
2-6
2-16
2-18
2-20
2-22
2-24
PI
3-3
00)
p~
=2 X
10
= .031,
3-1
6
3-l,t'
3-19
p~ -
.016
.5,
p~ - .322
46. 656. 7,776:
PC.7lJ-= .167. PCM =- .842
PI == .HIS.
p~ -= .086.
p~ == .86
PI =- .201
p ... . 164
n = 9604
.924
(a) .919:
(b) .931
p = .06
(a) PI = .29757, P2 -= .26932:
(b) PI -= .29754. p~ = .26938
(a) p -= .011869;
(b) p :. .011857
p = .744
PI -'-
443
444
ANSWERS AND HINTS FOR SELECTED PROBLEMS
711
CHAPTER4
4-2
4-3
4-5
4-8
4-9
.003, .838,.683, .683
54.3%
x.9~ = 5.624
(a) 6. 7%;
(b) 2.2 months
Differentiate the identity,. f~ e-~· dy = 1 e-•·" with respect to c.
4-10 (a) .04;
(b) I.S x 10" kw
4-11 (a) 4.6%;
(b) 20%
4-11 (a) geometric;
(b)
L" o.s X
.95* = .9Sl0-k
k-10
(b) .186
4-15 (a) .OS;
4-17 Uniform in the interval -9 <
4-19 (a)
4-23
4-24
4-26
4-19
3
x<3
~/,(V,)
~ [.t;({ry) + .t;<- {ry))
(b)
4
(~)
1/.(y) .,. /,( -y)]U(y);
(d) /.(y)U(y); P{y = 0} = f,(O)
Uniform in the interval (3, 5)
a = ::0.5,
b = + 2.5
Elx} = 6
E{P} = 12.1 x 1.004 watts
4-30 (a)
.t;.(y) =
{
I
3~
729 < y < 1.331
0
otherwise
71.• = 1,000, <T~ = 1/3;
(~) £{y} = 1,010
Use the approximation
g(x)"" R(71) + (x - 71)g'C71).
5·9
u~
= u 2;
= 2u4
!n L (X;+
Y;) =! L Xt +! ~ Yt;
•
n
n~
~ L X;Y;
(~ ~ Xt)(~ L )';)
Show that a 1 = ru,lu", a2 = ru,fuy.
11r = o, u). = 8 vTI; a = o, b = 12
:/:
5·13
5-14
u,.. = 11.7 watts
5-16 71 •. = 220 watts;
2/(c: 2 - s 2),
<l»(s)
=
c
71" = 0,
5-18
u, = V1.1c
5-19 Apply (5-73) to the moment function
(5-103).
5-ll /(x): isosceles triangle; base 0 ~ z ~ 2:
/,(s): isosceles triangle: base -I ~ w ~ 1:
/,(s): right triangle; base 0 ~ s ~ I.
S-23 (a) isosceles triangle with base 1.800 ~
R ~ 2.200;
(b) p=.l25
S-14 (a) isosceles triangle with base -20 ~
z ~ 20:
(b) Pz - P{- 5 s z s 4}
5-28 (a) Use (5-92):
(b) /.(z) =
c:2
5·19
J.<z>
J,: !.<z -
y)f,(y) dy
J~ e·.-oertz-al da
Z
>0
= {
;.
c2
fo
('·co~t:· ol da
z<0
S-30 Set z = x/y: show that m2 - m4 =- F:CO):
use (5-107).
(b)
4-31
CHAPTERS
5-1
5-4
NCO. 112), N(O, 112):
(~) .163
'Y=41Tr;
Use the inequalities F"_,.(x, .v>
(a)
(b)
F,y(x.
5-5
(a)
y) ~
F_.<.v>.
Use the identities
y) = /,Cx ).t;.(y) = f(y. x);
Note that £{yl} = £{xlfz3}.
Show that E{(x - y)2} -= 0.
/Cx.
(b)
S-6
~ f~,Cx).
CHAPTER6
6-1
6-1
6-3
6-4
6-5
6-6
6-7
6-9
6-10
(a) PI = .55:
(b) P2 = .82
(a) 32.6%;
(b) 32.8 years.
/.(ZIW = 5): /\'(2.5, 3 V2l: £{zlw = 5} = 2.5
PI = .2863,
P2 = .lOSS
cp(x) = x - x2f!
(a) ip(.r) = ..t·;
(b) E{ly - cp(x)x} = 0
Show that E{xy} = E{u}.
It follows from 16-21) and (6-54): 'Y = 513
(a) P{x > t} =- .2St> .,., - .1St> "':
(b) .355
6-11 R(x) = ( I - c·.t)ll•t>·'
ANSWERS AND HINTS FOR SELECTED PROBLEMS
445
7-26 Use (7-89) and Problem 7-25.
7-28 Note that the RV 2x is F(4, 6).
7-30 (a) Use (7-38) and (7-97);
(b) Show that the RV wis normal with
Tlw = Tlx - 'l'h.
= u 2(1/n + 1/m)
and the !lum
n-1
m-1
--::r ~ + -::r ~ is )(2(n + m - 2)
ui.
CHAPTER7
7-1
CT
Note that E{(c 1x 1
• • • •
-
c,x.)2} ~ 0 for
= A, then [see (7A-3)] 1J2 =
TAT*TAT* = TA 2T" = TAT* = D; hence,
7-31 If A2
any c;.
7-5
Note that
y
7-7
_
(
=X;- X= X;
>..1
1 ".
I - - +- ~ x•.
1)
n
CT
= A.;
7-32 Show that
n ;••·•
E{e"ln} = E{es<a,·····a·1} = «<>;(s)
E{e"} = E{E{en:n}} = E{CI>:(s)}
X
=
7-8
(a)
(b)
2 P•«<>~(s)
k•l
CHAPTER9
Use (5-63) and Problem 7-7 with
«<>t(s) = cl(s + c);
E{z} = E{E{z,n}} = E{nlc} = Nplc
=
200
7-10
(a)
Use Problem 5-8:
(b) Show that E{(x; 7-13 Note that y 4 <
7-14
7-15
7-16
7-17
7-19
7-20
7-22
7-23
X;- 1) 2}
= 2u 2•
x.~ iff x; s x~ for i ~ k, that
is, iff the event {x :$ x.~} occurs at least k
times.
With n = 5 and k = 3, (7-47) yieldsf,(t) =
3 X 10-3(1 - 55)~(65 - t)2
p = 2G(VJ) - I = .916
Note that E{xn = u 2 •
E{x1} = 3u4 •
(a) Note that x, = me iff heads shows k
times;
(b) Use {3-27);
(c) P{xso > 6c} == G(6!7) == .82
Use the CLT and Problem 4-32.
Apply the CLT to the RVs In X; and use
Problem 4-32.
Note that J;.(y) = 2yx2<n. y2),
Use (7-97) and Problem 7-22.
=
0.115 em;
(b) n = 16
202 < ., < 204:
(b) 2.235 em
9-3 21.400 < ., < 28,600
9-4 n=l
9-S (a) a = 25161, b = 36/61;
(b) 12.47 ± 0.26
9-7 c = 413 g
9-8 29.77 < 8 < 30.23
9-9 a = 0.8. b = 4
9-10 If w = x - i. then u~ = (I .,. l/20)u2; c ==
20.5
9-12 0.076 < Tl < 0.146
9-13 12.255 < A < 13.245
9-1
(a)
9-2
(a)
c
9-14 .50< p < .54
9-15 (a) 3.2%;
(b) y = .78
9-17 n = 2,500
9-18 p = .567
9-21 -5.17 < .,, - .,, < -2.83
9-22 87.7 < f1 < 92.3.
3.44 <
CT
< 9.13
9-23 0.44 < CT < 0.276
9-25 .308 < r < .512
9-26 Use the identity I[(x; - y;) - (i + y)J2 =
I(x; - i)2 ... I< Y; - yl2 + 2(n - Olin .
9-27 (a) Note that I(a - zb;) 2 ~ 0 for all z;
(b) Set x, - i = a; • .V;- y = b;.
9-28 P{y9 s x.~ < Yu = .5598 == .5581
446
ANSWERS AND HINTS FOR SELECTED PROBLEMS
9-30 Note that
0.5 )
G ( \7,;i4 - G
{2
(. -0.5)
\Tni4 = \j;n
9-31 3 = .I, c = .02; n = 3745
9-32 c = 1.29
9-33 c = 1/(x- xo>
9-36 Use the identities
rx ;~ rx ;~
u;. -
83 e.,.,.
ei
and
10-3
10-5
10-6
x
= ~e. )(!14n. x) dx
{3{8) - P{q > dHt} = P{28q
=
I "•
L i4,
nk 1~1
10-7
10-9
k,
= 24, k2 =
J:
-Br
> 28c·lf1}
x214n, x) dx
10-28 Use the lognormal approximation
t.m(63) = -1.67,
1. 00 ~63) = -2.62.
q = -1.6; accept H0
1.975(40) == 2.
u;;;: = 0.336.
q
1.19 < 2; accept Ho
Test 'I'Jr = 0 against 'I'Jr :/: 0 where
i =-
x2{4)
x2(4n) respectively. Show that
a = P{q < c11/o} = P{28oq < 28od
=
{3 = .32;
> 8.41
., .. = r
10-27 Note that the avs 28x and 28q are
CHAPTER 10
n = 129,
J
82
J(y, 8)
x > 8.58,
I. }': :· = 0.
·"
. lit
(b) /,( 8 ''· ) - r(8, - 8o) e
/y(y, 8) = /y(y. 8o) J(y. )
80
9-44 Show that/( y. z) = c· 2e ,.,,. for 0 < z < y.
9-45 Use Problem 9-38.
(a)
(b)
L P; =
10-25 Reject H 0 if x > 1.384: {3 = .40
10-26 (a) f,(r, 80 ) dr = P{r < r s r + dr H0} =
f<X. 80) d'l/.f,{r, 81) dr = P{r < r s
r + dr•H,} = f<X. 81) dV:
0=
dx =
dx
9-37 (a) I= 118 2 ;
(b) Use Problem 7-24.
9-38 Note that the statistic w = y - az is
2a~·: + a2u~.
unbiased with u~ =
9-43 The function J(y. 8) in (9-95) is known, and
10-1
(a) UCL = 93 U;
(b) p = .79
q = 9.76 < x:~c5) = II: accept H0
q = 2.36 < x:9~C2) = 5.99: yes
q = 17.76 < )(~9~(9) = 16.92: reject /10
10-ZO Note that
a2j a2L
(a1.)2
= aBI f + a8 f
aff1
10-15
10-16
10-17
10-18
(Problem 7-201 for the density of the
product x1 • • • x,.
10-32 Show that 81110 = 80 ,
8,. =
w = 2nl8o - .'i ·- .{In (x/80 )].
10-33 Show that the N avs x) are i.i.d. with joint
x.
density /(X) - exp {-
2
u; = u L' ....!cnk
2
2~ Q} where
Q = Q, ... Qz as in Problem 10-22.
4-t
40; reject H0
nAo = 110, k 1 = 110 - 1.645 v'TIO
= 93 >
q = 90; reject the hypothesis that A = 5
10-10 Under hypothesis H 1 , the av qu~u 2 is
x2(n) and
P{q s ciHt} = P
{;~ q < =~ ciH1}
10-U c = .136 > q = .I; accept H 0
10-13 k 1 = 31 < k = 36 < k2 = 41; accept H 0;
no decision about H 0
10-14 (a) G(2- 0.5Vn) - G(-2 - 0.5Vn) =
{3(30.1)- 0.1; find n. Accept H 0 iff
30 - 0.4/Vn < X < 30 + 0.4/Vn;
(b) I - {3(30.1) = 0.9
CHAPTER 11
11·3
11-6
=
Use (11-26) with w 1 I. w 2
y = z.
(a) Maximize the sum
= x, w3 =
~(X; - a) 2 - ~(YI - {3)2
+
~(li - 'Y)2
subject to the constraint a + {3 +
'Y = 1r and show that
ti-x=/3-.V=.Y-z
_ 1r-
-
c.x- 1- u
3
y.
ANSWF.RS AND HINTS FOR SELECTED PROBLEMS
(b)
Maximize the sum
1 ,...
,
u; ~
a)-
(X; -
I
+
1 ,...
u; ~ (y; -
/W
2 Cz;- y)2
+ u~
subject to the same constraint and
show that
a-x = -,~-y -y-i
--,= --,u:;
u~
u:
_ 1T - <i -r y- zJ
-
11-8
2 ur- (2
a1
-;--= 2a,
ua;
11·9
1) -
IJ.
~
CHAPTF.R 12
U;.t;
- A - p.x; ..:. 0
for any A and p.. Solve the 11 ~· 2
equations 2a; = A + IJ.X;, ~a; = I.
~a;.\"; = 0 for the 11 - 2 unknowns a;.
A, and p.:
(b) Proceeding similarly, solve the
system 2{:J; = A • JJ-'C;. I.{J; = 0.
I.{J;x; """ I.
(a) Accept 1/0 iff [see (11-52))
lhj < CTiJZJ-n;~
where
2 .
(Th -
(b)
Maximize the sum ~w;l,\'; - (a -r bx;))~.
Using the independence of the RVs y - ax
and x, show that E{y - ax·x} = 0, E{(y ax)~} = Q. and E{y - ax:x} = E{y x} aE{x:x} = E{y x} - ax.
11·16 Note that the RVS y - y and X; are
orthogonal; hence, they are independent,
and E{y - y x; • . . . , x,.} = E{y - y}
= 0.
11-13
11-15
3
Use Lagrange multipliers.
Note that .
1A U;-
(a)
~( .,
'(·
I
-
.'C)2 0'
~
h < O';,lt-o.2(n
12·1
12·2
12-3
12-4
Pt = PJ "- P• = .2.
P2 = P• = Po "' .4/3
Pt "' .42. P2 = .252, p, = .151. P• = .09.
p~ = .054. p,. .:: .034
/! x) = 0.425e ·' for - I < x < I
/!x) = 0.212e··t•-•' for I < .t < 5
12·5
f(x) =
12-6
12·7
12-8
12-9
Accept //0 iff tsce (11-57)1
ll-10
- 2)
~[Y;- (a +
1
- ,
~(X;- X)·
II-
2
hx;)J2
12-12 f(x. y) =
-
•
rr2 ~
n
(a)
Cov(y, b)=
(b)
.!.
~ (,..' - ....
)~
rr2 ~
.,
= ly
-
~.,; = ~Ef
(a ..-
~ {3;
=0
/(x.
t} = 0
I
12-13 Show that
u~
b.t)J2 ... (' -
~( lj; - 71;)2
y)
17'\/7
w2 }
z2
E { --;
- 2r -zw- - ~
u/Yn
-
y)
2
3 xy + y-')}
2 (. x·, - 2
exp { - 7
Note that
(c)
V27T
fl.\") = 7.e A ~·" ' for x < 1T and /Ct) "' 0
for lx > 1T
P4 "' .25 X • 7~ 4 • 0 :5 k :5 30
Pt =- P2 .:.. .064.
p, = P• = .138,
p~ = p,- .298
(a) H!y) -= //(x):
(b) /ICy) "' 1/{x) + In 3
Show that
/(x.
,
11-12
1.~ e" -~,: 2
E {tn dx. y)} :s E {c(x, y)-
where
0'7, ::. ,..
447
0'/,
b)2
12·14
12-17
0':0'.,
rr;
= 2( I
H(x, y) = 3.734
(a) H(A) = 0.655:
(b) n1 = 2.79 X J02K. nu == 5.36
- r 2)
X
1047
Index
A
Alternative hypothesis, 243
Analysis of variance, 360-69
ANOVA principle, 361
one-factor tests, 362
two-factor tests, 365
additivity, 366
Asymptotic theorems:
central limit, 214-17
lognormal distribution, 231
DeMoivre-Laplace, 70, 76, 216
entropy, 433
law of large numbers, 74, 219
Poisson, 78
Auxiliary variables, 158, 198
Axioms, 10, 32
empirical interpretation, 10
infinite additivity, 33
B
Bayes' formulas, 50, 171, 174
empirical interpretation, 175
Bayesian theory, 246
controversy, 247
estimation, 171, 287-90
law of succession, 173
448
Bernoulli trials, 64, 70
DeMoivre-Laplace, 70
law of large numbers, 74, 219
rare events, 77
Bertrand paradox, 16
Best estimators, 274, 307
Ra~ram~rbound,309
Beta distribution, 173
Bias, 2
Binomial distribution, 108, 212
large n, 108
mean, 154, 213
moment function, 154
Boote's inequality, 57
ButTon's needle, 141
c
Cartesian produce, 24, 60, 64
Cauchy, density, 107, 164, 167
Cauchy-Schwarz inequality, 319
Centered random variables, 146
Central limit theorem, 214-27
lattice type, 216
products. 231
lognormal distribution, 231
proof, 231
sufficient conditions, 215
INDEX
Certain event, 7
Chain rule, densities, 201
probabilities, 18
Chapman-KolmogorofT, 201
Charctcteristic functions, 154 (See
also Moment generating
functions)
Chi distribution. 232
Chi-square distribution, I06, 219-23
degree of freedom, 219
fundamental property, 221
moment function, 220
noncentral. 227
eccentricity, 227
quadratic forms, 221, 227
Chi-square tests, 349
contingency tables, 354
distributions, 357
incomplete null hypothesis, 352
independent events, 159
modified, 352
Circular symmetry, 142
Combinations, 26
Complete statistics, 314
Conditional distribution, 168-77,
200
chain rule, 201
empirical interpretation. 175
mean, 178, 201
Conditional failure rate, 188
Conditional probability, 45
chain rule, 48
empirical interpretation. 45
fundamental property, 48
Confidence, coefficient, 241, 274
interval, 241, 274
level, 274
limits, 274
Consistent estimators. 274
Contingency tables, 354
Convergence, 218
Convolution, 160, 211
theorem, 161, 211
Correlation coefficient, 145
empirical interpretation, 148
sample, 295
Countable, 22
Covariance, 145
empirical interpretation, 149
matrix, 199
449
nonnegative. 229
sample. 295, 318
Critical region, 244, 322
Cumulant, 166
generating function, 166
Cumulative distribution (St•e
Distributions)
Curve fitting (See Least squares)
D
DcMoivre-Laplace theorem. 70. 72.
76. 216
correction. 74
DcMorgun haw. 57
tmnsformations. 112-17. 155
Density. 9H. 136. 19M
circular symmetry. 142
conditional. 169. 177. 201
empirical interpretation. 100
histugmm. toO
marginal. IJ7. 201
point. 99
mass. JOt
tmnsformations. I 17-2 I. 156. 198
auxiliary variable, 158 (See abw
Distribution)
Dispersion (St•t• Variance)
Distribution. MS. 136. 198
computer simulation. 269
conditional. 168
Baye's formulas. 174
empirical interpretation. 96. 175
marginal. 137
model formation. 101
fundamental note. 102
properties. 92
Distributions:
beta. 173
binomial. tOM
Cauchy. 107. 164. 167
chi. 232
Erlang. 106
exponential. 106
gamma. 105
geometric. Ill
hypogeomctric. I I I
Laplace. 166
lognormal. 133. 23 I
450
IND~X
Distributions (cont.)
Maxwell. 232
multinomial. 217
normal. 103. 163. 200
Pascal. 132
Poisson. 109
Rayleigh. 156. 167
Snedecor F. 224
Student t. 223
uniform. 105
Weibull. 190
zero-one. 94
E
Eccentricity, 227
Efficient estimator, 310
Elementary event, 7, 30
Elements, 19
Empirical interpretation, 18
axioms, 10
conditional probability, 45
density, I00
distribution, 96
events, 31
failure rate, 189
mean, 122
percentiles, 97
Empty set, 22, 29
Entropy, 248, 414-35
as expected value, 420
four interpretations, 434
maximum, 248, 423-30
properties, 418
of random variables, 420-21
in statistics, 422-30 ·
Equally likely condition, 7, 14
Equally likely events, 38
Erlang distribution, 106
Estimation, 239, 273
Bayesian, 287
correlation coefficient, 295
covariance,295
difference of means, 290
distribution, 298
KolmogorotT estimate, 299
maximum likelihood, 302-6
mean, 275
moments, method of, 301
percentiles, 297
probabilities. 283-90
variance, 293
Estimation-prediction, 317
Estimators:
best, 274
consistent, 274
most efficient, 310
Events. 7. 30
certain. 29
elementary, 30
equally likely, 38
impossible. 29
independent, 52, 56
mutually exclusive. 30
Expected value, 122 (See also
Mean)
linearity, 125, 145
Exponential, distribution, 106
mean, 127
type, 310
F
Failure rate, 188
expected, 189
empirical interpretation, 189
Fisher's statistic, 296
Fourier tr.msform, 154
Fractile, 95
Franklin, J. M., 252
G
Galton's law, 183
Gamma distribution, 105
mean, 153
moment function, 153
moments, 153
variance, 153
Gamma function, 105
mean, 153
Gap test, 257
Gauss-Markoff theorem, 404
Gaussian (See Normal)
Geometric distribution, Ill
in gap test, 257
mean, 127
INOl::X
Goodness of fit tests, 348-60 (See
also Chi-square tests)
Pearson's test statistics. 349
computer simulation, 372
H
Hazard rate, 188
Histogram, 100
Hypergeometric. distribution, Ill
series, 67
Hypothesis testing:
computer simulation. 270
correlation coefficient. 338
distributions, 339
chi-square, 339
Kolmogoroff-Smirnov, 339
sign test, 340
mean, 327
equality of two means. 329
Neyman-Pearson test. 370
Poisson, mean, 335
equality of two means, 337
probability, 332
equality of two probabilities,
333
variance, 337
equality of two variances
I
Iff, 21
Impossible event, 29
Independent, events, 52, 56
empirical interpretation, 52
experiments, 140
random variables, 139. 198
trials, 60
Independent identically distributed
(i.i.d.), 202
Infant mortality, 178
Information, 306, 308
Insufficient reason, principle of, 17
1
Jacobian, 158, 198
451
K
Kolmogoroff. II
Kolmogoroff-Smirnov test. 339
L
Laplace distribution, 166
Laplace transform, 154
Lattice, type. 216
centr.tllimit theorem. 216
Law of large numbers, 74, 219
Law of succession, 173
Least squares. 388-411
curve fitting. 391-402
linear, 391
nonlinear. 396
perturbation. 400
prediction, 407-11
linear, 408
nonlinear. 410
orthogonality principle. 409,
411
statistical. 402-7
Gauss-markoff theorem, 404
maximum likelihood, 403
minimum variance. 404
regression line estimate. 405
weighted, 413
Lehmer, D. H .• 252, 253, 254n
Likelihood function, 247, 302
Jog-likelihood, 303
Likelihood ratio test, 378-82
asymptotic form. 380
Line masses, 138
Linear regression (See Regression)
Linearity, 125, 145
Lognormal distribution, 133
centred limit theorem, 231
Loss function, 186
M
Marginal, density, 137, 201
Marginal. distribution, 137
Markoff's inequality, 131
452
INDEX
Masses, density, 101
normal RVs, 167
point, 101
probability, 33, 138
Maximum entropy, method of, 248,
423-30
known mean, 423, 428
illustrations, 42S-28
atmospheric pressure, 426
Brandeis Die, 427
partition function, 429
Maximum likelihood, 302-6
asymptotic properties, 306
information, 306
Pearson's test statistic, 3S2
Maxwell distribution, 232
Mean, 122
approximate evaluation, 129, lSI
conditional, 178
empirical interpretation, 122
linearity, 12S, 14S
transformations, 124, 144
sample, 203, 222, 238
Measurements, minimum variance,
204
Median. 96, 186
Memoryless systems, 191
Mendel's theory. 350
Minimum variance estimates, 30716
complete statistics, 314
measurements, 204
Rao-Cram~r bound, 309
sufficiency, 312
Model, S
formation, 6
specification, 36
from distributions, 101
Moment generating function, 1S2,
IS4, 199
convolution theorem, 161, 211
independent RVs, ISS, 199
moment theorem, IS3, ISS
Moments, lSI, 1S4
method of, 301
Monte Carlo method, 2S1, 267
ButTon's needle, 142
distributions, 269
multinomial, 272
Pearson's test statistic, 271
Most powerful tests, 323
Neyman-Pearson criterion, 370
Multinomial distribution. 217
computer generated, 272
Mutually exclusive, 30
N
Neyman-Pearson, criterion. 370
sufficient statistics. 373
test statistic, 371
exponential type distributions.
373
Noncentral distribution, 227-29
eccentricity, 227
Normal curves, 70
area, 81
Normal distribution, 103, 163, 200,
439 (tub/e)
conditional, 179
moment function, 1S2, 200
moments, lSI, 16S
quadrant masses, 167
regression line, 179
Null hypothesis, 243
Null set, 22
0
Operating characteristic (OC) function, 322
Order statistics, 207-11
extremes, 209
range. 209
Orthogonality, 146
Orthogonality principle, 409
least square, 392
nonlinear, 18S, 411
Rao-Blackwell theorem, 185
Outcomes, 7
empirical interpretation, 31
equally likely, 36
p
Paired samples, 290, 329
Parameter estimation (See Estimation)
INDEX
Partition, 24
Partition function, 429
Pascal, distribution, 132
Pearson's test statistic, 349
incomplete null hypothesis, 352
Percentile curve, 94
empirical interpretation, 97
Percentiles, 238, 439-442 (tables)
Permutations, 25
Point density, 99
Point estimate. 274
Poisson distribution, 109
mean. 128
moment function. 152
Poisson. points, 79, 110
theorem, 78
Posterior. density. 173. 246. 287
probability, 51
Power of a test, 323
Prediction, 149, 181-86, 407-11
Primitive root, 254
Principle of maximum entropy, 248,
422
Prior. density. 173. 246. 288
probability. 51
Probability. the four interpretations.
9-17
Q_ ________
Quadrcltic forms, chi-square, 106,
219-23
Quality control, 342-48
Quantile, 95
Quetelet curve, 97
R
Random interval, 274
Random numbers, 251-67
computer generation, 258-67
Random points, 79
Random process, 217
Random sums, 206
Random variables (Rvs):
definition, 93
functions of, 112, 144, 198
453
Random walk. 231
Randomness, 9
tests of. 255
Range. 209
Rao-Blackwell theorem, 185
Rao-Cramcr bound, 309
Rare events, 77
Rutherford experiment, 359
Rayleigh distribution, 156, 167
Regression curve. 179, 202 (See
also Prediction)
Galton's law. 183
Regularity. 9
Rejection method, 261
Relative frequency. 4 (See also
Empirical interpretation)
Reliability. 186-94
conditional failure rclte, 188
state variable, 193
structure functions. 193
Repeated trials. 59-64 (See also
Bernoulli trials)
dual meaning, 59
Risk, 186
s________
Sample, r.mdom variable
correlation coefficient, 295
mean, 203. 222, 238
observed, 239
variance, 222
Sampling, 202
paired, 290, 329
Schwarz's inequality, 147. 319
Sequences of r.mdom variables. 217
Sequential hypothesis testing. 37478
Sign test, 340
Significance level, 322
Snedecor F distribution, 224
noncentral, 229
percentiles, 225, 442 (tables)
Spectral test, 257
Standard deviation, 126
Statistic. 274
complete. 314
sufficient, 312
test, 323
454
INDEX
Step function U(t), 102
Structure function, 193
Student t distribution, 223
noncentral, 229
percentiles, 225, 441 (tab/e)
Sufficient statistic, 312
System reliability, 186-94
T
Tables, 437-42
TchebychetT's inequality, 130
in estimation, 203, 278
Markoff's inequality, 131
Test statistic, 323
Time-to-failure, 187
Total probability, 49, 170
Transformations of random variables, 112, 144, 198
Tree, probability, 63
Trials, 7
repeated, 59-64
Typical sequences, 249, 430-34
u
Uncorrelatedness, 146
Uniform distribution, 105
variance, 127
v
Variance, 125
approximate evaluation, 165
empirical interpretation, 149
sample, 222
Venn diagram, 20
Von Mises, 12, 253
w
Weibul1 distribution, 190
z
Zero-one random variable, 94
mean, 127
Download