Probability & Statistics About the Author Atbanasios Papoulis was educated at the Polytechnic University of Athens and at the University of Pennsylvania. He started teaching in 1948 at the University of Pennsylvania, and in 1952 he joined the faculty of the then Polytechnic Institute of Brooklyn. He has also taught at Union College, U.C.L.A .• Stanford, and at the TH Darmstadt in Germany. A m~or component of his work is academic research. He has consulted with many companies including Burrough's, United Technologies, and IBM, and published extensively in engineering and mathematics. concentrating on fundamental concepts of general interest. In recognition of his contributions, he received the distinguished alumnus award from the University of Pennsylvania in 1973, and, recently, the Humboldt award given to American scientists for internationally recognized achievements. Professor Papoulis is primarily an educator. He has taught thousands of students and lectured in hundreds of schools. In his teaching, he stresses clarity. simplicity. and economy. His approach, reflected in his articles and books. has been received favorably throughout the world. All of his books .Jtave international editions and translations. In Japan alone six of his major texts.have been translated. His book Probability, Random Variables, and Stochastic Processes has been the standard text for a quarter of a century. In 1980: it was chosen by the Institute of Scientific Information as a citation . classic. ~ : Every year, the IEEE, an international organization of electrical engineers, selects one of its members as the outstanding educator. In 1984, this prestigious award was given to Athanasios Papoulis with the following citation: For inspirational leadership in teaching through thought-provoking lectures, research, and creative textbooks. PROBABILITY & STATISTICS Athanasios Papoulis Polytechnic University j.i Prentice-Hall lntemational, Inc. This edition may be sold only in those countries to which it is consigned by Prentice-Hall International. It is not to be re-exported and it is not for sale in the U.S.A., Mexico, or Canada. 0 1990 by Prentice-Hall, Inc. A Division of Simon 8t Schuster Englewood Cliffs, NJ 07632 All rights reserved. No part of this book may be reproduced, in any form or by any means, without permiaaion in writing from the publisher. Printed in the United States of America 10 9 8 7 8 5 4 3 2 1 ISBN D-13-711730-2 Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Umited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hiapanoamericana, S.A., Mexico Prentice-Hall of India Private Limited, NtiW Delhi Prentice-Hall of Japan, Inc., Tokyo Simon 8t Schuster Asia Pte. Ltd., Singapore Editora Prentice-Hall do Brasil, ltda., Rio de Janeiro Prentice-Hall, Inc., Englewood C/m., New Janey Contents Preface ix PART ONE PROBABILITY 1 1 The Meaning of Probability 3 1-1 Introduction 3 1-2 The Four Interpretations of Probability 9 2 Fundamental Concepts 2-1 2-2 19 Set Theory 19 Probability Space 29 Vi CONTENTS 2-3 Conditional Probability and Independence 45 Problems 56 3 Repeated Trials 59 3-1 Dual Meaning of Repeated Trials 59 3-2 Bernoulli Trials 64 3-3 Asymptotic Theorems 70 3-4 Rare Events and Poisson Points 77 Appendix: Area Under the Normal Curve 81 Problems 81 4 The Random Variable 84 4-1 Introduction 84 4-2 The Distribution Function 88 4-3 Illustrations 101 4-4 Functions of One Random Variable 4-S Mean and Variance 121 Problems 131 112 5 Two Random Variables 135 S-1 The Joint Distribution Function 135 S-2 Mean, Correlation, Moments 144 5-3 Functions of Two Random Variables 155 Problems 165 6 Conditional Distributions, Regression, Reliability 6-1 Conditional Distributions 168 6-2 Bayes' Formulas 174 6-3 Nonlinear Regression and Prediction 6-4 System Reliability 186 Problems 195 181 168 CONTENTS Vii 7, __ _ Sequences of Random Variables 197 7-1 General Concepts 197 7-2 Applications 204 7-3 Centr.tl Limit Theorem 214 7-4 Special Distributions of Statistics 219 Appendix: Chi-Square Quadratic Forms 226 Problems 229 PART TWO STATISTICS 233 a____ _ The Meaning of Statistics 235 K-1 Introduction 235 K-2 The Major Areas of Statistics 238 8-3 Random Numbers and Computer Simulation 251 9 Estimation 273 9-1 9-2 9-3 9-4 9-5 9-6 General Concepts 273 Expected Values 275 Variance and Correlation 293 Percentiles and Distributions 297 Moments and Maximum Likelihood 301 Best Estimators and the Rao-Cramer Bound 307 Problems 316 10 Hypothesis Testing 10-1 10-2 10-3 10-4 321 General Concepts 321 Basic Applications 324 Quality Control 342 Goodness-of-Fit Testing 348 Viii CONTENTS 10-5 Analysis of Variance 360 10-6 Neyman-Pearson, Sequential, and Likelihood Ratio Tests Problems 382 369 11 The Method of Least Squares 388 11·1 Introduction 388 11·2 Deterministic Interpretation 391 11-3 Statistical Interpretation 402 11-4 Pr~diction 407 Problems 411 12 Entropy 414 12-1 Entropy of Partitions and Random Variables 414 12-2 Maximum Entropy and Statistics 422 12-3 Typical Sequences and Relative Frequency 430 Problems 435 Tables 437 Answers and Hints for Selected Problems Index 448 443 Preface Probability is a difficult subject. A major reason is uncenainty about its meaning and skepticism about its value in the solution of real problems. Unlike other scientific disciplines, probability is associated with randomness. chance, even ignorance, and its results are interpreted not as objective scientific facts. but as subjective expressions of our state of knowledge. In this book. I attempt to convince the skeptical reader that probability is no different from any other scientific theory: All concepts are precisely defined within an abstrclct model. and all results follow logically from the axioms. It is true that the practical consequences of the theory are only inductive inferences that cannot be accepted as logical certainties; however. this is characteristic not only of statistical statements. but of all scientific conclusions. The subject is developed as a mathematical discipline; however, mathematical subtleties are avoided and proofs of difficult theorems are merely sketched or, in some cases, omitted. The applications are selected not only because of their practical value. but also because they contribute to the mastery of the theory. The book concentrates on basic topics. It also includes a simplified treatment of a number of advanced ideas. In the preparation of the manuscript. I made a special effon to clarify the meaning of all concepts, to simplify the derivations of most results, and ix X PREFACE to unify apparently unrelated concepts. For this purpose, I reexamined the conventional approach to each topic, departing in many cases from traditional methods and interpretations. A few illustrations follow: In the first chapter, the various definitions of probability are analyzed and the need for a clear distinction between concepts and reality is stressed. These ideas are used in Chapter 8 to explain the difference between probability and statistics, to clarify the controversy surrounding Bayesian statistics, and to develop the dual meaning of random numbers. In Chapter II, a comprehensive treatment of the method of least square is presented, showing the connection between deterministic curve fitting, parameter estimation, and prediction. The last chapter is devoted to entropy, a topic rarely discussed in books on statistics. This important concept is defined as a number associated to a partition of a probability space and is used to solve a number of ill-posed problems in statistical estimation. The empirical interpretation of entropy and the rationale for the method of maximum entropy are related to repeated trials and typical sequences. The book is written primarily for upper division students of science and engineering. The first part is suitable for a one-semester junior course in probability. No prior knowledge of probability is required. All concepts arc developed slowly from first principles, and they are i11ustrated with many examples. The first three chapters involve mostly only high school mathematics; however. a certain mathematical maturity is assumed. The level of sophistication increases in subsequent chapters. Parts I and II can be covered in a two-semester senior/graduate course in probability and statistics. This work is based on notes written during my stay in Germany as a recipient of the Humboldt award. I wish to express my appreciation to the Alexander von Humboldt Foundation and to my hosts Dr. Eberhard Hansler and Dr. Peter Hagedorn of the TH Damstadt for giving me the opportunity to develop these notes in an ideal environment. Athanasios Papoulis PART ONE PROBABILITY lThe Meaning of Probability Most scientific concepts have a precise meaning corresponding, more or less exactly, to physical quantities. In contrast, probability is often viewed as a vague concept associated with randomness, uncertainty, or even ignorance. This is a misconception that must be overcome in any serious study of the subject. In this chapter, we argue that the theory of probability, like any other scientific discipline, is an exact science, and all its conclusions follow logically from basic principles. The theoretical results must, of course, correspond in a reasonable sense to the real world; however, a clear distinction must always be made between theoretical results and empirical statements. 1-1 Introduction The theory of probability deals mainly with averages of mass phenomena occurring sequentially or simultaneously: games of chance, polling, insurance, heredity, quality control, statistical mechanics, queuing theory, noise. It has been observed that in these and other fields, certain averages approach a constant value as the number of observations increases, and this value remains the same if the averages are evaluated over any subsequence se- 3 4 CHAP. 1 THE MEANING OF PROBABILITY lected prior to the observations. In a coin experiment, for example. the ratio of heads to tosses approaches 0.5 or some other constant, and the same ratio is obtained if one considers. say. every fourth toss. The purpose of the theory is to describe and predict such averages in terms of probabilities of events. The probability of an event :A is a number PC :A> assigned to sll. This number is central in the theory and applications of probability; its significance is the main topic of this chapter. As a measure of averages. P(.'il) is interpreted as follows: If an experiment is performed n times and the event :A occurs n.-ll times, then almost certainly the relative frequency n.-lll n of the occurrence of .rfl is close to P(s4) P(9l)..,. n. 11 n (1-1) provided that n is sufficiently large. This will be called the empirical or relative frequency interpretation of probability. Equation (1-1) is only a heuristic relationship because the terms almost certainly, close, and sufficiently large have no precise meaning. The relative frequency interpretation cannot therefore be used to define P(.'tl) as a theoretical concept. It can, however, be used to estimate P(s4) in terms of the observed n.<rl and to predict n.'ll if P( .rfl) is known. For example, if 1,000 voters arc polled and 451 respond Republican, then the probability P(s4) that a voter is Republican is about .45. With P(s4) so estimated, we predict that in the next election, 45% of the people will vote Republican. The relative frequency interpretation of probability is objective in the sense that it can be tested experimentally. Suppose, for example. that we wish to test whether a coin is fair, that is, whether the probability of heads equals .5. To do so, we toss it 1.000 times. If the number of heads is "about" 500, we conclude that the coin is indeed fair (the precise meaning of this conclusion will be clarified later). Probability also has another interpretation. It is used as a measure of our state of knowledge or belief that something is or is not true. For example, based on evidence presented, we conclude with probability .6 that a defendant is guilty. This interpretation is subjective. Another juror, having access to the same evidence, might conclude with probability .95 (beyond any reasonable doubt) that the defendant is guilty. We note, finally, that in applications involving predictions. both interpretations might be relevant. Consider the following weather forecast: "The probability that it will rain tomorrow in New York is .6." In this forecast. the number .6 is derived from past records, and it expresses the relative frequency of rain in New York under similar conditions. This number, however. has no relevance to tomorrow's weather. Tomorrow it will either rain SEC. J- I INTRODUCTION 5 or not rain. The forecast expresses merely the state of the forecaster's knowledge. and it helps us decide whether we should carry an umbrella. Concepts am/ Reality Students are often skeptical about the validity of probabilistic statements. They have been taught that the universe evolves according to physical laws that specify exactly its future (determinism) and that probabilistic descriptions are used only for "random" or "chance" phenomena. the initial conditions of which are unknown. This deep-rooted skepticism about the "truth" of probabilistic results can be overcome only by a proper interpretation of the meaning of probability. We shall attempt to show that. like any other scientific discipline. probability is an exact science and that all conclusions follow logically from the axioms. It is, of course, true that the correspondence between theoretical results and the real world is imprecise; however, this is characteristic not only of probabilistic conclusions but of all scientific statements. In a probabilistic investigation the following steps must be clearly distinguished (Fig. I. I). Step I (Physical). We determine, by a process that is not and cannot be made exact. the probabilities P(stl;) of various physical events 91;. Step 2 (Conceptual). We assume that the numbers P(.s4;) satisfy certain axioms, and by deductive logic we determine the probabilities P(33;) of other events ;?/l;. Step 3 (Physical}. We make physical predictions concerning the events ?.13; based on the numbers P(tf.i;) so obtained. In steps I and 3 we deal with the real world. and all statements are inexact. In step 2 we replace the real world with an abstract model, that is, with a mental construct in which all concepts arc precise and all conclusions follow from the axioms by deductive logic. In the context of the resulting theory, the probability ofan event sf is a number P(stl) satisfying the axioms; its "physical meaning" is not an issue. Figure 1.1 Step I CPhysical) Observation Step 2 (Conccptnall Deduction ~ode I Step 3 IPhyskall !--::----:-~- Prediction 6 CHAP. J THE MEANING OF PROBABILITY We should stress that model formation is used not only in the study of "random" phenomena but in all scientific investigations. The resulting theories are, of course, of no value unless they help us solve real problems. We must assign specific, if only approximate, numerical values to physical quantities, and we must give physical meaning to all theoretical conclusions. The link, however, between theory (model) and applications (reality) is always inexact and must be separated from the purely deductive development of the theory. Let us examine two illustrations of model formulation from other fields. Geometry Points and lines, as interpreted in theoretical geometry, are not real objects. They are mental constructs havins. by assumption, certain properties specified in terms of the axioms. The axioms are chosen to correspond in some sense to the properties of real points and lines. For example, the axiom "one and only one line passes through two points" is in reasonable agreement with our perception of the corresponding property of a real line. Starting from the axioms, we derive, by pure reasoning, other properties that we call theorems. The theorems are then used to draw various useful conclusions about real geometric objects. For example, we prove that the sum of the angles of a conceptual triangle equals 180°, and we use this theorem to conclude that the sum of the angles of a real triangle equals approximately 180°. Circuit Theory In circuit theory, a resistor is by definition a twoterminal device with the property that its voltage V is proportional to the current /. The proportionality constant v R=1 (1-2) is the resistance of the device. This is, of course, only an idealized model of a real resistor and ( 1-2) is an axiom (Kirchoff's law). A real resistor is a complicated device without clearly defined terminals, and a relationship of the form in ( 1-2) can be claimed only as a convenient idealization valid with a variety of qualifications and subject to unknown "errors." Nevertheless, in the deyelopment of the theory, all these uncertainties are ignored. A real resistor is replaced by a mental concept, and a theory is developed based on (1-2). It would not be useful, we must agree, if at each stage of the theoretical development we were concerned with the "true" meaning of R. Returning to statistics, we note that, in the context of an abstract model (step 2), the probability P(-54) is interpreted as a number P(-54) that satisfies various axioms but is otherwise arbitrary. In the applications of the theory to real problems, however (steps I and 3), the number P(-54) must be given a physical interpretation. We shall establish the link between model and reality using three interpretations of probability: relative frequency, SF.C. J-J INTRODUCTION 7 classical, and subjective. We introduce here the first two in the context of the die experiment. Example 1.1 We wish to find the probability of the event s4 = {even} in the single-die experiment. In the relative frequency interpretation. we rely on {1-1): We roll the die n times. and we set P{.sA) = 11::~1 n where n:~~ is the number of times the event sA occurs. This interpretation can be used for any die. fair or not. In the classical interpretation, we assume that the six faces of the die are equally likely: that is, they have the same probability of showing (this is the "fair die" assumption). Since the event {even} occurs if one of the three outcomes h. or f.. or if. shows. we conclude that P{even} = 3/6. This conclusion seems logical and is generally used in most games of chance. As we shall see, however, the equally likely condition on which it is based is not a simple consequence of the fact that the die is symmetrical. It is accepted because, in the long history of rolling dice, it was observed that the relative frequency of each face equals l/6. • Illustrations We give next several examples of simple experiments, starting with a brief explanation of the empirical meaning of the terms trials. outcomes, and events. A trial is the single performance of an experiment. Experimental outcomes are various observable characteristics that arc of interest in the performance of the experiment. An event is a set (collection) of outcomes. The certain event is the set ~ consisting of all outcomes. An elementary event is an event consisting of a single outcome. At a single trial. we observe one and only one outcome. An event occurs at a trial if it contains the observed outcome. The certain event occurs at every trial because it contains every outcome. Consider, for example, the die experiment. The outcomes of this experiment are the six facesjj, . . . ,J,.: the event {even} consists of the three outcomes fi. f4, and ft.: the certain event ~ consists of all six outcomes. A trial is the roll of the die once. Suppose that at a particular trial, fi shows. In this case we observe the outcome/2 : however. many events occur, namely, the certain event, the event {even}. the elementary event {!2 }, and 29 other events! Example 1.2 The coin experiment has two outcomes: heads {h) and tails {t). The event heads = {h} consists of the single outcome II. To find the probability P{h} of heads, we toss the coin n :.: 1.000 times. and we observe that heads shows nh = 508 times. From this we conclude that P{h} ~ .51 (step I J. This leads to the expectation that in future trials, about 51% of the tosses will show heads (step 3). One might argue that this is only an approximation. Since a coin is symmetrical (the "equally likely" assumption), the probability of heads is .5. Had we therefore kept tossing, the nuio n1,1n would have approached .5. This is generally true; how- 8 CHAP. I THE MEANING OF PROBABILITY ever, it is based on our long experience with coins, and it holds only for the limited class of fair coins. • Example 1.3 Example 1.4 Example 1.5 We wish to find the probability that a newborn child is a girl. In this experiment. we have again two outcomes: boy (b) and girl (g). We observe that among the 1,000 recently born children, 489 are girls. From this we conclude that P{g} ~ .49. This leads to the expectation that about 49% of the children born under similar circumstances will be girls. Here again, there are only two outcomes. and we have no reason to believe that they are not equally likely. We should expect, therefore. that the correct value for P{g} is .5. However, as extensive records show. this expectation is not necessarily correct. • A poll taken for the purpose of determining Republican (r) or Democratic (d) party affiliation specifies an experiment consisting of the two outcomes r and d. A trial is the polling of one person. It was found that among 1.000 voters questioned. 382 were Republican. From this it follows that the probability P{r} that a voter is Republican is about .38, and it leads to the expectation that in the next election. about 389C of the people will vote Republican. In this case, it is obvious that the equally likely condition cannot be used to determine P{r}. • Daily highway accidents specify an experiment. A trial of this experiment is the determination of the accidents in a day. An outcome is the number k of accidents. In principle, k can take any value; hence, the experiment has infinitely many outcomes. namely all integers from 0 to x. The event .s4 = {k = 3} consists of the single outcome k = 3. The event ~ = {k s 3} consists of the four outcomes k = 0. I, 2. and 3. We kept a record of the number of accidents in 1,000 days. Here are the number nk of days on which k accidents occurred: k 0 1 2 3 4 5 6 7 8 9 10 >10 nl 13 80 144 200 194 155 120 15 14 4 I 0 From the table and (I-I) it follows with n31 = n3 = 200. nA = n0 + n 1 .,. 437, and n == 1,000; hence P(.si) == P{k = 3} = .2 P(OO) = P{k s 3} ~ .437 Example 1.6 n~ + n3 = • We monitor all telephone calls originating from a station between 9:00 and 10:00 A.M. We thus have an experiment, the outcomes of which are all time instances between 9:00 and 10:00. A single trial is a particular call, and an outcome is the time of the call. The experiment therefore has infinitely many outcomes. We observe that among the last 1,000 calls, 248 occurred between 9:00 and 9: 15. From this we conclude that SEC. 1-2 THE FOUR INTERPRETATIONS OF PROBABILITY 9 the probability of the event :;l = {the call occurs between 9:00 and 9: 15} equals P<:4),.... .25. We expect. therefore. that among all future calls occurring between 9:00 and 10:00 A.M •• 25% will occur between 9:00 and 9:15. Example 1.7 The age tat death of a person specifies an experiment with infinitely many outcomes. We wish to find the probability of the event .-.1 = {death occurs before 60}. To do so, we record the ages at death of 1.000 persons. and w~ observe that 682 of them are less than 60. From this we conclude that J'(sll == 682/1000 == .68. We should expect. therefore. that 689f of future deaths will occur before the age of 60. Regularity and Randomness We note that to make predictions about future averages based on past averages. we must impose the following restrictions on the underlying experiment. A. 8. Its trials must be pertormed under .. essentially similar.. conditions (regularity). The ratio n.-:~ln must be "essentially .. the same for any subsequence of trials selected prior to the observation (randomness). These conditions arc heuristic and cannot easily be tested. As we illustrate next. the difficulties vary from experiment to experiment. In the coin experiment, both conditions can be readily accepted. In the birth experiment. A is acceptable but B might be questioned: If we consider only the subsequence of births of twins where the firstborn is a boy, we might find a different average. In the polling experiment, both conditions might be challenged: If the voting docs not take place soon after the polling, the voters might change their preference. If the polled voters arc not "typical," for example. if they arc taken from an afftucnt community. the averages might change. 1-2 The Four Interpretations of Probability The term probability has four interpretations: I. Axiomatic definition (model concept) 2. Relative frequency (empirical) 3. Classical (equally likely) 4. Subjective (measure of belief) In this book. we shall use only the axiomatic definition as the basis of the theory (step 2). The other three interpretations will be used in the determination of probabilistic data of real experiments (step 1) and in the applications J0 CHAP. 1 THE MEANING OF PROBABILITY of the theoretical results to real experiments (step 3). We should note that the last three interpretations have also been used as definitions in the theoretical development of probability; as we shall see. such definitions can be challenged. Axiomatic: In the axiomatic development of the theory of probability, we start with a probability space. This is a set '::/of abstract objects (elements) called outcomes. The set '::/ and its subsets are called events. The probability of an event s4. is by definition a number P(s4.) assigned to s4.. This number satisfies the following three axioms but it is otherwise arbitrary. I. P(s4.) is a nonnegative number: P(s4) ~ 0 II. The probability of the event f:l (certain event) equals 1: Ill. (1-3) P('::/) = 1 (1-4) If two events s4. and~ have no common elements, the probability of the event .514 U ~ consisting of the outcomes that are in s4. or li equals the sum of their probabilities: P(s4 U ~) = P{s4) + P(~) (1-S) The resulting theory is useful in the determination of averages of mass phenomena only if the axioms are consistent with the relative frequency of probability, equation (1-1). This means that if in (1-3), (1-4), and (1-S) we replace all probabilities by the corresponding ratios nsA/n, the resulting equations remain approximately true. We maintain that they do. Clearly, nsA ~ 0; furthermore, n~ = n because the certain event occurs at every trial. Hence, P(f/) = n:t = I n n in agreement with axioms I and II. To show the consistency of axiom III with (1-1), we observe that if the events s4. and li have no common elements and at a specific trial the event s4. occurs, the event~ does not occur. And since the event s4. U li occurs when either s4. or li occurs, we conclude that n...,u~ = nsA + n91. Hence, P(s4.) = n.-A ~ 0 P(s4 U li) == n...,u~ = nsA + na == P(s4) + P(\i) n n n in agreement with (1-S). Model Formation We comment next on the connection between an abstract space f:l (model) and the underlying real experiment. The first step in model formation is the correspondence between elements of f:l and experimental outcomes. In Section 1-'l we assumed routinely that the outcomes of an experiment are readily identified. This, however, is not always the case. SEC. 1-2 THE I'OliR INTERPRETATIONS OF PROBABILITY JJ The actual outcomes of a real experiment can involve a large number of observable characteristics. In the formation of the model. we select from all these characteristics the ones that arc of interest in our investigation. We demonstrate with two examples. Example 1.8 Example 1.9 Consider the possible! models of the die experiment as interpreted by the three players X. Y. and Z. X says that the outcomes of this experiment are the six faces of the die. forming the space :J = {/1 • • • • • J~}. In this space. the event {even} con~ists of the three outcomes h . .1~. and./~. Y wants to bet on even or odd only. He argues. therefore. that the experiment has only the two outcomes t•ut•n and mid. forming the space :1 ·· {even. odd}. In this space. {even} is an elementary event consisting of the single outcome t'Vt'll. Z bets that the die will rest on the left side of the table and / 1 will show. He maintains. therefore. that the experiment has infinitely many outcome~ consisting of the six faces of the die and the coordinates of its center. The event {even} consists not of one or of three outcomes but of infinitely many. • In a polling experiment. a trial is the selection of a person. The person might be Republican or Democrat. male or female. black or white. smoker or nonsmoker. and so on. Thu!> the observable outcomes are a myriad of characteristics. In Example 1.4 we considered as outcomes the characteristics .. Republican .. and .. Democrat .. because we were interested only in party affiliation. We would have four outcomes if we considered also the sex of the selected per11ons. eight outcomes if we included their color. and so on. • Thus the outcomes of a probabilistic model arc precisely defined objects corresponding not to the myriad of observable characteristics of the underlying real experiment but only to those characteristics that arc of interest in the investigation . .-..·,,,.The axiomatic approach to probability is relatively recent* (Kolmogoroff, 1933): however. the axioms and the formal results had been used earlier. Kolmogoroff's contribution is the interpretation of probability as an abstr.tct concept and the development of the theory as a precise mathematical discipline based on measure theory. Relative Frequency The relative frequency interpretation (1-1) of probability states that if in 11 trials an event~ occurs 11:11 times, its probability P(.~) is approximately ll_.,j/11: PC.i4.) = II· ......:! ll (1-6) • A. KolmogorofT, .. Grundbegriffc der Wahrscheinlichkcits Rechnung. ·· f:rgeb Math tmd ilrrer Gren:.g. Vol. 2. 1933. 12 CHAP. 1 THE MEANING OF PROBABILITY provided that n is sufficiently large and the ratio n31 1n is nearly constant as n increases. This interpretation is fundamental in the study of averages, establishing the link between the model parameter P(~). however it is defined. and the empirical ratio n,11 /n. In our investigation. we shall use <J-6) to assign probabilities to the events of real experiments. As a reminder of the connection between concepts and reality. we shall give a relative frequency interpretatitm of various axioms. definitions. and theorems based on ( 1-6). The relative frequency cannot be used as the definition of P(~) because (1-6) is an approximation. The approximation improves. however, as n increases. One might wonder. therefore, whether we can define P(:A) as a limit: P(srl) = .,.... lim n:,J , (1-7) We cannot, of course. do so if n and n::~ arc experimentally determined numbers because in any real experiment. the number n of trials. although it might be large, it is always finite. To give meaning to the limit, we must interpret ( 1-7) as an assumption used to define P(~4.) as a theoretical concept. This approach was introduced by Von Mises* early in the century as the foundation of a new theory based on () -7). At that time the prevailing point of view was still the classical. and his work offered a welcome alternative to the concept of probability defined independently of any observation. It removed from this concept its metaphysical implications, and it demonstrated that the classical definition works in real problems only because it makes implicit use of relative frequencies based on our long experience. However, the usc of ( 1-7) as the basis for a deductive theory has not enjoyed wide acceptance. It has generally been recognized that Kolmogoroff's approach is superior. Cla.uic:al Until recently. probability was defined in terms of the classical interpretation. As we shall see, this definition is restrictive and cannot form the basis of a deductive theory. It is, however. important in assigning probabilities to the events of experiments that exhibit geometric or other symmetries. The classical definition states that if an experiment consists of N outcomes and N.?i of these outcomes are "favorable" to an event~ (i.e., they arc elements of .llf), then P(~) = N:A N (1-8) • Richard Von Mises. Probability, Statistics, and Truth, English edition. H. Geiringer. ed. (London: G. Allen and Unwin Ltd., 1957). SEC. J-2 THE FOUR INTERPRETATIOSS OF PROBABILITY 13 In words, the probability of an event .<A equals the ratio of the number of outcomes N_._. favorable to .9l to the total number N of outcomes. This definition, as it stands. is ambiguous because. as we have noted. the outcomes of an experiment can be interpreted in several ways. We shall demonstrate the ambiguity and the need for improving the definition with an example. Example 1.10 We roll two dice and wish to find the probability p that the sum ofthe faces that show equals 7. We shall analyze this problem in terms of the following models. (a) We consider as experimental outcomes the N .., II possible sums 2. 3•.... 12. Of these. only the outcome 7 is favorable to the event lll = {7}, hence N ..~ = I. If we use ( 1-8) to determine p. we must conclude that p = 1111. (b) We count as outcomes the N = 21 pairs 1-1. 1-2. 1-3 . . . . , 6-6, not distinguishing between the first and the second die. The favorable outcomes arc now N.1 - 3. namely the pairs 1-6. 2-5. and 3-4. Again using 0-8). we must conclude that p -' 3121. (~) We count as outcomes the N - 36 pairs distinguishing between the first and the second die. The favorable outcomes are now theN...; = 6 pairs 1-6. 6-1. 2-5, 5-2, 3-4. 4-3. and (1-81 yields p"" 6/36. We thus have three different solutions for the same problem. Which is correct? One might argue that the third is correct because the .. true .. number of outcomes is 36. Actually all three models can he used to describt• the die experiment. The third leads to the correct solution because its outcomes are .. equally likely." For the other two models, we cannot determine p from ( 1-81. • Example 1.10 leads to the following refinement of ( 1-8): The probability of an event s4. equals the ratio of the number of outcomes N.-J favorable to s4. to the total number N of outcomes, provided that all outcomes are equally likely. As we shall see, this refinement does not eliminate the problems inherent in the classical definition. We comment next on the various objections to the classical definition as the foundation of a precise theory and on its value in the determination of probabilistic data and as a working hypothesis. Note• Be aware ofthe difference between the numbers 11 and n.-.~ in (1-1) and the numbers N and N:A in (1-8). In the former, n is the total number of trials (repetitions) of the experiment and n. ~ is the number of successes of the event .s4. In the latter, N is the total number of outcome.s of the experiment and N ,_. is the number of outcomes that are favorable to .!4 (are elements of sf). 14 CHAP. I THE MEANING OF PROBABILITY CRmCISMS The term equally likely used in the refined version of (1-8) can mean only that the outcomes are equally probable. No other interpretation consistent with the equation is possible. Thus the definition is circular: the concept to be defined is used in the definition. This often leads to ambiguities about the correct choice of N and, in fact. about the validity of 0-8). 2. It appears that 0-8) is a logical necessity that does not depend on experience: "A die is symmetrical; therefore. the probability that 5 will show equals 1/6." However, this is not so. We accept certain alternatives as equally likely because of our collective experience. The probability that 5 will show equals 1/6 not only because the die exhibits geometric symmetries but also because it was observed in the long history of rolling dice that 5 showed in about 116 of the trials. I. In the next example, the equally likely condition appears logical but is not in agreement with observation. Example 1.11 We wish to find the probability that a newborn baby is a girl. It is generally assumed that p = 112 because the outcomes boy and girl are ··obviously .. equally likely. However. this conclusion cannot be reached as a logical necessity unrelated to observation. In the first place, it is only an approximation. Furthermore, without access to long records, we would not know that the boy-girl alternatives arc equally likely regardless of the sex history of the baby's family, the season or place of its birth. or other factors. It is only after long accumulation of records that such factors become irrelevant and the two alternatives are accepted as approximately equally likely. • 3. The classical definition can be used only in a limited class of problems. In the die experiment, for example. ( 1-8) holds only if the die is fair, that is, if its six outcomes have the same probability. If it is loaded and the probability of 5 equals. say •. 193, there is no direct way of deriving this probability from the equation. The problem is more difficult in applications involving infinitely many outcomes. In such cases, we introduce as a measure of the number of outcomes length. area. or volume. This makes reliance on (1-8) questionable, and, as the following classic example suggests, it leads to ambiguous solutions. Example 1.12 Given a circle C of radius r, we select .. at random .. a chord AB. Find the probability .stt ~ {the length I of the chord is larger than the length r v'J of the side of the inscribed equilateral triangle}. p of the event SEC. J-2 THE FOUR I~TERPRETATJO!'ojS OF PROBABILITY 15 We shall show that this problem can he given at least three reasonable solutions. Fir!it Solution The center M of the chord can be any puint in the interior of the circle C. The point is favorable to the event .::1 if it is in the interior of the circle C 1 of radius r/2 (Fig. 1.2a). Thus in this interpretation of .. randomness.·· the experiment consists of all points on the circle C. Vsing the area of a region as a measure of the points in that region. we conclude that the measure of the total number of outcomes is the area TTr! of the circle C and the measure of the outcomes favurable to the event .:Ji equals the area TTr~/4 of the circle C 1 • This yields TTr=i4 I p :. 7Tf! - 4 Second Solution We now assume that the end A of the chord AB is fixed. This reduces the number of pos!iibilities but has no effect on the value of p because the number of favorable outcomes is reduced proportionately. We can thus consider as experimental outcomes all points on the circumference of the circle. Since I> rvJ if B is on the 120° arc DB/~ of Fig. 1.2b. the number of outcomes favorable to A are all points on that arc. t:sing the length of the arcs as mea'iure of the outcomes. we conclude that the measure of the total number of outcomes is the length 27Tr of the Figure 1.2 :I (a) I 1.:) hi 16 CHAP. 1 THE MEANING OF PROBABILITY circle and the measure of the favorable outcomes is the length 2TTrl3 of the arc DBE. Hence. 2TTr13 1 p = 2TTr = 3 Third Solution We assume that the direction of AB is perpendicular to the line FK of Fig. 1.2c. As before. this assumption has no effect on the value of p. Clearly .I> r\13 if the center M of the chord AB is between the points G and H. Thus the outcomes of the experiment are all points on the line FK and the favorable outcomes are the points on the segment GH. Using the lengths rand r/2 of these segments as measures of the outcomes. we conclude that r/2 I p=-;- =2 This example, known as Bertrand paradox, demonstrates the possible ambiguities associated with the classical definition, the meaning of the terms po.uible and favorable, and the need for a clear specification of all experimental outcomes. • USES OF THE CLASSICAL DEFINITION In many experiments, the assumption that there are N equally likely alternatives is well established from long experience. In such cases, (1-8) is accepted as self-evident; for example, "If a ball is selected at random from a box containing m black and n white balls, the probability p that the ball is white equals ml(m + n)," and "if a telephone call occurs at random in the time interval (0, T), the probability p that it occurs in the interval (0, t) equals tiT." Such conclusions are valid; however, their validity depends on the meaning of the word random. The conclusion of the call example that p = tiT is not a consequence of the randomness of the call. Randomness in this case is equivalent to the assumption that p = tiT, and it follows from past records of telephone calls. 2. The specification of a probabilistic model is based on the probabilities of various events of the underlying physical experiment. In a number of applications, it is not possible to determine these probabilities empirically by repeating the experiment. In such cases, we use the classical definition as a working hypothesis; we assume that certain alternatives are equally likely, and we determine the unknown probabilities from (1-8). The hypothesis is accepted if its theoretical consequences agree with experience; otherwise, it is rejected. This approach has important applications in statistical mechanics (see Example 2.25). 3. The classical definition can be used as the basis of a deductive theory if (1-8) is accepted not as a method of determining the probabiJity of real events but as an assumption. As we show in the next chapter, a deductive theory based on (1-8) is only a special case of the axiomatic approach to probability, involving only experiments in which all elementary events have the same probabil• 1. SEC. 1-2 THE FOUR JSTERPRETATIONS OF PROBABILITY 17 ity. We should note, however. that whereas the axiomatic development is based on the three axioms of probability. a theory based on (1-8) requires no axioms. The reason is that if we assume that all probabilities satisfy the equally likely condition. all axioms become simple theorems. Indeed. axioms I and II are obvious. To prove (1-5), we observe that if the events :.-l and ~ consist of N.-.~ and N .11 outcomes. respectively. and they arc mutually exclusive, their union .<A. u :-A consists of N.., + N.11 outcomes. And since all probabilities satisfy ( 1-8), we conclude that P(.ii U 2/l) :-:- Nt:/ 11 ,...,. : · .... ~~ = P(:il) - P(l'A) (1-9) Suhjectivl' In the subjective interpretation of probability. the number P(:/J.) assigned to a statement .~ is a measure of our state of knowledge or belief concerning the truth of ~. The underlying theory can he generally accepted as a form of "inductive" reasoning developed ''deductively:· We shall not discuss this interpretation or its effectiveness in decisions based on inductive inference. We note only that the three axioms on which the theory is based are in reasonable agreement with our understanding of the properties of inductive inference. In our development. the subjective interpretation of probability will be considered only in the context of Bayesian estimation (Section 8-2). There we discuss the use of the subjective interpretation in problems involving averages, and we comment on the resulting controversies between "subjectivists" and "objectivists." Here we discuss a special case of the subjective interpretation of P(~) involving total prior ignorance. and we show that in this case it is formally equivalent to the classical definition. PRI~CIPU: OF I~St;FFICU:~T RI-:ASO:\ This principle states that if an experiment has N alternatives (outcomes) '; und we have no knowledge about their occurrence. we mu.'it assign to all alternatives the same probability. This yields {1-10) In the die experiment. we must assume that P{.f;} = l/6. In the polling experiment. we must assume that P{r} = P{d} = li2. Note that 0-10) is equivalent to 0-H). However. the classical definition, on which Equation (1-8) is based. is conceptually different from the principle of insufficient reason. In the classical definition, we know. from symmetry considerations or from past experience. that theN outcomes are equally likely. Furthermore, this conclusion is objective and is not subject to change. The principle of insufficient reason, by contrast. is a consequence of our total ignorance about the experiment. Furthermore. it leads to conclusions that are subjective and subject to change in the face of any evidence. 18 CHAP. I THE MEANING OF PROBABILITY Concluding Remarks In this book we present a theory based on the axiomatic definition of probability. The theory is developed deductively. and all conclusions follow logically from the axioms. In the context of the theory. the question "what is probability'?" is not relevant. The relevant question is the correspondence between probabilities and observations. This question is answered in terms of the other three interpretations of probability. As a motivation. and as a reminder of the connection between concepts and reality, we shall often give an empirical interpretation (relative frequency) of the various axioms. definitions. and theorems. This portion of the book is heuristic and docs not obey the rules of deductive reasoning on which the theory is based. We conclude with the observation that all statistical statements concerning future events are inductive and must be interpreted as reasonable approximations. We stress. however, that our inability to make exact predictions is not limited to statistics. It is characteristic of all scientific investigations involving real phenomena, deterministic or random. This suggests that physical theories arc not laws of nature, whatever that may mean. They are human inventions (mental constructs) used to describe with economy patterns of real events and to predict, but only approximately, their future behavior. To "prove" that the future will evolve exactly as predicted, we must invoke metaphysical causes. 2 _ _ __ Fundamental Concepts The material in Chapters 2 and 3 is based on the notion of outcomes, events, and probabilities and requires. for the most part. only high school mathematics. It is self-contained and richly illustrated. and it can be used to solve a large variety of problems. In Chapter 2, we develop the theory of probability as an abstract construct based on axioms. For motivation. we make also frequent reference to the physical interpretation of all theoretical results. This chapter is the foundation of the entire theory. 2-1 Set Theory Sets are collections of objects. The objects of a set are called elements. Thus the set {apple. boy. pencil} consists of the three elements apple, pencil. and boy. The elements of a set are usually placed in braces; the order in which they are written is immaterial. They can be identified by words or by suitable abbreviations. For example, {h, t} is a set consisting of the elements h for heads and t for tails. 19 20 CHAP. 2 FUNDAMENTAL CONCEPTS Similarly, the six faces of a die form the set {Ji. Ji, 13. J., A . .16} In this chapter all sets will be identified by script letters* sf., ~. '€, . . . ; their elements will, in general, be identified by the Greek letter Thus the expression c. st ={Ct. C2 •... , Cl\·} will mean that .s4 is a set consisting of the N elements Ct, C2. The notation C; E .s4 will mean that C; is an element of the set .s4 (belongs to the set sf.); the notation C; ~ st will mean that C; is not an element of st. Here is a simple illustrdtion. The set st = {2, 4, 6} consists of the three clements 2. 4, and 6. Thus 2E.srl 3~.54 In this illustration, the elements of the set sf. are numbers. Note, however, that numbers are used merely for the purpose of identifying the elements; for the specification of the set, their numerical properties are irrelevant. The elements of a set might be simple objects, as in the preceding examples, or each might consist of several objects. For example, .s4 = {hh, ht, th} is a set consisting of the three elements hh. ht, and th. Sets of this form will be used in experiments involving repeated trials. In the set stl, the elements ht and th are different; however, the set {th. hh, ht} equals :A.. In the preceding examples, we identified all sets explicitly in terms of their elements. We shall also identify sets in terms of the properties of their elements. For example, sf. = {all integers from I to 6} (2-2) is the set {1, 2, 3, 4. 5, 6}; similarly. ~ = {all even integers from I to 6} is the set {2, 4, 6}. Venn Diagrams Wc shall assume that all elements under consideration belong to a set ~called space (or universe). For example, if we consider children in a certain school, ~is the set of all children in that school. The set ~ is often represented by a rectangle, and its elements by the points in the rectangle. All other sets under consideration are thus represented by various regions in this rectangle. Such a representation is called a Venn diagram. A Venn diagram consists of infinitely many points; however, the set ~ that it represents need not have infinitely many elements. The diagram is used merely to represent graphically various set operations. • In subsequent chapters, we shall use script letters to identify only sets representing events of a probability space. SEC. 2-1 SF.T THEORY 21 .1\c::l J<igure 2.1 Subsets We shall say that a set 2A is a subset of a set stl if every of~ is also an clement of :A. (Fig. 2.1 ). The notations tA c .'11. stl :J :!A (2-3) will mean that the set 00 is a subset of the set .<A.. For example, if .'11. = {fi. Ji .};} 213 ;, .f,} then :11 C .'11.. In the Venn diagram representation of (2-3), the set :!A is included in the set stl. element u; Equality The notation .'11. = /A will mean that the sets stl and :1l consist of the same elements. To establish the quality of the sets .<A. and ~. we must show that every element of ?A is an clement of stl and every element of stl is an element of til. In other words. stl = ~ iff* ~ c stl and .>4 c 1A We shall say that ~ is a proper subset of .-;1 if :A is a subset of st. but does not equal .<A.. The distinction between a subset and a proper subs~t will not be made always. Unions and Intersections Given two sets .<A. and 33, we form a set consisting of all elements that are either in stl or in ~ or in both. This set is written in the form stl U ?A or stl + .?A and it is called the union of the sets .<A. and ~1l (shaded in Fig. 2.2). Given two sets stl and ~. we form a set consisting of all elements that are in :A. and in ?A. This set is written in the form ,r;zt. n ~ or .-4~ and it is called the intersection of the sets :A. and 3l (shaded in Fig. 2.3). Complement Given a set st.. we form a set consisting of all elements of ':t that arc not in :A. (shaded in Fig. 2.4). This set is denoted by stl and it is called the complement of stl. • Iff is an abbreviation for "if and only if." 22 CHAP. 2 FUNDAMENTAL CONCEPTS :.·J u :II Figure 2.2 Figure 2.3 8 Figure 2.4 Example 2.1 Suppose that ~ is the set of all children in a community, at is the set of children in fifth grade, and 00 is the set of all boys. In this case, ii is the set of children that are not in fifth grade, and ?ii is the set of all girls. The set at u ~ consists of all girls in fifth grade and all the boys in the community. The set at n ~ consists of all boys in fifth grade. • Disjoint S~ts We shall say that the sets .s4. and ~ are disjoint if they have no common elements. If two sets are disjoint, their intersection has a meaning only if we agree to define a set without elements. Such a set is denoted by 0 and it is called the empty (or null) set. Thus .stn!i=0 itT the sets .s4. and~ are disjoint. We note that a set may have finitely many or infinitely many elements. There are two kinds of infinities: countable and noncountable. A set is called SEC. 2-J SET THEORY 23 l'igure 2.5 countable if its elements can be brought into one-to-one correspondence with the positive integers. For example. the set of even numbers is countable; this is easy to show. The set of rational numbers is countable; the proof is more difficult. The set of all numbers in an interval is noncountable: this is difficult to show. PROPERTIES The following set properties follow readily from the defini- tions. s.4.u:J=~ If ~ stn:·t=sl. s.4.U0=.?t stn0-==0 C s.4., then s.4. U ~ = .s.'J and s.4. r. :A = tA. Transitive Property If :A C :Jl and :1l C '{, then ~lt c '-€ (Fig. 2.5). Commutative Property :11 u :13 = 213 u .?t Associative Property (.Ill u ~) u <t = ,>{ u (~ u ~) From this it follows that we can omit parentheses in the last two operations. Distributive Law (See Fig. 2.6.) (dl u 00) n <t = <.sJf n <t) u (;:A n '£) Thus set operations are the same as the corresponding arithmetic operations if we replace .sJf U :il and .'i4. n tJl by ~<4 + !A and .>4:13, respectively. With Figure 2.6 24 CHAP. 2 FUNDAMENTAL CONCEPTS Partition A Figure 2.7 the latter notation, the distributive law for sets yields (sf + 00)(<€ + ~) = sf(<€ + ~) + 00(<@ + S,) = .~<(5 + sf(jt + til'~ + ~Jjl(jj Partitions A partition of Y is a collection A = [sf,, . . . , sfmJ of subsets sf,, . . . , sfm of fl with the following property <Fig. 2.7): They are disjoint and their union equals f/: sf; n sfj = 0, ; j sf, u . . . u sfm = y (2-4) The set fl has, of course, many partitions. If sf is a set with complement Si, then sfnsf=0 hence, [sf, Si] is a partition of fl. * Example l.l Suppose that ~ is the set of all children in a school. If .7l; is the set of all children in the ith grade, then (sf., . . . , sf 12 ) is a partition of~. lf:i is the set of all boys and <§ = i the set of all girls, then [9\, <§) is also a partition. • Cartesian Product Given two sets sf and 00 with elements a; and ~i, respectively, we form a new set ~, the elements of which are all possible pairs a;~i. The set so formed is denoted by ~=sfx~ a~ and it is called the Cartesian product of the sets sf and~. Clearly, if sf has m elements and ~ has n elements, the set ~ so constructed has mn elements. Example 2.3 If sf = {car, apple, bird} and~ ={heads, tails}, then C€ = sf x ~ = {ch, ct, ah, at, bh, bt} The Cartesian product can be defined even if the sets sf and identical. • ~ are SEC. Example 2.4 If s4 2-1 SET THEORY 25 = {heads, tails} = ~. then <€ = J4 X ~ • = {hh. ht, th, tt} Subsets and Combinatoric.\· Probabilistic statements involving finitely many elements and repeated trials are based on the determination of the number of subsets off! having specific properties. Let us review the underlying mathematics. PERMUTATIONS AND COMBINATIO~S Given N distinct objects and a number m s N, we select m of these objects and place them in line. The number of configurations so formed is denoted by P:~ and is called "permutations of N objects taken m at a time." In this definition. two permutations are different if they differ by at least one object or by the order of placement. To clarify this concept, here is a list of all permutations of the N = 4 objects a, b, c, d form = 1 and m = 2: m=1 (2-6) a b c: d Pt = 4 m =2 ab ac: ad ba be bd ca cb cd da db de p~ = 12 (2-7) Note that P~ = 3P1. This is so because each term in (2-6) generates 4 - 1 = 3 terms in (2-7). Clearly. 3 is the number of letters remaining after one is selected. This leads to the following generalization. • Theonm P':, = N(N- I)· · · (N- m + I) (2-8) • Proof. Clearly, Pf = N; reasoning as in (2-7) we obtain P~ = N(N- 1). We thus have N(N - I) permutations of the N objects taken 2 at a time. At the end of each permutation so formed, we attach one of the remaining N - 2 objects. This yields (N - 2)Pf! permutations of N objects taken 3 at a time. By simple induction. we obtain P':, = (N - m + l)P~ 1 and (2-8) results. We emphasize that in each permutation, a specific object appears only once, and two permutations arc different even if they consist of the same objects in a different order. Thus in (2-7), ab is distinct from ba; the configuration aa does not appear. Example 2.5 As we see from (2-8). P!0 = 10 X 9 = 90 If the ten objects are the numbers 0, I, . . . • 9. then P~0 is the total number of two-digit numbers excluding 00, II, 22. . . .• 99. • - 26 CHAP. 2 FUNDAMENTAL CONCEPTS = N in (2-8), we obtain P~ = N(N - I) · · · 1 = N! • Corollary. Setting m This is the number of permutations of N objects (the phrase "taken N at a time" is omitted). Example 2.6 Here are the 3! = 3 x 2 permutations of the objects a. b, and c: abc acb bac bca (·ab • cba Combinations Given N distinct objects and a number m s N, we select m of these objects in all possible ways. The number of groups so formed is denoted by C~ and is called "combinations of N objects taken mat a time." Two combinations are different if they differ by at least one object; the order of selection is immaterial. If we change the order of the objects of a particular combination in all possible ways, we obtain m! permutations of these objects. Since there are C~ combinations, we conclude that P~ = m !C~ (2-9) From this and (2-8) it follows that CN = N(N- I)· · · (N- m + 1) m m! N). Multiplying numerator and denomination This fraction is denoted by ( m by (N - m)!, we obtain CN- m- (N)N! m. - m!(N- m)! (2-10) Note, finally, that c~ = c~-m This can be established also directly: Each time we take m out of N objects, we leaveN- m objects. Example 2.7 With N = 4 and m = 2, (2-10) yields 4 _ (4) _ 4 X 3 _ C2 - 2 -IX2- 6 Here are the six combinations of the four objects a, b, c, d taken 2 at a time: d M ~ k M ~ • AppUcations We have N objects forming two groups. Group 1 consists of m identical objects identified by h; group 2 consists of N - m identical objects identified by t. We place these objects inN boxes, one in each box. St::C. 2-1 SET THEORY 27 We maintain that the number x of ways we can do so equals X=(~} (2-11) • Proof. The m objects of group I are placed in m out of the N available boxes and the N - m objects of group 2 in the remaining N - m boxes. This yields (2-11) because there are C~ ways of selecting m out of N objects. Example 2.8 Here are the q = 6 ways of placing the four objects h, h. t. t in the four boxes 8 1 , 8 2 , 83,84: q = 6 • Example 2.9 This example has applications in statistical mechanics <see Example 2.25). We place m identical balls inn> m boxes. We maintain that the number y of ways that we can do so equals y = (~) (2-12) where N "" n - m - I • Proof. The solution of this problem is rather tricky. We consider them balls as group I of objects and then - I interior walls separating then boxes as group 2. We thus haveN = n + m - I objects, and (2-12) follows from (2-11). In Fig. 2.8, we demonstrate one of the placements of m = 4 balls in n = 7 boxes. In this case. we have n - I = 6 interior walls. The resulting sequence of N = n - m - I = 10 balls and walls is bwwbbwwwbw, as shown. • Binary Numbers We wish to find the total number z of N-digit binary numbers consisting of m Is and N - mOs. Identifying the Is and Os as the objects of group 1 and group 2. respectively. we conclude from (2-12) that z = (~). Joigure 2.8 b bb bwwbbwwwbw b m =4 n =7 28 CHAP. 2 FUNDAMENTAL CONCEPTS Example 2.10 Here are the (~) = 6 four-digit binary numbers consisting of two Is and two Os: 1100 1010 1001 OliO 0101 0011 • Note From the identity (binomial expansion) ,... (a +b)".= L (~) a'"lr"-"' ... ~o it follows with a = b = I that (2-13) Hence, the total number of N-digit binary numbers equals 2·'·. c.... Subsets Consider a set Y consisting of the N elements C1 , • • • , We maintain that the total number of its subsets, including f/ itself and the empty set 0, equals 2N. • Proof. It suffices to show that we can associate to each subset of f/ one and only one N-digit binary number [see (2-13)]. Suppose that sA is a subset of Y. If J4 contains the element Ct. we write I as the ith binary digit; otherwise, we write 0. We have thus established a one-to-one correspondence between all theN-digit binary numbers and the subsets ofY. Note, in particular, that Y corresponds to II . . . I and 0 to 00 . . . 0. Example 2.11 The set ~ = {a, b, c, d} has four elements and 24 = 16 subsets: 0 {a} {b} {c} {d} {a, b} {a, c} {a, d} {b, d} {c, d} {a, b, c} {a, b, d} {a, c, d} {b, c. d} The corresponding four-digit numbers are as follows: ()()()() OliO 0001 1010 0010 1100 0100 0111 1000 1011 0011 1101 0101 1110 {b, c} {a, b, c, d} 1001 1111 • Genenllzed Combinations We are given N objects and wish to group them into r classes A 1 , , A, such that the ith class A; consists of k1 objects where k 1 + · · · + k, =N The total number of such groupings equals eN "·· ...• k. = N! k, !k2! • . k,! (2-14) • Proof. We select first k 1 out of theN object to form the first class A 1 • As we know, there are (Z) ways to do so. We next select k out of the remain2 ing N - k 2 objects to form the second class A 2 • There are ( N ~ k,) ways to SEC. 2-2 PROBABII.ITY SPACE 29 do so. We then select k3 out of the remaining N- k1 - k2 objects to form the third class Ah and so we continue. After the formation of the A,_, class. there remain N- k 1 - • • • - k,- 1 = k, objects. There is(~:) = I way of selecting k, out of the remaining k, object: hence. there is only one way of forming the last class A,. From the foregoing it follows that c·v == N! <N - k,>! <k, , + k,)! k,. · ..• A. k 1!(N- k,)! k2!(N- k,- k~)! k,.,!k,! and (2-14) results. Note that <2-10) is a special case of{2-14) obtained with r = 2, k 1 = k, h- :.:. N - k • and eN 4,.Az = cv A Example 2.12 We wish to determine the number M of bridge hands that we can deal using a deck with N ,_ 52 cards. In a bridge game. there are r -= 4 hands: each hand consists of 13 cards. With k, = /.:2 = k~ = k• ... 13. (2-14) yields -c'2n. · · · · u -.. M - 52! ..,5,6 02K 13! 13! 13! 13! . ··' x I • Let us look at two other interpretations of (2-14). We haver groups of objects. Group I consists of k 1 repetitions of a particular object h1 ; group 2 consists of k~ repetitions of another object h 2 • and so on. These objects are placed into N = k 1 + · · · + k, boxes, one in each box. The total number of ways that we can dO SO equa)S .A,· 2. Suppose that ~ is a set consisting of N clements and that ld, ..... sf, I is a partition of~ formed with the r sets :A; as in (2-4). If the ith set ~; consists of k; elements where k; are given numbers, the total A.· number of such partitions equals c~: I. Ct .. .. .. 2-2 Probability Space In the theory of probability. as in all scientific investigations, it is essential that we describe a physical experiment in terms of a clearly defined model. In this section, we develop the underlying concepts. • Definition. An experimental model is a set ~. The elements C1 of ~ are called outcomes. The subsets of~ are called events. The empty set 0 is called the impossible event and the set~ the certain event. Two events .54. 30 CHAP. 2 FUNDAMENTAL CONCEPTS and ~ are called mutually exdusive if they have no common elements, that is, if .s4. n ~ = 0. An event {C;} consisting of the single outcome C; is called an elemetltary event. Note the important distinction between the element C; and the event {C;}. If ff has N elements, the number of its subsets equals 2.-v. Hence, an experiment with N outcomes has 2·'· events (we include the certain event and the impossible event). Example 2.13 In the single-die experiment, the space ~ consists of six outcomes: 9' = {Jj . . . . ,.f6} It therefore has 26 = 64 events. We list them according to the number of elements in each. (~) = I event without elements. namely, the event 0. ( ~) = 6 events with one element each, namely. the elementary events {Jj }. {ii}•...• {J6}. (~) = 15 events with two elements each, namely, {Ji • ./i}. {Jj.jj}. . . . . {f•. .f6}. ( ~) = 20 events with three elements each. namely. {Jj.Ji./J}. {Jj . ./i . ./4}....• {./4.};./6}. (:) = 15 events with four elements each. namely, {Ji . .fi. JJ •./4}. fli . .h .f,. J.}..... u; ..14. h . .f6}. ( ~) = 6 events with five elements each. namely. {Jj ,Ji.f,,J4,J~} .. ... {Ji.f,.J4.J;.J6}. (:) = I event with six elements, namely. the event!/. In this listing, the various events are specified explicitly in terms of their elements as in (2-1). They can, however. be described in terms of the properties of the elements as in (2-2). We now cite various events using both descriptions: {odd}= Ui .};,/.} {even}= {./i • ./4 •.16} {less than 4} = {jj • .h. h} {even, less than 4} = {.h} Note, finally, that each element of9' belongs to 2·"· 1 = 32 events. For example. jj belongs to the event {odd}, the event {less than 4}. the event {jj • .h }, and 29 other events. • Example 2.14 (a) In the single toss of a coin, the space ~ consists of two outcomes !I = {h, t}. It therefore has 22 = 4 events, namely 0. {h}, {t}, and fl. (b) If the coin is tossed twice, 9' consists of four outcomes: f:i = {hh, ht, til, II} SEC. 2-2 PROBABII.JTY SPACE 31 It therefore has 2-' = 16 events. These include the four elementary events {hh}. {ht}. {th}. and {11}. The event-~~ "" {heads at the first toss} is not an elementary event. It is an event consisting of the two outcomes hlr and lrt. Thus ~ 1 = {hit. lzt}. Similarly. -~~ =- {heads at the second toss} = {lrlt. tlr} :J 1 "" {tails at the first toss} = {th. 11} :l ~ =- {tails at the second toss} '"' {111. 11} The intersection of the events "Jt 1 and ·It~ is the elementary event {hlr}. Thus ~ 1 n ·}{~ = {heads at both tosses} = {lrh} Similarly. '1t 1 n ?j 2 - {heads first. tails second} "" {Itt} :? 1 n ;j ~ = {tails at both tos!>es} = {II} • l·:mpiricallntc•rprc•wtion o/Omnmln and J·:n·llf\. In the applications of probability to real problems. the underlying experiment is repeated a large number of times. and probabilities arc introduced to describe various averages. A !tingle performance of the experiment will be called a trial. The repetitions of the experiment form rc•peated trials. In the single-die experiment, a trial is the roll of the die once. Repeated trials are repeated rolls. In the polling experiment. a trial is the selection of a voter from a given population. At each trial. we observe one and only one outcome. whatever we have agreed to consider as the outcome for that experiment. The set of all outcomes is modeled by the certain event:;. If the observed outcome~ is an element of the event ."Jl. we say that the event .?'i occurred at that particular trial. Thus at a single trial only one outcome is observed: however. many events occur. namely. all the 2·'· 1 subsets of ;f that contain the particular outcome ~. The remaining 2-"· 1 events do not occur. Example 2.15 We conduct a poll to determine whether a voter is Republican (r) or Democrat (d). In this case. the experiment consists of two outcomes !f = {r. d} and four events: 0. {r}. {d}. :J (b) We wish to determine also whether the voter is male or female. We now have four outcomes: !J .:.: {rm. rf. dm. c{/1 and 16 events. The elementary events are {rm}. {1:/1. {dm}. and {df}. Thus :1t .:. : {Republican} = {rm. rj) ~~ =- {Democmt} -= {dm. ~11 .t4 =-- {male} - {rm. elm} .1-- = {female} = {rj: df} (a) :11. ~1. n .« n .M. . :.: {Republican. male} =- {rm} :It n = {Democr.u. male} - {dm} 'J .+ . : {Republican. female} = {t:/1 n :;. - {Democrat. female} = {df} • Note that the impossible event docs not occur in any trial. The certain event occurs in every trial. If the events .!A and :11 are mutually exclusive and 32 CHAP. 2 FUNDAMENTAL CONCEPTS .54 occurs at a particular trial, 00 does not occur at that trial. If 3l c .54 and ?1\ occurs, .54 also occurs. At a particular trial, either the event sf or its complement .54 occurs. More generally, suppose that [.s.i 1 • • • • , .54"'] is a partition of!:!. It follows from Equation (2-4) that at a particular trial, one and only one event of this partition will occur. Example 2.16 Consider the single-die experiment. The die is rolled and 2 shows. In our terminology, the outcomeJi is observed. In this case, the event {even}, the event {less than 4}. and 30 other events occur, namely, all subsets of~ that contain the element/2 • • Returning to an arbitrary !:! , we denote by na~ the number of times the event .s4. occurs inn trials. Clearly, na~ s n, n!l = n, and n0 = 0. Furthermore, if the events .s4. and 00 are mutually exclusive, then n:.eu~ If the events .s4. and Example 2.17 ~ = n81 + (2-15) n~ have common elements, then n~~~ = n:.e + n~ - na~1~ (2-16) We roll a die 10 times and we observe the sequence J. h Ji t. h h h t. t. h In this sequence, the elementary event {12} occurs 3 times, the event {even} 7 times, and the event {odd} 3 times. Furthermore, if sf ={even} and li ={less than S}, then sf u !i = {Ji,Ji,fl,J.,/6} sf n ~ = {12./.} n, = 7 na = 8 nsiriA = 6 na~u~ = 9 in agreement with (2-16). • Note This interpretation of repeated trials is used to relate probabilities to real experiments as in (2-1). If the experiment is performed n times and the event sf occurs n11 times, then P(sf) =- naln, provided that n is sufficiently large. We should point out, however, that repeated trials is also a model concept used to create other models from a given experimental model. Consider the coin experiment. The model of a single toss has only two outcomes. However. if we are interested in establishing averages involving two tosses, our model has four outcomes: hh, ht, th, and tt. The model interpretation is thus fundamentally different from the empirical interpretation of repeated tosses. This distinction is subtle but fundamental. It will be discussed further in chapter 3 and will be used throughout the book. The Axioms A probabilistic model is a set fJ the elements of which arc experimental outcomes. The subsets off! are events. To complete the specification of the model, we shall assign probabilities to all events. • Definition. The probability of an event .54 is a number P(-54) assigned to .54. This number satisfies the following axioms. SEC. 2-2 PROBABILITY SPACE 33 I. It is nonnegative: P(sll.) ~ (2-17) 0 II. The probability of the certain event equals I: P(~) Ill. If the events .<A. and P(.<A. U ~ =I (2-18) are mutually exclusive. then ~) = P(al.) + P(:'A) (2-19) This axiom can be readily generalized. Suppose that the events sf,~. and~ are mutually exclusive. that is, that no two of them have common elements. Repeated application of (2-19) yields P(.<A. U ?A U '€) = P(:il) .,. P(~-Ji) + P(<(!,) This can be extended to any finite number of terms; we shall assume that it holds also for infinite but countably many terms. Thus if the events .s4 1 ••'11. 2 • • • • arc mutually exclusive, then P(sf, U sl2 U • • ·) = P(.<A.d + P(.<A.2) .,. • • • (2-20) This docs not follow from (2-19): it is an additional requirement known as the axiom of infinite additivity. Axiomatic Definition or an Experiment In summary. a model of an experiment is specified in terms of the following concepts: I. A set ~ consisting of the outcomes '; 2. Subsets of ~~ called events 3. A number P(.<A..) assigned to every event <This number satisfies the listed axioms but is otherwise arbitrary.) The letter~~ will be used to identify not only the certain event but also the entire experiment. Probability Masses We shall find it convenient to interpret the probability P(Sif) of an event sf as its probability mass. In Venn diagrams, !:1 is the entire rectangle, and its mass equals I. The mass of the region of the diagram representing an event .<A. equals P(Sif). This interpretation of P(Sif) is consistent with the axioms and can be used to facilitate the interpretation of various results. 34 CHAP. 2 FUNDAMENTAL CONCEPTS PROPERTIES In the development of the theory, all results must be derived from the axioms. We must not accept any statement merely because it appears intuitively reasonable. Let us look at a few illustrations. I. The probability of the impossible event is 0: P(0) = 0 (2-21) • Proof. For any szl, the events szl and 0 are mutually exclusive, hence (axiom Ill) P(szl U 0) = P(szl) + P(0). Furthermore, P(szl U 0) = P(szl) because szl U 0 = szl, and (2-21) results. 2. The probability of :si equals (2-22) P(szl) = I - P(st) • Proof. The events 91. and ~ are mutually exclusive, and their union equals ~, hence, P(~) = P(szl U ~) = P(s4) + P(:si) and (2-22) follows from (2-18). 3. Since P(szl) ~ 0 (axiom 1), it follows from (2-22) that P(szl) s 1. Combining with (2-17), we obtain 0 s P(szl) s 1 (2-23) 4. If :1 C szl, then P(li) s P(~) (2-24) • Proof. Clearly (see Fig. 2.9), sz1 = sz1 u :~ = :~ Furthermore, the events hence (axiom III) ~ and szl n u (.si n ~> mare mutually exclusive; P(szl) = P(9J) + P(:A n 9J) And since P(szl n ~) ~ 0 (axiom I), (2-24) follows. 5. For any szl and 9J, P(szl U :1) = P(szl) + P(li) - P(szl n ~) (2-25) (2-26) • Proof. This is an extension of axiom Ill. To prove it, we shall express the event szl U 00 as the union of two mutually exclusive Figure 2.9 SEC. 2-2 PROBABILITY SPACE 35 Figure 2.10 events. As we see from Fig. 2.10. stt U ~ = as in (2-25), P(stl U ~) = P(stt) + P(~ Furthermore, stt U (stl. n ~).Hence, n 8) (2-27) ~=~n~=~u~n~=~n~u~n~ (distributive law); and since the events .~ n ?.13 and ~ n ?13 are mutually exclusive, we conclude from axiom III that P(~) Eliminating P(stt = P(stt n B) + P(~ n ?A) (2-28) n ?.e) from (2-27) and (2-28), we obtain (2-26). We note that the properties just discussed can be derived simply in terms of probability masses. J::mpiricallmerpretation. The theory of probability is based on the theorems and on deduction. However, for the results to be useful in the applications, all concepts must be in reasonable agreement with the empirical interpretation P(stt) = n:Ain of P(.si). In Section 1-2. we showed that this agreement holds for the three axioms. Let us look at the empirical interpretation of the properties. I. The impossible event never occurs; hence, P(0) = ne = n 0 2. As we know, n.., + n:;; = n; hence. n.., = 1 - n.,. = 1 - P(stt) n n n 3. Clearly, 0 s n.., s n; hence, 0 s n"ln s 1 in agreement with (2-22). 4. If~ C stt, then na s n"; hence. P(~) = n;t = n - P(li) 5. In general (see (2-16)], P(stt U li) n.o~ua = n:Aua = n,,. n n = na s n n:A n = P(stt) = n.,. ... na - n:Af':A; hence. -r na - n:Ara n n = P(stt) - PC~) - PCstt n ~) 36 CHAP. 2 FUNDAMENTAL CONCEPTS Model Specification An experimental model is specified in terms of the probabilities of all its events. However, because of the axioms, we need not assign probabilities to every event. For example, if we know P(s4), we can find P(Sl) from (2-22). We will show that the probabilities of the events of any experiment can be determined in terms of the probabilities of a minimum number of events. COUNTABLE OUTCOMES C1 , • • • , Suppose, first, that fl consists of N outcomes CN. In this case, the experiment is specified in terms of the proba- bilities Pi = P{C;} of the elementary events {C;}. Indeed, if sA. is an event consisting of the r outcomes '•· , . . . , '•·. it can be written as the union of the corresponding elementary events: sA.= {C.,} u · · · u {Cd <2-29) hence [see (2-20)], P(sA.) = P{C.,} + · · · + P{Cd = P4, + · · · + Pk. (2-30) Thus the probability of any event s4 of Y equals the sum of the probabilities Pi of the elementary events formed with all the elements of s4. The numbers Pt are such that PI + · · · + PN = I Pi~ 0 (2-31) but otherwise arbitrary. The foregoing holds also if~ consists ofcountably many outcomes (see axiom III). It doe~ not hold if the elements of~ are noncountablc (points in an interval, for example). In fact, it is not uncommon that the probabilities of all elementary events of a noncountable space equal zero even though P(Y) = I. Equally Likely Outcomes We shall say that the outcomes of an experiment are equally likely if I PI= • .• = PN = - (2-32) N In the context of an abstract model, (2-32) is only a special assumption. However, for real experiments, the equally likely assumption covers a large number of applications. In many problems, this assumption is established empirically in terms of observed frequencies or by "reasoning" based on physical "symmetries." This includes games of chance, statistical mechanics, coding, and many other applications. From (2-30) and (2-32) it follows that if an event sA. consists of NsA outcomes, then P(s4) = NsA (2-33) N This relationship can be phrased as follows: The probability P(sA.) of an event sA. equals the number N sA of elements of sA. divided by the total number N of elements. It appears, therefore, that (2-32) is equivalent to the classical definition of probability [see (1-8)). There is, however, a fundamental differ- SEC. 2-2 37 PROBABILITY SPACE ence. In the axiomatic approach to probability, the equally likely condition of (2-32) is an assumption used to establish the probabilities of an experimental model. In the classical approach, (2-32) is a logical conclusion and is used, in fact, to define the probability of sf.. Let us look at several illustrations of this important special case. Example 2.18 (a) We shall say that a coin is fair if its outcomes II and tare equally likely, that is, if P{h} = P{t} = ~ (b) A coin tossed twice generates the space f:f = {hh, ht, th. tt}. If its four outcomes are equally likely (assumption), then P{hh} = P{ht} = P{th} = P{tt} = 4I (2-34) In this experiment, the event ~. = {heads at the first toss} = {hh, ht} has two outcomes, hence, P(~ 1 ) = 112. (c) A coin tossed three times generates the space ~ = {hhh, hht, hth. htt. thh, tht. tth, ttt} Assuming again that all outcomes are equally likely, we conclude that P{hhh} = · · · = P{ttt} = ~ (2-35) The event ff 2 = {tails at the second toss} = {hth, htt, tth, ttt} has four outcomes hence, P(fJ 2) = 1/2. The event sl = {two heads show} = {hhh, hht, thh} has three outcomes, hence, P(sf.) = 3/8. We show in Section 3-1 that the equally likely assumptions leading to (2-34) and (2-35) are equivalent to the assumption that the coin is fair and the tosses are independent. • Example 2.19 (a) We shall say that a die is fair if its six outcomes/; are equally likely, that is, if = · · · = P{f"} =!6 The event {even}= {12 ,/..,Jf.} has three outcomes. hence, P{even} = P{/1} 3/6. (b) In the experiment with two dice. we have 36 outcomes/;./j. If they are equally likely. {/;./j} = 1/36. The event {II} = {[.,/,. fJ.~} has two outcomes, hence. P{ II} '-= 2/36. The event {7} = Utft.. f,Jj .fds .fsfi. f,[.. • .14/3 } has six outcomes, hence. P{7} "" 6/36. • Example 2.20 (a) If a coin is not fair. then P{h} = p P{t} = q p ... q = I For a theoretical investigation, pis an arbitrary number. If represents a real coin, pis determined empirically, as in (1-r ,.. 38 CHAP. 2 FUNDAMENTAL CONCEPTS (b) A coin is tossed twice. generating a space with four outcomes. We assign to the elementary events the following probabilities: P{hh} = p 2 P{ht} = pq P{th} = qp P{tt} These probabilities are consistent with (2-30) because p2 _. pq _ qp + p2 = (p _ q)2 = 1 = q2 C2-37) The assumption ofC2-37) seems artificial. As we show in Section 3-1, it is equivalent to (2-36) and the independence of the two tosses. In preparation, we note the following. With }ftt 'll2 • 5 1 • 3"2-the events heads at the first, heads at the second, tails at the first, tails at the second toss. respectively-we have P<~ 1 ) = P{hh} + P{ht} = p2 ~ pq = p(p- q) = p P('H2) = P{hh} - P{tlr} = p2 + qp = p(p + q) = p P(3"1) = P{th} + P{tt} = qp T q2 = qCp + q) = q P(3"2) ::.: P{ht} "t" P{tt} = pq + q2 = q(p T q) = q The elementary event {hh} is the intersection of the events }t 1 and ~~; hence, PC'Ilt n 'lf2) = P{hh}. Thus PC~1 n 1tz) = P{hh} = p 2 = PC1t,)P(ltz) P('lt 1 n 3 2 ) = P{ht} = pq = PC'It 1 )PC~~) (2-38) P(3" t n 'ltz) = P{th} = qp = PC~ 2JP(3" t> PC~t n 3~):..: P{tt} q~ = P<:ft>PC3"~) = Example 2.21 (a) In Example 2.12Ca), S = {r. d} and P{r} • = p, P{d} = (b) In Example 2.12(b), the space is ~ = {rm, rd. dm, q as in C2-36). df} To specify it, we assign probabilities to its elementary events: P{rm} = P1 P{rf} = P2 P{dm} = PJ P{df} = P• where PI - P2 + Pl • P• = I. Thus Pt is the probability that a person polled is Republican and male. lf9' is the model of an actual poll, then p 1 == n,,.ln where n,,. is the number of male Republicans out of n persons polled. In this experiment, the event ~ = {Republican} consists of two outcomes. Applying (2-31). we obtain PC~) = P{rm} + P{rf} "" P1 ... P2 Similarly, P(.,if) = PI .,. P~ P(~) = p~ + P• PC~)= P2 + P4 • Equally Likely Events The assumption of equally likely outcomes on which (2-32) is based can be extended to events. Suppose that :A 1 , • • • , .s-4.,. are m events of a partition. We shall say that these events are equally likely if their probabilities arc equal. Since the events are mutually exclusive and their union equals'::/, we conclude as in (2-32) that P(.s4 1) = · · · = P(.s4,.) = mI (2-39) Problems involving equally likely outcomes and events are important in a variety of applications, and their solution is often difficult. However, the SEC. 2-2 PRORABJI.ITY SPACE 39 difficulties are mainly combinatorial (counting the number of outcomes in an event). Since our primary objective is the clarification of the underlying theory, we shall not dwell on such problems. We shall give only a few illustrations. Example 2.22 We deal a bridge hand from a well-shuffled deck of cards. Find the probabilities of the following events: .:4 - {hand contains 4 aces} ~ = {4 kings} 't = {4 aces and 4 kings} '.i. = {4 aces or 4 kings} The model of this experiment has ( ~~) outcomc'i. namely. the number of ways that we can take 13 out of 52 objects [sec (2-9)1. In the context of the model, the assumption that the deck is well shuffled means that all outcomes are equally likely. In the event st. there are 4 aces. and the remaining 9 cards are taken from the 48 cards that arc not aces. Thus the number of outcomes in :.4 equals ( ~ ) . This is true also for the event :1\. Hence. Pl:ll) = PWU (~) - - - = .0026 ( 52) 13 The event '€ = .:S n til contains 4 aces and 4 kings. The remaining 5 cards are taken from the remaining 44 cards that are not aces or kings. Hence. (~) PC'-f:) = - - = 1.7 (~i) Finally~ = ,71 U PC':;?;J ~1\: X 10 " hence [see (2-26)]. = P(.!4) + P(i/3) 1 2 (~) (~n - P('O = - - - - - - = .0052 • Example 2.23 (a) A box contains 60 red and 40 black balls. A ball is selected at random. Find the probability Pu that the ball is red. The model of this experiment has 60 · 40 = 100 outcomes. In the context of the model, the random selection is equivalent to the assumption that all outcomes are equally likely. Since there arc 60 red balls. the c·,cnt R = {the select ball is red} has 60 outcomes. Hence. Pu "" P(\/t) = 601100. (b) We select 20 balls from the box. Find the probability p, that 15 of the selected balls are red and 5 black. In this case, an outcome is the selection of 20 balls. There are 1 ( : ) ways of selecting 20 out of 100 objects; hence. the experiment has 40 CHAP. 2 FUNDAMENTAL CONCEPTS (~~)equally likely outcomes. There are(~) ways of selecting 15 out of the 60 red balls and ( ~) ways of selecting 5 out of 40 black balls. Hence, there are ( ~) x ( ~) ways of selecting IS red and 5 black balls. Since all outcomes are equally likely, we conclude that (~)X(~) Pb = - - - - = .065 ( ~':) (2-40) This can be readily generalized. Suppose that a set contains L red objects and M black objects (two kinds of elements) where L + M = N. We select n s N of these objects at random. Find the probability p that I of these objects are red and m black where I T m = n. This experiment has ( ~) equally likely outcomes. There are ( ~) ways of selecting I out of the L red objects and (:) ways of selecting m out of the M black objects. Hence, as in (2-40), L! M! I!(L- I)! m!(M- m)! p= N! n!(N- n)! Example 2.24 (2-41) • We deal a 5-card poker hand out of a 52-card well-shuffled deck. Find the probability p that the hand contains 3 spades. This is a special case of(2-41) with N =52, L = 13, and I= 3 if we identify the 13 spades with the red objects and the other 39 cards with the black objects. With M = 39, n = 5, and m = 2, (2-41) yields (1])(3!) (Sf) p = - - - = .082 • In certain applications involving a large number of outcomes, it is not possible to determine empirically the probabilities p; of the elementary events. In such cases, a theoretical model is formed by using the equally likely assumption leading to (2-33) as a working hypothesis. The hypothesis is accepted if its consequences agree with experimental observations. The following is an important applications from statistical mechanics. Example 2.25 We place at random m particles in n ~ m boxes. Find the probability p that the particles are located in m preselected boxes, one in each box. SEC. 2-2 PROBABILITY SPACE 41 The solution to this problem depends on what we consider as outcomes. We shall analyze the following celebrated models. (a) Maxwell-Boltzmann. We assume that them panicles are distinct, and we consider as outcomes all possible ways of placing them into the n boxes. There are n choices for each panicle; hence. the number N of outcomes equals n"'. The number N,4 of favorable outcomes equals the m! ways of placing the m panicles into the m preselected boxes (permutations of m objects). Thus N = n"'. N.,~ = m!. and (2-33) yields m! p = nnr (b) Bose-Ein.stein. We now assume that them panicles are identical. In this case, N equals the number of ways of placing m identical objects into 11 boxes. As we know from (2-12) and from Fig. 2.8. this number equals n- m- I) ( . There is, of course. only one way of placing the m par- m . ticles in the m preselected boxes. Hence, I /) = ---'----- ("+:-') m!(n- I)! <n + m- I)! (c) Fermi-Dirac. We assume again that the panicles are identical and that we place onl.v one panicle in each box. In this case. the number of possibilities equals the number (. n ) of combinations of n objects taken m at a m· time. Of these, only one is favorable: hence. I p=-= (n) m!(n- ml! n! .m One might argue. as indeed it was argued in the early years of statistical mechanics. that only the first of these solutions may be accepted. However. in the absence of direct or indirect experimental evidence, no single model is logically correct. The models are actually only hypothe.ses: the physicist accepts the panicular model whose consequences agree with observations. • We now consider experiments consisting of a noncountable number of outcomes. We assume, first, that ~consists of all points on the real line ~ = {-x < t < x} (2-42) NONCOUNTABLE OUTCOMES This case arises often: system failure, telephone calls, birth and death, arrival times, and many others. The events of this experiment are all intervals {t 1 s 1 s 12} and their unions and intersections (see "Fundamental Note" later in this discussion). The elementary events are of the form {I;} where I; is any point on the t-axis, and their number is noncountable. Unlike the case of countable elements, ~ is not specified in terms of the probabilities of the elementary events. In fact, it is possible that P{t;} = 0 for every outcome I; 42 CHAP. 2 FUNDAMENTAL CONCEPTS even though ~ is the union of all elementary events. This is not in conflict with (2-20) because in the equation, the events .:A.; are countable. We give next a set of events whose probabilities specify ~completely. To facilitate the clarification of the underlying concept of density, we shall use the mass interpretation of probability. If the experiment ~ has a countable number of outcomes ,;, the probabilities p, = P{,;} of the elementary events {C;} can be viewed as point masses (Fig. 2.11 ). If~ is noncountable as in (2-42) and P{C;} = 0 for every ,;;the probability masses are distributed along the axis and can be specified in terms of the density function a(t) defined as follows. The mass in an interval (1 1 , t 2) equals the probability of the event {1 1 s 1 s 12}. Thus (2-43) The function a(t) can be interpreted as a limit involving probabilities. As we see from (2-43), if lit is sufficiently small, P{t 1 s t s t 1 +lit}= a(t 1)/it, and in the limit ( ) at, . I1m = .11-G P{t 1 s I s t 1 - ~I} A Ql (2-44) We maintain that the function a(t) specifies the experiment ~ completely. Indeed, any event sf of~ is a set~ of points that can be written as a countable union of nonoverlapping (disjoint) intervals (see "Fundamental Note"). From this and (2-20) it follows that P(s4) equals the area under the curve a(t) in the region ~. As we see from (2-30). the area of a(t) in any interval is positive; hence, a(t) ~ 0 for any t. Furthermore, its total area equals P(~) = I. Hence, a(t) is such that a(t) ~0 r. a(t)dt =I (2-45) Note The function a(t) is related to but conceptually different from the density of a random variable, a concept to be developed in Chapter 4. In (2-45) we assumed that ~ is the set of all points on the entire line. In many cases, however, the given set~ is only a region~ of the axis. In such cases, a(t) is specified only for 1 E ~ and its area in the region ~ equals I. The following is an important special case. Fipre 2.11 P; 2-2 Sf.C. PROBABII.ITY SPACE 43 We shall say that Y. is a set of random points in the interval (tl, b) if it consists of all points in this interval and et(t) is nmsttmt. In this case [see (2-45)), et(t) = 1/(b - a) and P{t, ~ I ~ 1~} = f t: ,. I• - /1 et(/) dt = - - - - b - (2-46) ll Fundtm!(•fltal Nott•. In the definition of a probability space. we have assumed tacitly that all subsets of!f are events. We can do so if!J is countable. but if it is noncountable. we cannot assign probabilities to ull its suhs\!ts consistent with axiom Ill. For this reason. events of !f ure all subsets that can he expressed as countable unions and intersections of intervals. That this does not include all subsets of ~ is not easy to show. However. this is only of mathematical interest. For most applications. sets that arc not countable unions or intersections of intervals are of no int\!rest. Empirica/lmt•tprt'ttltimr oft~ltl. As we know. the probabilities of the events of a model can he evaluated empirically as relative frequencies. If. therefore. a model parameter can be expressed as the probability of un event. it can be so evaluated. The function a(t) is not a probability: however. its integral is [sec (2-43)1. This fact leads to the following method for evaluating a(t). To find a(t) fort = t;. we form the event .!4; = {t; :S t :S f; - .lt} At a single trial. we observe an outcome t. If the observed t is in the interval ( t;. t, - .11), the event .s4; occurs. Denoting by .l11, the number of such occurrences at 11 trials. we conclude from <2-43) and 11-1) that P(:.i;) = !,t,1.·.11 a(t)dt .ltl, ,. II If .1t is sufficiently small. the integral equals a(t,).lt: hence • .ill a(t) .1t == - ' I II This expresses a(f;) in terms of the number interval (I;. I; + .it). Example 2.26 .i11; (2-47) (2-481 of observed outcomes in the We denote by 1 the age of a person when he dies, ignoring the part of the population that live more than 100 years. The outcomes of the resulting experiment arc all points in the interval (0. 100). The experimental model is thus specified in terms of a function a( I) defined for every t in this interval. This function can be determined from (2-48) if sufficient data are available. We shall asumc that a<r> = 3 x w· 9r~ooo - r> 2 o ~ r < 100 <2-49> (see Fig. 2.12). The probability that a person will die between the ages of 60 and 70 equals P{60::: t ::i } 70 ""3 X 10 91 70, 1\U , t·(IOO- t)·dt- .154 The probability that a person is alive at 60 equals P{t > 60} = 3 x 10 9 J:oo t~CIOO- t)2 dt = .317 44 CHAP. 2 FUNDAMENTAL CONCEPTS a{t) 100 0 Figure 2.12 Thus according to this model. 15.4% of the population dies between the ages of 60 and 70 and 31.7% is alive at age 60: • Example 2.27 Consider a radioactive substance emitting particles at various times 1;. We observe the emissions starting at t = 0. and we denote by t 1 the time of emission of the first particle (Fig. 2.13). Find the probability p thatt 1 is less than a given time 10 • In this experiment. ~ is the axis t > 0. We shall assume that a{t) = >.e-At {2-50) 1~0 From this and (2-43) it follows that p = P{t 1 < t 0 } =A Jo(to e·Aldt =I- e-Ato • Points on the Plane Experiments involving points on the plane or in space can be treated similarly. Suppose, for example, that the experimental outcomes are pairs of numbers (x, y) on the entire plane ~ = {-x <X, y < x} (2-51) or in certain subset of the plane. Events of this experiment are all rectangles and their countable unions and intersections. This includes all nonpathological plane regions ~. To complete the specification of~. we must assign probabilities to these events. We can do as in (2-43): We select a positive function a(x, y), and we assign to the event {(x. y) e a} the probability P{(x, y) E ~} = Jf a(x, y)dtdy (2-52) '.1 The function a(x, y) can be interpreted as surface mass density. Fipre 2.13 0 )( )( f; t SEC. 2-3 CONDITIONAL PROBABILITY AND INDEPENDENCE 45 .t l'igure 2.14 Example 2.28 A point is selected at random from the rectangle~ of fig. 2.14. Find the probability p that it is taken from the trapezoidal region 9:. The model of this experiment consists of all points in 9t. The assumption that the points are selected at random is equivalent to the model assumption that the probability density a(x, y) is constant. The area of@t equals 24: hence, a(x, y) = 1124. And since the area of~ equals 9, we conclude from (2-52) that p =..!..If dxdy = 1. 24 24 • <j 2-3 Conditional Probability and Independence Given an event .M. such that P(.M.) =I= 0, we form the ratio P(lA n .M.)/P(.M.) where lA is any event of~. This ratio is denoted by P(lA.I.M.) and is called the "conditional probability of lA assuming .M.." Thus PClAI.M.> = P(lA n .M.) (2-53) P(.M.) 1 The significance of this important concept will be appreciated in the course of our development. Empirical llrtnprewtion. We repeat the experiment" times, and we denote by n.., and n31 r_., the number of occurrences of the events .M. and s4 n .M., respectively. If n is large, then (see (I-I)) P(.M.) "'" n.« n P(s4 n .M.) "" n:4n.fl n Hence. P(s4 n .M.) P(.M.) == nJAn.trln n.ufn = n.ar..tr n.AI. 46 CHAP. 2 FUNDAMENTAL CONCEPTS and (2-53) yields P(sA. .M.) == n..,".tt (2-54) n.11 The event sA. n .M. occurs iff .M. and sA. occur; hence, n.-.n.tt is the number of occurrences of sA. in the subsequence of trials in which .M. occurs. This leads to the relative frequence interpretation of P<~I.M.): The conditional probability of sA. assuming ..« is nearly equal to the relative frequency of the occurrence of sA. in the subsequence of trials in which ..« occurs. This is true if not only n but also n.11 is large. Example 2.29 Given a fair die, we shall determine the probability of 2 assuming even. This is the conditional probability of the elementary even sA. =- {.12} assuming ..« = {even}. Clearly, sA. c ..«; hence, sA. n .M = sA.. Furthermore, P(sA.) = 116 and P(.M.) = 3/6; hence, P(sA. n .M.) 116 I P(.f2/even) = P(.M.) = 112 = 3 Thus the relative frequency of the occurrence of 2 in the subsequence of trials in which even shows equals 1/3. • Example 2.30 In the mortality experiment (Example 2.26), we wish to find the probability that a person will die between the ages of 60 and 70, assuming that the person is alive at 60. Our problem is to find P(~l.«> where sA. = {60 < t s 70} ..« = {t > 60} As we have seen, P(sA.) = .154 and P(.M.) = .317. Since sA. n .M. =sA., we conclude that P(sA.) .154 P<sA.I.«> = PUf.) = .317 = ·486 Thus 15% of all people die between the ages of 60 and 70. However, 48.6% of the people that are alive at 60 die between the ages of 60 and 70. • Example 2.31 A box contains 3 white balls and 2 red balls. We select 2 balls in succession. Find the probability p that the first ball is white and the second red. The probability ofthe event 'W 1 ={white first} equals 315. After the removal of the white ball there remain 2 white and 2 red balls. Hence, the conditional probability of the event ~2 = {red second} assuming 'W 1 equals 2/4. And since 'lt' 1 n ~2 is the event {white first, red second}, we conclude from (2-531 that 6 20 Next let us find a direct solution. The experiment has 20 outcomes, namely. the P~ = 5 x 4 permutations w1w2, WJM'J, w,r, • . . . , r2w2, r2w3. r2r1 of the 5 objects w1 , w2 • w3 , r 1 , r 2 taken 2 at a time. The elementary events are equally likely, and their probability equals 1/20. The event {white first, red second} consists of the 6 outcomes p = P('W, n ffi2) = P(~21'W,)P('W,) -= 42 x 53 = Wt Tt , W2r1, WJTt , Wt r2, W2r2, M'3T3 Hence, its probability equals 6120. • SEC. 2-3 CONDITIONAL PROBABILITY AND INDEPENDENCE 47 In a number of applications. the available information (data). although sufficient to specify the model. does not lead directly to the determination of the probabilities of its events but is used to determine conditional probabilities. The next exa!'Jlple is an illustration. Example 2.32 We arc given two boxes. Box I contains 2 red and 8 white cards: box 2 contains 9 red and 6 white cards. We select ut rundom one of the boxes. and we pick at random one of its cards. Find the probability p that the selected card is red. The outcomes of this experiment are the 25 cards contained in both boxes. We denote by :13 1 the event consisting of the 10 cards in box I and by fil 2 the event consisting of the 15 cards in box 2 in Fig. 2.15. From the assumption that a box is selected at rdndom we conclude that P!?A1l = P(;1\2) =- ~ 12-55) The event Jl. = {red} consists of II outcomes. Our problem is to find its probability. We cannot do so directly because the 25 outcomes of~ are not equally likely. They are, however. conditionally equally likely. subject to the condition that a box is selected. As a model concept this means that io' P(l1tltJII) = 9 P(./1 :1!2> "" i5 12-56) Equations (2-55) and (2-56) are not derived. They are assumptions about the model based on the hypothesis that the box and the card arc selected at random. Using these assumptions. we shall derive PfdlJ deductively. Since P<~·tll) I = Pf:/1 n ~~~ /, '"'I·" 2) ·.. P(~l) (,.,. .7) PC11 n ~21 P!~2) it follows from (2-55) and !2-56) that Pfdl. n ;'A ) I 2 X ! .., ..!_ = 10 2 10 .1> n P (m ._, m 2) 9 = 15 X I l -'- 3 )0 The events 'til n :1\ 1 and Jl. n ?A 2 arc mutually exclu!>ivc (see Fig. 2.15). and their union equals the event 't/1.. Hence. PI~> = P!;ifl n :1} 1) -1 PC11 n :1}:> -=- 1~ Thus if we pick at random a card from a randomly selected box, in 40% of the trials the selected card will be red. • l'igure 2.15 .1!. 8w 48 CHAP. 2 FUNDAMENTAL CONCEPTS From (2-53) it follows that P(.stl n ~) = P(.slti~)P(3l) (2-57) Repeated application of this yields P(.stl n ~ n <€) = P(.stliOO n <€) P (~ n <€) = P(.stli~ n '-€)P(~I~)P(~) This is the chain rule for conditional probabilities and can be readily generalized (see Problem 2-14). Fundamental Property We shall now examine the properties of the numbers P(.stli.M.) for a .fixed .M. as .stl ranges over all the events of~. We maintain that these numbers are, indeed, probabilities; that is, they satisfy the axioms. To do so, we must prove the following: I. II. III. If .stl n 00 P(.stli.M.) ~ P<~I.M.> = 0 I (2-58) (2-59) = 0, then P(.stl u OOI.M.> = PC.stli.M.> + P<OOI.M.> (2-60) • Proof. Equation (2-58) follows readily from (2-17) because .stl n .M. is an event. Equation (2-59) is a consequence of that fact that ~ n .M. = .M.. To prove (2-60), we observe that if the sets .sit and 33 are disjoint, their subsets .stl n .M. and ~ n .M. are also disjoint (Fig. 2.16). And since (.srt u ~) n .M. = (.stl n .M.) u (~ n .M.), we conclude from (2-19) that P(.slt u ~I.M.> = P[(.stl U ~) n .M.) = P(.stl n .M.) P(~ n .M.) P(.M.) P(..t{) + P(.M.) and (2-60) results. The foregoing shows that conditional probabilities can be used to create from a given experiment ~ a new experiment conditioned on an event .M. of~. This experiment has the same outcomes C1 and events .stl; as the original experiment~. but its probabilities equal P(.stl1I.M). These probabilities specify a new experiment because, as we have just shown, they satisfy the axioms. Figure 2.16 SEC. 2-3 CONDITIONAl. PROBABILITY AND INDEPENDENCE 49 Note, finally. that PC.51ti.M.> can be given the following mass interpretation. In the Venn diagram (fig. 2. 16), the event sl r. Ji. is the part of .5lt in .M., and P<.51ti.M.) is the mass in that region normalized by the factor 1/P(.M.). Total Probability and Bayes' 11zeorem In Example 2.32. we expressed the probability PC :fl.> of the event 9l in terms of the known conditional probabilities PC9iltll,) and P(t.ltl~h). The following is an important generalization. Suppose that [.51t 1 , • • • , .s4m I is a partition of f:J consisting of m events as shown in Fig. 2.17. We maintain that the probability P(tll) of an arbitrary event ?A of f:J can be written as a sum: P(?J3) = P(~l.51t,)PC.sd,) • Proof. Clearly [see (2-4)[. ~ = ze n f:l = ?A n (.51t 1 u · + · · · + P<:A!:Jl,)PC.iim) (2-61) · · u ~,> = <1A n .~·i,) u · · · u (?13 n .s4m) But the events ~ n sll; are mutually exclusive because the events sll; are mutually exclusive. Hence [see (2-20)]. P(iA) = P(::A n :.:d,) - • • · - PC-A r :~,.,) (2-62) And since P(~ n sll;) = P(tilj:A;)P(:4;), (2-61) follows. This is called the theorem of total probability. It is used to evaluate the probability P(~) of an event :13 if its conditional probabilities P(OOI:4;) arc known. Example 2.33 In a political poll, the following results are recorded: Among all voters, 70% are male and 30% are female. Among males, 4()9(. are Republican and 60"k Democratic. Among females. 45% are Republican and 55~ are Democr.nic. Find the probability that a voter selected at random is Republican. In this experiment, f:f ::=. {rm, rf. dm, ~f}. We form the events.« ={male},~ :.::. {female}. and~ = {Republican}. The results of the poll yield the following data PUO ::. .70 P(::i) .... 30 PM!.«) = .40 P(1/. 5) = .45 Hence fsee (2-61)1, P('til.) •·igure 2.17 = P(f1t ..,tf.)P(..tt) + P<;11. 3-)P(.:fJ-= .415 • 50 CHAP. 2 FUNDAMENTAL CONCEPTS • Bayes' Theorem. We show next that (2-63) This is an important result known as Bayes' theorem. It expresses the posterior probabilities P(.slf;IOO) of the events .slf; in terms of their prior probabilities P(sA.;). Its significance will become evident later. • Proof. From (2-53) it follows that P(st·l~> I = P(.slf; n ~) P(~) P(~l.slf·) = P(.slf; n ~) I P(.slf;) Hence, P<stl~> I = P(~l.slf;)P(.slf;) P(~) (2-64) Inserting (2-61) into (2-64), we obtain (2-63). Example 2.34 We have two coins: coin A is fair with P{heads} = J/2, and coin B is loaded with P{heads} = 213. We pick one of the coins at random, we toss it. and heads shows. Find the probability that we picked the fair coin. This experiment has four outcomes: ~ = {ah, at, bh, bt} For example, ah is the outcome "the fair coin is tossed and heads shows." We form the events .s4 = {fair coin} = {ah. ht} ~ = {loaded coin} = {bh. bt} 'Jl = {heads shows} = {ah, bh} The assumption that the coin is picked at random yields P(.s4) = P(~) = 112. The probability of heads assuming that coin A is picked equals P('Jll.s4) = 1/2. Similarly, PC'Jl!OO> = 2/3. This completes the specification of the model. Our problem is to find the probability P(.s4I'Jl) that we picked the fair coin assuming that heads showed. To do so, we use (2-63): P('Jll.s4)P(.s4) P<.s4I'H> = P<'HI.s4)P{.s4) + P<'HI~>P<98) • In the next example, pay particular attention to the distinction between the determination of the model data based on the description of the experiment and the results that follow deductively from the axioms. Example 2.35 We have four boxes. Box I contains 2,000 components, of which I ,900 are good and 100 are defective. Box 2 contains SOO components, of which 300 are good and 200 defective. Boxes 3 and 4 each contain I ,000 components, of which 900 are good and SEC. 2-3 CONDITIONAL PROBABILITY AND lNDEPENDI~NCE 100 defective. We select at random one of the boxes and pick component. Cit 51 random a single (a) find the probability Pu that the selected component is defective. (b) The selected component is defective: find the probability p, that it came from box 2. Model Spedfication The space !I of this experiment has 4,000 good (g) clements forming the event Cfi and 500 defective (d) clements forming the event a. The elements in each box form the events ~~ = {1.900g, IOOd} !A~ = {300g. 200d} ~l = {900g. IOOd} :11~ = {900g. IOOd} The information that the boxes are selected at nmdom leads to the assumption that the four events have the same probability. Hence lsee (2-3911. PC~,) = PC?A2 I = PC~~) = PC'til4 I . . :. ~ (2-65) The rclndom selection of a component from the ith box leads to the assumption that all elements in that box are conditionally equally likely. From this it follows that the conditional probability P(a ~.) that a component taken from the ith box is defective equals the proportion of defective components in that box. This is the extension of (2-32) to conditional probabilities. and it yields c; .-... 100 -· 05 P<~ w•d - 2.000 - · p, !Jl ) C:i: J.J~ P<~lflhl = ~~:::0 == P('.£ .I :11~) 200 - 4 - 500 - · = ~~: = .I (2-66) Deduction (a) from (2-62) and the foregoing it follows that the probability P(~) that the selective component is defective equals . I I P('::t) = .05 X 4- .4 X 4 +.I X 4I · .I X 4I = .1625 (b) The probability Pb that the defective component came from box 2 equals P(@hl~>. Hence [see (2-64)], P(~ ~) = P(~!002)P(002J = .4 x .25 = 615 21 P('it) .1625 . Thus the prior probability P(\~h) of selecting box ~ 2 equals .25; the po.fterior probability. assuming that the component is defective, equals .615. • t:mpiricallmerprt'tcltion. We perform the experiment n times. In 25% of the trials, box 2 is selected. If we consider only the n:t trials in which the selected part is defective, then in 61.5% of such trials it came from box 2. Example 2.36 In a large city, it is established that 0.5% of the population has contracted AIDS. The available tests give the correct diagnosis for 80% of healthy persons and for 98% of sick persons. A person is tested and found sick. Find the probability that the diagnosis is wrong, that is, that the person is actually healthy. 52 CHAP. 2 FUNDAMENTAL CONCEPTS We introduce the events sf.= healthy ~ = tested healthy C€ =sick= 3i ~ = tested sick = 9i The unknown probability is P(st'a). From the description of the problem it follows that P(sf.) = .995 P(<€) = .005 P(~lsf.) = .20 P(21~) = .98 Hence [see (2-61)], P(9l) = P(2!sf.)P(sf.) + P(9l ~)P(~) = .2039 This is the probability that a person selected at random will test sick. Inserting into (2-64), we conclude that P(sf.l~) = P(2isf.)P(sf.) = .1990 = 976 P(9l) .2039 . • Independent Events Two events sf. and ~ are called statistically independent if the probability of their intersection equals the product of their probabilities: P(st n ~) = P(sf)P(~) (2-67) 1 The word statistical will usually be omitted. As we see from (2-53), if the events~ and~ are independent, then P(sti~> = P(sf.) P(~lst) = P(~) (2-68) The concept of independence is fundamental in the theory of probability. However, this is not apparent from the definition. Why should a relationship of the form of (2-67) merit special consideration? The importance of the concept will become apparent in the context of repeated trials and combined experiments. Let us examine briefly independence in the context of relative frequencies. Empirical Interpretation. The probability P(sf.) of the event sf. equals the relative frequency ""'n of the occurrence of sf. in n trials. The conditional probability P(stla) of sf. assuming~ equals the relative frequence of the occurrence of sf. in the subsequence of n~ trials in which~ occurs [see (2-54)].1fthe events sf. and a are independent, then P(stla> = P(sf.); hence, n.., n = nsAn!A n~ (2-69) Thus if the events sf. and a are independent, the relative frequency of the occurrence of sf. in a sequence of n trials equals its relative frequency in the subsequence of na trials in which a occurs. This agrees with our heuristic understanding of independence. Example 1.37 In this example we use the notion of independence to investigate the possible connection between smoking and lung cancer. We conduct a survey among the following SEC. 2-3 CONDITIONAL PROBABILITY AND INDEPENDENCE. 53 four groups: cancer patients who are smokers (cs). cancer patients who are nonsmokers (c.f). healthy smokers (cs). healthy nonsmokers (cs). The results of the survey show that P(cs) = P1 P(c.'f) = P2 Pies) .:..: p, P(C.f) = P4 We next form the events <t = {cancer patients} -= {cs. d} 2 = {smokers} = {cs. cs} ~ n 2 = {cancer patients. smokers} = {cs} Clearly. P('~) = P1 - P2 P(':/.) - PI ~ P.' P(t n r,t) = PI If P('-f.. n ~) :/: P(<t)P(ra), that is. if P1 :/: lp 1 - p 2)(pl - pd. the events {cancer patients} and {smokers} are statistically dependent. :'~Jote that this reasoning does not lead to the conclusion that there is a causal relationship between lung cancer and smoking. Both factors might result from a common cause (work habits. for example) that has not been considered in the experimental model. • Example 2.38 Two trains, X and Y. arrive at a station at ntndom between 0:00 and 0:20A.M. The times of their arrival are independent. Train X stops for 5 minutes, and train Y stops for 4 minutes. (a) Find the probability p 1 that train X arrives before train Y. (b) Find the probability p 2 that the trains meet. X arrived before train Y. Model Spedfication An outcome of this experiment is a pair of numbers (x, y) where x and y are the arrival times of train X and train Y. respectively. The resulting space !f is the set of points in the square of Fig. 2.18a. The event :4 = {X arrives in the interval (1 1 • 12 )} =- {1 1 :s .\' :s 12} is a venical strip as shown. The assumption that x is a random number in the interval (0. 20) yields (c) Assuming that the trains meet. find the probability p 1 that tntin P(.s4) = lz - 11 20 The event ~ = {Y arrives in the interval (1 3 • 14 )} = {1 3 ~ y :s 14 } is a horizontal strip, and its probability equals P(~) The event .s4 n = l4- l3 20 911 is a rectangle as shown, and its probability equals P(.s4 n ~) = P(.s4)P{911) = (I• - 13 )(rz - 11 ) 20 X 20 This is the model form of the assumed independence of the arrival times. Thus the probability that (x, y) is in a rectangular set equals the area of the rectangle divided by 400. And since any event ~ can be expressed as a countable union of disjoint rectangles, we coqclude that P(~) = area of 2 400 54 CHAP. 2 FUNDAMENTAL CONCEPTS 4 0 (a) 20 ."C (b) Figure 2.18 This concludes the specification of if. We should stress again that the relationships are not derived; they are assumptions based on the description of the problem. Deduction (a) We wish to find the probability p 1 of the event~ = {X arrives before Y}. This event cQnsists of all points in~ such that x s y. Thus <e is a triangle. and its area equals 200. Hence, 200 PI = P(~) = 400 = .500 (b) The trains meet iff x s y + 4. because train Y stops for 4 minutes. andy s x + 5, because train X stops for 5 minutes. Thus the trains meet iff the event~= {-5 s x- y s 4} occurs. This event is the region of Fig. 2.18b consisting of two trapezoids. and its area equals 159.5. Hence, ,. 159.5 399 P2 = P(':t) = 400 = · (c) The probability that X arrives before Y (i.e .. that event~ occurred) assuming that the trains met (i.e., that event~ occurred) equals P(<€1~). Clearly, '€ n a is a trapezoid. and its area equals 72. Hence, _ G _ P3 - P<<€1:-t) - P<~ n a> _ ..E._ _ - 1595 - .451 This example demonstrates the value of the precise model specification in the solution of a probabilistic problem. • Example 2.39 P(a) Here we shall use the notion of independence to complete the specification of an experiment formed by combining two unrelated experiments. SEC. 2-3 CONDITIONAL PROBABILITY AND INDEPENDENCE 55 We are given two experiments. The first is a fair die specified by the model :1, = Ui . ... . J~} P{Ji} .., ~ and the second is a fair coin specified by the model :f~ = {h. t} P{lr} = P{t} I ,,. .; We perform both experiments. and we wish to find the probability p that 5 shows on the die and head.f on the coin. If we make the reasonable assumption that the two experiments are independent. it would appear from 12-67) that we should conclude that (2-70) This conclusion is correct; it does not. however. follow from 12-67). In that equation, as in our entire development. we dealt only with subsets of a sin1de space. To accept (2-70) not merely as a heuristic statement but as a conclusion that follows logically from the axioms in a single probabilistic model. we must construct a new experiment ff in which 5 and heads are subsets. This is done as follows. The space ff of the new experiment consists of 12 outcomes. namely. all pairs of objects that we can form taking one from~. and one from :1~: :t = {f,h,Jih •. .. ,f,t,Jf.t} Thus ff = :/1 x f/ 2 is the Cartesian product of the sets :1 1 and :t~ [see 12-5)]. In this experiment, 5 is an event sll consisting of the two outcomes j~lr and f,r; heads is an event tA consisting of the six outcomes jjh • . . . . f6h. Thus sll = {5} = {fch. fit} ~ = {heads} = {jjh, •.. .f,Jt} To complete the specification of:t, we must assign probabilities to its subsets. Since {5} and {heads} are events in the original experiments. we must set P(sll) = !6 P(dll =- !2 In the experiment:/, the event {5 on the die and heads on the coin} is the intersection sll n ~ {hh} of the events sll and ffi. To find the probability of this event, we use the independence of sll and~. This yields P(al n :1\l = 1/6 x 112 = 1112. in agreement with (2-70). • We show next that if the events .'12 and ?A are independent. the events sl and ?13 are also independent: then P(!A n :M = Pla>P<:A> <2-71) If P<:A n ?A> = P<dl>P<11l>. • Proof. As we know, :11. u S4 = ~ Hence, P(dl.) + P(.i) = follows that P(:A n :1:\) = :1\ ""' {.lll n :1u u 1:4 n :-Jl) I and P(OO) = P(sfJ. n :1\) + P(.~ n :1l). From this it P(~) - P<:A n = (I - P(dl.))P(9A) ~) = PH·M - P(:tl)P(:1.1) = P(-;t)P(t/3) and (2-71) results. We can similarly show that the events :A and :13 arc also independent. 56 CHAP. 2 FUNDAMENTAL CONCEPTS Figure 1.19 Generalization The events .54 1 , • • • , .54, are called mutually statistically independent, or, simply, independent if the probability of the intersection of any number of them in a group equals the product of probabilities of each event in that group. For example. if the events .'ilf 1 • .542 , and .54 3 are independent, then P(-54, P(.s4, n .'ilf2) n .'ilf3) = P(.s4,)P(.s43) P(JA, n .'ilf2 n af3) = P(.s4, )P(.s4:!) P(.'ilf2 n af3) = P(.'ilf2)P(.'ilf3) (2-72) (2-73) Thus the events .s4,, .'ilf2, .'ilf3 are independent, if they are independent in pairs and their intersection satisfies (2-73). As the next example shows, three events might be independent in pairs but not mutually independent. Example 2.40 = P(.'ilfJ)P(-542 )P(af3) The events s4, :1, and '€ are such that (Fig. 2.19) P(af) = P(li) = P('€) = !5 P(af P(af n ~ n '-€) = ..!.. 25 n ~) = P(sl n <€) = P(OO n '€) = 2~ This shows that they are independent in pairs. However, they are not mutually independent because they do not satisfy (2-73). • Problems 2-1 2-2 Show that: (a) if sf u 00 = sf n ~.then sf = ~;(b) sf u (~ n '€) c (s4 u 00) n '€; (c) if af 1 c sf, 00 1 c 00, and sf n ~ = 0, then at, n :1 1 = 0. Iff/={-~< I< oo},s4 = {4 SIS 8}, 00 = {7 sIS 10}, find sf U 00, sf n ~.i', and st n 00. PROBLEMS 1·3 2-4 2·5 2-7 2·9 2-10 2·11 2-12 2·13 2-14 2·15 57 The set 9' consists of the 10 integers I to 10. Find the number of its subsets that contain the integers 1. 3, and 7. (De Morgan's law) Using Venn diagrams, show that stl u 9l = ~ n j and stl n li = :9i u i. Express the following statements concerning the three events stl, li\, and~ in terms of the occurrence of a single event: (a) none occurs; (b) only one occurs; (c) at least one occurs; (d) at most two occur: (e) two and only two occur: (I) stl and fi occur but <(S does not occur. If P(stl) = .6. P(OO) = .3, and P(stl n 9l) = .2. find the probabilities of the events :9i u !i. :9i u i, stl n i. and :si n i. Show that: (a) PC::A U ~ U ~) = P(stl) - PC9l) - P('f.) - P(.ltt n 9l) - P(.ltt n ~) - P(!i n <€) + P(stl n 98 n <(S); (b) (Boote's inequality) P(stl 1 U • • • U stl,) s P(stll) T • • .... P(stl, ). Show that: (a) if P(stl) = P(~) = I, then P<stl n :1.1) = I: (b) P~(.<A n ?A) s P(stl)P(\it) and P<::A n 91\) s [PC.<A) -r PC~J.Ul/2. Find the probability that a bridge hand consists only of cards from 2 to 10. In a raffle, 100 tickets numbered I to 100 are sold. Seven are picked at random, and each wins a prize. (a) Find the probability that ticket number 27 wins a prize. (b) Find the probability that each of the numbers 5. 15, and 40 wins a prize. A box contains 100 fuses, 10 of which are defective. We pick 20 at r.mdom and test them. Find the probability that two will be defective. If P(stl) = .6, P(stl U ~) = .8, and P(stli~J = .5. find PC~). Show that P(stl) = PC::AI.«>P<.M.> + P<stliM>P<Ah Show that PCstl n 00 n ~ n ~> = P(.<A:~ n <€ n ;:cJP<?AI'€ n ~>P<~I~>P<~). Show that: (a) if ::4 n 00 = 0. then P(stl) ,.,. ..~~ '.A) P(:1i) P( ON .:4 U ,,.J = P(.<;4) + P(?A) and P(li lstl) = I _ P(::A) (b) P(Sf I.M.> = I - P(stli.M.). 2-16 We receive 100 bulbs of type A and 200 bulbs of type B. The probability that a bulb will last more than three months is .6 if it is type A and .8 if it is type B. The bulbs are mixed, and one is picked at random. Find the probability p 1 that it will last more than three months: if it does. find the probability p 2 that it is type A. 2·17 The duration t of a telephone call is an element of the space !f = {t 2!: O} specified in terms of the function edt) = !(' t' '". c =5 minutes, as in (2-43). Find the probability of the events .<A = {O s t s 10} and ~ = {t 2!: 5}: find the conditional probability P<stll~). 2·18 The events stl and 00 are independent, and ~ c <4. Find P(.<A). 2·19 Can two events be independent and mutually exclusive? Two houses A and Bare far apart. The probability that in the next decade A will bum down equals to- 3 and that B will bum down equals 2 x 10· 3 • Find the probability p 1 that at least one and the probability p2 that both will bum down. 2-21 Show that if the events stl, !i, and~ are independent. then (a) the events Sf and ii are independent; (b) the events, stl, li, and <i are independent. 58 CHAP. 2 FUNDAMENTAL CONCEPTS 2-ll Show that II equations are needed to establish the independence of four events; generalize to n events. 2-23 Show that four events are independent iff they are independent in pairs and each is independent of the intersection of any of the others. 1·24 A string of Christmas lights consists of 50 independent bulbs connected in series; that is, the lights are on if all bulbs are good. The probability that a bulb is defective equals .01. Find the probability p that the lights are on. 3 _ _ __ Repeated Trials Repeated trials have two interpretations. The first is empirical: We equate the probability P(s4) of an event .<4 defined on an experiment ~ to the ratio , . ,/n where n. 4 is the number of successes of .rA in " repetitions of the underlying physical experiment. The second is conceptual: We form a new experiment f!, = ~~ x · · · x ~ the elements of which are sequences ~ 1 • • • ~,where~; is any one of the elements of ~f. In the first interpretation n is large; in the second. " is arbitrary. In this chapter. we use the second interpretation of repeated trials and determine the probabilities of various events in the experiment ~f,. 3-1 Dual Meaning of Repeated Trials In Chapter I. we used the notion of repeated trials to establish the relationship between a theoretical model and a real experiment. This was based on the approximation P(sf) "" n!AIn relating the model parameter P(sf) to the observed ratio n.11 fll. In this chapter. we give an entirely different interpretation to the notion of repeated trials. To be concrete. we start with the coin experiment. 59 60 CHAP. 3 REPEATED TRIALS REPEATED TOSSES OF A COIN The experiment of the single toss of a coin ~ {h. t} and the probabilities of its is specified in terms of the space = elementary events P{h} = p (3-1) P{t} = q Suppose that we wish to determine the probabilities of various events involving n tosses of the coin. For example. the probability that in 10 tosses. 7 heads will show. To do so. we must form a new model if, the outcomes of which are sequences of the form ft ... f; ... f, (3-2) where f; ish or t. The space~~~ so formed is written in the form 9', = ~ X • • • X ~ (3-3) and it is called Cartesian product. This is a reminder of the fact that the elements of~, are sequences as in (3-2) where f, is one of the elements of~. Clearly, there are 2" such sequences: hence, ~f, has 2" elements. The experiment if, cannot be specified in terms of 9' alone. Additional information concerning the multiple tosses mu!\t be known. We shall presently show that if the tosses are independent, the model rJ, is completely specified in terms of9'. Independence in the context of a real coin means that the outcome of a particular toss is not affected by the outcomes of the preceding tosses. This is in general a reasonable assumption. In the context of a theoretical model, independence will be interpreted in the sense of (2-67). As preparation, we discuss first the special case n = 3. Example 3.1 A coin tossed n = 3 times generates the space 9') =- if x 8 outcomes ~~ x 9' consisting of the 2) = hlrh. hht. hth. 1111. tlrh. till. tth. tit We introduce the events 'It; "" {heads at the ith tosst fj; (3-4) = {tails at the ith tos~} and we assign to these events the probabilities (3-5) P('lt;) = p P(~;) = q This is consistent with (3-2). Using (3-5) and the independence of tosses. we shall determine the probabilities of the elementary events of !I,,. Each of the events '#t; and ~;consists of four outcomes. For example. 'll 1 = {hhh.lrlrt. hth. htt} :1 1 = {tlrlt. tlrt. ttlr. ttt} The elementary event {hhlr} can be written as the inter~ection of the events 1t1 • Jf~. 'It 3 : hence, P{hhh} = P{:Jt, n 'Jf2 n :~t) t From the independence of the tosses and 12-73) it follows that P<'ll, n 'lf2 n 1f)} = P<'lf, )P('Il~ )P( H:d = p·' hence. the probability of the event {hhlr} equals p\ Proceeding similarly. we can determine the probability of all elementary events of ~1.'. The result is P{hhh} = p' P{thh} = p 2q P{hht} = p~q P{tht} = pq2 P{hth} = p 2q P{tth} = pq2 P{htt} = pq2 P{ttt} = p·' SEC. 3-J DUAL MEANING OF REPEATED TRIAI.S 61 Thus the probability of an elementary event equals p 4q 1 4 where k is the number of heads. This completes the specification of :1.,. To find the probability of any event in :t). we add the probabilities of its elementary events as in (2-30). For example. the event .'ll ={two heads in any order} consists of the three outcomes hlu. hth. thh: hence. P(.•/l) - P{hlu} + P{hth} + P{thh} - 3p~Cf (3-6) • A coin tossed 11 times generates the space !f,, consisting of2" outcomes. Its elementary events are of the form ;~ ...,. {k head-. in a specific order}. We shall show that (3-7) P{A heads in a specific order} = pAq" 4 We introduce the events /If; and :·l; as in 0-4) and assign to them the prnhahilities p and q as in (3-5). To prove (3-7). it suffices to express the elementary event ;~as an intersection of the events ··1f; and /i;. Suppose. to be concrete. that \'A :-:: {lrth . . . t}. In this case. {hrlr . . . h} = Jt 1 n ;,-~ n "H', n · · · n ;'If,, where the right side contains k events of the form J(.; and n - k events of the form :•i;. from the independence of the tosses and (2-73 ). it follows that the probability of the right side equals p4q'' 4 as in 0-7). Example 3.2 Find the probability p, that the first 10 tosses of a fair coin will shuw heads and the probability p 1, that the first 9 will be heads and the next one tails. In the experiment ~J- 111 • the events .•tl - { 10 heads in a mw} :A .,. N heads thc:n tails} arc elementary. With p =- q - 112 and n .... HI. 13-71 yields P(.o,4) - I /'(:'1\1 :- -,w - Thus. contrary to a common impression. the events :1l and :11 arc equally rare. • Note• Equation (3-7) seems heuristically obvious: We have k heads and 11 - k tails: the probability for heads equals p and for tails q: the tosses are independent: hence. (3-7) must be true. However. although heuristics lead in this case to a correct conclusion, it is essential that we phrase the problem in terms of a single model and interpret independence in terms of events satisfying (2-67). We shall now show that the probability that we get k heads (and n - k tails) in any order equals P{k heads in any order} = ( Z) pAq"-4 (3-8) • Proof. The event {k heads in any order} consists of all outcomes formed with k heads and n - k tails. The number of such outcomes equals the number of ways that we can place k heads and 11 - k tails on a line. As we have shown in Section 2-1, this equals the combinations CZ = ( ~) of n 62 CHAP. 3 REPEATED TRIALS objects taken kat a time [see (2-1011. Multiplying by the probability p£q" £of each elementary event. we obtain (3-8). For n = 3 and k = 2. (3-8) yields 1'{2 heads in any order} = ( ~) p1q ::.: 3p1q in agreement with (3-6). Example 3.3 A fair coin is tossed 10 times. Find the probability that heads will show 5 times. In this problem. I p=q=i k -: 5 11 = 10 and (3-8) yields I I 252 . P{5 head s m any ord er} = ( 10) 5 x 2'" x ~ == 1.024 Example 3.4 • We have two coins as in Example 2.34. Coin A is fair. and coin 8 is loaded with P{h} = 2/3. We pick one of the coins at r.mdom. toss it 10 times. and observe that heads shows 4 times. Find the probability that we picked the fair coin. The space of this experiment is a Cartesian product 9'11 x ~ 111 where~~~= {c1, b} is the selection of a coin and U' 10 is the toss of a coin 10 times. Thus this experiment has 2 x 210 outcomes. We introduce the events ~ = {coin A tossed 10 times} ~ = {coin 8 tossed 10 times} '!IJ = {4 heads in any order} Our problem is to find the conditional probability P(:41~t). From the randomness of the coin selection, it follows that P(~) = P(~) = 112. If .<;4 is selected. the probability P(~l.llfl that 4 heads will show is given by (3-81 with p = q = 1/2. Thus " I I P(~~~) = ( 10) 4 X p X ~ . P(':tj:~) = ( 10 4 ) X (3 2 )• X (3 I )" Inserting into Bayes' theorem 12-63). we obtain r. _ P(~~l.?liP(sf) P(:A l~~ - P(!:tls41P!.9'l) + P(~ti:~)P(~) Example 3.5 I + 2 ,.., 3 111 = .783 • A coin with P{h} ::. p is tossed " times. Find the proballility Pu that at the first n - I tosses. heads shows k - I times. and at the 11th toss. heads shows. First Solution In n tosses. the probability of k heads equals pAq"-£. There are {~ =:) ways of obtaining k - I heads at the first 11 - I trials and heads at the 11th trial: hence. Pu = ( nk - I) pq" I £ (3-9) SEC. 3-J DUAL MEANING OF REPEATED TRIALS Second Solution The probability uf k · I heads in (z ~ :) p' IC/'" It I( 63 I trials equals 11 - (J.J0) II The probability of heads at the 11th trial equab I'· Multirlying !3-IOJ hy p (independent tusses). we obtain (3-9). Fur k = 1. equation 13-9) yields fl.,- pet 1 (J-Ill This is the probability thut heads will show ut the nth tm•., hut not before. • .Vot(' Example 3.5 can he given u different interpretation: If we toss the coin an infinite number of times. the probability that heads will show at the nth toss but not before equals pc(' 1• In this interrrctatiun. the underlying experiment is the space !f., of the infinitely many tusses. Probability Tree In fig. ~.I we give a graphical representation of repeated trials. The two horizontal segments under the letter !f 1 represent the two clements lr and t of the experiment ~1' 1 - :1. The probabilities of {h} and {t} arc shown at the end of each segment. The four horiLontal segments under the letter !f~ represent the four clements of the space !f~ - ~f x ~f of the toss of a coin twice. The probabilities of the corresponding elementary events arc shown at the end of each segment. Proceeding similarly. we can form a tree representing the experiments !/ 1• • • • • :1, fur any 11. •·igure 3.1 " , hilt lull Ill ~I"' Vt.t: = jill!. Ill, til. II: ~~ = ihllil. illlt. iltil. lut. "" lI ,2 l pq l lilt ~~2 till! till!. tilt. rtil. rrr: til 1pq I tilt I If tr/1 II 11[2 I ((( 64 CHAP. 3 REPEATED TRIALS Dual Meaning of Repeated Trials The concept of repeated trials has two fundamentally different interpretations. The first is empirical. the second conceptual. We shall explain the difference in the context of the coin experiment. In the first interpretation. the experimental model is the toss of a coin once. The space ~consists of the two outcomes lr and t. A trial is a single toss of the coin. The experiment is completely specified in terms of the probabilities P{h} = p and P{t} = q of its elementary events {h} and {t}. Repeated trials are thus used to determine p and q empirically: We toss the real coin n times, and we set p = 111,/n where n,, is the observed number of heads. This is the empirical version of repeated trials. The approximation p == n1,1n is based on the assumption that n is !ilt/fidently large. In the second interpretation. the experiment model is the toss of a coin n times. where n is any number. The space f!, is now the Cartesian product ~n = ~ x · · · x ~consisting of2" outcomes of the form hth . .. h. A single trial is the toss of the coin n times. This is the conceptual interpretation of repeated trials. In this interpretation. all statements are e.mct. and they hold for any n, large or small. If we wish to give a relative frequency interpretation to the probabilities in the space ~~,. we must repeat the experiment of the n tosses of the coin a large number of times and apply ( 1-1 ). 3-2 Bernoulli Trials Using the coin experiment as an illustration. we have shown that if ff = {C 1 • Cz} is an experiment consisting of the two outcomes C1 • C2 and it is repeated n times, the probability that C1 will show k times in a specific order equals p 4qn- 4• and the probability that it will show k times in any order equals p,(k) = ( ~) p 4q" ., p = P{Cd The following is an important generalization. Suppose that~ is an experiment consisting of the elements C;. Repeating ~ n times, we obtain a new experiment ~n =~ X ~~ X • • • X ~ (Cartesian product). The outcomes of this experiment are sequences of the form (3-12) where f; is any one of the elements C; of ~f. Consider an event .54 of~~ with P(J4) = p. Clearly. the complement .54 is also an event with P(~) = I - p = q. and [.54, :s.i I is a partition of~. The ith element f; of the sequence (3-12) is an element of either .54 or ~. Each sequence of the form (3-12) thus genercltes a sequence ~ = .54.54.54 ••• sA (3-13) SEC. 3-2 BERNOULLI TRIALS 65 where we place .stl at the ith position if .stl occurs at the ith trial. that is, if e; E .~;otherwise. we place :si. All sequences of the form of(3-13) are events of the experiment ~fn. Clearly. 00 is an event in the space !f,: we shall determine its probability under the assumption that the repeated trials of!/' are independent. From this assumption it follows as in (3-7) that P(,ql.~ ....~) = P<:A>PC.ii> · • · P<.>'ll = pq · • · p (3-14) If in the sequence ~ the event .stl appears k times, then the event Si appears " - k times and the right side of (3-14) equals pkqn·A. Thus (3-15) P{.-.4 occurs k times in a specific order} == p'q"- 4 We shall next determine the probability of the event '.:JJ = {.~ occurs k times in any order} • Fundamental Theorem. In p,(k) 11 independent trials. the probability = P{~ occurs k times in any order} that the event .~ will occur k times (and the event :!i " equals • Proof. There are CZ = ( ~) k times) in any order events of the form {.<fl occurs k times in a specific order}. namely. the ways in which we can place ,<4 k times and .Si" k times in a row (see (2-IO)j. Furthermore, all these events are mutually exclusive, and their union equals the event {.~·1 occurs 1.: times in any order}. Hence. (3-16) follows from 0-151. Example 3.6 A fair die is rolled seven times. Find the probability tH21 that 4 will show twice. The original experiment !/"is the single roll of the die and-'·~ """ {./4}. Thus I 5 Pl.·-~1 -= (, With" = 7 and J.. Pl.'!ll - 6 = 2. 0-16) yields 7! I ~ 5 ' P7121 '· 2!;! ((;) ((,) - .234 Example 3.7 • A pair of fair dice is rolled four times. Find the probability p4(0) that II will not show. In this case. !f is the single roll of two dice and :.4 .., {J;.t~..f,h}. Thus P(.'.~) -=- ., f{; P(.•.4) =- '\4 Jfl 66 CHAP. 3 REPEATED TRIALS With n = 4 and k = 0. {3-16) yields 34) = .796 = (36 4 p~(O) • We discuss next a number of problems that can be interpreted as repeated trials. Example 3.8 Twenty persons arrive in a store between 9:00 and 10:00 A.M. The arrival times are random and independent. Find the probability p, that four of these persons arrive between 9:00 and 9: 10. This can be phrased as a problem in repeated trials. The original experiment 9 is the r.mdom arrival of one person and the repeated trials are the arrivals of the 20 persons. Clearly, sf= {a specific person arrives between 9:00 and 9: I0} is an event in~. and its probability equals PC.vl) = 10 60 [see (2-46)J because the arrival time is random. In the experiment ~t3, of the 20 arrivals, the event {four people arrive between 9:00 and 9: IO} is the same as the event {sf occurs four times}. Hence. 4 Pu Example 3.9 20! (') = p:!C){4) = 4!16! 6 (5)'" 6 = .02 • In a lottery. 2,000 persons take part. each selecting at random a number between I and 1.000. The winning number is 253 (how this number is selected is immaterial). (a) Find the probability p, that no one will win. (b) Find the probability Ph that two persons will win. (a) Interpreting this as a problem in repeated trial~o. we consider as the original experiment ~ the r.mdom selection of a number N between I and 1.000. The space ~ has 1,000 outcomes. and the event sf = {N = 253} is an elementary event with probability P(sf) = P{N = 253} = .001 The selection of 2,000 numbers is the repetition of~ 2.000 times. It follows, therefore. from (3-16) with k = 0 and p = .001 that Pu = (.999)2•000 = e- 2 = .135 (b) If two persons win. sf occurs twice. Hence. Pb = ( 2 ·~) (.001)2(.999) 1•998 = .27 • SEC. Example 3.10 3-2 BERNOULLI TRIALS 67 A box contains K white and N - K black cards. (a) With replacements. We pick a card at random. examine it. and put it back. We repeat this process n times. Find the probability p., that k of the picked cards are white and n - k are black. Since the picked card is put back. the conditions of the experiment remain the same at each selection. We can therefore apply the results of repeated trials. The original experiment is the selection of a single card. The probability that the selected card is white equals KIN. With P= K the probability that k out of the p, = ( N- K N 11 q = -~v- selections arc white equals 1Jn (N)K ' ( 1 - K )" N ' (3-11) (b) Without replllc:emems. We again pick a card from the box. but this time we do not replace it. We repeat this process n times. Find the probability Ph that k of the selected cards are white. This time we cannot use (3-16) because the conditions of the experiment change after each selection. To solve the problem. we proceed directly. Since we are interested only in the total number of white cards. the order in which they are selected is immaterial. We can consider. therefore. the n selections as a single outcome of our experiment. In this experiment. the possible outcomes are the number (:)of ways of selecting n out of N objects. Furthermore. there arc ( ~) ways of selecting k out of K white cards. and k ways of selecting n (Nn -·_ K) k out of the N - K black cards. Hence lsee also (2-41)). Ph is given by the hypC'rpeometric series pN) (. N- pN) ( k ll - k Ph .:: - - - - - - · - (3-18) (7r) Note that if k << K and n « N. then after each drawing, the number of white and black cards in the box remains essentially constant; hence. p., == p,. • We determine next the probability P{k 1 s k s k~} that the number k of successes of an event .s4 in n trials is between k. 1 and k~. 68 CHAP. 3 REPEATED TRIALS • Theonm (3-19) where p = P(dl.). • Proof. Clearly, {k 1 s k s k2 } is the union of the events ~4 = {dl. occurs k times in any order} These events are mutually exclusive, and P(OOd = p,.(/<) as in (3-16). Hence, (3-19) follows from axiom Ill on page 33. Example 3.11 An order of 1.000 parts is received. The probability that a part is defective equals .I. Find the probability p, that the total number of defective parts does not exceed 110. In the context of repeated trials. ff is the arrival of a single part. and .91 = {the part is defective} is an event of~~ with P(.si) = .I. The arrival of 1.000 parts generates the space ~,. of repeated trials. and p, is the probability that the number of times .91 occurs is between 0 and 110. With p = .I t1 =0 k2 = 110 n = 1.000 (3-191 yields • Example 3.12 We receive a lot of N mass-produced objects. Of these, K are defective. We select at random n of these objects and inspect them. Based on the inspection results. we decide whether to accept or reject the lot. We use the following acceptance test. Simple Sanaplinc Suppose that among the n sampled objects, k are defective. We choose a number leo, depending on the particular application, and we accept the lot if k s /co. If k > k0 • the lot is rejected. Show that the probability that a lot so tested is accepted equals ~ , .. 0 (PN) (NpN) k n - k K N (3-20) p=- (Z) • Proof. The lot is accepted if k s k0 • It suffices therefore to find the probability p, that among the n sampled components. k will be defective. As we have shown in Example 3.10, Pt is given by the hypergeometric series (3-18). Summing fork from 0 to lco. we obtain (3-20). • We shall now examine the behavior of the numbers p,.(k) = ( ~) ptq"-' (kn) = k!(ll n!- k)! for fixed n, as k increased from 0 to n (see also Problem 3-8). SEC. 3-2 BERNOULLI TRIALS 69 p .. •5 = :!0 II 0 :! 20 km k p,(kl p = .3 = 20 II ~0 k k If p = 112. then p,(k.) is proportional to the binomial coefficients ( ~ ): p,(k) = n I (k) 2" In this case. p,(k) is symmetrical about the midpoint n/2 of the interval (0. n). If n is even. it has a sinl(le maximum fork= k, = n/2; if n is odd. it has two maxima n-1 n t I and k.-= k· = - k = k., = -y- - 2 If p :1= 112. then p,(k.) is not symmetrical: its maximum is reached for k """ np. Precisely. if (11 + I )p is not an integer. then p,(k) has a single maximum fork. = k, = Un + I )p). * If (n + I )pis an integer. then p,(k) has two maxima: k = k., = (n + J)p - I and k. ·~ k.~ - (lr + l)p These results are illustrated in Fig. 3.2 for the following cases: I. 2. 3. 4. n = 20 n::... II p= .5 p::. .5 = 2() p= .3 p= .4 II n=9 k., .:.. llp ..:.. 10 k., = (11 ·- l)p = 5 k~ ..:. (n + l)p (II + l)p = 6.3 k, = 16.3) = 6 l)p.:.. 4 (II k, ::. 3 k~.:.:::. 4 .. The significance of the curves shown is explained in Section 3-3. • (x) means ··the largest" integer smaller than x. ·· =6 70 CHAP. 3 REPEATED TRIALS 3-3 Asymptotic Theorems In repeated trials. we are faced with the problem of evaluating the probabilities 11(11 - 1) • • • (11 - k + 1) p,(k) = p4q" 4 (3-21) 1. 2 . . . k and their sum (3-19). For large n. this is a complicated task. In the following. we give simple approximations. The Normal Curves We introduce the function 1 ,![(X) = - - t' ,. ·:~ (3-22) \12; and its integral (3-23) These functions are called (standard) twmwl or Gtm.uian curves. As we show later. they are used extensively in the theory of probability and statistics. Clearly. 1((-X) = ,![(X) Furthermore (see the appendix to this chapter), • J. . . -- yf2; t' ''~dx = 1 (3-24) From this and the evenness of R(XI it follows that G(l) = 1 G(O) =~ G(-x) = ~- G(x) (3-25) In Fig. 3.3 we show the functions R(X) and G(x). and in Table* Ia we tabulate G(x) for 0 s x s 3. For x > 3 we can use the approximation (i(x) == 1 - !X .~[(X) (3-26) De Moivre-Lap/ace Theorem It can be shown that for large n, Pn(k) can be approximated by the samples of the normal curve R(X) properly scaled and shifted: (3-27) • All tables arc at the back of the bunk. SEC. 3-3 ASYMPTOTIC THF.OREMS 7) g(xl Cilxl 0 •·igure 3.3 where tr -= vlijhi T/ -= np (3-28) In the next chapter. we give another interpretation to the function g(x) (density) and to the constants 71 (mean) and tr (standard deviation). The result in (3-271 is known as the l>t• Moivn·-I.ap/ace tlwort•m. We shall not give its ntther difficult proof here: we shall only comment on the range of values of 11 and k for which the approximation is satisfactory. The standard normal curve takes significant values only fur lxl < 3; for lxl > 3. it is negligible. The scaled and shifted version of the interval (- 3. 3) is the interval f) ..: (T/ - 3tr. 71 + 3tr). The rule-of-thumb condition for the validity of {3-27) is as follows: If the interval I> is in the interior of the interval (0. II), 13-271 is a satisfactory approximation for every k in the interval D. In other words. (3-27) can he used if () < lip'- 3 Vllpq < k < ,, I 3 v;;pq < II (3-29) Note. finally. that the approximation is best if p,l k) is nearly symmetrical. that is. if p is close to .5. and it deteriorates if p is close to 0 or to I. We list the exact values of p,.(k) and its approximate values obtained from (3-27) for n = 8 and for p = .5. As we see. even for such moderate values of n, the approximation error is small. 72 CHAP. 3 REPEATED TRIALS k Pa(/c) 2 I 0 3 5 4 6 7 8 .004 .031 .109 .219 .273 .219 .109 .031 .004 approx. .005 .030 .104 .220 .282 .220 .104 .030 .005 Example 3.13 A fair coin is tossed 100 times. Find the probability that heads will show k = 50 and /c = 45 times. In this case. 11 = 100 p = q = .5 llP = 50 11pq = 25 Condition (3-29) yields 0 <50- 15 < k <50 + 15 < 100: hence. the approximation Pn(k) == S~ ,.-!4 ~~~~~· is satisfactory provided that /c is between 35 and 65. Thus 1 1 P{k = 50} == - = .08 P{k = 45} = 5-- (' 5 V21T yl2; •·~ = .048 • As we have seen, the probability that the number k of successes of an inn trials is between k 1 and k2 equals event~ P{kl S k S k2} = ±(Z) p 4qn-l. 4· 4, Using the De Moivre-Laplace theorem (3-27). we shall give an approximate expression for this sum in terms of the normal curve G(x) defined in (3-23). • Theonm (3-30) provided that k1 or k2 satisfies (3-29) and u = v'/ijiq >> (3-31) I • Proof. From (3-27) it follows that ~ ~ z ( ) f"'q" -4 == _I_ ~ ~ \I'21TU 4-4, 4-4, The right side is the sum of the k2 /(x) = I u v'21T - I! 14-l)l!f2or! k1 + I samples of the function e-«.c-1)1'.,2tr• = -I (T (x-71) g -(T SEC. 3-3 ASYMPTOTIC THEOREMS I k1- 0.5 (a) k2 73 X + 0.5 (b) l'igure 3.4 for x "" k 1 , k1 + I, . . . . k2 • Since u >> I by assumption. the functionf(x) is nearly constant in the interval (k. k + I) of unit length (Fig. 3.4a); hence. its area in that interval is nearly .f(k). Thus ~ LJ (" c,~, _1), ., !_.,.. I u \12., 4· 4, With the transformation I --:= \12., u X-TJ JJ(': (3-32) ("-h' 4• dx :..:. udy v=(1' • the right side of 0-32) equals [see 0-23)1 -I- f.''~ \121T 1)1/tr 14 1 • "11/or /o.~ ··Tl.) - (,, ( k1 ·- TJ) e .,.: - dv = G ( 1• U • (1' (3-33) and (3-30) results. The approximation (3-27) on which (3-30) is based holds only if k is in the interval (TJ - 3u, Tl + 3cr). However, outside the interval, the numbers p,(k) and the corresponding values of the exponential in (3-27) are negligible compared to the terms inside the interval. Hence. (3-7) holds so long as the sum in (3-27) contains terms in the interval (TJ -· 3cr. Tl + 3u). Note in particular that if k 1 = 0. then G ( ~) = G (- • ~) < G( 3) -.. .001 u v npq· =0 because (see (3-29)) np > 3 vlipq; hence, i (tr) p'q" 4-11 p 4 == G ( k2 - 11p ) vnpq (3-34) provided that k2 is between TJ - 3u and TJ + 3u. Example 3.14 The probability that a voter in an election is a Republican equal!! .4. Find the probability that among I ,000 voters. the number k of Republicans is between 370 and 430. This is a problem in repeated trials where .!If = {r} is an event in the experiment 74 CHAP. 3 REPEATED TRIALS 9' and P(s/l) = p = .4. With n = 1.000 k 1 = 370 (3-34) yields P{370 Example 3.15 :5: k :5: k2 = 430 430} == G llp = 400 (~)- G ( ·- npq = 240 ~) = .951 • We receive an order of 10,000 parts. The probability that a part is defective equals .I. Find the probability that the number k of defective parts does not exceed. 1.100. With n = 10,000 k2 = 1.100 p =.I np = 1.000 11pq = 900 (3-34) yields P{k :5: 1.000}::: G ( ~) • = .999 Correction In (3-32), we approximated the sum of the k2 - k 1 + I terms on the left to the area of/(x) in k2 - k 1 intervals of length I. If k2 - k1 >> I. the additional term on the left that is ignored is negligible, and the approximation (3-30) is satisfactory. For moderate values of k2 - k 1 , however. a better approximation results if the integr.ttion limits k 1 and k2 on the right of (3-32) are replaced by k 1 - 112 and k2 + 112. This yields the improved approximation ~ (n) TJ'q"·""" G (k2 + 0.5- np) _ G (k'- 0.5- np) (3_35 ) p of (3-30), obtained by replacing the normal curve g(x) by the inscribed staircase function of Fig. 3.4b. 4~4, vnpq vnpq THE LAW OF LARGE NUMBERS According to the empirical interpretation of the probability p of an event~. we should expect with near certainty that the number k of successes of ~ in n trials be close to np provided that n is large enough. Using (3-30), we shall give a probabilistic interpretation of this expectation in the context of the model ~, of repeated trials. As we have shown, the most likely value of k is, as expected, np. However, not only is the probability tsee (3-27)) I r(3-36) vnpq that k equals np not close to I. but it tends to zero as n- oc. What is almost certain is not that k equals np but that the ratio kin is arbitrarily close top for n large enough. This is the essence of the Jaw of large numbers. A precise formulation follows. P{k = np} = ~ SEC. 3-3 ASYMPTOTIC THEOREMS 75 • Theorem. For any positive 1-:. /.; , P{p - 1-: -:: ·· p • t:} > .997 < (3-37) provided that " > 9pqle~. • Proof. With /.; 1 = (p - e)n. P{(p e)n < k < (p "-~ + t:)n}""' "'- (p (i 1 1-:)n. 0-30) yields I (fJ. T ·~'"·- ''1!.1 v llf'Cf li I (p .~ e)ll -_np] v'npq Hence. p { p- ,.; !5 nk !5 (n } ( p + e = G ,.; Vpq)- (i ( -·; iii v,Cf)- '!(j (•: V/;t/) tn I (3-38) If n > 9pql e~. then 1-: v'llfj}q > 3 and 2G(e ~;;)-I >2G(3)- I , 2 x .998ft- I.,.. .997 and 0-37) results. Note. finally. that fii Ui ( ,.; 'J PC/ ) I ,,...., --+ 2G(-x) I ' I Since this is true for any e. we conclude with near certainty that kin tends to 11 -+ ~. This is, in a sense. the theoretical justification of the empirical interpretation ( 1-1) of probability. p as Example 3.16 (a) A fair coin is tossed 900 times. Find the probability that the ratio J.Jn is bet ween .49 <tnd .51. In this prohlcm. 11 - 9(Ml. r. - .01. 1: \:nip~/ - .fl. and IJ-~IU yields P { .49 ··. :,., <.. .51 } ""' :!Cil.fll I - .4515 (b) Find n such that the probability that kin is between .49 and .51 is .95. In this case. (3-38) yields P 1.49 < ;, < 51} ~(i(.ll:! \til I ·- .95 Thus n is the solution of the e4uatiun (i(.(l~ \:111- .975. Frum Table I we sec that (;(1.95).,... .9744 < .975 < (i(~.IMII "' .97725 Using linear interpolatiun. we conclude that hec 14-~))J .02 v/i ~"" 1.96: hence. 11 -· 9.6tMI. • Generalized Bernoulli Trials In Section 3-2 we determined the probability p,(/;.) that an event .tA of an experiment ~f will succeed k times in 11 independent repetitions of rf. Clearly. if .14 succeeds k times. its complement succeeds 11 - /.; times. The fundamental theorem (3-16) can thus be phrased as follows: 76 CHAP. 3 REPEATED TRIALS / f)~ .!i3 J (b) (a) Cl (c) Figure 3.5 The events .<;4, = :A. :A 2 = .~form a partition of !I (Fig. 3.5a). and their probabilities equal p, = p and p 2 = I - p. respectively. In the space ~f, of repeated trials, the probability of the event {.<4 1 occurs k 1 = k times and &1 occurs k. 2 = " - k times} equals p,(k) = p,(kl. k1) = /. ,I1;. I P1' p~: (3-39) "1·"1· Our purpose now is to extend this result to arbitrary partitions. We are given a partition A = 1.-:4, ••..•.<A,) of the space~~. consisting of the revents dl.; (Fig. 3.5b). The probabilities of these events are r numbers P; = P(dl.;) such that P1 + · · · + p, = I Repeating the experiment fJ " times. we obtain the experiment :-1, of repeated trials. A single outcome of ~1, is a sequence ~~~~ ... ~, where ~j is an element of ~f. Since A is a partition. the element ~j belongs to one and only one of the events :A;. Thus at a particular trial. only one of the events of A occurs. Denoting by k; the number of occurrences of .94; in n trials. we conclude that k 1 + · · · + k, = n At the jth trial. the probability that the event :A; occurs equals p;. Furthermore. the trials are independent by assumption. Introducing the event ~ = {.<A; occurs k.; times in a specific order} we obtain P(~~) = Pt' · · · p~· (3-40) This is the extension of (3-15) to arbitrary partitions. We shall next determine the probability p,(/.: 1 • • (.:Ji = {.<A; occurs li.; times in any order} • k,) of the event P',· (3-41) • Theorem p 11(k I • • • • • "! . kr ) -- kl! , . , /i.,! P 4I ' • • • • Proof. For a specific set of numbers k 1 , • • • • k,. all events of !f, of the form~ have the same probability. Furthermore. they are mutually exclu- SEC. 3-4 RARE EVENTS AND POISSON POINTS 77 sive, and their union is the event ~1.. To find P(~'i.-) it suffices to find the total number N of such events. Clearly. N equals the number C4, ..... 4, of combinations of" objects grouped in r classes, with k; objects in the ith class. As we have shown in (2-14). the number of such combinations equals I C"4, .....4. -- li.,! ."· . . /;.,! Multiplying hy P<?A>. we obtain (3-41 ). • Corollary. We are given two mutually exclusive events .llt 1 and .lit~ (Fig. 3.5c) with p 1 = P(.<A- 1 ) and p~ = P(.<A~). We wish to find the probability p., that in n trials, the event .s4 1 occurs k, times and the event :A~ occurs k~ times. To solve this problem. we introduce the event .lit_, = .~. u .~~ This event occurs k3 = n - (k, + k2 ) times and its probability equals p 3 = 1 - p 1 - p 2 • Furthermore, the events .~~t .. .llt2, .lit~ form a partition of ~. Hence, p, Example 3.17 = p,( t\JI. • t\~' k'.\ ) = I. I. ll! p4oI p4• p4' ~ .\ "'·"~·"'· Jl. Jl. I (3-42) A fair die is rolled 10 times. Find the probability p, that .fi shows 3 times and even shows 5 times. In this case •.rJl, = {jj}. !A 2 = {even} 3 II = 10 kt = 3 .li.2 = 5 PI = 6 P1 = 6 Clearly. the events .<A, and dl 2 are mutually exclusive. We can therefore apply (3-42) with This yields I)·'( &.3)~( cJ2 - .4 10! ( P•n< 3• 5• 2) = 3!5!2! 6 1 • 3-4 Rare Events and Poisson Points We shall now examine the asymptotic behavior of the probabilities n(n - I) · · · (n - lo; -+ I ) p,(k) = k! .... _ p4q•,-4 under the assumption that the event .lit is rtm•, that is. that p << 1. If n is so large that np >> I. we can use the De Moivre-Laplace approximation (3-27). If, however, n >> I but np is of the order of 1-for example, if n = 1,000 and p = .002-then the approximation no longer holds. In such cases, the following important result can be used. 78 CHAP. 3 REPEATED TRIALS • Poisson Theorem. If p << I and 11 >> I. then. for II. of the order of 11p. ( ") 1\ 4 P4q't (npl 4 ::.- (' -•IP - (3-43) II.! • Proof. The probabilities p,(/\) take significant values for II. near np. Since p << I we can use the approximations np << n. II.<< 11: ll(ll - I) · · · (n - II. + I) == n4 q = I - p ,.,. (' P q" 4 == q'' == e-"P This yields n(n - I) • • • (II - k + I) 11 4 --=-----'---:--:-'------'- p'q" 4 "" _ p'(' ·ttp (3-44) II.! k! and (3-43) results. We note that (3-43) is only an approximation. The formal theorem can be phrdsed as a limit: a' (3-45) ( -") p'q'' '-+(' , _ k k! as n-+ x, p-+ 0, and np-+ a. The proof is based on a refinement of the approximation (3-44). Example 3.18 An order of 2.000 parts is received. The probability that a part is defective equals to-~. Find the probability p, that no component is defective and the probability Ph that there are at most 3 defective components. With n = 2.000. p = 10 \ 11p = ~. (3-43) yields p, "" P{k = 0} = q" = (I - 10 ·'1~·01111 - t' ~ = .406 Pb = P{k s 3} = ±(Z) t•o pkq"· 4 ± = •~o e-• (n~)A = e- 2 {t k. + ~1 + 2.2: 2 + :) 3. = .851 • Generalization Consider the r events~; of the partition A of Fig. 3.5. We shall examine the asymptotic behavior of the probabilities p,(k 1 • • • • • k,) in (3-41) under the assumption that the first r - I of these events are rare. that is, that PI << I. . . . • p, I << I Proceeding as in (3-44). we obtain the approximation p,(k, • . . . • k,) =- k n! I1 • • • • k pt• · · · p~· ·, 11I e· "· a~· ,_' ' Q< t· 1 1 t\j' (3-46) where tit = np, • . . • • a,_, = nPr-1. This approximation holds for II.; of the order np;, and it is an equality in the limit as n -+ oc. SF.C. j---1.--J 3-4 RARE EVENTS AND POISSON POINTS 79 k--tb-+: ~~x~~~~x~~x~'----*x--~~~x~------*x---------------~x~~' ··T/~ 11 12 1.1 T/~ 14 Figure 3.6 Poi!l.wm Points The Poisson approximation is of particular importcmce in problems involving random points in time or space. This includes radioactive emissions. telephone calls. and trctffic accidents. In these and other fields. the points are generated by a model ~f, = !/ x · · · x !f of repeated trials where ~~ is the experiment of the rctndom selection of a single point. In the following, we discuss the resulting model as n -+ x.. starting with the single-point experiment ~f. We are given an interval ( ··· T/2. 712) (Fig. 3.6), and we place a point in this interval at rctndom. The probability that the point is in the interval (1 1 , I~) of length Ia = 1~ - 11 equals I,,IT [see (2-46)]. We thus have an experiment ~~ with outcomes all points in the interval (- T/2. 1'/2). The outcomes of the experiment !I, of the repeated trials of ~~ are , points in the interval (- T/2. T/2). We shall find the probability p,(k) that k of these points are in the interval (1 1 , 1~). Clearly. p,(k) is the probability of the event {k points in 1,}. This event occurs if the event {1 1 ~ 1 ~ 12 } of the experiment~~ occurs k times. Hence. P{k points in 1,} - ( ~) PV'-' ,_1'1., We shall now assume that, >> I and T >> t;,. In this case. {1 1 c;:: rctre event. and (3-44) yields • • } I ,. (111)1')4 P{k·pomtsml, ,...c."··· k! (3-47) 1 ~ t~} is a (3-48) This. we repeat, holds only if 1, << 1' and k << 11. Note that this probability depends not on nand T sepamtely but only on their ratio "1' ll."-- (3-49) This rcttio will be called the dc•nsily of the points. As we see from (3-49), A equals the average number of points per unit of time. Next we increase " and 1'. keeping the ratio niT constant. The limit of ~.. as n -+ x will be denoted by ~" and will he called the experiment of Poisson poi111s with density 11.. Clearly. a single outcome of~~.. is a set of 11 points in the interval (- T/2. 1'/2); hence. a single outcome of :-f.. is a set of infinitely many points on the entire 1 axis. In the experiment ~f". the proba- 80 CHAP. 3 REPEATED TRIALS bility that k points will be in an interval (1 1, 11 ) of length 111 equals . . } ., (.\1,)4 P{k Poants 1n 1 = e-ft"-" k! k = 0, I, . . . (3-50) This follows from (3-48) because the right side depends only on the ratio llf1'. Nonoverlapping Intervals Consider the nonoverlapping intervals (1 1 • 11 ) and (1 3 , 14 ) of Fig. 3.6 with lengths 1, = 11 - t 1 and 1, = t4 - t~. In the experiment 9' of placing a single point in the interval (- T/2, T/2). the events .llt 1 ={the point is in t,}, s4.1 ={the point is in r,}, and .!A.t = {.i 1 n ~ 1 } ={the point is outside the intervals r, and r,} form a partition. and their probabilities equal 1, lh TT .A P(.v..,) -= I - respectively. In the space ~1, of placing n points in the interval (- T/2. T/2), the event {k, in 1,, k 11 in r,} occurs itT s4. 1 occurs k 1 -= k, times, :A1 occurs k~ = k11 times, and s4. 1 occurs k 3 = n - k, - k, times. Hence [see (3-46)], . . P{k, ant,, k, an r,} n! = k• !k:!!k,! (1")'' T ('h)'~ T (I - t, 1,)'' TT (3-51) From this and (3-46) it follows that if n n- x. (3-52) -=11. T then • k • P{k, m 1,. , an 1, } (.\1,)4.• -AI (,\th)4• = e -AI" k e • -k 1 ,.1 h• (3-53) Note from (3-51) that if Tis finite, the events {k, in 1,} and {kh in r,} are not independent, however, forT- x., these evenb are independent because then [see (3-53) and (3-50)] (3-54) P{k, in t,, k, in 111 } = P{k 11 in t,}P{k, in t11 } Sumnuuy Starting from an experiment involving n points randomly placed in a finite interval, we constructed a model fl consisting of infinitely many points on the entire axis, with the following properties: "& 1. The number of points in an interval of length 111 is an event {k, in t,,} 2. the probability of which is given by (3-50). If two intervals 10 and tb are nonoverlapping, the events {k 11 in 1,} and {kb in lb} are independent. These two properties and the parameter A specify completely the model of Poisson points. ~"& PROBLEMS 81 Appendix Area under the Normal Curve We shall show that r. I c,- .. ,'cb: ~ .J~ a>O (3A-I) • Proof. We set This yields 12 =I', I'. c' .... ·.,·,clxdy In the differential ring AR of Fig. J.7. the integrand equals e-,.,· hecause x! + y 2 = r 2• From this it follows that the intcgml in AR equals e · ,.,· times the area 21rrdr of the ring. Integrating the resulting product for r from 0 to x and setting r! = z. we obtain J! and (3A- I) results. = J.' " 2trre_"'.dr = 1r J.'· e "·c/:. " =~ a Figure 3.7 dr Problems 3-1 A fair coin is tossed 8 times. (a) Find the probability p, that heads shows once. (b) Find the probability P! that heads shows at the sixth toss but not earlier. 82 CHAP. 3 REPEATED TRIALS 3-l 3-J 3-4 3-S 3-6 3-7 3-8 A fair coin is tossed 10 times. Find the probability p that at the eighth toss but not earlier heads shows for the second time. A fair coin is tossed four times. Find the number of outcomes of the space ~~ .. and of the event .<;J ""' {heads shows three times in any order}: find Pt:A). Two fair dice are rolled 10 times. Find the probubility p 1 that 7 will show twice: find the probability p 1 that II will show once. Two fair dice are rolled three times. Find the number of outcomes of the space ~~.I and of the events .9l = {7 shows at the second roll} • .1\ = {II does not show}: find Pl!/11 and PUAI. A string of Christmas lights consists of 20 bulbs connected in series. The probability that a bulb is defective equals .01. (at Find the probability p 1 that the string of lights works. (b) If it does not work. we replace one bulb at a time with a good bulb until the string works: find the probability p 1 that the string works when the fifth bulb is replaced. (d find the probability p1 that we need at most five replacements until the string works. (a) A shipment consists of 100 units. and the probability that a unit is defective is .2. We select 12 units; find the probability p 1 that 3 of them are defective. (b) A shipment consists of 80 good and 20 defective components. We select at random 12 units: find the probability p 1 that 3 of them are defective. Compute the probabilities p,(k I ::. {;) p'q" ' for: (a) " = 20. p = .3. k I. . . .• 20: (b) n = 9. p = .4. k = 0. I. . of two consecutive values of p,lkl equals • 9. rlkl "'p,(k- II= 3-10 3-11 3-ll 3-13 3-14 3-IS 3-16 Show that the ratio rCk) kq k + lip and that rlkl increases ask increases from 0 to"· Using this. find the values of k for which p,(k) is maximum. Using the approximation (3-271. find p,(k) for" = 900. p = .4. and k = 340. 350. 360. and 370. The probability that a salesman completes a sale is .2. In one day. he sees 100 customers. Find the probability p that he will complete 16 sales. Compute the two sides of (3-18): (a)for n = 20, p = .4. and k from 5 to II; (b) for n = 20. p :.... .2. and k from I to 7. A fair coin is tossed n times. and heads shows k times. Find the smallest number n such that P{.49 s kin s .51} > .95. A fair die is rolled 720 times. and 5 shows k times. Find the probability that k is a number in the interval 196. 144). The probability that a passenger smokes is .3. A plane has 189 passengers. of whom k are smokers. Find the probability p that 45 s k s 67: (a) using the approximation (3-30); (b) using the approximation (3-35). Of all highway accidents. 52% are minor. 30% are serious. and 18% are fatal. In one day. 10 accidents arc reported. Using (3-4:!). find the probability p that two of the reported accidents are serious and one is fatal. The probability that a child is handicapped is to-·'. Find the probability p 1 that in a school of 1,800 children. one is handicapped. and the probability p 2 that more than two children are handicapped: (a) exactly; (b) using the approximation (3-43). p,(k) 3-9 (~) = 0. (ll - PROBI.F..MS 83 3-17 We receive a 12-digit number by teletype. The probability that a digit is printed wrong equals 10 ·'. Find the probability p that one of the digits is wrong: (a) exactly: (b) using the Poisson approximation. 3-18 The probability that a driver has an accident in a month is .01. Find the probability p 1 that in one year he will hnvc one accident and the probability p~ that he will have at least one accident: (a) exactly: (b) using (3-43). 3-19 Particles emitted from a rudioactive substance form a Poisson set of points with A -· I. 7 per second. Find the probability I' that in :! seconds. fewer than five particles will be emitted. 4 _ _ __ The Random Variable A random variable is a function x(() with domain the set':/ of experimental outcomes ( and with range a set of numbers. Thus xW is a model concept, and all its properties follow from the properties of the experiment ':/. A function y = g(x) of the random variable x is a composite function y(() = g(x(()) with domain the set ':/. Some authors define random variables as functions with domain the real line. The resulting theory is consistent and operationally equivalent to ours. We feel strongly. however, that it is conceptually preferable to interpret x as a function defined within an abstract space ':/, even if':/ is not explicitly used. This approach avoids the use of infinitely dimensional spaces, and it leads to a unified theory. 4-1 Introduction We have dealt so far with experimental outcomes, events, and probabilities. The outcomes are various objects that can be identified somehow, for example, "heads," "red," "the queen of spades." We have also considered experiments the outcomes of which are numbers, for example, "time of a call," "IQ of a child"; however, in the study of events and probabilities, the 84 SEC. 4-1 INTRODUCTION 85 numerical character of such outcomes is only a way of identifying them. In this chapter we introduce a new concept. We assign to each outcome' of an experiment f:f a number x(,). This number could be the gain or loss in a game of chance, the size of a product. the voltage of a random source, or any other quantity of interest. We thus establish a relationship between the elements'; of the set~ and various numbers x(,;). In other words, we form a function with domain the set ff of abstract objects ';and with range a set of numbers. Such a function will be called a random variable. Example 4.1 The die experiment has six outcomes. To the outcome Ji we assign the number lOi. We have thus formed a function x such that x(/;) = IOi. In the same experiment, we form another function y such that y</;l = 0 if i is odd andy</;) = I if i is even. In the following table. we list the functions so constructed. f, Ji l ~ x(/;) y(/;) f4 h Ji. 10 20 30 40 50 60 0 I 0 I 0 I The domain of both functions is the set :J. The range of" consists of the six numbers 10, . . . • 60. The range of y consists of the two numbers 0 and I. • To clarify the concept of a random variable, we review briefty the notion of a function. Meaning or a •·unction As we know, a function x = x(t) is a rule of correspondence between the values oft and x. The independent variable t takes numerical values forming a set fl, on the t-axis called the domain of the function. To every tin fl, we assign, according to some rule, a number x(t) to the dependent variable x. The values of x form a set ff, on the x-axis called the range of the function. Thus a function is a mapping of the set fl, on the set fl.,. The rule of correspondence between t and x could be a table, a curve, or a formula. for example, x(t) = t2• The notation x(l) used to represent a function has two meanings. It means the particular number x(t) corresponding to a specific 1; it also means the function x(l), namely, the entire mapping of the set f:l, on the set fix. To avoid this ambiguity, we shall denote the mapping by x, leaving its dependence on 1 understood. Gentrali1.fltlon The definition of a function can be phrased as follows: We are given two sets of numbers f:l, and fl.,. To every 1 E f:!, we assign a number x(t) belonging to the set f:/11 • This leads to the following genercllization. 86 CHAP. 4 THE RANDOM VARIABLE We are given two sets of objects ~a and Y';J consisting of the arbitrary elements a and {3, respectively: a E Y'n {3 E ff(J We say that {3 is a function of a if to every element a of the set ~a we make correspond one element {3 of the set ~fJ. The set :Ja is called the domain of the function and the set f:ffJ its range. Suppose that ';/a is the set of all children in a community and f://J the set of their mothers. The pairing of a child with its mother is a function. We note that to a given a there corresponds a single {3. However, more than one element of the set fJ a might be paired with the same {3 (a child has only one mother, but a mother might have more than one child). Thus the number Np of elements of the set f:/11 is equal to or smaller than the number Na of the elements of the set ~a. If the correspondence is one-to-one, then Na = N11. The Random Variable A random variable (Rv) represents a process of assigning to every outcome C of an experiment f:f a number x(C). Thus an RV xis a function with domain the set f:f of experimental outcomes and range a set of numbers (Fig. 4. I). All Rvs will be written in boldface letters. The notation x(C) will indicate the number assigned to the specific outcome and the notation x will indicate the entire function, that is, the rule of correspondence between the elements Cof fJ and the numbers x(C) assigned to these elements. In Example 4.1, x indicates the table pairing the six faces of the die with the six numbers 10, . . . , 60. The domain of this function is the set f:f = {Ji, . . . ,J,.}, and its range is the set {10, . . . , 60}. The expression x(ji) is the number 20. In the same example, y indicates the correspondence between the six faces/; and the two numbers 0 and I. The range ofy is therefore the set {0, 1}. The expression y(Ji) is the number I (Fig. 4.2). c. Events Generated by avs In the study of avs, questions of the following form arise: What is the probability that the RV x is less than a given number Figure 4.1 .s4 st = {x!Sx} = {t1 !Sx!:x2f SEC. x: 0 20 30 40 I I 4-1 50 • • • • \ I INTRODUCTION 87 60 t • X ~ •y Figure 4.2 x? What is the probability that x is between the numbers x 1 and x 2 ? We might, for example, wish to find the probability that the height x of a person selected at random will not exceed certain bounds. As we know, probabilities arc assigned only to events; to answer such questions, we must therefore express the various conditions imposed on x as events. We start with the determination of the probability that the RV x does not exceed a specific number x. To do so. we introduce the notation $l. = {x ::; .t} This notation specifies an event .sll consisting of all outcomes ' such that x(') s x. We emphasize that {x s x} is not a set of numbers; it is a set $l. of experimental outcomes (Fig. 4.1). The probability P(S'l) of this set is the probability that the RV x does not exceed the number x. The notation ?A = {x 1 < X< X2} specifies an event~ consisting of all outcomes ' such that the corresponding values x(C) of the RV x arc between the numbers x 1 and x~. Finally, ~ = {x = xo} is an event consisting of all outcomes ' such that the value x(C) of x equals the number x0 • Example 4.2 We shall illustrate with the avs x and y of Example 4.1. The set {x s 35} consists of the elementsjj ,/2 , and/3 because x(Ji) s 35 only if i = I. 2. or 3. The set of {x s 5} is empty because there is no outcome such that x(Ji) s 5. The set {20 s x :s 35} consists ofthe outcomesJi and/3 because 20 s x(Ji) s 35 only if i = 2 or 3. The set {x = 40} consists of the element~ because x(Ji) = 40 only if i = 4. Finally, {x = 35} is the empty set because there is no outcome such that x(Ji) = 35. Similarly, {y < O} is the empty set because there is no outcome such that y(Ji) < 0. The set {y < I} consists of the outcomesf1 .f,, and f. because y(Ji) < I for i = I, 3, or 5. Finally, {y ;s I} is the cenain event because y(Ji) '$ I for every Ji. • 88 CHAP. 4 THE RANDOM VARIABLE In the definition of an Rv, the numbers x(C) assigned to x can be finite or infinite. We shall assume, however, that the set of outcomes C such that x(C) = :::t:x: has zero probability: P{x = :x:} = 0 P{x = -:x:} = 0 (4-1) With this mild restriction, the definition of an Rv is complete. 4-2 The Distribution Function It appears from the definition of an RV that to determine the probability that x takes values in a set I of the x-axis, we must first determine the event {x E I} consisting of all outcomes Csuch that x(C) is in the set/. To do so, we need to know the underlying experiment~. However, as we show next, this is not necessary. To find P{x E /}it suffices to know the distribution of the Rv x. This is a function Fx(x) of x defined as follows. Given a number x, we form the event .54.x = {x s x}. This event depends on the number x; hence, its probability is a function of x. This function is denoted by Fx(x) and is called the cumulative distribution/unction (c.d.f.) of the RV x. For simplicity, we shall call it the distribution function or just the distribution of x. • Definition. The distribution of the RV x is the function Fx(x) = P{x s x} (4-2) defined for every x from -:x: to :x:. In the notation Fx(x), the subscript x identities the RV x and the independent variable x specifies the event {x s x}. The variable x could be replaced by any other variable. Thus Fx(w) equals the probability of the event {x s w}. The distributions of the RVS x, y, and z are denoted by Fx(x), F 1(y), and Fz(t.), respectively. If, however, there is no fear of ambiguity, the subscripts will be omitted and all distributions will be identified by their independent variables. In this notation, the distributions of the avs x, y, and z will be F(x), F(y), and F(z), respectively. Several illustrations follow. EDJDple 4.3 (a) In the fair-die experiment, we define the RV x such that x(Jj) = IOi as in Example 4.1. We shall determine its distribution F,(x) for every x from -:x: to ac, We start with specific values of x F,(200) = P{x s 200} = PU:I) = I F,(45) = P{x s 45} = P{Jj, fi, J3, 14} = 64 SEC. 4-2 THE DISTRIBUTION FUNCTION Fx(X) 89 F>'(y) I I 2 I 6 0 y X Figure 4.3 = P{fi.f,Jj} = ~ F,(30) = P{x s 30} F,(29.99) = P{x s 29.99} = P{Jj. h} = ~ 6I f,(IO.l) = P{x s 10.1} = P{fi} = FA(5) = P{x s 5} = P(0) = 0 Proceeding similarly for any .t. we obtain the staircase function FA(x) of Fig. 4.3a. (b) In the same experiment. the av y is such that y(/;) = {o, i i = I. 3, 5 = 2. 4. 6 In this case, F,(l5) F,.(l) F~.(O) F_,.(- 20) = P{y s 15} = P(9') = I I}""" P(Yl _,_ I = P{y s = P{y s 0} 3 = P{fj ./1. fi} = 6 = P{y s - 20} = P(0) - 0 The function F 1 (y) is shown in Fig. 4.3b • Example 4.4 In the coin experiment, fl = {h, t} P{h} = p P{t} = q We form the av x such that x(h) = I X(f) = 0 The distribution ofx is the staircase function F(x) of Fig. 4.4. Note, in particular, that F(0.9) = P<x s 0.9} = P{t} = q F(4) = P{x s 4} = P(f/) = I F(-5) ..:.: P{x s -5} .:.:. P(0) :..:. 0 F(O) = P(x s 0} = P{t} -= q • 90 CHAP. 4 THE RANDOM VARIABLE F(x) Pf q.--......;...... 0 X Fipre 4.4 Example 4.5 Telephone calls occurring at random and uniformly in the interval (0, 1) specify an experiment ~ the outcomes of which are all points in this interval. The probability that tis in the interval (t 1 , t2 ) equals [see (2-46)] 12- It P{1 1 s 1 s 12} = - T - We introduce the av x such that X(l) = I 0s I s T In this example, the variable 1 has a double meaning. It is the outcome of the experiment ~ and the corresponding value x(l) = 1 of the av x. We shall show that the distribution oh is a ramp, as in Fig. 4.5. To do so, we must find the probability of the event {ll s x} for every x. Suppose, first, that x > T. In this case, X(l) s x for every I in the interval (0, because ll(t) = 1; hence, F(x) = P{ll(l) s x} = P{O s I s 7l = I x> T If 0 s x s T, then x(l) s x for every 1 in the interval (0, x); hence, n = TX 0 s x s T Finally, if x < 0, then {x(l) s x} = 0 because x(l) = 1 ~ 0 for every 1 in ~; hence, F(x) = P{x < x} = P(0) = 0 x<0 • F(x) Example 4.6 = P{x s x} = P{O s 1 s x} The experiment ~ consists of all points in the interval (0, oo). The events of~ are all intervals (11 , 12) and their unions. To specify~. it suffices, therefore, to know the probability of the event {1 1 s t s 12 } for every 11 and 12 in~. This probability can be Fipre 4.5 F(x) T X 4-2 SEC. THF. DISTRIBUTION FUNCTION 91 specified in terms of a function a(t) as in (2-43): P{t 1 s t s t 2 } (•· = ),: a(t)dt We shall assume that a(t) = 2e- 21 This yields P { 0 :s t :s to 1l = 2 Joftn e ·u' dt ' = I - e · ·'" (4-3) (a) We form an RV x such that X(t) = t I 2: 0 Thus. as in Example 4.5, t is an outcome of the experiment ~ and the corresponding value of the RV x. We shall show (fig. 4.6a) that ,.-u. x ~ 0 X< 0 0 0, then x(t) s x for every t in the interval (0, x): hence, F<x> = { I - If x 2: F.(x) = P{x s If x < 0, then {x s x} =0 x} = P{O s t s x} = I - e·U. because llt(t) 2: 0 for every t in ~; hence, F,(x) = P{x < (4-4) x} = P(0) =0 Note that whereas (4-3) has a meaning only for to 2: 0. (4-4) is defined for all x. (b) In the same experiment, we define the RV y such that o ~ r s o.s y<r> = { 1 t > 0.5 Thus y takes the values 0 and I and P{y = 0} = P{O :s t :;:: 0.5} "' I 0 P{y = I}~ P{t > 0.5} (' t = 2 Jn(' \ ('-~'dt = e·• From this it follows (fig. 4.6b) that ~~ = {!- y ;:: I ,-o 0-5yS) • y<O Figure 4.6 F.(x) F,.(}') I- 0.5 X (a) ---..J e-• ... )' 0 (b) 92 CHAP. 4 THE RANDOM VARIABLE It is clear from the foregoing examples that if the experiment ~ consists of finitely many outcomes, F(x) is a staircase function. This is also true if~ consists of infinitely many outcomes but x takes finitely many values. PROPERTIES OF DISTRIBUTIONS The following properties are simple consequences of (4-1) and (4-2). F( -~) I. 2. = P{x = -:x:} = 0 F(:x:) = P{x s :x:} =I The function F(x) is monotonically increasing; that is, if x 1 < x2 then F(xd s F<x2) (4-5) (4-6) • Proof. If x 1 < x2 and C is such that x(C) s x, , then x(C) s x2; hence, the event {x < x.} is a subset of the event {x s x2 }. This yields F(xd = P{x < Xt} s P{x < x2} = F(x2> From (4-5) and (4-7) it follows that 0 s F(x) s 1 Furthermore, if F(x0 ) = 0 then F(x) = 0 for every x s xo P{x > x} 3. =I (4-7) (4-8) (4-9) (4-10) - F(x) c. the events {x s x} and {x > x} are mutually exclusive, and their union equals fl. Hence, P{x s x} + P{x > x} = P(~) = I • Proof. For a specific 4. (4-11) • Proof. The events {x s x 1} and {x1 < x s x2} are mutually exclusive, and their union is the event {x s x2 }: {x s x.} U {x, < x s x 2 } = {x s x2 } This yields P{x s xd + P{x, < x s x2} and (4-11) results. 5. = P{x s x;!} The function F(x) might be continuous or discontinuous. We shall examine its behavior at or near a discontinuity point. Consider first the die experiment of Example 4.3a. Clearly, F(x) is discontinuous at x = 30, and 2 3 3 F(30) =F(30.01) = 6 F(29.99) = 6 6 Thus the value of F(x) for x = 30 equals its value for x near 30 to the right but is different from its value for x near 30 to the left. Furthermore, the discontinuity jump of F(x) from 2/6 to 3/6 equals the probability P{x = 30} = 1/6. We maintain that this is true in · general. SEC. 4-2 THE DISTRIBUTION FUNCTION 93 .-------- Po t ______ _ 0 Figure 4.7 Suppose that F(x) is discontinuous at the point x = x 0 • We denote by F(x0 ) and F(x0 ) the limit of F<x> as x approaches x0 from the right and left, respectively (fig. 4.7). The difference Po = F(xo > - F(xo > (4-12) is the "discontinuity jump" of Ftc) at the point x0 • As we see from the figure, P{x < xo} = F(xii) P{x s x0 } = F(xii) (4-13) This shows that (4-14) F(xo) = F<xo) Note, finally, that the events {x < x0 } and {x = 0} are mutually exclusive, and their union equals {x s x}. This yields P{x < x0 } + P{x = xo} = P{x s xo} From this and (4-13) it follows that P{x = xo} = F(xo) - F(xii) =Po P{x < xo} = F(xo) (4-15) Continuous, Discrete, and Mixed Type avs We shall say that an RV xis of continuous type if its distribution is continuous for every x (Fig. 4.8). In this case, F(x") = F(x) = F(x- ); hence, P{x = x} = 0 (4-16) Figure 4.8 F(x) Fix) F(x) "'V Pt ..... F<xn X 0 Continuous 0 Xi Discrete :c X 0 ~ixed 94 CHAP. 4 THE RANDOM VARIABLE for every x. Thus if x is of continuous type, the probability that it equals a specific number xis zero for every x. We note that in this case, P{x s x} = P{x < x} = F(x) (4-17) F(x,) We shall say that an av x is of discrete type if its distribution is a staircase function. Denoting by x; the discontinuity points of F(x) and by p; the jumps at x;, we conclude as in (4-12) and (4-14) that P{x = x;} = F(x;) - F(xi) = p; (4-18) Since F( -~> = 0 and F(~) = 1, it follows that if F(x) has N steps, then Pt + ' ' · + PN = 1 (4-19) Thus if x is of discrete type, it takes the values X; with probabilities p;. It might take also other values; however, the set of the corresponding outcomes has zero probability. We shall say that an av xis of mixed type if its distribution is discontinuous but not a staircase. If an experiment ~ has finitely many outcomes, any av x defined on ~ is of discrete type. However, an av x might be of discrete type even if t;J has infinitely many outcomes. The next example is an illustration. P{x1 s x Example 4.7 s x2 } = P{x 1 < x s x2 } = F(x2 ) Suppose that sf is an event of an arbitrary experiment ~f. - We shall say that x.., is the zero-one RV associated with the event sf if CCsf I X.;~(C) = { 0 CE sf Thus x.., takes the values 0 and I (fig. 4.9), and P{x.., = I} = P(sf) = p P{x_. = 0} = P(~) =I - p ne PercentUe Curve The distribution function equals the probability u • = F(x) that the av x does not exceed a given number x. In many cases, we are faced with the inverse problem. We are given u and wish to find the value xu of x such that P{x s Xu} = u. Clearly, Xu is a number that depends on u, and it Figure 4.9 FCx) t 1-p·---"""""-t 0 X SEC. 4-2 THE DISTRIBl:TION ..-uNCTION 95 X Percentile Distribution Figure 4.10 is found by solving the equation (4-20) Thus x, is a function of u called the u-percentile Cor qua11tile or fractile) of the RV x. Empirically. this means that IOOu%- of the observed values of x do not exceed the number x,. The function x, is the inverse of the function u = F(x). To find its graph. we interchange the axes of the F(x) curve (Fig. 4.10). The domain of x, is the interval 0 ::s u ::s I, and its range is the x-axis F(x,) = u - :x: ::S X ::S :x:. Note that if the function F(x) is tabulated. we use interpolation to find the values of x, for specific values of u. Suppose that u is between the tabulated numbers u, and uh: F(x,) = Ua < 11 < uh = F(xh) The corresponding x, is obtained by the straight line approximation Xh- Xu 4 21 X "" X (U - It ) ( • ) ll U lih- U,. II of F(x) in the interval (x,.. xh>· In Fig. 4.1 I. we demonstrate the determina- Jo'igure 4.11 X F(x) 1.60 .94520 1.65 .95053 1.95 .97441 200 .97725 230 .98928 235 .99061 u x,. .95 J.(j4 .975 1.96 .99 2.33 96 CHAP. 4 THE RANDOM VARIABLE tion of Xu for u = .95, .975, and .99 where we use for F<x> the standard normal curve G(x) [see (3-23)]. Median The .5-percentile x.s is of particular interest. It is denoted by m and is called the median of x. Thus F(m) = .5 m = x.s The Empirkal Distrihutimr. We shall now give the relative frequency interpretation of the function F(x). To do so, we perform the experiment n times and denote by ~; the observed outcome at the ith trial. We thus obtain a sequence t. ' . . . ' t;' . . . ' t,. (4-22) of n outcomes where t; is one of the elements Cof~. The av x provides a rule for assigning to each element Cof~ a number x((). The sequence of outcomes (4-22) therefore generates the sequence of numbers (4-23) Xt, ••• , X;, • • • , X 11 where x; = x(t;) is the value of the av x at the ith trial. We place the numbers x; on the x-axis and form a staircase function F,.(x) consisting of n steps, as in Fig. 4.12. The steps are located at the points X; (identified by dots), and their height equals 1/n. The first step is at the smallest value Xmin of X; and the last at the largest value Xmu. Thus F,.(x) = 0 for X < Xmin F,(x) = I for x ~ Xmu The function F,.(x) so constructed is called the empirical distribution of the RV X. As n increases, the number of steps increases, and their height 1/n tends to zero. We shall show that for sufficiently large n, F,(x) = F(x) (4-24) in the sense of(l-1). We denote by n~ the numberoftrials such thatx; s x. Thus n.• is the number of steps of F,(x) to the left of x; hence. n n F,.(x) = .2. (4-25) Figure 4.12 X SEC. 0 10 m =2 4-2 m =3 X THE DISTRIBUTION FUNCTION 0 97 10 X Figure 4.13 As we know, {x s x} is an event with probability F(x). This event occurs at the ith trial iff x; s x. From this it follows that ntis the number of successes of the event {x s x} inn trials. Applying (1-1) to this event. we obtain F(x) -= P{x s x} == !!l = Fn(.tl n Thus the empirical function F,,(x) can be used to estimate the conceptual function F(x) (see also Section 9-4). In the construction of f'n(x). we assumed that the numbers x; are all different. This is most likely the case if F(x) is continuous. If, however. the RV xis of discrete type taking the N values c 4 • then x; = c4 for some k. In this case, the steps of F,,(x) are at the points c4 , and the height of each step equals min where m is the multiplicity of the numbers x, that equal c4 (Fig. 4.13). Example 4.8 We roll a fair die 10 times and observe the outcomes fihfi.J,f,fdf.J,j;J, The corresponding values of the RV x defined as in Example 4.1 are 10 50 60 40 50 20 60 30 50 30 In Fig. 4.13 we show the distribution F(x) and the empirical distribution Fn(x). • Tire Empirical P(•rc·t•llfil(• rQuett•h•t Curv«'J. Using the 11 numbers X; in (4-23), we form n segments of length X;. We place them in line parallel to they-axis in order of increasing length, distance lin apart (Fig. 4.14). If. for example, xis the length of pine needles. the segments are n needles selected at random. We then form a polygon the corners of which are the endpoints of the segments. For sufficiently large n. this polygon approaches the u-percentile curve Xu of the RV X. Tlte Density Function We can use the distribution function to determine the probability P{x E R} that the RV x takes values in an arbitrary region R of the real axis. To do so, we express Rasa union of nonoverlapping intervals and apply (4-11). We show next that the result can be expressed in terms of the derivative f(x) of F(x). We shall assume, first, that F(x) is continuous and that its derivative exists nearly everywhere. 98 CHAP. 4 THE RANDOM VARIABLE u 0 1/n F~pre4.14 • Dtjinition. The derivative f(x) = dF(x) dx (4-26) of F(x) is called the probability density function (p.d.f.) or the frequency function of the RV x. We discuss next various properties of f(x). Since F(x) increases as x increases, we conclude that (4-27) f(x) ~ 0 Integrating (4-26) from x 1 to x2 , we obtain F(x2> - F(x.> With x2 = -co, this yields Xl = J:.J<f>df F(x) because F( -co) 1 = ., f(x)dx (4-28) (4-29) = 0. Setting x = ~. we obtain r. f(x)dx =1 (4-30) Note, finally, that (4-31) SEC. 4-2 THE DISTRIBUTION FUNCTION 99 /Cx) X (a) (b) Figure 4.15 This follows from (4-28) and (4-17). Thus the area of f(x) in an interval (x 1 , x2) equals the probability that x is in this interval. If x, = x, x2 = x +ax, and ax is sufficiently small. the integral in (4-31) is approximately equal to f(x)ax; hence,· P{x ~ X ~ X + ax} ::= f(x)ax (4-32) From this it follows that the density f(x) can be defined directly as a limit involving probabilities: f(x) = lim P{x ~ 6x....O With Xu X ~ X + ax} (4-33) ax the u-percentile of x, (4-29) yields (Fig. 4. t5a) u = F(x,) = f: f(x)dx Note, finally, that if f(x) is an even function-that is, iff(- x) = f(x) (Fig. 4. 15b)-then I - F(x) = F(-x) Xt-u = -x, (4-34) From this and the table in Fig. 4. tt it follows that ifF (x) is a normal distribution, then Xot = -x99 = -2.3 X.os = - X.9S = -1.6 DISCRETE TYPE RVS Suppose now that F(x) is a staircase function with discontinuities at the points x•. In this case, the RV x takes the values x. with probability (4-35) P{x = x.} = P• = F(x,J - F(xi) The numbers P• will be represented graphically either in terms of F(x) or by vertical segments at the points x• with height equal to P• (Fig. 4.16). Occasionally, we shall also use the notation P• = f<x•> to specify the probabilities P•. The function f(x) so defined will be called point density. It should be understood that its valuesf(x•> are not the derivatives of F(x); they equal the discontinuity jumps of F(x). 100 CHAP.4 THE RANDOM VARIABLE F(x) 1- T Pt 3 3 i i l I 0 I I I 1 2 3 ii i 2 0 X 3 X Figure 4.16 Example 4.9 The experiment r:l is the toss of a fair coin three times. In this case, r:l has eight outcomes, as in Example 3.1. We define the RV x such that its value at a specific outcome equals the number of heads in that outcome. Thus x takes the values 0, 1, 2, and 3, and I 3 3 1 P{x = O} =P{x = I} =P{x = 2} = 8 8 8 P{x = 3} =-8 In Fig. 4.16, we show its distribution F(x) and the probabilities Pk· • The Empiriclll Density (1/istol(rtlm}. We have performed an experiment n times, and we obtained the n values xi of the RV x. In Fig. 4.12, we placed the numbers xi on the x-axis and formed the empirical curve F,.(x). In many cases, this is too detailed; what is needed is not the exact values of xi but their number in various intervals of the x-axis. For example, if x represents yearly income, we might wish to know only the number of persons in various income brackets. To display such information graphically, we proceed as follows. We divide the x-axis into intervals of length 4, and we denote by n~c the number of points x; that are in the kth interval. We then form a staircase functionf,(x), as in Fig. 4.17. The kth step is in the kth interval (c4 , c1c + 4), and Figure 4.17 n~c n~ 0 X 2 3 6 11 15 9 7 3 2 SEC. 4-3 ILLUSTRATIONS 101 its height equals n41n.:1. Thus nt f,.,(x) = n.1 c4 s x s c 4 + .1 (4-36) The function/,.(x) is called the histogram of the RV x. The histogram is used to describe economically the data X;. We show next that if n is large and d small, f,(x) approaches the density /(x): j,.(x) == /(x) (4-37) Indeed, the event {c4 s x < c4 + .:1} occurs n4 times in" trials, and its probability equals/Ccdd [see (4-32)). Hence. j'(c, ).;1 == P{c, S x < c4 + .1} -::· '..!! II = (, (.r).:1 . '' Probability Mass Density In Section 2-2 we interpreted the probability P(:A.) of an event sA. as mass associated with :A. We shall now give a similar interpretation of the distribution and the density of an av x. The function F(x) equals the probability of the event {x ~ x}; hence, F(x) can be interpreted as mass along the x-axis from - x to x. The difference F(x2 ) F(x1) is the mass in the interval (x 1 , x2 ), and the difference F(x + 4x) F(x) === f(x)4x is the mass in the interval (x, x + 4x). From this it follows that .f(x) can be interpreted as mass density. If x is of discrete type, taking the values x4 with probability Pk, then the probabilities are point masses p4 located at xk. Finally, if xis of mixed type, it has distributed masses with density f(x), where F'(x) exists, and point masses at the discontinuities of F(:c). 4-3 Illustrations Now we shall introduce various avs with specified distributions. It might appear that to do so, we need to start with the specification of the underlying experiment. We shall show, however, that this is not necessary. Given a distribution <l>(x), we shall construct an experiment ~ and an RV x such that its distribution equals <l>(x) • •. ROM THF. mSTRIBUTIO~ TO THE MODEL We are given a function <l>(x) having all the properties of a distribution: It increases monotonically from 0 to I as x increases from -'JC to x, and it is continuous from the right. Using this function, we construct an experimental model ~ as follows. The outcomes of~ are all points on the t-axis. The events of~ are all intervals and their unions and intersections. The probability of the event {t 1 ~ t ~ t2 } equals P{t1 s t s t 2} = <l>(t2) - 4>(t 1) (4-38) This completes the specification of ~. We next form an av x with domain the space ~and distribution the 102 CHAP. 4 THE RANDOM VARIABLE given function ~(x). To do so. we set (4-39) x(t) = t Thus t has a dual meaning: It is an element of~ identified by the letter t. and it is the value of the RV x corresponding to this element. For a given x, the event {x s x} consists of all elements t such that x(t) s x. Since x(t) = t, we conclude that {x s x} = {t s x}; hence [see (4-38)], F(x) = P{x s x} = P{t s x} = ~(x) (4-40) Note that ~ is the entire t-axis even if ~(x) is a staircase function. In this case, however, all probability masses are at the distontinuity points of ~(x). All other points of the t-axis form a set with zero probability. We have thus constructed an experiment specified in terms of an arbitrary function ~(x). This experiment is. of course, only a theoretical model. Whether it can be used as the model of a real experiment is another matter. In the following illustrations, we shall often identify also various physical problems generating specific idealized distributions. Fundame11tal Note. From the foregoing construction it follows that in the study of a single RV x, we can avoid the notion of an abstract space. We can assume in all cases that the underlying experiment ~ is the real line and its outcomes are the value x of x. This approach is taken by many authors. We believe, however, that it is preferable to differentiate between experimental outcomes and values of the av x and to interpret all avs as functions with domain an abstract set of objects, limiting the real line to special cases. One reason we do so is to make clear the conceptual difference between outcomes and avs. The other reason involves the study of several, possibly noncountably many avs (stochastic processes). If we use the real line approach, we must consider spaces with many. possibly infinitely many, coordinates. It is conceptually much simpler. it seems to us, to define all avs as functions with domain an abstract set ~. We shall use the following notational simplifications. First, we introduce the step function (Fig. 4.18): U(x) = {~ x~o (4-41) x<O This function will be used to identify distributions that equal zero for x < 0. For example, /(x) = 2e-2.rU(x) will mean that f(x) = 2e-2.r for x ~ 0 and f(x) = 0 for x < 0. Figure 4.18 U(x) 0 X SEC. 4-3 II.LUSTRATIONS 103 The notation f(x)- «f>(x) (4-42) will mean thatf(x) = -ycb(x) where 'Y is a factor that does not depend on x. If j(x) is a density, then 'Y can be found from (4-30). Normal We shall say that an RV xis standard normal or Gaussian if its density is the function (Fig. 4.19) I .,, g(x) = - - e· ·-·- v'21T introduced in (3-22). The corresponding distribution is the function G(x) = _I_ J' v'21T e -~='2 d~ 1 from the evenness of g(x) it follows that G(-x) =I - G(x) Shifting and scaling g(x). we obtain the general normal curves f(x) = -I- e·· 1• ·"r~,,.I .a- = - K - (4-44) u v'21T (T (T (X-TJ) F(x) = -I- J·' e u v'21T -... 2, u-.,lrz,.-d~ = G (·x-TJ) -- (4-45) (T We shall usc the notation N(TJ, u) to indicate that the RV xis normal, as in (4-44). Thus N(O, I) indicates that x is standard normal. u). then From (4-45) it follows that if x is N(TJ, P{x, ~ x s x2 } TJ) - G (XtTJ) = F(x2 ) - F<x.> = G. (X2-u-u- (4-46) l'igure 4.19 .\'(0. I I !\"(3. 2) 0.1 X 104 CHAP. 4 THE RANDOM VARIABLE With x 1 = TJ - kcr, and x2 = TJ + kcr, (4-46) yields P{TJ - kcr s x s TJ + kcr} = G(k) - G( -k) = 2G(k) - 1 (4-47) This is the area of the normal curve (4-44) in the interval (TJ- kcr, TJ + kcr). The following special cases are of particular interest. As we see from Table Ia, G(l) = .8413 G(2) = .9772 G(3) = .9987 Inserting into (4-47), we obtain P{TJ - cr < x s TJ + cr} ""' .683 P{TJ - 2cr < x s TJ + 2cr} = .954 P{TJ - 3cr < x s TJ + 3cr}""' .997 (4-48) We note further (see Fig. 4.11) that P{TJ - 1.96cr < X s TJ + 1.96cr} P{TJ - 2.58cr < x s TJ + 2.58cr} P{TJ - 3.29cr < x s TJ + 3.29cr} = .95 = .99 = .999 (4-49) In Fig. 4.20, we show the areas under the N(TJ, cr) curve for the intervals in (4-48) and (4-49). The normal distribution is of central imponance in the theory and the applications of probability. It is a reasonable approximation of empirical distributions in many problems, and it is used even in cases involving Rvs with domain a finite interval (a, b). In such cases, the approximation is possible if the normal curve is suitably truncated and scaled, or, if its area is negligible outside the interval (a, b). Figure 4.l0 " f!+O X 1-+----------.999'----------....l r ~t--------.99'-------+-1 ....~------.95------.1 'I - 3.26a 'I - 2.S8a 'I - 1.96a " 'I + 1.9611 'I+ 2.S8a 'I+ 3.2611 SEC. Example 4.10 4-3 ILLUSTRATIONS 105 The diameter of cylinders coming out of a production line are the values of a normal av with 11 = 10 em, u = 0.05 em. (a) We set as tolerance limits the points 9.9, 10.1, and we reject all units outside the interval (9.9, 10.1) = (1J - 2u. 11 -r 2u) Find the percentage of the rejected units. As we see from (4-48), P{9.9 < x s 10.1} = .954; hence. 4.6% of the units are rejected. (b) We wish to find a tolerance interval (10- c. 10 - <')such that only 1% of the units will be rejected. From (4-49) it follows that P{IO- c < x ~ 10 - d = .99 for c = 2.58u = .129 em. Thus if we increase the size of the tolerance interval from 0.2 em to 0.258 em. we decrease the number of rejected components from 4.6% to 1~. • Uniform We shall say that an RV xis uniform (or uniformly distributed) in the interval (a - c:/2, a -r c/2) if c c I - j(X) = { OC' a-;;SXSCI·T- 2 elsewhere The corresponding distribution is a ramp, as shown in Fig. 4.21. Gamma The RV x has a gamma distribution if f(x) = yxh-le ... U(x) b > 0 c > 0 (4-50) The constant 'Y can be expressed in terms of the following integral: r· na) = Ju •\'a 1e \ d\·• This integral converges for a Clearly (see (3A-I)], 1'(1)= r (4-51) > 0. and it is called the Jo'"e·~'dy= gamma function. I 1 (!) = Jor· -Vy e-''d..,. = 2 r· e-~= dz = 2 · Jo (4-52) y; •·igure 4.21 F!x) /(X) I Uniform c 0 a - c·/2 a + c!2 x 0 a- ct2 a+ c/2 x 106 CHAP. 4 THE RANDOM VARIABLE Replacing a by a + I in (4-51) and integrating by parts, we obtain f(a +I)= a Jo" yo-le-1'dy = af(a) (4-53) This shows that if we know f(a) for I < a < 2 (Fig. 4.22), we can determine it recursively for any a > 0. Note, in particular. that if a = n is an integer, r(n + 1) = nr(n) = n(n - 1) · · · f(l) = n! For this reason, the gamma function is called generalized factorial. Withy = ex, (4-51) yields {" xb-le-•·x dx 'Y lo = 2cb Jo{" yb-le-:v dy = 2cb r(b) (4-54) And since the area of /(x) equals I. we conclude that cb 'Y = f(b) The gamma density has extensive applications. The following special cases are of particular interest. Chi-square f(x) = 1 2"12f xr"I2He-xi2U(x) (i) n: integer Of central interest in statistics. Er/ang /(x) = (n-c" I)! x" .. 1e-rxu(x) n: integer Used in queueing theory, traffic, radioactive emission. Figure 4.22 1 Vi 2 0 SEC. /(.t) 4-3 107 II.I.USTRATIONS F(;o:) Exponential 0 0 X X Jo"igure 4.2..1 f:xponential (Fig. 4.23) /(x) = ce ,.,U(x) F(x) = (I - e •t )Utr) Important in the study of Poisson points. Cauchy We shall introduce this density in terms of the following experiment. A particle leaves the origin in a free motion. Its path is a straight line forming an angle 8 with the horizontal axis (Fig. 4.24). The angle 8 is selected at random in the interval ( -.,/2, .,/2). This specifics an experiment ~ the outcomes of which are all points in that interval. The probability of the event {8 1 ~ 8 ~ fh} equals 8~- lh P{e, ~ 8 ~e.}=-=--_..:....:. • 1r as in (2-46). In this experiment, we define an RV x such that x(8) = a tan fJ Thus x(8) equals the ordinate of the point of intersection of the particle with the vertical line of Fig. 4.24. Clearly, the event {x < x} consists of all outcomes 8 in the interval (-.,/2, cb) where x = a tan cb: hence. 1r } cb + Tr/2 I 1 X F(x) = P{x ~ x} = P { - - ~ 8 ~ cb = = - + - arctan 2 ., 2 1r a Differentiating, we obtain the Cau('hy density: al1r . j (x) = , + ' x- a- (4-55) Figure 4.24 j'(;l:) :o: =-a tan~ Candt} x(O) o~a- a X 108 CHAP. 4 THE RANDOM VARIABLE Binomial We shall say that an RV x has a binomial distribution of order n if it takes the values 0, 1, . . . , n with probabilities = k} = (Z) pkq"- 4 P{x k = 0, 1, . . . , n (4-56) Thus x is a discrete type RV, and its distribution F(x) is a staircase function = L (Z) pkq"-k q = 1 - p (4-57) h;, with discontinuities at the points x = k (Fig. 4.25). The density f(x) of x is different from zero only for x = k, and f(k) = ( Z) pkq" 4 k = 0, I, . . . , n F(x) The binomial distribution originates in the experiment ~, of repeated trials if we define an Rv x equal to the number of successes of an event .s4 in n trials. Suppose, first, that ~, is the experiment of the n tosses of a coin. An outcome of this experiment is a sequence t = t •... t, where t; ish or t. We define x such that x(t) = k where k is the number of heads in t. Thus {x = k} is the event {k heads}; hence, (4-56) follows from (3-8). Suppose, next, that ~, is the space of the n repetitions of an arbitrary experiment~ and that .s4 is an event of~ with P(si) = p. In this case. we set x(t) =kif tis an element of the event {d occurs k times}. and (4-56) follows from (3-16). Largen As we have seen in Chapter 3 (De Moivre-Laplace theorem), the binomial probabilities /(k) approach the samples of a normal curve with TJ = np u = vnpq (4-58) Thus [see (3-27)] 1 f(k) = e-lk· ,p)!J1\!jjpq V21rnpq (4-59) Figure 4.25 f(x) F(x) I Binomial n = 25 p = .2 0 s X X SEC. 4-3 ILLUSTRATIONS 109 Note, however, the difference between a binomial and a normal av. A binomial av is of discrete type, and the function f(x) is defined at x = k only. Furthermore, f(x) is not a density; its values f(k) are probabilities. The approximation (4-59) is satisfactory even for moderate values of n. In Fig. 4.25, we show the functions F(x) andf(x) and their normal approximations for n = 25 and p = .2. In this case, Tl = np = 5 and cr = v;;j)q = 2. In the following table, we show the exact values of /(k) and the corresponding values of the normal density N(S, 2). k 0 f(k) 1 2 3 5 4 6 7 8 9 10 11 .004 .024 .071 .136 .187 .196 .163 .111 .062 .029 .012 .004 N(S, 2) .009 .027 .065 .121 .176 .199 .176 .121 .065 .027 .009 .002 Poisson An RV x has a Poisson distribution with parameter a if it takes the values 0, I. 2, . . . with probability P{x = k} = e ·u a" k! k = 0, 1. 2, . (4-60) The distribution of x is a staircase function k F(x) = e-u L !!.. ksc (4-61) k! The discontinuity jumps of F(x) form the sequence (Fig. 4.26) /(k) = e-u a" k! k = 0, I, 2, . . . (4-62) depending on the parameter a. We maintain that if a < I, then /(k) is maximum fork= 0 and decreases monotonically ask~ :lC, If a> 1 and a is l'igure 4.26 f(x) F(x) I Poisson a= 1.5 0 0 '. 110 CHAP. 4 THE RANDOM VARIABLE not an integer, /(k) has a single maximum fork = [a). If a > I and a is an integer,f(k) has two maxima: fork. = a - I and fork = a. All this follows readily if we form the ratio of two consecutive terms in (4-62): f(k - I) a4 1(k - I)! k /(k) = a*lk! = a Large a If a >> I, the Poisson distribution approaches a normal distribution with 71 =a, cr = Va: al. I e a - == - - e-lk-ari2a a >> I (4-63) k! v1mi This is a consequence of the following: The binomial distribution (4-55) approaches the Poisson distribution with a = np if n >> I and p << I [see Poisson theorem (3-43) 1. If n >> I, p << I, and np >> I, both distributions tend to a normal curve, as in (4-63). Poisson Points In Section 3-4 we introduced the space · points specified in terms of the following properties: I. of Poisson The probability that there are k points in an interval (1 1 , length lu = /2 - 11 equals ·---rr- e -At 2. ~,. (Ata)k k = 0.1,... 12 ) of 4 64) (- where A is the "density" of the points. If (1,, 12) and (13, l4) are two nonoverlapping intervals, the events {ka points in (I,, t2)} and {kh points in (1 3 • /4)] are independent. Given an interval (1 1 , t 2 ) as here, we define the RV x as follows: An outcome Cof r;J,. is an infinite set of points on the real axis. If k of these points are in the interval (1 1 , 12), then x(C) = k. From (4-64) it follows that this RV is Poisson-distributed with parameter a = Ala where A is the density of the points and Ia = l2 - r,. In the next example we show the relationship between Poisson points and exponential distributions. Example 4.11 Given a set of Poisson points, identified by dots in Fig. 4.27. we select an arbitrary point 0 and denote by w the distance from 0 to the first Poisson point to the right ofO. We have thus created an RV w depending on the set of the Poisson points. We Fapre4.1.7 e Poisson point .. . . . .. ~ ( 0 w •• • SEC. 4-3 Ill ILLUSTRATIONS .... maintain that the RV w has an exponential density: J,.(w) = "e-h"U(w) F.,.(w) = (J - e-A~ )U(w) .(4-65) .. • Proof. It suffices to find the probability P{w s w} of the event {w s w} where·•v is a specified positive number. Clearly. w s w iff there is at least one Poisson Point in the interval (0, w). We denote by x the number of Poisson points in the interval (0, w). As we know, the RV xis Poisson-distributed with parameter""'· Hence, P{x = 0} = e A~ w > 0 And since {w s w} = {x ~ 1}, we conclude that F,..(w) = P{w s w} = P{x ~ I} = I - P{x == O} = I - e-~ ..Differentiating, we obtain (4-65). • Geometric An RV x has a geometric diJtrihution if it takes the values I. 2, 3, . . . with probability P{x = k} = pq'· 1 k = I, 2. 3.... where q = I - p. This is a geometric sequence. and ± pqk-• ' . . (4-66) = _P_ = I • I - q The geometric distribution has its origin in the following application or"Ber-· noulli trials (see Example 3.5): Consider an event at of an experimint ~with P(~) = p. We repeat~ an infinite number of times. and we denote· by x the number of trials until the event .sll. occurs for the first time. Clearly, x i~ an RV defined in the space 9':r. of Bernoulli trials. and. as we have shown in (3-1 f), it has a geometric distribution. 4-1 Hypergeometric The RV x has a lrypergeometric distribution if it takes the values 0, I, . . . , n with probabilities ( K)(N k P{x K. ,:=k) = k} = - - - - (~) k = 0. I. . . . • n (4-67) where N. K, and n are given numbers such that llSKsN Example 4.12 Example 4.13 A set contains K red objects and N - K black objects. We select n s K of these objects and denote by x the number of red objects among the n selections. As we see from (2-41). the RV x so formed has a hypergeometric distribution. • We receive a shipment of 1,000 units, 200 of which are defective. We select at random from this shipment 25 units, test them, and accept the shipment if the defective units are at most 4. Find the probability p that the shipment is accepted. 112 CHAP. 4 THE RANDOM VARIABLE The number of defective components is a hypergeometric av x with N = 1,000 K = 200 n =25 The shipment is accepted if x s 4; hence, ±(2~) (25~ k) 4 p = L P{x = k} = •·o •=o {I~) = .419 This result is used in quality control. • 4-4 Functions of One Random Variable Recall from calculus that a composite function y(t) = g(x(t)) is a function g(x) of another function x(t). The domain of y(t) is the t-axis. A function of an RV is an extension of this concept to functions with domain a probability space~. Given a function g(x) of the real variable x and an av x with domain the space ~, we form the composite function y(C) = g(x(C)) (4-68) This function defines an RV y with domain the set ~. For a specific C; E ~, the value y(C;) of the RV y so formed is given by y; = g(x;), where x; = x(C;) is the corresponding value of the RV x (Fig. 4.28). We have thus constructed a function y = g(x) of the RV x. Distribution of g(x) We shall express the distribution F,(y) of the RV y so formed in terms of the distribution Fx(x) of the RV x and the function g(x). We start with an example. Faaure 4.28 y= g(x) X SEC. 4-4 FUNCTIONS OF ONE RANDOM VARIABLE ((3 We shall find the distribution of the Rv y = x2 starting with the determination of F,.(4). Clearly. y s 4 iff- 2 s x s 2; hence, the events {y s 4} and {-2 s x s ·2} are equal. This yields Fr(4) = P{y s 4} = P{-2 s x :52}= f~,(2)- f~,(-2) We shall next find Fv(-3). The event {y s -3} consists of all outcomes such that y<C> s -3. This event has no elements because y(C) = x2(C) ;:::- 0 for every C; hence. Fv(-3) = P{y s -3} = P(0) = 0 Suppose, finally. that y ;:::- 0 but is otherwise arbitrary. The event {y s :d consists of all outcomes Csuch that the values y( C> of the RV yare on the portion of the parabola g(x)· = :r 2 below the horizontal line L, of Fig. 4.29. This event consists of all outcomes Csuch that {x~<C> s y} where, we repeat, y is a specific positive number. Hence, if y ;:::- 0. then f~,(y) = P{y s y} = P{-yY s xs yY} = F,(Vy)- F,{-yY) (4-69) If y < 0. then {y s y} is the impossible event, and F,.(y) = P{y s y} = 1'<0> = 0 (4-70) With F,.(y) so determined. the density f,.(y) of y is obtained by differentiating F,.(y). Since . dF,,(yY) =_I_ (,(Vv) d,... 2 . vv· · · t1gure 4.29 f~(X) I 0 -yy 0 ..fi ~-v;~xsv;--j .\' X )' 114 CHAP. 4 THE RANDOM VARIABLE (4-69) and (4-70) yield f...< .") Example 4.14 _I_J.<VV> +_I_ r<- = { ~ vY ' . 2 y>O vY J' (4-71) y<O Suppose that }; (x) A = _ I _ t'-.~:2a~ v'21T and y= x2 (T Sincef.(-x) = /.(x). (4-71) yields f..(y) . = _1_/.(vY)U(y) = Vy (T V'I21iY t'-~·J2a!U(y) • We proceed similarly for an arbitrary g(x). To find F,.(y) for a specific y, we find the set Ir of points on the x-axis such that g(x) y. If x e Ir, then g(x) is below the horizontal line Ly (heavy in Fig. 4.30) andy s y. Hence, F,(y) = P{y s y} = P{x E I,} (4-72) Thus to find F,.(y) for a specific y, it suffices to find the set I,. such that g(x) s y and the probability that x is in this set. For y as in Fig. 4.30, Ir is the union of the half line x s x 1 and the interval x2 s x s x 3 ; hence, for that value of y, s F1(y) = P{x E I,} = P{x s = FA(x,) + x1} + P{x2 s x s x3 } Fx(XJ) - FA(x2) The function g(x) of Fig. 4.30 is between the horizontal lines y = Ya and y = y,: Yb < g(x) < Ya for every x (4-73) Since y(C) is a point on that curve, we conclude that if y > Ya, then y(C) s y for every C e ~and if y < y,, then there is no C such that y(C) s y. Hence, tlpre4.30 X -----------Yb -------------- SEC. 4-4 115 FUNCTIONS OF ONF. RANDOM VARIABLE . {P(~) = I f.,.(y) = P(0) = 0 .V > .Vu Y < Yh (4-74) This holds for any g(x) satisfying (4-73). Let us look at several examples. In all cases, the RV x is of continuous type. The examples are introduced not only because they illustrate the determination of the distribution Fv(Y) of the RV y = g(x) but also because the underlying reasoning contributes to the understanding of the meaning of an RV. In the determination of Fv(y), it is essential to differentiate between the RV y and the number y. Illustrations I. Linear transformation. (See Fig. 4.31). (a) g(x) = X 2+ 3 Clearly, y s y itT x s Xa = 2y - 6 (Fig. 4.31 a); hence, F1 (y) = P{x s x,} = F,(2y - 6) = --X2 . . 8 In this case, y s y itT x ~ xh = 16 - 2y (Fig. 4.31b); hence, F1 (y) = P{x ~ Xb} = I - f~(l6 - 2y) (b) 2. g(x) Limiter. (See Fig. 4.32.) g(x) = RV { a x >a x -as x sa -a x <-a If y >a, then {y s y} = ~:hence, F_v(Y) = I. If -as y sa, then {y s y} = {x s .v}: hence, F1 (y) = f~(y). If y < -a, then {y s y} = 0; hence, F,.(y) = 0. In this example, F_v(Y) is discontinuous at y = :ta; hence, the y is of the mixed type, and P{y = a}= P{x ~a} = 1 - f~,(a) P{y = -a}= P{x s -a}= Fx(-a) •·igure 4.31 (a) (b) 116 CHAP. 4 THE RANDOM VARIABLE g(x) X -a 0 x a -a 0 a y Fagure 4.31 3. Dead zone. (See Fig. 4.33.) x>c -c: s x s c { X+ C X< -c: lfy ~ 0, then {y s y} = {x s y + c}; hence, Fv(Y) = Fx(Y +c). lfy < 0, then {y s y} = {x s y- c}; hence, F1(y) = Fx(Y- c). X-C: g(x) = 4. 0 Again, F_,(y) is discontinuous at y = 0; hence, the mixed type, and P{y = 0} = P{-c <X< c} = Fx(c) - Fx(-c) Discontinuous transformation. (See Fig. 4.34.) RV y is of +C X ~ 0 x-c: x<O lfy ~ c, then {y s y} = {x s y- c}; hence, F 1 (y) = Fx(Y- c). If -c < y < c, then {y s y} = {x s 0}; hence, F 1 (y) = Fx(O). If y < -c, then {y s y} = {x s y + c}; hence, F_,.(y) = Fx(Y +c). In this illustration, we select for g(x} the distribution f"_.(x) of the RV x (Fig. 4.35). We shall show that the resulting Rv y = F...(x) is uniform in the interval (0, I) for any Ft(x). The function g(x) = f".r(x) is between the lines y = 0 andy = I. (X) = {X g 5. Figure 4.33 g(x) F(x) -c 0 c X 0 y SEC. 4-4 FUNCTIONS OF ONE RANDOM VARIABLE -c.- 0 0 J J7 c Figure 4.34 Hence [see (4-74)), = {~ y>l (4-75) y<O If 0 s; y s; I, then {y s; y} = {x s; x1 } where x1 is such that f~,(x1 ) = y. Thus x1 is they-percentile of x, and F_v(Y) = P{x s Xy} = y 0 s y s; I (4-76) Illustration 5 is used to construct an RV y with uniform distribution starting from an RV x with an arbitrary distribution. Using a related approach, we can construct an RV y = g(x) with a specified distribution starting with an RV x with an arbitrary distribution (see Section 8-3). F_,.(y) Density of g(x) If we know F,.(y ), we can find fv<Y) by differentiation. In many cases, however, it is simpler to express the density f,(y) of the RV y - g(x) directly in terms the density f,(x) of x and the function g(x). To do so, we form the equation (4-77) g(x) = Y where y is a specific number, and we solve for x. The solutions of this equation are the abscissas x; of the intersection points of the horizontal line L, (Fig. 4.36) with the curve g(x). l'igurc 4.35 g(x) = l·"x(xl Fx(X) F<xrl y I 118 CHAP. 4 THE RANDOM VARIABLE dy -L, + dy .,_YJI+------++---~or-Ly Figure 4.36 • Fundamtntlll Thtonm. For a specific y, the density /y(y) is given by I' ( ) Jy y = fx(x,) lg'(x,)l + ... - fx(x;) + ... lg'(x;)l =~ fx(x;) ~ jg'(x;)l where x; are the roots of (4-77): y = g(x1) = · · · = g(x;) = · · · and g'(x;) are the derivatives of g(x) at x = x;. (4-78) (4-79) • Proof. To avoid generalities, we assume that the equation y = g(x) has three roots, as in Fig. 4.36. As we know [see (4-32)], /y(y) dy = P{y < y < y + dy} (4-80) Clearly, y is between y and y + dy iff x is in any one of the intervals (Xt, Xt + dxt) (X2 - ldx2l• X2) (X3, X3 + dx3) where dx, > 0, dx2 < 0, dx3 > 0. Hence, P{y < y < y + dy} = P{x, < x < Xt + dx,} + P{x2 - ldx2! < x < x2} + P{x3 < x < X3 + dx3} <+81) This is the probability that y is in the three segments of the curve g(x) between the lines L, and L1 + dy. The terms on the right of (4-81) equal P{x1 < x < Xt + dx 1} = fx(x,)dx, P{x2- ldx2l < x < x2} P{x3 < x < X3 + dx3} = fx(x2>ldx2! = fx(x3) dx3 I dx, = ~( )dy g Xt dx2 = ~( g X2 )dy dx3 1 = ~( g X3 )dy 1 Inserting into (4-81), we obtain I' ( ) d = /x(x,) d + ./x(x2) d + fx(x3) dy JY y y g'(x,) y lg'(x2>l y g'(x3) and (4-78) results. SEC. 4-4 FUNCTIONS OF ONE RANDOM VARIABJ.E 119 Thus to findj;.(y) for a specific y. we find the roots x; of the equation y = g(x) and insert into (4-78). The numbers x; depend. of course, on y. We assumed that the equation y = g(x) has at least one solution. If for some values of)' the line L,. docs not intersect g(x). thcnfvCY) = 0 for these values of y. Note, finally. that if gCx) = y 0 for every x in an interval (a, b) as in Fig. 4.33, then F_,.(y) is discontinuous at y = Yo· Illustrations I. gCx) = ax + h The equation J' = ax+ b has a single solution x 1 = (y - b)/a for every)'. Furthermore. g'Cx) = a: hence. I I /v(}') = lctJr(XJ) = 'lli fr - t l (4-82) (y-b) ThusJ;.(y) is obtained by shifting and scaling the density Jx<x> of x. I g (x) =- 2. X The equation y = 1/x has a single solution x 1 = Jly for every y. Furthermore, -I g '(x) = -, r Hence, (4-83) 3. = ax2 a > 0 If y > 0, the equation y = ax 2 has two solutions: x 1 = yYiQ g(x) and x2 = -ffa (fig. 4.29). Furthermore, = 2ax g'(x1) = 2 vUY g'Cx2) = g'(x) -2 vQY Hence, .t;<y> I = IY· vayf. ( "Va) 2 + 'Y I vayfr (- ~a) 2 <4-84) If y < 0, the equation y = ax 2 has no real roots; hence,J;.(y) = 0 [see also (4-71)]. Example 4.15 xis uniform in the interval (5, 10). as in Fig. 4.37. andy = 4x2• In this case. J.C- vY74) = 0 for every .v > 0 and The RV f. ( ~)- L2 Hence. J;.(y) = r ';:' fV s < -v 4 < 10 elsewhere 100 < y < 400 elsewhere • 120 CHAP. 4 THE RANDOM VARIABLE f,.(y) .OS .oos s 0 0 400 100 y Figure 4.37 4. g(x) = sin x I, the equation y = sin x has no real solutions: hence. /,(y) = 0. If IYI < I, it has infinitely many solutions X; y = sin X; x; =arcsin y Furthermore, If IYI > g'(x;) = cos x; = VI - sin2 :c; = v"i""="? and (4-78) yields f,(y) I ~ = v'J=Y2 ~ /.(x;) (4-85) where fx(x;) are the values of the density /.(x) of x (Fig. 4.38). Suppose now that the RV x is uniform in the interval ( -7r. 1r ). In this case, -1T <X;< 1T Fipre4.38 g(x) 1 X fr -1 0 X1 X 0 y .... X SEC. 4-5 MEAN AND VARIANCE 121 g(x) 0 -s X -s o2 6 x 0 4 7 9 y l'igurc 4.39 for any y; hence, the sum in (4-85) equals 212Tr. This yields lhr J;<y> = { 5. :1 - y2 ·yl < Yl I (4-86) >I g(x) = fAx> as in Fig. 4.35. Clearly, g(x) is between the lines y = 0 andy = I; hence,.fv(y) = 0 for y < 0 andy> I. For 0 s y s I. the equation y = Fx(x) has a single solution x 1 = X~- where x~. is they-percentile of the RV x; that is. y = F(x,.). Furthermore. if'(x) = F.~(x) = fx(x); hence, Osysl Thus the RV y is uniform in the interval (0, I) for any f~,(x) [see also (4-76)]. Point Masses Suppose that x is a discrete type Rv taking the values x" with probability p 4 • In this case, the RV y = g(x) is also of discrete type. taking the values y 4 = g(x4). If y 4 = g(x4) only for one x 4 • then P{y = y,} = P{x = xd = P4 =- frC:cd If, however, y = Y• for x = Xa and x = xh-that is. if y, = g(xu) = g(xh)-then the event {y = Y•} is the union of the events {x = xu} and {x = xh}; hence, P{y = yk} = P{x = Xu} + P{x = Xh} = p" + Ph Note, finally, that if x is of continuous type but g(x) is a staircase function with discontinuities at the points x 4 , then y is of discrete type, taking the values y 4 = g(xd. In Fig. 4.39. for example, P{y = 7} = P{2 < x s 6} = f,(6) - F,(2) = .3 This is the discontinuity jump of F~.(y) at y = 7. 4-5 Mean and Variance The properties of an RV x are completely specified in terms of its distribution. In many cases, however, it is sufficient to specify x only partially in terms of 122 CHAP. 4 THE RANDOM VARIABLE certain parameters. For example. knowledge of the percentiles xu of x, given only in increments of .I, is often adequate. The most important numerical parameters of x are its mean and its variance. Mean The mean or expected value or statistical average of an RV x is, by definition, the center of gravity of the probability masses of x (Fig. 4.40). This number will be denoted by E{x} or Tlx or.,. If x is of continuous type with density f(x) (distributed masses), then E{x} = r. xf(x) dx (4-87) I If x is of discrete type taking the values x 4 with probabilities p~;. = f(x4 ) (point masses), then (4-88) A constant c can be interpreted as an RV x taking the value c for every (. Applying (4-88), we conclude that E{c} = cP{x = c} = c Empirical Interpretation. We repeat the experiment n times and denote by .t1 the resulting values of the RV x [see also (4-23)]. We next form the arithmetic average XJ + ... +X11 ~ I (4-89) X= = ~ Xj- n 1 n x of these numbers. We maintain that tends to the statistical average E{x} ofx as n increases: (4-90) E{x} n large x ""' FIIUJe 4.40 fl =f xf(x) dx X SEC. 4-5 MEAN AND VARIANCE 123 We divide the x-axis into intervals of length .1 as in Fig. 4.17 and denote by n~ the number of X; between £'4 and c:4 + a. If .1 is small. then X; == c 4 for ck s X; s c4 - .1. Introducing this approximation into (4-M9). we obtain - = '\.... ~ x 4 ,, =- ,... c·,j.,kd . ~ C't - II <4-91) j. 4 where/,(cd = n41n.1 is the histogr.tm ofx (sec <4-36)1. As we have shown in (4-37),/,(xl = .f(x) for small a; hence. the last sum in (4-91) tends to the integral in (4-871 as n -+ :x: this yields (4-901. We note that x;ln is the area under the empirical percentile curve in an interval of length 1/n (fig. 4.41): hence. the sum in (4-891 equals the total area of the empirical percentile. From the foregoing it follows that the mean E{x} of our RV x equals the algebraic area under its percentile curve x,. namely, I.... xf<x>dx = Jo(' x,du Ftr,) =u Interchanging the x and F(x) axes, we conclude that £{x} equals the difference of the areas of the regions ACD and OAB of Fig. 4.41. This yields £{x} = I: R(x) dx - J:,, F(.r) dx (4-92) where R(x) = I - F(x). Equation (4-92) can be established directly from (4-87) if we use integration by parts. Symmetrical Densities If f<x> is an even function, that is, if!<- x) = f(x), then E{x} = 0. More generally, if f<x) is symmetrical about the number a, that is, if f<a + x) = f<a - x) then E{x} = a This follows readily from (4-87) or, directly, from the mass interpretation of the mean. A density that has no center of symmetry is called skewed. Figure 4.41 X; B 0 X E(x}= ACD- DAB 124 CHAP. 4 THE RANDOM VARIABLE MEAN OF g(x) Given an RV x and a function g(x), we form the RV y = g(x) As we see from (4-87) and (4-88), the mean of y equals E{y} = f. L y,./y(y•> or yJ;.(y)dy (4-93) l for the continuous and the discrete case, respectively. To determine E{y} from (4-93), we need to find/y(y). In the following, we show that E{y} can be expressed directly in terms of fx(x) and g(x). • Fundamental Theonm E{g(x)} = r. (4-94) or g(x)fx(x)dx • Proof. The discrete case follows readily: If y 4 -= g(x,.) for only a single .t~;, then L YkP{y = Yk} = L g(xk)P{x = x,.} If Yk = g(x•> for several x., we add the corresponding terms on the right. To prove the continuous case, we assume that g(x) is the curve of Fig. 4.36. Clearly, {y < y < y + dy} = {XJ < X < XJ + dx} U {X2 < X < X2 + dx2 } U {XJ < X < X3 + dx3} where, in contrast to (4-81), all differentials are positive. Multiplying by y = g(x 1) = g(x2 ) = g(x3 ) the probabilities of both sides and using (4-81), we obtain y/y(y) dy = g(x,)fx(x,) dx, + g(x2>fx<x2> dt2 + g(xJ)fx(xJ) dx3 Thus to each differential of the integral in (4-93) corresponds one or more nonoverlapping differentials of the integral in (4-94). As dy covers they-axis, each corresponding dx covers the x-axis; hence, the two integrals are equal. We shall verify this theorem with an example. Example 4.16 The RV xis uniform in the interval (I. 3) (Fig. 4.42), andy= 2x + I. As we see from (4-82), .t;.(y) =~f. (Y; I) From this it follows that y is uniform in the interval (3, 7) and E{y} = r. yf,.(yJ dy = .25 ~ y dy = 5 This agrees with (4-94) because g(x) = 2K + I and f" g(x)/.(x) dx = .5 J Ch + 3 -x I I) dx = 5 • SI::C. fx(X) 4-5 MEAN AND VARIANCE 125 fv(Y) .5 y = 2x ... I ,... .~5 I 0 I 'lx 3 0 X 3 y 7 'ly l'igure 4.42 Unearity From (4-94) it follows that £{ax} =a £{g,(x) - · · · + g,.(x)} r• xJ.~(X) dx = a£{x} (4-95) f. = [g,(x) + · · · - g,.(x)jft(x) dx = £{.1(J(X)} - • · · .,. £{.1( (X)} (4-96) 11 Thus the mean of the sum of n functions g,{x) of the RV x equals the sum of their means. Note, in particular, that E{h} = band (4-97) E{ax + b} = aE{x} + b In Example 4.16, Tl.t = 2 'and Tl:v = 5 = 2.,.. + I, in agreement with (4-97). Variance The variance or dispersion of an RV x is by definition the central moment of inertia of its probability masses. This number is denoted by u 2• Thus for distributed masses, '\ (4-98) and for point masses, (4-99) From (4-98) and (4-94) it follows that u 2 = £{(x - "1)2} = £{x2 } - 2.,E{x} T .,z 126 CHAP. 4 THE RANDOM VARIABLE where 7J = E{x}. This yields = cr, + rr, ,_.{ x-'} c. (4-100) This has a familiar mass interpretation: Clearly, E{x2 } is the moment of inertia of the probability masses with respect to the origin and 71 2 is the moment of inertia of a unit point mass located at x = TJ· Thus (4-100) states that the moment of inertia with respect to the origin equals the sum of the central moment of inertia, plus the moment with respect to the origin, of a unit mass at x = 7J. In probability theory. E{x} is called the first moment, E{x2 } the second moment, and u 2 the second central moment (see also Section 5-2). The square root u of u 2 is called the standard deviation of the RV X. We show next that the variance u~ of the u~ We know that TJ, u~ = E{(y - RV y = ax + b equals = a u~ 2 = a7Jx + b; hence, .,,)2 } = E{[(ax + b) - (4-101) (a1Jx + b)F} = E{a2(x - 1Jx)2 } and (4-101) follows from (4-95). Note, in particular, that the variance of x + b equals the variance x. This shows that a shift of the origin has no effect on the variance. As we show in (4-113), the variance is a measure of the concentration of the probability masses near their mean. In fact, if u =0, all masses are at a single point because thenf(x) = 0 for x =I= 11 [see (4-98)]. Thus (4-102) if u =0 then x = 11 = constant in the sense that the probability that x =I= 11 is zero. Note, in particular, that (4-103) if E{x2 } = 0 then x= 0 Illustrations We determine next the mean and the variance of various distributions. Note that it is often simpler to determine u 2 from (4-100). Normal The density of a standard normal RV z is even; hence, 11: = 0. We shall show that the variance of z equals I: Iu...2 = V"21T J"_,. z e-:ndz = I 2 1 To do so, we differentiate the identity [see (3A-U] r. e-a:!dz = y:;;:(a-lt2) (4-104) SEC. 4-5 MEAN AND VARIANCE 127 with respect to a. This yields f7 (- z2 )<'-a~: dz = .. ~ v1T(a ·ll2) Setting a= 112. we obtain (4-104). We next form the RV x = az + b. As we sec from (4-82). if a> 0, then fx(x) (x -- b) ~_I_ e a v'2:;T ::.! h a h h¥t2u' tl Thus xis N(b. a). and [see (4-IOO)J TJ~ = b tTl = ll (4-105) This justifies the use of the letters ., and u in the definition (4-44) of a general normal density. Uniform Suppose that x is uniform in the interval (tl - d2, a -t c/2), as in Fig. 4.21. In this case. tl is the center of symmetry of/(:c). and since the location of the origin is irrelevant in the determination of u~. we conclude that , I J·, ~ , c~ (4-106) u· = -c ..: x- dx = -12 .,=a Exponential If /(:c) then E{x} = c . I. xe '·'dx (I = n .. -••U(.r) I c =- Hence, I r:{ x-'} - .,-, u-, = c:. YJ=c We continue with discrete type avs. Zero-One The P{x In this case. E{x} = 0 Hence. X RV I = -. c· (4-107) x takes the values I and 0 with = p P{x = 0} . ,. . q =- I - p = I} q - I X p = p (4-108) 'YJ=p Geometric The x takes the values I. 2. : . . with P{x = k} = ptf · 1 k = I. 2•. RV We shall show that I YJ=p (4-109) 128 CHAP. 4 THE RANDOM VARIABLE • Proof. Differentiating the geometric series ±tt=-qI - q k~l twice, we obtain " ~ kqk-1 ± I k(k - l)qk-2 =(I - q)2 = k=l 2 (I - q)3 From this it follows that E{x} E{x2} =p f k=l = P L" •·1 kq"- 1 = k2q4-l P (I - q)2 = I+ 2 o -pqq)3 + o -P q)·~ = ---:/1 r and (4-109) results [see (4-100)]. From (4-109) it follows that the expected number of tosses until heads shows for the first time (see Example 3.5) equals lip. Poisson The RV x takes the values 0, I, . . . with ale k P{x = k} = e-(J k! = 0, I, . We shall show that (4-110) T]=a • Proof. Differentiating the identity " a• etJ=Lk! l=-0 twice, we obtain e" " =L l=O ak-2 k(k - I) - k.1 Hence, and (4-110) results. From (4-110) and (4-64) it follows that the expected number of Poisson points in an interval of length t(J equals AI(J. This shows that A. equals the expected number of points in a unit interval. APPROXIMATE EVALUATION OF E{g(x)} We wish to determine the mean of the RV y = g(x). To do so, we must know the density ofx. We shall show SEC. 4-5 MEAN AND VARIANCE 129 •·igure 4.43 that if g{x) is sufficiently smooth in the region (l}.. - c. Tl.r + c) where,h{x) takes significant values (Fig. 4.43), we can express E{g(x)} in terms of the of x. mean Tl.r and the variance Suppose, first, that g(x) = ax + b. In this case, E{g(x)} = Dl}.r + b = g(l}.,) This suggests that if g(x) is approximated by its tangent g(x) = g(l}.r) + g'(l}.r)(x- l}.,) in the intervall}.r :t c·, then E{g(x)} = g(l}.. ). Indeed, since E{x- Tl.rl = 0, the estimate (4-111) results. This estimate can be improved if we approximate g(x) by a parabola u; = g(l}.,) + g'(l}.. )(x- l}.,) ... Tl.r )2 } = ui. we conclude that g(x) Since £{(x - g i·' ( "( ) x- 1}.,)2 (4-112) (4-113) This is an approximation based on the truncation (4-112) of the Taylor series expansion of g(x) about the point x = Tic. The approximation can be further improved if we include more terms in (4-112). The result, however, involves knowledge of higher-order moments of x. Example 4.17 g(x) = X-I In this case, , 2 g (X)= - xl I N("'.t) = "'• , 2 NI7J.,)"'- \ 71.\ 130 CHAP. 4 THE RANDOM VARIABLE and (4-113) yields • TCHEBYCHEFFS INEQUALITY The mean 71 of an RV x is the center of gravity of its masses. The variance cr2 is their moment of inertia with respect to.,. We shall show that a mcijor proportion of the masses is concentrated in an interval of the order of cr centered at .,. Consider the points a kcr, b + kcr of Fig. 4.44 where k is an arbitrary constant. If we remove all masses between a and b (Fig. 4.44b) and replace the masses p, = P{x ~ ., - kcr} to the left of a by a point mass p 1 at a and the masses P2 = P{x ~ 71 + kcr} to the right of b by a point mass p 2 at b (Fig. 4.44c). the resulting moment of inertia with respect to ., will be smaller than cr 2 : =., - =., Hence, I Pt + P2 ~ kl From this it follows that I P{lx- Til~ kcr} ~ k 2 (4-114) or, equivalently, I P{71 - kcr < x < 71 + kcr} ~ I - k 2 (4-115) where equality holds iff the original masses consist of two equal point masses. We have thus shown that the probability masses p 1 + p 2 outside the interval (71 - kcr, 71 + kcr) are smaller than llk2 for any k and for any F(x). Note, in particular, that if llk2 = .05-that is, if k = 4.47-then at most 5% of all masses are outside the interval 71 ± 4.47cr regardless of the form of Figure 4.44 b =f1 + ka a =11- ka a ., (a) X ~ a ., {b) 1'--- . b X ··1 ., I a {c) 1P2 b .. .t PROBLEMS f(y) c 13 1 =Aq p c X c 0 c 0 X (b) (a) (c) Figure 4.45 F(x). If F(x) is known, tighter bounds can be obtained; if, for example, xis normal, then Lsee (4-50)) 5% of all masses are outside the interval., ::t 1.96u. Markoff's Inequality Suppose now that x =:::: 0. In this case, F(x) = 0 for x < 0; hence, all masses are on the positive axis. We shall show that a major proportion of the masses is concentrclted in an interval (0, c) of the order of.,. Consider the point c: = ATJ of Fig. 4.45 where 11. is an arbitrary constant. If we remove all masses to the left of c (Fig. 4.45b) and replace the mass,es p = P{x =:::: .\71} to the right of c by a point mass pat c, the moment with respect to the origin will decrease from ., to pc:; hence, pATJ < .,. This yields P{x =:::: ATJ} I <A (4-116) for any A and for any F(x). Note that (4-114) is a special case of (4-116). Indeed, the RV (x - 71) 2 is positive and its mean equals cr 2• Applying (4-116) to the RV (x - 71) 2, we conclude with A = k2 that P{lx - 7112 =:::: k2u2} s ~2 and (4-114) results because the left side equals P{lx - TJI =:::: kcr}. Problems 4-1 The roll of two fair dice specifies an experiment with 36 outcomes.fiJi.. The RVS x and y are such that ifi+k=1 x<Ji.li.) = i ... k y(JiJi.) = { ifH·k=11 and y(JiJi.) = 0 otherwise. (a) Find the probabilities of the events {6.5 < x < 10}, {x > 2}, {5 < y < 25}, {y < 4}. (b) Sketch the distributions Fx(x), Fv(y) and the point densitiesfx~),/y(y). · The RV xis normal with., = 10, cr = 2. Find the probabilities of the following events: {x > 14}, {8 < x < 14}, {lx - 101 < 2}. {x - 5 < 7 < x - 1}. 2~ 4-l 132 CHAP. 4 THE RANDOM VARIABLE 4-3 4-4 4-5 4-6 4-7 -17 -9 -7 -6 -2 4-8 4-9 4-10 4-11 4-1.2 4-13 4-14 The SAT scores of students in a school district are modeled by a normal av with TJ = 510 and cr = SO. Find the percentage of students scoring between 500 and 600. The distribution ofx equals F(x) = (I - e- 21 )U(x). Find P{x > .5}, P{.3 < x < .7}. Find the u-percentiles ofx for u = .I, .2, . . . , .9. Given F(S) = .940 and F(6) = .956, find the x0..,, percentile of x using linear interpolation. If F(x) = x/5 for 0 s x s 5, find j(x) for all x. Find the corresponding upercentiles x,. for u = .I, .2, . . . , .9. The following is a list of the observed values xi of an RV x: 3 6 II 14 18 23 26 31 34 35 37 39 42 44 48 51 53 61 Sketch the empirical distribution F,.(x) and the percentile curve x,. of x. Sketch the histogramf,(x) for A = Sand A = 10. The life length of a pump is modeled by an av x with distribution F(x) = (I e-•·• )U(x). The median of x equals x0.5 = 50 months. (a) Under a warranty, a pump is replaced if it fails within five months. Find the percentage of units that are replaced under the warranty. (b) How long should the warranty last if under the warranty only 3% of the units are replaced? (a) Show that if/(x) = c:lxe-r"U(x), then F(x) =(I - e-rx- cxe-.-.)U(x). (b) Find the distribution of the Erlang density (page 106). The maximum electric power needed in a community is modeled by an av x with density c:lxe-r•U(x) where c = 5 x 10· 6 per kilowatt. (a) The power available is 106 kw; find the probability for a blackout. (b) Find the power needed so that the probability for a blackout is less than .005. We wish to control the quality of resistors with nominal resistance R = I ,ooon. To do so, we measure all resistors and accept only the units with R between 960 and I ,040. Find the percentage of the units that are rejected. (a) If R is modeled by an N(IOOO, 20) av. (b) If R is modeled by an av uniformly distributed in the interval (950, I ,050). The probability that a manufactured product is defective equals p. We test each unit and denote by x the number of units tested until a defective one is found. (a) Find the distribution of x. (b) Find P{x > 30} if p = .OS. We receive a shipment of200 units, of which 18 are defective. We test 10 units at random, and we model the number k of the units that test defective by an RV x. (a) Find the distribution of x. (b) Find P{x = 2}. (Pascal distribution) The av x equals the number of tosses of a coin until heads shows k times. Show that P{x = k} = (~ =: :) p"q"··• 4-15 The number of daily accidents in a region is a Poisson RV x with parameter a = 3. (a) Find the probability that there are no accidents in a day. (b) Find the probability that there are more than four accidents. (c) Sketch F(x). 4-16 Given an N(l, 2) av x and the functions (Fig. P4.16) g,(x) = {xo lxl s 2 lxl > 2 gz(x) = { ~(s2 ; 2 -2 - x < -2 gl(x) = { -II x<O x>O PROBLEMS 12<x> It (X) 133 IJ(X) 2 -2 0 / 2 0 X X -2 l'igure P4.16 4-17 4-18 4-19 4-10 4-11 4-ll we form the avs y = g 1(x), z = g 2(x). w = g 3(x). (a) Find their distributions. (b) Find the probabilities of the events {y = 0}, {z = 0}. and {w = 0}. The RV xis uniform in the interval (0, 6). Find and sketch the density of the RV y = -2x + 3. The input to a system is the sum x = I0 - 11 where 11 is an N<O, 2) RV, and the output is y = x2• Findf.ly) and F_,.(y). Express the density /y(y) ofthe RV y = g(x) in terms of the density f,(x) ofx for the following cases: (a) g(x) = x 3; (b) g(x) = .~; (c) g(x) = 1xl (full-wave rectifier); (d) g(x) = xU(x) (half-wave rectifier). The base of a right triangle equals 5, and the adjacent angle is an RV 8 uniformly distributed in the interval (0, ?TI4). Find the distribution of the length b = 5 tan8 of the opposite side. The RV x is uniform in the interval (0. 1). Show that the density of the RV y = - lnx equals e·-'U(y). Lognormal distribution. The RV xis N('l'/. u) andy = e'-. Show that .t .. ) f,•v = uy v'21T I exp {- 2u· __!_, (lm·. - .,.,>} ., This density is called lognormal. 4-13 Given an Rv x with distribution F,(x), we form the RV y 4-14 4-lS 4-16 4-17 4-18 4-19 Uh·) . = 2F,(x) + 3. Findf,.(y) · and F).(y). Find the constants u and b such that if y ....: ax - band '17.• = 5. u, = 2, then.,.,,. = 0. (7'). = I. . A fair die is rolled once. and x equals the number of faces up. Find .,.,_, and u,. We roll two fair dice and denote by x the number of rolls until 7 shows. Find E{x}. A game uses two fair dice. To participate. you pay $20 per roll. You win SIO if even shows, $42 if 7 shows, and $102 if II shows. The game is fair if your expected gain is $20. b the game fair? Show that for any c. E{(x- d} = <'17. - d +IT;. The resistance of a resistor is an RV R with mean t.ooon and standard deviation 200. It is connected to a voltage source V = II OV. Find approximately the mean of the power P = V 21R dissipated in the resistor, using (4-113). 134 CHAP. 4 THE RANDOM VARIABLE 4-30 The RV x is uniform in the interval (9, 11) and y = x 3• (a) Find /y(_v); (b) find TJ.. and crx: (c) find the mean ofy using three methods: (i) directly from (4-93); (ii) indirectly from (4-94); (iii) approximat~ly from (4-113). · / 4-31 The av x is N(TJ, cr), and the function g(x) is nearly linear in the interval TJ 3cr < x < y + 3cr with derivative g'(x). Show that the RV y = g(x) is nearly normal with mean TJy = g(TJ) and standard deviation cr.,. = lg'(TJ)Icr. . · 4-32 Show that if/x(x) = e""U(x) andy= V2i then/,(y) = ye·).z'2U(y). s _ _ __ Two Random Variables Extending the concepts developed in Chapter 4 to two RVS, we associate a distribution, a density, an expected value, and a moment-generating function to the pair (x, y) where x andy are two RVS. In addition, we introduce the notion of independence of two RVs in terms of the independence of two events as defined in Chapter 2. Finally, we form the composite functions z = g(x, y), w = h(x, y) and express their joint distribution in terms of the joint distribution of the RVS x and y. 5-1 The Joint Distribution Function We arc given two RVS x andy defined on an experiment~. The properties of each of these Rvs are completely specified in terms of their respective distributions f~(x) and Fv(y). However, their joint properties-that is, the probability P{(x, y} E D} that the point (x, y) is in an arbitrary region D of the plane-cannot generally be determined if we know only Fx(x) and F,(y). To find this probability, we introduce the following concept. 135 136 CHAP. 5 TWO RANDOM VARIABLES • Definition. The joint mmulative di.ftrihutimzjimctioll or. simply. thejoilll distributio11 Ft,.(x. y) of the RVs x andy is the probability of the event = {x ~ x} n {x s x. y s .v} {y s y} consisting of all outcomes C such that x(C) s x and y(C) s y. Thus Fn(X, y) = P{x < x. y s y} (5-1) The function F.t.,.(x, y) is defined for every x andy: its subscripts will often be omitted. The probability P{<x, y) ED} will be interpreted as mass in the region D. Clearly, F(x, y) is the probability that the point <x. y) is in the quadrant 0 11 of Fig. 5.1; hence, it equals the mass in IJ0 • Guided by this. we conclude that the masses in the regions 0 1 • D2. and D3 are given by P{x s x. Y1 < y s J2} P{x, < x s x2. y s y} P{x, < x S X2, Y1 <yS Y2} = F(x2, Y2) = F<x. Y2> = F(x2 • y) - F(x, Yd -· F(x 1 • y) F(x,, Y2) - F(x2 • .Vd - (5-2) (5-3) + F(x1 , yd (5-4) respectively. JOINT DENSITY If the RVs x and y are of continuous type, the probability masses in the xy plane can be described in terms of the function f(x, y) specified by P{x < x s x + dx, y < y s y + dy} = f(x, y)dxdy (5-5) This is the probability that the Rvs x and y are in a differential rectangle of area dxdy. The functionf(x, y) will be called the joint density function of the RVS ll andy. From (5-5) it follows that the probability that (x, y) is a point in an arbitrary region D of the plane equals P{(x, y) CD}= JJ j(x. y)dxd.Y (5-6) /) Applying (5-6) to the region 0 11 of Fig. 5.1. we obtain F(x, y) = f f. /(a.~) & da d/:l (5-71 This yields f Note, finally, that f(x, y) ~ 0 (x t. ) 'Y = iJ2F(x, y) (5-8) ax iJy t.J<x. y)dxdy This follows from (5-5) and the fact that {x s :x;, = F(:x:, :x:) = I (5-9) y s oo} is the certain event. SEC. 5-1 THE JOINT DISTRIBUTION FUNCTION 137 Yt , Figure S.l MAR<IINAL DISTRIBUTIONS In the study of the joint properties of several RVs, the distributions of each are called marginal. Thus Fx(x) is the marginal distribution and h(x) the marginal density of the RV x. As we show next, these functions can be expressed in terms of the joint distribution ofx andy. We maintain that ~-~.(X)= Fx.v<x. :x:) r. = .f~·<x. y)dy The first equation follows from the fact that {x s x} = {x s x, y s h(X) (5-10) :x:} is the event consisting of all outcomes such that the point (x, y) is on the left of the line L .• of Fig. 5.2a. To prove the second equation, we observe that.fx(x)dx = P{x s x s x + dx} is the probability mass in the shaded region llD of Fig. 5.2b. This yields h(x)dx = fJ /xy(x, y)dxdy = d:c J:.J. . (x. y)dy JJ) We can show similarly that F,(y) = f"xy(:x:, y) /,(y) = f.J:y(X. (5-11) y) dx Figure 5.2 l liD xSx - .....-dx ySy (a) (b) 138 CHAP. 5 TWO RANDOM VARIABLES POINT AND LINE MASSES Suppose that the avs x and y are of discrete type taking the values X; and Yk, respectively, with joint probabilities P{x = x,, y = y,} = f(x;, y,t) =Pile (5-12) In this case, the probability masses are at the points (x1, yd andf(x1 Yt) is a point density. The marginal probabilities P{x = x;} = fx(x;) = p; P{y = yk} = /y(yk) = P1r. (5-13) can be expressed in terms of the joint probabilities Ptlr.. Clearly, P{x = X;} equals the sum of the masses of all points on the vertical line x = x;, and P{y = YA:} equals the sum of the masses of all points on the horizontal line y = Ylr.· Hence, (5-14) p;=LPilr. • Note, finally, that L Pi i Example 5.1 = LIt qlr. = Li,lt Pik =I (5-15) A fair die rolled twice defines an experiment with 36 outcomes/Jj. (a) The avs x and y equal the number of faces up at the first and second roll, respectively: x(Ji/j) = i y(Ji,/j) =j i.j ==I, . . . , 6 (5-16) This yields the 36 points of Fig. .5.3a; the mass of each point equals 1136. (b) We now define x andy such that x(Ji/j) = li- il y(Jijj) == i + i Thus x takes the six values 0, I, . . . , .5 and y takes the 11 values 2, 3, . . . , 12. The corresponding marginal probabilities equal 6 10 8 6 4 2 P; == 36 36 36 36 36 36 1 2 3 4 5 6 5 4 3 2 1 qj = 3636363636363636363636 F'ipre 5.3 y •••••• •••••• •••••• •••••• •••••• s •••••• s 0 (a) X s 0 (b) X SEC. 5-1 THJ:: JOINT I>ISTRIBUTION I'UNCTION 139 In this case. we have 21 points on the plane (Fig. 5.3bJ. There are six points on the line x....: 0 with masses of 1/36: for example, the mass of the point (0, 4) equals 1/36 because ix = 0, y - 4} = t/2 ,/z }. The masses of all other points equal2/36: for example. the mass oft he point (3, 5) equals 2136 because {x:..:. 3. y .. 5} = U.J4./~ti}. • In addition to continuous and discrete type Rvs. joint distributions can be of mixed type involving distributed masses. point masses, and line masses of various kinds. We comment next on two cases involving line masses only. Suppose. first, that x is of discrete type, taking the values x;. and y is of continuous type. In this case, all probability masses arc on the vertical lines x = x; (Fig. 5.4a). The mass between the points y, andy~ of the line x = X; equals the probability of the event {x = x;, y 1 < y < y~}. Suppose. finally. that x is of continuous type and y = g(x). If D is a region of the plane not containing any part of the curve y =!((X), the probability that (x, y) is in this region equals zero. From this it follows that all masses are on the curve y = !((X). In this case, the joint distribution F(x, y) can be expressed in terms of the marginal distribution F,(x) and the function g(:c). for example. with x and y as in Fig. 5.4b, F(x, y) equals the masses on the curve y = g(x) inside the shaded area. This includes the masses on the left of the point A and between the points B and C. Hence, F(x, y) = F,(x 1 ) + F,(x) - F,(x~) Independent RVs As we recall[see (2-67)], two events s4. and ~A arc independent if P(s4. n ~) = P<:A)P(~) The notion of independence of two RVs is based on this. • Definition. We shall say that the Rvs x andy arc statistically indep(•ndent if the events {x !5 x} and {y !5 y} arc independent, that is. if P{x !5 x, y !5 y} = P{x !5 x}P{y !5 y} (5-17) Figure 5.4 Line masse) J gl.\') YJ-- X (a) thl 140 CHAP. 5 TWO RANDOM VARIABLES for any x andy. This yields f~.v(X, y) = Fx(x)Fy(y) (5-18) Differentiating, we obtain /r;(x, y) = fx(xlf;.(y) (5-19) Thus two Rvs are independent if they satisfy (5-17) or (5-18) or (5-19). Otherwise, they are "statistically dependent." From the definition it follows that the events {x1 s x < x2 } and {>· 1 s y < y2 } are independent; hence, the masses in the rectangle D3 of Fig. 5.1 equals the product of the masses in the vertical strip (x 1 s x < x2 ) times the masses in the horizontal strip(>·, s y < y 2 ). More generally. if A and Bare two point sets on the x-axis and they-axis, respectively, the events {x E A} and {y E B} are independent. Applying this reasoning to the events {x = x;} and {y = Yk}, we conclude that ifx andy are two discrete type RVs as in (5-12) and independent, then (5-20) Note, finally, that if two RVs x andy are "functionally dependent." that is, if y = g(x), they cannot be (statistically) independent. Independent Experiments Independent RVs are generated primarily by combined experiments. Suppose that fJ, is· an experiment with outcomes '' and ~2 another experiment with outcomes C2. Proceeding as in Section 3-1, we form the combined experiment (product space)~ = f:/ 1 x ~2 • The outcomes of this experiment are ~~~2 where~~ is any one of the elements C1 of~ 1 and ~2 is any one of the elements C1 of ~f::. In the experiment ~~. we define the Rvs x andy such that x depends only on the outcomes of ~f 1 andy depends only on the outcomes of~2 • In other words. x(~ 1 ~::) depends only on ~~ and y(~ 1 ~2 ) depends only on ~2 • We can show that if the experiments ~1 1 and ~/2 are independent. the RVs x andy so formed are independent as well. Consider, for example, the RVS x andy in (5-16). In this case,~. is the first roll of the die and ~2 the second. Furthermore, P{x = i} = l/6, P{y = k} = 1/6. This yields P = {x = i, y = k} = 3~ = P{x = i}P{y = k} Hence, in the space are independent. ~ of the two independent rolls of a die, the RVs x and y SEC. 5-J 141 THE JOINT DISTRIBUTION FUNCTION ILJ.t.;STRATIONS The Rvs x and y are independent. x is uniform in the interval (0, a), andy is uniform in the interval (0, b). ! fx(x) = { ao {i 0 s x sa otherwise = 0 s; y s; b · 0 otherwise Thus the probability that xis in the interval (x, x + .1-t) equals tufa, and the probability that y is in the interval (y, y + Ay) equals Aylb. From this and the independence of x and y it follows that the probability that the point (x, y) is in a rectangle with sides tu and Ay included in the region R = {0 s x sa, 0 s y s b} equals tuAylab. This leads to the conclusion that h-(y) ...!..b ={ a f(x, y) 0 s x s a. 0 s y s b (5-21) 0 otherwise The probability that (x, y) is in a region D included in R equals the area of D divided by ab. Example 5.2 A fine needle of length cis dropped "at random" on a board covered with parallel lines distanced apart (fig. 5.5a). We wish to find the probability p that the needle intersects one of the lines. This experiment is called Buffon's needle. We shall first explain the meaning of randomness: We denote by x the angle between the needle and the parallel lines and by y the distance from the center of the needle to the nearest line. We assume that the Rvs ll and y are independent and uniform in the intervals (0, Trl2) and (0, d/2), respectively. Clearly. the needle intersects the lines itT Figure S.S x.....,•./ J' I cJ :! I !1- d ----., ""2 stn. .· \' 12 a<c I ~I y I I L.- 1r I (a) I (' 2 (b) X 142 CHAP. 5 TWO RANDOM VARIABLES Hence. to find p, it suffices to find the area of the shaded region of Fig . .S.Sb and use (5-21) with a = 1r12 and b = d/2. Thus 4 c 2c p = 1rd Jo 2 cos x dx = 1rd r··1 De Monte Carlo Metbod from the empirical interpretation of probability it follows that if the needle is dropped n times. the number n; of times it intersects one of the lines equals n; = np - 2nd1rd. This relationship is used to determine 1r empirically: If n is sufficiently large, 1r = 2ncldn;. Thus we can evaluate approximately the deterministic number 1r in terms of averages of numbers obtained from a random experiment. This technique. known as the Monte Carlo method. is used in performing empirically various mathematical operations, especially estimating complicated integrals in several variables (see Section 8-31. • The RVS x and y are independent and normal. with densities f,(x) ·~ . = ,.• I'2- e····,.:rr I /.(v) = - - " v ~11 In this case, (5-19) yields · f(x, V) ' Example 5.3 .,, . (!.,.. _,. u'\12; = -•-, e·«x!'·'.!"~·,.: (5-22) 21TCT" With f{x, y) as in (5-22), find the probability p that the point (x, y) is in the circle ~sa. As we see from (5-22) and (5-6), P{~ <a}= 2~u~ Jf e·•·•!•Y:~rdxdy '.r·+,,••<u I f." 21TTt'-r'.;tr . . .dr = I = --, 21TO'" 0 .., . - e·u··-•r • Circular Symmetry We shall say that a density functioh/(x, y) has circular symmetry if it depends only on the distance of the point (x, y) from the origin, that is, if f(x, y) = «//(r) r = v'x 2 + y 2 (5-23) As we see from (5-22), if the RVs x and y are normal independent with zero mean and equal variance, their joint density has circular symmetry. We show next that the converse is also true. • Theorem. If the Rvs x and y are independent and their joint density is circularly symmetrical as in (5-23), they are normal with zero mean and equal variance. 5-J SEC. 143 THE JOINT I>ISTRIBUTION FUNCTION This remarkable theorem is another example of the importance of the normal distribution. It shows that normality is a consequence of independence and circular symmetry. conditions that are met in many applications. • Proof. From (5-23) and (5-19) it follows that r·.,·· eM·., (5-24) cfJ(r) = j~(.r)/v(v) r "' V x· + yWe shall show that this identity leads to the conclusion that the functions j,(x) and J;-lv) are normal. Difl"erentiating both sides with respect to x and using the identity d "'-( _ dfb(r) or _ X "'-'( ) "~~ r) - - - - - "~~ r ax dr ax r - we obtain (5-25) As we see from (5-24), x<b(r) = xfx(x)J;.(y). From this and (5-25) it follows that I f.~(x) I cf>'(r) r cf>(r) = x f.(x) ~-~ The left side is a function only of r = V x~ + y 2 , and the right side is independent of y. Setting x = 0, we conclude that both sides are constant. Thus d -I [;(x) - - = a = constant dx In /.(:c) = ax x ft(x) In j.(x) = ax2 T + C j.(:c) = c en.•l'2 Since fx(x) is a density, a < 0; hence. x is normal. Reasoning similarly, we conclude that y is also normal with the same a. t"undions of Independent R\"S Suppose now that z is a function of the and that w is a function of the RV y: z = g(x) w = h~) RV x (5-26) We maintain that if the avs x andy are statistically independent, the avs z and w are also statistically independent. • Proof. We denote by A: the set of values of x such that g(x) s z and by B.., the set of values y such that h(y) s w. From this it follows that {z s z} = {x EAt} {w s w} = {y E B.,..} for any z and w. Since x andy are independent, the events {x E A4 } and {y E B,..} are independent; hence, P{z s z. w s w} = P{z s z}P{w s w} (5-27) Note, for example, that the avs x2 and y3 are independent. 144 CHAP. 5 TWO RANDOM VARIABLES 5-2 Mean, Correlation, Moments The properties of two avs x and y are completely specified in terms of their joint distribution. In this section, we give a partial specification involving only a small number of parameters. As a preparation, we define the function g(x, y) of the avs x and y and determine its mean. Given a function g(x, y), we form the composite function z = g(x, y) This function is an RV as in Section 4-1, with domain the set ~ of experimental outcomes. For a specific outcome'; E ~.the value z(,;) ofz equals g(x;, y;) where x; andY; are the corresponding values ofx andy. We shall express the mean of z in terms of the joint density /(x, y) of the avs x and y. As we have shown in (4-94), the mean of the RV g(x) is given by an integral involving the density fx(x). Expressingfx(x) in terms of f(x, y) [see (5-10)], we obtain E{g(x)} = J: g(x)fx(x)dx J: J: g(x)f(x, y)dxdy = (5-28) We shall obtain a similar expression for E{g(x, y)}. The mean of the RV z = g(x, y) equals Tit = r. z.fr.(z) dz (5-29) To find.,,, we must find first the density of z. We show next that, as in (5-28), this can be avoided. • Theorem E{g(x, y)} = r. r. g(x, y)f(x, y)dxdy (5-30) and if the avs x andy are of discrete type as in 15-12). E{g(x, y)} = L g(x;. ydf(x;. yd (5-31) i.4 • Proof. The theorem is an extension of(4-94) to two variables. To prove it. we denote by 11Dl the region of the xy plane such that z < g(x. y) < z + dz. To each differential dz in (5-29) there corresponds a region 110: in the xy plane where g(x. y) == z and P{z s z s z + dz} = P{(x, y) E 11D:} 5-2 SEC. 145 MEAN, CORRELATION, MOMENTS As dz covers the z-axis, the corresponding regions W: are nonoverlapping and they cover the entire X)' plane. Hence, the integrals in (5-29) and (5-30) are equal. It follows from (5-30) and (5-31) that £{g 1(x, y) + · · · + g,(x, y)} = £{g 1(x, y)} + · · · + £{g,(x, y)} (5-32) as in (4-96) (linearity of expected values). Note [see (5-28)] that Tl.t = r. f.. xf(x, y)dxdy cr; = f~ ry (x - 71x>2f<x. y)dxdy Thus (7Jx, TJy) is the center of gravity of the probability masses on the plane. The variances cri and cr~ are measures of the concentration of these masses near the lines X = TJ., andy = 71,-. respectively. We introduce next a fifth parameter that gives a measure of the linear dependence between x and y. Covariance and Correlation The covariance JJ..•.v of two RVs x and y is by definition the "mixed central moment'': Cov(x, y) = p..,. = E{(x - "h )(y - 71.•·)} (5-33) Since = £{xy} - 7J.,£{y} - TJvE{x} - TJ.,TJy = £{xy} - E{x}J::{y} E{(x - TJ.. )(y - 71~·)} (5-33) yields JJ.x.1· (5-34) The ratio (5-35) is called the corre/atimr coefficient of the Rvs x andy. Note that r is the covariance of the centered and normalized RVS x-""._,.• Xo = __ Yn = y- Tl•· u, CTx Indeed, l:.'{Xo} = E{yo} = 0 crxo2 = £{w~} cr~·n2 = £{y02 } = I -u = I X - Tit Y - Tlv } JJ.n { } = E --· ~ = ~ = r EXoyo { CTx u~. CT,cr_,. 146 CHAP. 5 TWO RANDOM VARIABLES Uncorrelatedness We shall say that the Rvs x andy are uncorrelated if r = 0 or, equivalently, if E{xy} = E{x}E{y} (5-36) 1 We shall next express the mean and the variance of the sum of two Rvs in terms of the five parameters Tl.tt .,,. , CTx, CT,, ILx•. Suppose that z=ax+by In this case. Tl: = tiTJ, + h.,, and E{(z - TJ:I~} = E{ltl(x - TJ.,) + /J(y - .,, IF} Expanding the square and using the linearity of expected values. we obtain u~ ::. a~u~ + b~u~ + 2ah,.,.... (5-37) Note. in particular. that if,.,.,, .:. : 0. then ' = u;' + u~.' (5-38) CT\•.• Thus if two RVs arc uncorrclatcd. the variance of their sum equals the sum of their variances. Orthogonality We shall say that two RVS x and y are ortltol(onal if E{xy} = 0 (5-39) In this case. E{(x + y)~} = E{x2 } + E(y~}. Orthogonality is closely related to uncorrelatedness: If the RVs x and y are uncorrelated. the celllert•d RVs x - TJ., and y -- .,,. are orthogonal. Independence Recall that two RVs x and y are independent iff (5-40) .f(x. _v) = .f,(x)J..(J') • Theonm. If two RVS are independent, they are uncorrelated. • Proof. If (5-40) is true, E{xy} = f,., =f. r. xyf(x, y)dxdy xJ,.(x)dx f. =f. f . yJ;.(y)dy xyJ,.(x)/y(y)dxdy = E{x}E{y} The converse is true for normal RVs (see page 163) but generally it is not true: If two Rvs are uncorrelated. they are not necessarily independent. Independence is a much stronger property than uncorrelatedness. The first is a point property; the second involves only averages. The following is an illustration of the difference. Given two independent RVS x andy, we form the Rvs g(x) and h(y). As we know from (5-27), these Rvs are also independent. From this and the theorem just given it follows that E{g(x)h(y)} = E{g(x)}E{h(y)} (5-41) SEC. 5-2 for any g(x) MEAN, CORRELATION, MOMENTS J47 and g(y). This is not necessarily true if the RVs x and y are uncorrelated; functions of uncorrelated Rvs arc generally not uncorrelated. SCHWARZ'S INEQt;ALITY We show next that the covariance of two Rvs x andy cannot exceed the product CTxCTy. This is based on the following result, known as Schwarz's inequality: (5-42) I Equality holds iff (5-43) Y =CoX • Proof. With c an arbitrary constant. we form the mean of the RV (y - cx) 2 : /(~·) = E{(y- cx)2 } = t. J",, (y- cx)2j(x, y)dxdy (5-44) Expanding the square and using the linearity of expected values. we obtain /(d = E{y:!} - 2cf..'{xy} + c!E{x~} (5-45) Thus /(c) is a parabola (Fig. 5.6). and /(d > 0 for every c; hence, the parabola cannot intersect the x-axis. Its discriminant /) must therefore be negative: and (5-42) results. To prove (5-43). we observe that if J) = 0. then /(d has a real root c11 ; that is. /(co) = E{(y - CoX)2} = 0 This is possible only if the RV y - c·uX equals zero Jscc (4-103)]. as in (5-43). • Corollary. Applying (5-42) to the Rvs x -· 71, andy ·- 71··. we conclude that E2{<x - 71.• )(y - 71,.)} ~ f..'{(x .. 71, )~ }t.'{(y - 71, >2 } Figure 5.6 l(cl D<O v~ 0 c 148 CHAP. 5 TWO RANDOM VARIABLES Hence. , lrl s #Lil s u,,u,. Furthermore [see (5-43)]. lrl = I itT y - .,,. = co( X - I Tl.•) (5-46) To find c,,. we multiply both sides of (5-46) by (x - .,, ) and take expected values. This yields E{(x - .,, )(y - Tl.•· H= <·uE{(x - Tl.• )~} Hence. u,. if r = I cu ru., u ,. = -,-· = u:, .. { - u,. if r =-I u .• We have thus reached the conclll<iion that if lrl = I. then y is a linear function of x and all probability masses are on the line u, - (x - ... r=±I .v - ...., ,. = ± u, .,., ) (T~ As lrl decreases from I to 0. the masses move away from this line (Fig. 5.7). Empirical Interpretation. We repeat the experiment n times and denote by X; and Y; the observed values of the avs x and y at the ith trial. At each of the n points (x~o y 1), we place a point mass equal to 1/n. The resulting pattern is the empirical mass distribution ofx andy. The joint empirical distribution F,.(x, y) of these avs is the sum of all masses in the quadrant Do of Fig. 5.1. The joint empirical density (two-dimensional histogram) is a function f,(x, y) such that the product 4 14 2/,(x, y) equals the number of masses in a rectangle with sides 4 1 and 4 2 containing the point (x, y). For large n, F,.(x, y) =- F(x, y) f,(x, y) = j(x, y) as in (4-24) and (4-37). Figure 5.7 y r=O 0 flx X X 5-2 SEC. MF.AN, CORRELATION, MOMENTS 149 We next form the arithmetic averages .f and .Vas in (4-89) and the empirical estimates -2 cr , '\."" = -nI ~ ; (X· - X-)2 ;;..,.. = !n L; (X; - I ~ n ; -' -)2 CT~ -'- - ~ ()' - ~ ' .V ' .t)(>•; - y) . u;. u~. and Jl..,y. The point (i, f) is the center of gravity of the empirical masses. 0:~ is the second moment with respect to the line x = and is the second moment with respect to the line 'Y = j. The ratio ;;. ...rii;ii,. is a measure of the concentrcttion of the points near the line cr,. - of the parameters x. o-; \'-v= • • ±~(x-.l") (T_, This line approaches the line (5-46) as n--+ x. In Fig. 5.7 we show the empirical masses for various values of r. LINEAR REGRESSION Regression is the determination of a function y = <f>(x) "fitting" the probability masses in the xy plane. This can be phrased as a problem in estimation: Find a function <f>(x) such that, if the RV y is estimated by <f>(x), the estimation error y - <f>(x) is minimum in some sense. The general estimation problem is developed in Section 6-3 for two Rvs and in Section 11-4 for an arbitrcuy number of RVs. In this section, we discuss the linear case; that is, we assume that <f>(x) = a + bx. We are given two RVs x andy, and we wish to find two constants a and b such that the line y = a + bx is the best fit of "yon x" in the sense of minimizing the mean-square value e = E{[yof the deviation v (a + bx)2 } = f. f. [y - (a + bx)Jlf(x, y)dxdy (5-48) = y- (a+ bx) ofy from the straight line a+ bx (Fig. 5.8). Figure 5.8 y r =I ..-..-·· ....- x on y X 150 CHAP. 5 TWO RANDOM VARIABLES Clearly, e is a function of a and b, and it is minimum if ile = · ·:! iiCI J• , j·, ' I,.. = -2f;{y - (Cl :~ = -2 r, r. = - 2E{(y - - a+ bx)lf'(x. \')dxd\' . . • + bx)} = 0 (5-49) ly(tl - (tl + bx))xf(x. y)dxdy bx)]x} =0 The first equation yields From this it follows that y- (a+ bx) = (y- 1J,-) - b(x- 1Jx) E{(y- 1J,-)- b(x- 1Jx>l = 0 Inserting into the second equation in (5-49), we obtain E{[(y- 711 ) - b(x - 1Jx)J(x- 1Jx>l = ru_.u>. - bui = 0 Hence, b = ruvlu_.. We thus conclude that the linear LMS (least mean square) fit of y on X is the line y - 1Jy (T =r ~ (x u_. - 1Jx) (5-50) known as regression line (Fig. 5.8). This line passes through the point (1Jx, 711 ), and its slope equals ru1 /u_.. The LMS fit of "x on y" is the line X - 1Jx = r uUx (y - 1Jy) )' passing through the point (1J.., 711 ) with slope u 1 1ru_.. If regression lines coincide with the line (5-46). lrl = 1, the two APPROXIMATE EVALUA110N OF E{g(s, y)} For the determination of the mean of g(x, y), knowledge of the joint density f(x, y) is required. We shall show that if g(x, y) is sufficiently smooth in the region D wheref(x, y) takes significant values, E{g(x, y)} can be expressed in terms of the parameters 1Jx, 1Jy, U;tt Uy, and #Lxy. Suppose, first, that we approximate g(x, y) by a plane: g(x, y) == g(1J;tt 1Jy) ag ag + (x- 1Jx) ax+ (y- 1Jy) ay (all derivatives are evaluated at x = 1Jx andy= 711 ). Since E{x- 1Jx} = 0 and E{y - 717 } = 0, it follows that (5-51) E{g(x, y)} = g(7J.., 711 ) This estimate can be improved if we approximate g(x, y) by a quadratic surface: ag ag g(x, y) ""'g(7Jz, 1Jy) + (x- 1Jx) ax+ (y- 1Jy) ay 1 azg azg 1 azg + 2 (x- 1Jx)2 axz + (x- 1J_.)(y- 71>) axay + 2 (y- 1Jy)2 ayz SEC. 5-2 MF.AN, CORREI.AriON, MOMENTS J5J This yields Example 5.4 Clearly. if2g - 2 = 2\' • ilx Inserting into (5-52), we obtain E{x2y} == .,;.,, - .,,u~ - 2.,,ru,u, • Momellls and Momelll Functions The first moment 71 and the variance u~ give a partial characterization of the distribution F(x) of an RV x. We shall introduce higher-order moments and usc them to improve the specification of F(x). • D~jinitions. The mean m11 = E{x"} = of X11 is the moment of order n of the ILn = E{(x- 71)11 } f, x"f(x) dx RV (5-53) x. The mean = f~ (x- 71)"/(:c)dx (5-54) is the central moment of order n. Clearly, 2 mo = ILo = I m1 = 71 /LI = 0 1L2 = u2 m2 = 1L2 + 71 Similarly, we define the absolute moments £{1xl"} and the moments E{(x - a)11 } with respect to an arbitrary point a. Note that if f(x) is even, then 71 = O. mn = ILn, and P.2n-1 = 0; if f(x) is symmetrical about the point a, then 71 = a and p.211 • 1 = 0. Example 5.5 We shall show that the central moments of a normal RV equal n even _ { I x 3 · · · (n - I )u" IJ.tt - 0 II odd (5-55) • Proof. Sincej(x) is symmetrical about its mean. it suffices to find the moments of the centered density I ., . /(x) = - - e·•· _.,. uv'21T 152 CHAP. 5 TWO RANDOM VARIABLES Differentiating the normal integral (3-55) k times with respect to a we obtain f" z21te .z -x <ll dz = 1X3···(2k-ll,/1r V a2k·J 24 With a = l/2u 2 and n = 2k, this equation yields 1 - z"e·::12fr dz = l x 3 · · · (n - l)u" u\12; J" -x And since p., = 0 for n odd, (5-55) results. • MOMENT-GENERATING FUNCTIONS The moment-generating function, or, simply, the moment function, of an RV x is by definition the mean of the function en. This function is denoted by cl>(s), and it is defined for every s for which E{eSX} exists. Thus (5-56) and for discrete type avs cl>(s) = ~ eu'f(x~c) From the definition it follows that E{e<n .. bls} = eh•E{easa} (5-57) = eb)cJ>t(as) Hence, the moment function cl>1(s) of the RV y = ax + b equals «<>,(s) = E{eslaa·hl} = eb•cJ>.,(as) Example 5.6 We shall show that the moment function of aN(.,, u) <l»(s) RV (5-58) equals = e"'etM:12 (5-59) • Proof. We shall find first the moment function <1»0fs) of the Clearly, Xo is N(O, I); hence, RV Xo = (x - 71)/u. Inserting the identity x2 sx - - 2 J = - -2 (x - sZ s)2 ... - 2 into the integral, we obtain = esZ12 J"-x\12; _1_ e·lx-sJlf2dt = ,slf2 last integral equals 1. And since x = ., + Q'Xo, <l»o(.f) because the (5-58). • Example 5.7 The RV x is Poisson-distributed with parameter a P{x = k} = a• e·a k! k = 0, J, . . . (5-59) follows from 5-2 SEC. MEAN, CORRELATION, MOMENTS 153 lnsening into (5-57), we obtain Cl>(s) = e·a 2" 4D0 4 a1 e•• k, (5-60) • == e·ue"'' We shall now relate the moments m, of the RV x to the derivatives at the origin of its moment function. Other applications of the moment function include the evaluation of the density of certain functions of x and the determination of the density of the sum of independent avs. • Moment Theonm. We maintain that E{x"} = m, = «t»lnl(O) (5-61) • Proof. Differentiating (5-56) and (5-57) n times with respect to s, we obtain and for continuous and discrete type avs, respectively. With s from (5-53). = 0, (5-61) follows • Corollary. With n = 0, I, and 2, the theorem yields «1»(0) = I, «1»'(0) = m, = TJ «1»"(0) = m 2 = TJ 2 + u2 (5-62) We shall use (5-62) to determine the mean and the variance of the gamma and the binomial distribution. Example S.H The moment-generating function of the gamma distribution f(x) = 'Yxb-le-•·•u(x) equals Cl>(s) = 'Y i"xbo 1 e-~<"··su dx = (c'Yf{b) - s)b c.h = --.,. (c - s)b (5-63) (see (4-54)]. Thus «~»'"'(s) = .:. .;b(:..:.b_-___.:_1:.. .)·.,....-·_• ..!.;<b:..,.--....:.n.:....._-....:l~)c=-h (c - s)b'" Hence, «~»'"'(O) = E{x"} With n = I and n - I) · · · (b +· n - I) c" (5-64) = 2, this yields E{x} = ~ E{x2 } = b(b ; I) c Example 5.9 = b(b c The av x takes the values 0, I. . . . , n with P{x = k} = ( Z) p•q"-k u-' = b c2 (5-65) • 154 CHAP. 5 TWO RANDOM VARIABLES In this case, (5-66) Clearly, «<>'(s) = n(pr + q)" cl>"(s) = n(n - l)(pr 4>'(0) = np 4>"(0) 1pt>' + q)"- 2p 2 ~ = n2p 2 - + n(pt>' + q)"- 1pr np2 + np Hence, (5-67) • The integral in (5-56) is also called the Laplace transform of the functionf(x). For an arbitrary f(x), this integral exists in a vertical strip R of the complex s plane. If f(x) is a density, R contains the jt.rraxis; hence, the function ci»UOJ) = f.. ei*'Xf(x) dx (5-68) exists for every real OJ. In probability theory, ci»UOJ) is called the characteristic function of the av x. In general, it is called the Fourier transform of the function f(x). It can be shown that f(x) =21T I J'_,. " e-1-cJ»UOJ)dOJ (5-69) This important result, known as the inuersion formula, shows that f<x> is uniquely determined in terms of its moment function. Note, finally, that f(x) can be determined in terms of its moments. Indeed, the moments m,. equal the derivatives ofcl>(s) at the origin, and cl>(s) is determined in terms of these derivatives if its Taylor expansion converges for every s. Inserting the resulting cJ»UOJ) into (5-69), we obtainf(x) . .JOINT MOMENTS Proceeding as in (5-53) and (5-54), we introduce the joint moments m., = E{x•y} = r. t,. x•yt(x. y)dxdy (5-70) and the joint central moments P.kr = E{(x- Tl~)'(y- Tl.v)'} = fx f,. (X- Tl.,)"(y- 11v>1<x. y)dxdy Clearly, m1o = ,.,.. #J.IO = 0 mot = ,.,,. m20 , , = 71; + u:; mo2 , , (5-71) ., = 71; + u;, /Lot = 0 #J.l I = IJ..<.Y #J.20 = u; IJ.oz = u; The joint moment func:tioh cl>(s 1, s2) of the avs x and y is by definition cl>(s 1 , s 2 ) = E{E"'•••s~} = fx r. ets,x·.•~''f(x, y)dxdy (5-72) SEC. 5-3 FUNCTIONS OF TWO RANDOM VARIABLES 155 Repeated differentiation yields (5-73) This is the two-dimensional form of the moment theorem (5-61 ). Denoting by <1>,{.\") and <1>,.(.\") the moment functions ofx andy. we obtain the following relationship bet.wccn marginal and joint moment functions: <1>,.(.\"~) ,... <1>,(.\"1) = E{e···} = ct>{.\"1. ()) J::(t-•·Y} <1>(0. ·'"~) (5-74) Note, finally. that if the avs x and y arc incll'pl'ndl'nt. the avs ('''• and ('"'''are independent lsee (5-41)). From this it follows that cl>(.\"1. s~) = J::{e••• }E{t•':> } ...:. <l>,<sj)ct>,.<s!) !5-75) Example 5.10 From (5-75) and (5-59) it follows that if the Rv x and y are normal and independent, with zero mean, then <J)(·'"I •.\·!I ,. ,, ·• tr;.\·~ ''} exp { iI (tT\Si 1 (5-76) • 5-3 Functions of Two Random Variables Given two Rvs x andy and two functions g(x, y) and h(x. y), we form the functions w = h(x, y) z = g(x, y) These functions are composite with domain the set f/; that is, they are RVs. We shall express their joint distribution in terms of the joint distribution of the RVS X and Y• We start with the determination of the marginal distribution Fl(z) of the RV z. As we know. Fl(z) is the probability of the event {z s z}. To find Ft(Z), it suffices, therefore, to express the event {z s z} in terms of the RVs x andy. To do so, we introduce the region Dt of the xy plane such that g(x, y) s z (Fig. 5.9). Clearly, z s z itT g(x, y) s z: hence. P{z ~ z} = P{g(x, y) s z} :.:. P{<x, y) E D:} The region JJ, is the projection on the xy plane of the part of the g(x. y) surface below the plane z = constant: the function F:IZ) equals the probability masses in that region. Example 5.11 The RVS x and y are normal and independent with joint density r ( )- I . lxl· rl)l2cr Jxy x, Y - 21Tw e . We shall determine the distribution of the z =+ Vx2 RV t y2 156 CHAP. 5 TWO RANDOM VARIABLES y y X X {a) I b) Flpre 5.9 If z ~ 0, then Dz is the circle g(x, y) = ~ s z shown in Fig. 5.9. Hence (see Example 5.2), Fz(Z) = 2~(T2 JJ e-lx~•r,Jnal dxdy = I - e-z2f2al D, If z < 0, then {z s z} = 0; hence, F:<z> = 0. Rayleigh Density. Differentiating F:(z), we obtain the function f:<z> = ~ e-rf2alu<z> (5-77) known as the Rayleigh density. • Example 5.12 (a) z = max (x, y) The region Dz of the xy plane such that max (x, y) s :is the set of points such that x s z and y s z (shaded in Fig. 5.10). The probability masses in that region equal F xy(Z, z). Hence, Fz(Z) = FAy(Z, Z) (5-78) (b) z = min (x. y) The region Dz ofthexy plane such that min (x, y) s z is the set of points such thatx s z or y s z (shaded in Fig. 5.10). The probability masses in that region equal the masses Fx<z> to the left of the vertical line x = z plus the masses F,(z) below the horizontal line y = z. minus the masses FA1(z. z) in the quadrant <x s z. y s z). Hence, (5-79) • Joint Density of g(x, y) and h(x, y) We shall express the joint density fzw(z, w) of the RVS z = g(x, y) and w = h(x, y) in terms of the joint density /xy(x, y) of the RVS x andy. The following theorem is an extension of (4·78) to two RVs, and it involves the Jacobian J(x, y) of the transformation 1. = g(x, y) w = h(x, y) (S-80) SEC. 5-3 FUNCTIONS OF TWO RANDOM VARIABLES 157 )' z L min (.Y, y) S z max (x, y) S z 0 ·I .\" Figure S.to The function J(x. y) z ~ X is by definition the determinant J(.t. y) ::. iiJ:(:C. y) iiX ~'.(X •.\:) ilx il\' iiJt( X. y) (5-tH) -ily and it is used in the determination of two-dimensional integrals involving change of variables. • Fundamental Theorem. To find./; .. <:. u·). we solve the systems (5-80) for x and y. If this system has no real solutions in some region of the zw plane. ,(.,(z. w) = 0 for every(<.. w) in that region. Suppose. then. that (5-80) has one or more solutions <x1• )';).that is. (5-82) /t(x;. )';) = w In this case. r ( ) .f~,(x,. Ytl J; - w = . + ... + J~,.( :C;. _\';) + ... (5-83) :.. .... IJ<x,. Yt>l IJ<x;. y;>l where (:c;. y;) are all pairs (X;. y;) that satisfy (.5-821. The number of such pairs and their values depend. of course. on the particular values of z and w. • Proof. The system (5-82) transforms the differential rectangle A of Fig. 5.11a into one or more differential pardllelogmms 8 1 of Fig. 5.11b. As we know from calculus. the area I8 11 or the ith parallelogram equals the area lA I = dzdw of the rectangle, divided by J(x;, y,). Thus = h_,.(z. w)dzdw = fx.,(x~o y;)IJ- 1(x;. Y;)ldzdw P{z s z s z + dz. w s w s w + dw} P{(x, y) E 8,} (5-84) And since the event {z s z s z + dz, w s w s w + dw} is the union of the disjoint events {(x;, y1) E 8 1}, (5-83) follows from (5-84). With h_,.(z, w) so determined, the marginal densities ft(z) and j,.(z) can be obtained as in (5-10). 158 CHAP. 5 TWO RANDOM VARIABLES y y M z 0 =z Y; --~~ w g(x, y) 0 X X (a) (b) Figure S.U Auxiliary Variable We can use the preceding theorem to find the density /,.(z) of one function z = g(x, y) of two avs x andy. To do so, we introduce an auxiliary av w = h(x, y), we determine the joint density f: ...(z, w) from (5-83) and the density f:(z) from (5-10): /,.(z) = r. J..... (5-85) (z, w)dw The variable w is selected at our convenience. For example, we can set w x or y. We continue with several illustrations. = Linear Transformadons Suppose first that z=ax+b w=cy+d This system has a single solution z-b w-d Xt Yl = - a c for any z and w, and its Jacobian equals ac. Inserting into (5-83), we obtain 1 w(5-86) f:...,.(z. w) = laclfn -a-· -t-.- =-- (z- b d) Suppose, next, that In this case, the Jacobian equals J(x, y) = ,a, a2 If D = 0, then w = a2zl b, ; hence, the joint distribution ofz ahd w consists of line masses and can be expressed in terms of Fi:.). It suffices, therefore, to assume that D =I= 0. With this assumption, we have a single solution Xt and (5-83) yields ~ Jtw ( z, = W bzz - b,w D Yt = -azz + a,w D ) = ~ (bzz - b,w -azz + a1w) _1_ Jxy D ' D IDI (5_88) SF.C. 5-3 FUNCTIONS OF TWO RANDOM VARIABLES 159 mSTRIBUTION o•· x + y An important special case is the determination of the distribution of the sum z = x + y. Introducing the auxiliary variable w = y. we obtain from (5-88) with a 1 = b 1 = I. a~ = 0. h~ = I, f,..(z. w) = J.).(z - w, w) (5-89) Hence. /:(z) = f. J.,.(z - w, w)dw (5-90) This result has the following mass interpretation. Clearly. z s z s z + dz iff z s; x ..... y s z + dz, that is, iff the point (x. y) is in the shaded region llD;; of Fig. 5.12. Hence, fz{z)dz = P{z s x + y s z + dz} = P{(x, y) E flD:} (5-91) Thus to find f(z), it suffices to find the masses in the strip llD:. For discrete type RVs, we proceed similarly. Suppose that x and y take the values x; and y4 • respectively. In this case. their sum z = x + y takes the values z = z, = x; + y 4 , and P{z = z,} equals the sum of all point masses on the line x + .v = z,. Example S. 13 The avs x and y are independent. taking the values I. 2. . . .• 6 with probability 1/6 (two fair dice rolled once). In this case. there are 36 point masses on the xy plane, as in Fig. 5.13. and each mass equals 1136. The sum z = x + y takes the values 2. 3, ...• 12. On the line x- y = 5, there are four point masses: hence, P{z"" 5}"'" 4/36. On the line x + y = 12. there is a single point: hence. P{z = 12} = 1136. • The determination of the density of the sum of two independent RVS is often simplified if we use moment functions. As we know. if the Rvs x and y are independent. the Rvs e"' and esy are also independent; hence Isee (5-75)], E{e''"'Y'} = £{e·'11}E{e·'Y} Figure 5.12 y •·igure 5.13 160 CHAP. 5 TWO RANDOM VARIABLES From this it follows that <l>.,.~(s) = cl>~(s)<l>.v(s) (5-92) Thus the moment function of the sum of two independent RVs equals the product of their moment functions. Example 5.14 If the avs x andy are normal. then [see (5-59)) Cl»x(s) = exp {.,~s + ~ u~s2 } "" ...1(s) - - exp {.,).s + !2 u)s 22} Inserting into (5-92). we obtain "" - exp {(71, ..,~·~(s)- + .,,)s + ! (u,2+ uy)s 2 2} (5-93) 2 Clearly. (5-93) is the moment function of a normal av; hence. the sum of two normal independent avs is normal with Til = .,. + Tlr• u: = u~ + u~. • Example 5.15 If the Rvs x andy are Poisson. then [see (5-60)) Cl»,(s) = exp {u,(e' - I)} Cl»,.(.vl = exp {Cl,(t'' - II) Inserting into (5-92). we obtain 4»., _,.(.f 1 = exp {tl, + e~,.)(t' • ·- II} Thus the sum of two independent Poisson Rvs with parameters tively, is a Poisson RV with pantmeter a, -+ a,.. • Example 5.16 (5-94) tl.• and a,. respec- If the avs x and y have a binomial distribution with the same p. then [see (5-66)] e~».(s) = (pe• + q)"' 4»7(s) = (pe• + q)" Hence, cJ»•• ,<s) = (pe• + q)"'•" (5-95) Thus the sum of two independent binomial avs of order m and n. respectively. and with the same p is a binomial RV of order m • n. • Independence and Convolution We shall determine the density of the sum z = x + y of two independent continuous type avs. In this case. ft,<x • .v> = ft(x)[..(y) Inserting into (5-90). we obtain fr.(z) = r. fx<z- w)J,(w)dw (5-96) This integral is called the conuolution of the functions .fx(x) and J,(y). We have thus reached the important conclusion that the density of the sum of two independent avs equals the convolution of their respective densities. SEC. 5-3 FUNCTIONS OF TWO RANDOM VARIABLES 161 Combining (5-96) and (5-92), we obtain the following mathematical result, known as the convolution theorem: The moment function <l»~(s) of the convolution.fx(z) of two densitiesft(x) and.f..(y) equals the product oftheir moment functions cl»,r(s) and <l»,.(s). · Clearly, if the range of x is the interval (a, b) and the range of y the interval (c. d), the range of their sum z = x + y is the interval (a + c, b + d). From this it follows that if the functions.fx(x) andfv(Y) equal zero outside the intervals (a, b) and (c, d), respectively. their convolution .fx(z) equals zero outside the interval (a + c:, b + d). Note in particular that if .fx(x)=O forx<O and f,.(...,.)=O fory<O then /J<) = {~/.(<- w)f,(w)dw ~>0 (5-97) z<O because .fx(z - w) t:.xample 5.17 = 0 for w > z. From (5-97) it follows that if the RVs x and y arc independent with densities J.(x) =- cu- 0 -"U(x) j;.(y) = oe "-'U(y) then the density of their sum equals .f:(z:) = o 2U(z) J: e-nl: "'1e ...... dw =- o 2ze·o.:u(z) • In the following example, we use the mass interpretation of .fx(z) (see Fig. 5.12) to facilitate the evaluation of the convolution integral. Example 5.18 The avs x and y are independent, and each is uniformly distributed in the interval (0, c). We shall show that the density of their sum is a triangle as in Fig. 5.14. Clearly,f(x, y) = llc2 for every (x, y) in the squareS (shaded area) and /(x, y) = 0 elsewhere. From this it follows that_h(z)dz equals iAD:;Ic 2 where IAD:I is the area of the region z s :c + y s z + dz inside the squareS [see (5-91)). As we see from Fig. 5.14. laD I= {zdz : Hence, (2c- zldz r cl !:<z> = 2c- Oszsc l ---;r JOI~TI.Y oszsc c<z<2c c<z<2c • NORMAl. DISTRIBUTIONS We define joint normality in terms of the normality of a single RV. 162 CHAP. 5 TWO RANDOM VARIABLES y fx(w) = fy(w) 1 c (2c- z) dz 0 Z t----, c 0 X FJgUre5.14 • Dtjinltion. Two RVs x and y are jointly normal itT the sum z=ax+by is normal for every a and b. We show next that this definition specifies completely the moment generating function cl>.ry(s,, s2) = E{es,a·.slf} of x and y. We assume for simplicity that E{x} = E{y} = 0. • Theonm. Two RVS x and y with zero mean are jointly normal itT where u 1 and u 2 are the standard deviations of x and y, respectively, and r is their correlation coefficient. • Proof. The moment function of the RV z «<>z(S) = E{e.sa} = exp Hu~s 2} where = ax + by equals [see (5-59)] u~ = a2ui + 2abru1u 2 + b2u~ Furthermore. «~>:<s) = E{e'"• • ,,,} cl>;(l) , , + 2abra,u2 + b-ui) , ., } = exp { 2I u:"} = exp { 2I (a·uj «1>:(1) = E{e"• • ">} = «<>.u-(a. b) Setting a = ·'"• and b = s2 we obtain (5-98). Conversely. if «~>x.v< s,, S:!) is given by (5-98) and z 2 «<>:(s) = «~>x.v(as, bs) = exp {~ (a2ui = ax + by. then + 2abra10':! + b2u~>} SEC. 5-3 FUNCTIONS OF TWO RANDOM VARIABLES 163 This shows that z is a normal RV, hence (definition) the RVS x andy are jointly normal. Joint Density fn.(.t, y) . From (5-98) and (5-72) it follows that I = 21TUtU2 VI { - (x 2 1 xy y2)} - 2r-- + 2 r 2 exp - 2(/ - r-') 2 U t UtU:! U2 (5-99) The proof is involved. Joint normality can be defined directly: Two Rvs x and y are jointly normal if their joint density equals e·Qh. >'1 where Q(x, y) is a positive quadrdtic function of x andy. that is Q(x. y) = c 1 x~ + c 2xy + <"l.\' 2 + c4x I <'~.\· + c,::: 0 Expressing the five pardmcters 11•· 11~· <r 1• u~. and r in terms of c; and using the fact that the integral of .f'<x. y) equals I. we obtain j~,.(.'(. y) = 2 ~0 exp {- 2;r l<d<x - 111 )2 - 2nr,<r~(x · 111 )(y -- 112> + uitv · 112>2!} (5-100) where D = u 1u 2 ~.Note that (5-100) reduces to (5-99) if 11• = 112 = 0. Uncon-elatedness and Independence It follows from (5-100) that if r = 0 then J.~,(x • This shows that if two independent. RVS .v> = .f~(x)f,.(y) arc jointly normal and uncorrclated. they are Mtuginal Normality It follows from (5-98) or directly from the definition with a = I and b = 0, that, if two RVS are jointly normal, they are marginally normal. Linear Transformations If z=ax+by then lsee (5-88)] w=c:x+dy Q 1(z w) = Q (dz - bw -cz + aw) ad - be: ' ad - be If Q(x, y) is a quadratic function of x andy, Q 1(z, w) is a quadratic function of z and w. This leads to the following conclusion: If two RVs z and w are linear functions of two jointly normal RVs x andy. they are jointly normal. To find their joint density, it suffices therefore to know the five parameters 11t• 11 ... , Ut, u.,., and rtw· ' 164 CHAP. 5 TWO RANDOM VARIABLES Example 5.19 The avs ll and y are normal with .,, = 0 .,.. = 2 u.. = 1 Find the density of the av z = 2x + 3y. As we know, z is N(1J,, u,) with .,, = 21J.r + 3"1, = 4 u: = 4u~ + u, = 2 3 r=- 9cr: + 12ru..u,- = 49 8 • GENERAL TRANSFORMATIONS We conclude with an illustration of the fundamental theorem (S-83). We shall determine the distribution of the av X z=- using as auxiliary variable the RV y w = y. The system z=-Xy has a single solution x 1 w=y = zw, y 1 = w, and its Jacobian equals I X - y - y2 J(x, y) - 0 1 =Y Inserting into (S-83), we obtain hw(Z, w) = lwlh1 (zw, w) and (S-85) yields h(z) = Example 5.20 J: lwlf..,(zw,w)dw (S-101) The avs ll and y are jointly normal with zero mean as in (5-99). We shall show that their ratio z = lily bas a Cauchy density centered at z = ru 11u2 : /,.(z) = u tCT2 v'f"="'r2 .,. • Proof. Since.h,(-x, -y) /,(z) = ' (S-102) =/.ry(x, y), it follows from (5-101) that J... w exp {- 2 I u~(z - ru./u2)2 + uf(l - r2) 2Tru1u 2VI - r2 o (z2 z 1 )} ~ - 2r - - - ~ dw u1u2 u2 w2 2(1 - r 2) u, With the integral equals and (5-102) follows. Integrating (5-102), we obtain the distribution l F,(z) 1 UzZ - TUJ = J-• /,.(a)da = 2 + UJ ~It - r2 (5-103) • PROBLEMS 165 Problems 5-1 5-l S-3 S-4 5-S 5·6 5-7 If f(x, y) = ye ·l.r:- 8.<', find: (a) the marginal densitiesf.(x) and,{y(y); (b) the constant y; (e) the probability p = P{x s .5, y s .5}. (a) Express Fx,.(x, y) in terms of F,(x) if: y = 2x: y = -2x; y = x2• (b) Express the probability P{x s x, y > y} in terms of f~,,(x. y). The RVS x andy are of discrete type and independent. taking the values x = n, 11 = 0 •...• 3 andy '"" m. m .:. 0, . . . • 5 with P{x = n} = 1/4, P{y = m} = 116. Find and sketch the point density of their sum z = x ... y. Show that •) < F,,(x) - F,.(y) F( X,) 2 (a) Show that if the avs x andy are independent and F,(w) = F 1 (w) for every w, then P{x s y} = P{x ~ y}. (b) Show that if the avs x and z = x/y are independent, then E{xlfy3} = E{x3 }/E{yl}. Show that if.,.,.. = 'l1yo cr., = cr:., and r.. = I, then x = y in probability. Show that if m, = E{x"} and p., = E{(x - '11)"} then Show that if the RV xis N(O, cr), then I x 3 · · · (n - l)crlk E{lxl"} n = 2k = { 2•k !u2lt. t ~ n = 2k .,. I Find the mean and the variance of the RV y = x2• S-9 Give the empirical interpretation: (a) of the identity E{x.,. y} = E{x} + E{y}; (b) of the fact that, in general, E{xy} :1- E{x}E{y}. 5-10 Show that ifz =ax+ b. w = cy + d, and ac :1- 0, then r~.. = r~y· 5·11 Show that if the Rvs x andy are jointly normal with zero mean and equal variance the avs z = x + y and w = x - y are independent. 5-12 Show that if the RVs x 1 and x2 are N(.,, cr) and independent, the RVS - Xt + X2 x=-2- y= (Xt - i)2 - (X2 - 2 i) 2 are independent. 5-13 We denote by a,x + b 1 the LMS fit ofy on x and by a 2y + b2 the LMS fit ofx on y. Show that a 1a2 = r~•• 5-14 The av xis N(O, 2) andy = x3• Find the LMS fit a - bx ofy on x. 5-15 Using the Taylor series approximation ag g(x, y) = g(.,•• '17y) + (x - .,.,.) ilx ag + ()' - .,.,,.) ay show that if z = g(x, y), then 2 CTz = cr,2 (ag) iiX 2 .._ 2(ag} CTy ily 2 ag ag + 2rcr,cry iiX ily 166 CHAP. 5 TWO RANDOM VARIABLES 5·16 The voltage v and the current i are two independent avs with TJ~ = IIOV cr,. = 2V TJ; = 2A cr; = 0.1 A Using Problem 5-15 and (5-52) find approximately the mean TJ,. and the standard deviation cr... of the power w = vi. 5·17 The RV xis uniform in the interval (0, c). (a) Find its moment function cl>(s ). (b) Using the moment theorem (5-62). find Tl• and cr,. 5·18 We say that an av x has a Laplace distribution if j(x) = ~ e ···•. Find the corresponding moment function cl>( s) and determine Tl• and cr, using (5-62). 5·19 Show that if two avs x andy are jointly normal with zero mean, then E{x2y2} = E{x2}E{y2} "'" 2E:{xy} 5-20 The RVs x and y are jointly normal with zero mean. (a) Express the moment function «~>:<s) of the av z =ax- by in terms of the three parameters cr_,, cr,. and'·~· (b) Show that if two avs x andy are such that the sum z =ax+ by is. a normal av for every a and b, then x and y are jointly normal. 5·11 The logarithm 'i'(s) =In t/l(s) of the moment-generating function cl>(s) is called the second moment generating function or the cumulant generating function. The derivatives /c,. = '1'1"1(0) of 'i'(s) are called the cumulants of x. (a) Show that leo = 0 lc, = TJ /c2 = cr 2 /c3 = m3 /c4 = m4 - 3cr 4 where m,. = E{x"}. (b) Find 'i'(s) ifx is a Poisson RV with parameter a. Find the mean and the variance of x using (a). 5·ll The avs x andy are uniform in the interval (0, I) and independent. Find the density of the avs z = x T y. w = x - y, s = lx - y;. 5-23 The resista'lces of two resistors are two independent avs, and each is uniform in the interval (9000, 1,1000). Connected in series. they form a resistor with resistance R = R1 + R2 • (a) Find the density of R. (b) Find the probability p that R is between l,900U and 2,1000. 5-24 The times of arrival of two trains are two independent avs x and y, and each is uniform in the interval (0, 20). (a) Find the density of the RV z = x- y. (b) Find the probability p 2 in Example 2-38 using avs. S·lS Show that if z = xy. then /:(r.) • = J"-x -1Xi. f..· (x. X!)dx 1 5-16 The avs x andy are independent. Express the joint density fz,..(z.. w) of the Rvs z = x2 • w = y3 in terms of f.< x) and J.( y ). Use the result to show that the avs z · and w are independent. 5-17 The coordinates of a point (x, y) on the plane are two avs x and y with joint density f.,.(x, y). The corresponding polar coordinates are r = Vx2 "'" y2 • = arctan l X -7r < t/1 < 1r Show that their joint density equals f,•(r, t/1) = rf.,(r cos t/J, r sin t/J) Special case: Show that if the avs x and y are N(O, cr) and independent, the avs rand f/1 are independent, r has a Rayleigh distribution, and f/1 is uniform in the interval ( -7r, 1r). PROBLEMS 167 5·28 The RVS x and yare independent and z = x + y. (a) Find the density of y if /,(x) = ce· "Uix) /:(z) = c~ze •:U(z) (b) Show that if y is uniform in the interval (0, I l. then J;<:.l =F..(:.)- F,(:.- IJ 5-19 Show that if the RVs x and y are independent with exponential densities ce-.-.U(x) and ce-•)·U(y). respectively. their difierence z = x - y has the Laplace density (' 2 e-•:. 5·30 Given two normal RVS x and y with joint density as in (5-99), show that the probability masses m., m~. m 3• m4 in the four quadrants of the xy plane equal I a I a m, = ml = 4 - 21T m2 = m. = 4 - 21T where a ==arcsin r . .l' I Or 4 i1r I 4+ Or 27f r 0 I + 4 0' 2ir .t I 4 0' 2, Figure PS.30 5-31 The RVS x andy are N(O, u) and independent. Show that the Rvs z = x/y and w = Vx2 + y2 are independent. z has a Cauchy density. and w has a Rayleigh density: 6 _ _ __ Conditional Distributions, Regression, Reliability In this chapter, we use the concept of the conditional probability of events to define conditional distributions, densities, and expected values. In the first section, we develop various extensions of Bayes· formula, and we introduce the notion of Bayesian estimation in the context of an unknown probability. Later, we treat the nonlinear prediction problem and its relationship to the regression line defined as conditional mean. In the last section, we present a number of basic concepts related to reliability and system failure. 6-1 Conditional Distributions Recall that the conditional probability of an event .54 assuming .M. is the ratio P<-54I.M.> = P(-54 n .M.) (6-1) P(.M.) defined for every .M. such that P(.M.) > 0. In the following, we express one or both events .54 and .M. in terms of various avs. Unless otherwise stated, it will be assumed that all avs are of continuous type. The discrete case leads to similar results. 168 SEC. 6-1 169 CONDITIONAL DISTRIBUTIONS We start with the definition of the conditional distribution F(xi.M.> ofthe x assuming .M.. This is a function defined as in (4-2) where all probabilities are replaced by conditional probabilities. Thus RV F<xi.M.> = P{x s xi.M.} = P{x ~.~) ..t-t} (6-2) Here, {x s x, .M.} is the intersection of the events {x s x} and .M.; that is, it is an event consisting of all outcomes ' that are in .At and such that xW s x. The derivative f(xi.M.> = dF(xi.M.> (6-3) dx I of F(xi.M.> is the conditional density of x assuming .«. It follows from the fundamental note on page 48 that conditional distributions have all the properties of unconditional distributions. For example, F(x2I.M.> - F<xd.M.> = P{x, <xs x~'··1t} P{x < X S X (6-4) the area of f(x!.M.> equals I. and [sec (4-32)] f<xi.M.> dx Example 6.1 = P{x < x < x + dx'.M.} = + 6.x, .M.) P<.t.f.) (6-5) In the fair-die experiment, the RV xis such that x(/;) = IOi and its distribution is the staircase function of Fig. 6.1. We shall determine the conditional distribution F(x!.M.) where .M. = {f2,J4,Jf,}. If x 2! 60, then {x s x} is the cenain event and {x s x, .«} = .M.; hence. . F(xi.M.) P(.M.) = PU() = I Jo'igure 6.1 F<x> 0 F(xlcven) 10 X 0 20 X 170 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY If 40 s x < 60, then {x s x, .M.} = {h,J4}; hence, F( !..«) = P{Ji,J4} = 2/6 = ~ P(.M.) 3/6 If 20 s x < 40, then {x s x, .M.} = {Ji }; hence. X. 3 = P(.M.) P{Ji} = 1/6 = ! 3/6 3 x • .M.} = 0; hence, F(xiJO = 0. F( '.M.) XI Finally, if x < 20, then {x s • TOTAL PROBABILITY AND BAYES' FORMliLA We have shown [see (2-24)] that if [.s4 1 , • • • , .Sit,] is a partition of f:l, then (total probability theorem) P(~) = P<~l.stldP<.stld for any ll.ll. With ~ F(x) +· · ·+ P<~l.!lf,)P(.s-4,) = {x s x}. this yields = F<xl.stl.>P<.9fd + · · · + F<xi.Slf,)P(.~,) (6-6) (6-7) and by differentiation /(x) Example 6.2 = f<xl.SitdP(s4d + · · · + /lx .!lf,)P(.~,) (6-8) Two machines M 1 and M2 produce cylinders at the rate of 3 units per second and 7 units per second, respectively. The diameters of these cylinders are two RVs with densities N(TJ 1 , u) and N(TJ2, u). The daily outputs are combined into a single lot, and the RV x equals the diameters of the cylinders. We shall find the density of x. In this experiment, .Ill, = {C came. from MJ} .!ll2 = {C came from M2} are two events consisting of 30% and 70% of the units, respectively. Thus f<xl.!lt 1 ) is the conditional density of x assuming that the unit came from machine M 1 ; hence, /(xl.!ll 1 ) is N(fl" u). Similarly,f(xl.!ll2) is N(712, u). And since the events .!ll 1 and .!ll2 . form a partition, we conclude from (6-8) with P(.!ll 1) = .3 and P(.!ll2) = .7 that /(x) = _.3_ e-l•-,1)12a2 + _.7_ e·~-,z,ZJ2a2 0' as in Fig. 6.2 • Flpre 6.2 v'21r 0' v'21r SEC. 6-J CONDITIONAl. DISTRIBUTIONS 171 We shall now extend Bayes' formula /'(!1\j.'.'i) /'(:;ij.?\) '::. """'P{:1\'i'""" 1'1::11 to RVs. With :1', -= {x s P{:'llx s Similarly lsce (6-411. . P{.~4lx, x~. (6-9) x~ (6-9) yields P{x !:= xl:;l~ Ftrl::i} "- - · - - Pl::i) -- · -.- P(.·.·l) P{x ::: xt J· lx I < x :<X!~=- Hx1l::l) F<xd::i) -F(.r.) _ H:~ /'{::1) (6- J()) (6-11) Using (6-11 ). we shall define the conditional prutmbility P{:illxl -- P{::1ix -= x} uf an event :,4 assuming x '"'x. If P{x = x} > 0. we can usc (6-9). We cannot do so. however. if x is of continuous type because then P{x =- x~ = 0. In this case. we define P{.'lll.rl as a limit: P{.'lllx - x} - lim P{:;llx l.•-11 With x 1 = x. x1 -= x + ~x. <; x ': x f ~.r} it follows from this and (6-11 1 that -= .(<xl::l)/'(::i) / '(::'I··) • ..• ./hi (6-12) We next multiply both sides by ./hi and integrate. Since the area uff(xj:AI equals 1. we obtain (6-13) This is another form uf the total probability theorem. The corresponding version of Bayes' formula follows frum (6-121: Jhkill = ···, Pt:.tilxlf<~.. J. P<.~·llx!f<xldx (6-14) Bayesian Estimation We are given a coin and we wish to determine the probability p that heads will show. To do so. we toss it n times and observe that heads shows k times. What conclusion can be drawn from this observation about the unknown p? This problem can be given two interpretations. In the first interpretation, pis viewed as an unknown pantmeter. In the second, pis the value of an RV p. These interpretations are examined in detail in Part Two. Here we introduce the second interpretation (Bayesian) in the context of Bayes' formula (6-14). We assume that the probability of heads is an RV p defined in an experiment f:/,.. This experiment could be the random selection of a coin from a large supply of coins. The RV p takes values only in the interval (0, I); hence, its density f(p) vanishes outside this interval. We toss the selected 172 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY coin once, and we wish to find the probability PC~) that heads will show, that is, that the event '3t = {heads} will occur. • Theorem. In the single toss of a randomly selected coin, the probability that heads will show equals the mean of p: P('R) = J: pf(p)dp (6-15) • Proof. The experiment of the single toss of a randomly selected coin is a Cartesian product ~c x ~ 1 where :/1 = {h, 1}. In this experiment, the event '#! = {heads} consists of all pairs ,,.h where ,,. is any element of the space :/,.. The probability that a particular coin with p(t.) = p will show equals p. This is the conditional probability of the event :N, assuming p =- p. Thus P{H'jp}.-:.p (6-16) Inserting into (6-13). we obtain (6-15). Suppose that the coin is tossed and heads shows. We then have an updated version of the density ofp, given byf(pi1n. This density is obtained from (6-141 and (6-16): f(pl"/() = 1 f. I pf(p) (6-17) P.f(p)dp Rt:r•:ATEU TRIALS We now consider the toss of a randomly selected coin n times. The space of this experiment is a Cartesian product !f,. x Y, where !f, consists of all sequences of length n formed with h and 1. In the space [f, X if,. dl = {k heads in a specific order} is an event consisting of all outcomes of the form '· hi · · · h where " is an element of !f, and ht · · ·his an element of !f,. As we know from (3-15). /'(.':o'ljp) = p"q"- 4 q =I - p This is the probability that a particular coin with p(?:,.l will show k heads. From this and (6-13) it follows that P(~) = J: p"q"-kj'(p)dp (6-18) = p. tossed" times. (6-19) Equation (6-19) is an extension of (6-15) to repeated trials. The corresponding extension of (6-17) is the conditional density f(pj.slt) = 1 fo p"q" -kj'(p) p"q"-"f(p)dp (6-20) of the RV p assuming that in n trials k heads show. This density is used in the following problem. 6-1 SEC. CONDITIONAL DISTRIBUTIONS 173 We have selected a coin, tossed it n times. and observed k heads. Find the probability that at the next toss heads will show. This is equivalent to the problem of determining the probability of heads of a coin with prior density f(ploSif). We can therefore use theorem (6-15). The unknown probability equals (6-21) Example 6.3 Suppose that p is uniform in the interval (0. 1). In this case • .f(p) = I; hence, the probability of heads (see (6-15)] equals fl pdp = ! Jo 2 We toss the coin n times; if heads shows k times. the updated density of p is obtained from (6-20). Using the identity fl •o _ Jo P P )'r-kd P = k!(n (n - k)! + I)! (6-22) we obtain (n ... I)! 4 -4 ji(p .s4) -- k!(n (6-23) - k)! P q" This function is known as beta density. lnsening into (6-21). we conclude that the probability of heads at the next toss equals (1 _ (n + I)! (1 H _ _ k -.- 1 (6-24) Jo pf(pl.si)dp- k!(n - k)! Jo p q" 4 dp - n • 2 Thus after the observation of k heads, the density of the coin is updated from uniform to beta and the probability of heads from 1/2 to <k _._ 1)/(n + 2). This result is known as the law of succession. • The assumption that the density of the die, prior to any observation, is constant is justified by the subjectivists as a consequence of the principle of insufficient reason (page 17). In the context of our discussion, however, it is only an assumption (and not a very good one, because most dice are fair). Returning to the general case, we shall call the densitiesf(p) andf(ploSif) in (6-20) prior and posterior, respectively. The posterior density f(ploSif) is an updated version of the prior, and its form depends on the observed number of heads. The factor /(p) = pk(l - p)"··k in (6-21) is called the likelihood function (see Section 9-5). This function is maximum for p =kin. In Fig. 6.3 we show the functionsf(p), /(p), and f(p 1st) - l(p lf(p) For moderate values of n, the factor /(p) is smooth, and the product l(plf(p) exhibits two maxima: one near the maximum kl n of l(p) and one near the 174 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION. RELIABILITY 1: /tp) 2: /(p) 3: f<p~:Jll p II II ta) (bl Figure 6.3 maximum of f(p). As n increases. the sharpness of /(p) prevails, and the functionf(pl.llf) approaches /(p) regardless of the form of the prior f(p). Thus as n- ~. f(plS~t> approaches a line at p = kin and its mean. that is. the probability of heads at the next toss, tends to kin. This is, in a sense, the model justification of the empirical interpretation kl n of p. Bayes' Formulas The conditional distribution F<xl.M.> of an RV x assuming .M. involves the event {x s x, .M.}. For the determination of F<xl.M.>. knowledge of the underlying experiment is therefore required. However, if .M. is an event that can be expressed in terms of the RV x, then F<xl.M.> can be expressed in terms of the unconditional distribution F(x) ofx. Let us look at several illustrations (we assume that all RVs are of continuous type). Suppose. first. that .M. :. {x -5 11} In this case. F<.ri.M.) is the conditional distribution ofx assuming that x-:; a. Thus ,..( I < ) = P{xP{~ x. x s} a} (6-25) r.t'X-a X~ II where tl is a fixed number and x is a variable ranging from -x to x. The event {x :s: x. x -5 a} consists of all outcomes such that x s x and x :s: a; its probability depends on x. If x ~ a. then {x c:; x. x s tl} = {x s a}: hence. • P{x sa} f<xlx;s:tl)=P.{ }=I · X S ll Sl'.C. 6-2 175 BAYES' FORMULAS y ~a) F(xlx a 0 If x <a. then {x -;: x. x ~ a} - {x < x}: hence. . P{x s x} Fix) f<xlx :sa):.:: - ,{ .---~- - .( l 1· a 1 x <.:; ar Thus F( x jx :s c1) is proportional to F< x) for x :s a. <md fur x > a it equals I <Fig. 6.4). The conditional density f<xlx < a) is uhtained by differentiation. Suppose. next. that ..tf = {a ~ x < h} We shall determine f<x;.M) directly using (6-5): .( xasxs I . b rt d x=-P{ x < x :::; x ········-·....,+ dx. a_<_x_<_-_h...!.} ./ P{a ~ x ·=: h} To do so. we must find the probability of the event { t .,. X < t I c/.t} {x < x < x + dx. a < x :::: h} = { · ' ·(·} · (6-26) a<x<b otherwise Since P{x < x:::: x + dx} = fl.tldx. we conclude from 16-5) that .f<x;a fix) ~ "<; b) .:... Feb.) - /:(a·) fur a< x < b (6-27) and zero otherwise. Example 6.4 x is NITJ. o-1: we shall find its conditional density fix AO assuming.« = :s "< ., + o-}. As we know (see Fig. 4.201, the probability that x is in the interval (TJ - u, ., + o-) equals .683. Setting F(b) - F(a) = .683 in (6-271. we obtain the truncated normal density The RV {., - (T /(xllx - .,1 < o-) = I ·- . e-.. -.,..,•u· .683o- Vi,; for ., - u < x < ., - u and zero otherwise (Fig. 6.5). • Hmpirkal llltl'rprt•ltltion. We wish to determine empirically the conditional distribution F<xla :s x :s b). To do so, we repeat the experiment n times, and we reject all outcomes ' such that x(') is outside the interval (a, b). In the subsequenc~ s 1 of the remaining n 1 trials, the function F(x a :s x s b) has the 176 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY y /(XI lx - 111 <a) Fagure6.S empirical interpretation of an unconditional distribution: For a specific x, it equals the ratio n;/n 1 where n. . is the number of trials of the subsequence s 1 such that x(C) s x. Suppose that our experiment is the manufacture of cylinders and that X;= x(C;) is the diameter of the ith unit. To control the quality of the output, we specify a tolerance interval (a, b) and reject all units that fall outside this interval. The function F<xla < x s b) is the distribution of the accepted units. Note, finally, that if y = g(x) is a function of the RV x, its conditional density /,(yi..«.) is obtained from (4-78), where all densities are replaced by conditional densities. If, in particular, .M is an event that can be expressed in terms ofx, then/,(yi.M> can be determined in terms offx(x) and the function g(x). Example 6.5 If .M. = {x ;;:: O} and then [see (4-84)J _ /,(ylx ;;:: 0) - I Wy /x(Vy) • 1 - F... (O) U(y) Joint Distributions We shall investigate the properties of the conditional probability P(sfl~> when both events sf and :i are specified in terms of the RVs x and y. We start with the determination of the function F,.(ylx 1 s x s x 2 ). With sf = {y s y} and 00 = {x 1 s x s x 2}. it follows from (6-1) that .... (V IX1 r,. · · S X S X2 ) = P{x1 s x s x 2 • y s y} P{x1 s x s x2} F(x2• y) - F<x1. y) = -'-:::-=-:-.::....;..-....,..;.__:...;.,.::..;. F.,(x2) - F.,(xl) (6- S) 2 We shall use (6-28) to determine the conditional distribution F,.(yix> of y assuming x = x. This function cannot be determined from (6-1) because the event {x = x} has zero probability. It will be defined as a limit: F,.(yix> • = .1.<..:.0 lim F,.(yix s . x s x + .:1x) SEC. 6-2 BAYES' FORMULAS 177 Setting x1 = x and x~ = x + dy in (6-28) and dividing numerator and denominator of the right side hy .1x. we conclude with ~x- 0 that F,.(\·lx) = -.-1- iiF(.x. Yl · f,<x l fiX (6-29) The conditional density .f;<ylxl of y. assuming x ,. x. is the derivative of F,<ylx> with respect toy. Thus . ill-',1\-l.rl j.(\' \') = ____.___:_ ___ \ iJ,\' ' I· I ii~F(x. \') -= -·- -- -·- .t: (.\') tl.\: ily The functionj~Lrlyl is defined simil<1rly. Omitting suhscripts. we conclude from the foregoing and from (5-8) that . I (.\'. .\') .(\'X) I -= f<x. ,.l --: --· .I . .I< x l ..:. . \') ·rtx. ..... If the Rvs x andy are independent. f<x. yl = ./lrl.f<yl .f(_\':xl == f<yl (6-30) :...._ .I(_\') /( X j .\' ) -' .f< X ) For a specific x. the functiun f< x. y I is the intersection ( prc~lilt'l of the surface z = j'(x. yl hy the plane x =-constant. The conditional density.f<ylx). considered as a function ofy. is the profile of.fLr. yl normalized hy the factor I /.f~( X). From (6-30) and the relationship f(y) = f ..J<x. )')dx between marginal and joint densities. it follows that f(y) = J:. (6-31) f(y;x)f(x)dx This is another form of the total probability theorem. The corresponding form of Bayes' formula is f(x•y) · Example 6.6 = f<yix)f(x) f(y) = f<ylx>f<x> (6- 32 ) f..f<y:x)f(x)dx The RVs x andy are normal with zero mean and density (sec (5-99)1 I • (X~ f<x. \')- exp { - , - 2r-·--.:..., • 2(1 - r·) Uj u u Ui X\' \'2)} (6-33) 1 2 We shall show that (6-34) 178 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY • Proof. As we know, f(x) = I , . ~ e-x·t2aj (6-35) u 1V21T Dividing (6-31) by (6-33), we obtain an exponential with exponent I - 2(1 - 2 r 2) (x xy v2 ) x2 u 2 - 2r u 1u, + ~; + ~ I - - • = 2u~ll • I - ( r2) Y - ru2 "'0) x ) 2 and (6-34) results. • Conditional Expected Values The mean E{x} of a continuous type RV x equals the integral in (4-87). The conditional mean E{xi.M.} is given by the same integral wheref(x) is replaced by f<xi.M.). Thus E{xi.M.} = f,. xf<xi.M.>dx (6-36) The empirical interpretation of E{xi.M.} is the arithmetic mean of the samples x1 of x in the subsequence of trials in which the event .M. occurs. Example 6.7 Two light bulbs are bought from the same lot. The first is turned on on May I and the second on June I. If the first is still good on June 1, can it, on the averctge, last longer than the second? Suppose that the density of the time to failure is the functionj(x) of Fig. 6.6. In this case, E{x} = 3.15 month. The conditional density f<xlx 2: I) is a triangle, and E{xlx 2: I} = 5 months. Thus the average time to failure after June I of the old bulb is 5 - I = 4 months and of the new bulb 3.15 months. 1bus the old bulb is better than the new bulb! This phenomenon is observed in statistics of populations with high infant mortality: The mean life expectancy a year after birth is larger than at birth. • If .M. is an event that can be expressed in terms of the RV x, the conditional mean E{yi.M.} is specified if f(x, y) is known. Of particular interest is Fipre 6.6 )' .7 -/(X) /(XIX;::: I) f .I 0 'llx s 13 X SI::C. 6-2 BAYES' FORMULAS 179 Figure 6.7 the conditional mean E{ylx} ofy assuming x = x. As we shall see in Section 6-3, this concept is important in mean-square estimation. Setting .M. = {x s x s x + ~x} in (6-36), we conclude with ~x- 0 that = E{ylx} r. yf(y!x)dy (6-37) Similarly, =f. g(y)J(yix)dy E{g(y)lx} (6-38) where f(ylx) is given by (6-30). For a given x, the integral in (6-37) is the center of gravity of all masses in the strip (x, x + dx) of the xy plane (Fig. 6. 7). The locus of these points is the function t/J(x) = f, yf(y!x)dy (6-39) known as the regression curve of y on x. Example 6.8 lfthe avs x andy are jointly normal with zero mean, then Isee (6-34)] (y - rcr2xlcrt)2 } /(ylx) - exp { 2u~(l - r2) For a fixed x, this is a normal density with mean ru2xlu 1: hence. E{y.x} = cb(x) u2 x . = r u, (6-40) If the avs x andy are normal with mean Tlx and Tl~·· respectively, thenf(ylx) is given by (6-34) where y and x are replaced by y - Tly and x - Tlx. respectively. In this case,J<ylx) is a normal density in y with mean rcr2 (6-41) 1/>(x) = TJy + (x - TJ,) u, Thus the regression curve of normal avs is a straight line with slope rcr2/cr 1 passing through the point (TJx• .,,). • 180 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY The mean E{y} of an av y is a deterministic number. The conditional mean E{ylx} is a function t/>(x) of the real variable x. Using this function, we form the composite function t/>(x) as in Section 4-4. This function is an av with domain the space fl. Thus starting from the deterministic function E{ylx}, we have formed the av E{ylx} = t/>(x). • Theorem. The mean of the av E{ylx} equals the mean of y E{E{yjx}} • Proof. As we know from (4-94). E{ylx} = E{fb(x)} = = E{y} t. (642) fb(x)j(x)dx Inserting (6-39) into this and using the fact that f(yjx)/(x) obtain E{t/>(x)} =f.. (t. yf(yjx)dy) f(x)dx =f. t . = f(x, y), we yf(x, y)dxdy This yields (642) because the last integral equals E{y} [see (S-28)]. In certain problems, it is easy to evaluate the function E{ylx }. In such cases, (642) is used to find E{y}. Let us now look at an illustration involving discrete type avs. Example 6.9 The number of accidents in a day is a Poisson av x with parameter a. The accidents are independent, and the probability that an accident is fatal equals p. Show that the number of fatal accidents in a day is a Poisson av y with parameter ap. To solve this problem, it suffices to show that the moment function of y [see (5-60)) equals E{e"} = e"'1"'-ll (6-43) This involves only expected values; we can therefore apply (6-42). • Proof. The RV xis Poisson by assumption; hence, P{x = n} = e·o a" n! n = 0, I, . . . Ifx = n, we haven independent accide.nts during that day, and the probability of each equals p. From this it follows that the conditional distribution of the number y of fatal accidents assuming x = n is a binomial distribution: P{y = klx = n} = (~) p"q"-• k = 0. I, . . . , n (6-44) and its moment-function [see (5-66)] equals E{e"lx = n} = (pe' + q)" (6-45) The right side is the value of the, RV (pe' + q)• for x = n. Since s has a Poisson distribution, it follows from (4-94) that the expected value of the RV (pe' + q)S equals SEC. 6-3 x 2 .. ~o NONLINEAR REGRESSION AND PREDICTION :x (pes + q)"P{x = n} = 2 .. -o 181 II (pes + q)" e u a' = e·u eulpr•·ql n. Hence [see (6-42)1 £{elY} = E{E{esYixH = E{(pes + q)"} and (6-43) results. = e"P•'·ull qt • We conclude with the specification of the conditional mean E{g(x, y)!x} of the RV g(x, y) assuming x = x. Clearly, E{g(x, y)lx} is a function of x that can be determined as the limit of the conditional mean E{g(x, y)lx !5 x !5 .\· + dx}. We shall, however, specify it using the interpretation of the mean as an empirical average. The function E{g(x, y)lx} is the average of the samples g(x;, y;) in the subsequence of trials in which x; = x. It therefore equals the average of the samples g(x, y;) of the RV g(x, y). This leads to the conclusion that E{g(x, y)lx} = E{g(x, y)jx} (6-46) ~ote the difference between the RVs g(x, y) and g(x, y). The first is a function of the RVs x andy; the second is a function of the RV y depending on the parameter x. However, as (6-46) shows. both avs have the same mean, assuming x = x. Since g(x, y) is a function ofy (depending also on the parameter x). its conditional mean is given [see (6-38)] by E{g(x. y)lx} = r. g(x, y)f(y 1x)dy (6-47) This integral is a function 6(x) of x; it therefore defines the Rv 6(x) E{g(x, y)lx}. The mean of 9(x) equals r. O<x>.f<x>dx = = r. f . f. r. .lf(X. y>f<ylx>f<x)c/xdy g(x. ylf<x. y)dxdy But the last integml is the mean of lf(X. yl: hence. E{E{g(x. yllx}} = E{lf(X, yl} Note the following special cases of (6-46) and (6-48): E{g,(x)g~<y>lx} f;{lft(X)g~(y)} = = lft(X )f;{lf~(y>lx} = E{g,(x)E{.!!~(yllx}} (6-48) (6-49) 6-3 Nonlinear Regression and Prediction The RV y models the values of a physical quantity in a real experiment, and its distribution is a function F(y) determined from past observations. We wish to predict the value y(() = y of this RV at the next trial (Fig. 6.8a). The outcome ( of the trial is an unknown element of the space ff; hence, y could 182 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY ------- y r.-.__ .__ ........ y!t;> = y c = £{y ~ ,. -~- '! (a) (b) Figure 6.8 be any number in the range ofy. We therefore cannot predict y; we can only estimate it. Suppose that we estimate y by a constant c. The estimation error y - c is the value of the difference y - c, and our goal is to choose c so as to minimize in some sense this error. We shall use as our criterion for selecting c the minimization of the mean-square (MS) error e = E{(y- c)2} = f,, (y - c:)':j(x)dx (6-50) This criterion is reasonable; however, it is used primarily because it leads to a simple solution. At the end of the section, we comment briefly on other criteria. To find c, we shall use the identity (see Problem 4-28) E{(y - d~) = (TJ,. - £')~ + CT~ (6-51) Since 71.•· and u~. are given constants. (6-51) is minimum if c = .., . = J~ "1,\ (6-52) ,.r.(\•)d\' J\ . • .... - Thus the least mean square (LMS) estimate of an mean. RV by a constant is its REGR•:SSION Suppose now that at the next trial we observe the value x(') = x of another RV x. On the basis of this information. we might improve the estimate of y if we use as its predicted value not a constant but a function c/J(x) of the observed x. It might be argued that if we know the number x = x(,), we know also' and, hence, the value y = y(') ofy. This is the case, however, only ify is a function of x. In general, x <C•> = x for every C. in the set .sl.r = {x = x }, but the corresponding values y((.) of y might be different (Fig. 6.8b). Thus the observed value x of x does not determine the unknown value y = y( (.) of y. It reduces our uncertainty about y, however, because it tells us that C. is not an arbitrary element of Y but an element of its subset .sl.r. SEC. 6-3 NONI.INEAR REGRESSION AND PRF.DICTION ]83 For example. suppose that the RV y represents the height of all boys in a community and the RV x their weight. We wish to estimate the height y of Jim. The best estimate of y by a number is the mean Tl.•· of y. This is the average of the heights of all boys. Suppose. however. that we weigh Jim and his weight is x. As we shall show. the best estimate of y is now the average f..'{ylx} of all children that have Jim's weight. Again using the LMS criterion. we shall determine the function f/>(x) so as to minimize the mean value e = E{ly- f/>(x)j2} = f~ f~ [y- cb(xll:f<x. y)dxdy (6-53) of the square of the estimation error y - cb(x). • Theorem. The LMS estimate of the RV y in terms of the observed value x of the RV x is the conditional mean f/>(x) = E{ylx} = • Proof. Inserting the identity f<x. (' = = J'. J'. y) r. J:f'Cylx>dy = j(y x)f(x) into (6-53), (6-54) we obtain [y- f/>(x}ff'lyl.,·)flx)dxdy J'Jhl J'. [y d>(x)j~f'<yl.ndydx All integrands are positive: hence, e is minimum if the inner integral on the right is minimum. This integral is of the form (6-50) if cis changed to cb(x) andj(y) is changed to f<ylx>. Therefore. the integral is minimum if f/>(x) is given by (6-52), mutatis mutandis. Changing the function J..(y) in (6-52) to · f<yix), we obtain (6-54). We have thus concluded that the LMS estimate of y in terms of xis the ordinate f/>(x) of the regression curve (6-39). Note that if the RVs x andy arc normal as in (6-31 ). the regression curve is a straight line [see (6-40)]. Thus, for normal Rvs. linear and nonlinear predictors are identical. Here is another example of Rvs with this property. Example 6.10 Suppose that x and y are two Rvs with joint density equal to I in the parallelogram of Fig. 6-9 and zero elsewhere. In this case, f<YIX) is constant in the segment AB of the line Lx of the figure. Since the center of that segment is at y ..:.. .t/2, we conclude that E{ylx} = x/2. • Galton's Law The term regression has its origin in the following observation by the geneticist and biostatistician Sir Francis Galton (1822-1911): "Population extremes regress toward their mean ... In terms of average 184 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY )I f(x) =! 2 Figure 6.9 heights of parents and their adult children, this can be phrased as follows: Children of tall (short) parents are on the average shorter (taller) than their parents. This observation is based on the fact that the height y of children depends not only on the height x of their parents but also on other genetic factors. As a result, the conditional mean of children born of tall (short) parents, although larger (smaller) than the population mean, is smaller (larger) than the height of their parents. This process continues until after several generations, the mean height of descendants of tall (short) parents approaches the population mean. This empirical result is called Galton's law. The statistical interpretation of the law can be expressed in terms of the properties of the regression line cp( x) = E {yl x}. We observe first. that if x >.,,then cp(x) < x; if x <.,,then cp(x) > x (6-55) because cp(x) is the mean height of all children whose father's height equals x. This shows that cp(x) is below the line y = x if x > ., and above the line y = x if x < Tl· The evolution of this process to several generations leads to the following property of cp(x ). We start with a group of parents with height x 0 > 71 as in Fig. 6.1 0, and we find the average Yo = cp( x0 ) < x0 of the height of their children. We next form a group of parents with height x, = y 0 and find the average y 1 = cp(x1) < y0 of the height of their children. Continuing this process, we obtain two sequences x,. and y,. such that x,. > cp(x,.) = y,. = x,.. 1 > cp(x,..,) = y,. .. , = x,..2 - 71 n-x Starting with short parents, we obtain similarly two sequences x~ and y~ such that x~ < cp(x~) = y~ = x~ .... < cp(x~.,) = y~., = x~.2 - 71 n- x This completes the properties of a regression curve obeying Galton's law. Today, the term regression curve is used to characterize not only a function obeying Galton's law but any conditional mean E{ylx}. SEC. 6-3 NONLINEAR REGRESSION AND PREDICTION 185 y xi> xj xi I XI I Xo X 1.75 meters Figure 6.10 THE ORTIIOGO~i\1.1'1'\' PRI~CIPU: We have shown in (5-49) that if tl + hx is the best MS fit of y on "· then E{[y - (a t bx))x} = 0 (6-56) This result can be phrased as follows: If tl + hx is the linear predictor ofy in terms of x. then the prediction error y - (a t hx) is onhogonal to x. We show next that if «f,(x) is the nonlinear predictor of y. then the error y - (/,(x) is onhogonal not only to x but to any function q(x) of "· • Theorem. If «f,(x) .:..: E{ylx}. then E{ly - (/,(x})q(x)} -= 0 (6-57) • Proof. From the linearity of expected values. it follows that E{y ·· «f,(x}lx} = l:.'{ylx} - (/,(x} "" 0 Hence tsee (6-49)), E{ly - «f,(x)lq(x)} = f.:{q(x)E{y - «f,(x)lx}} = 0 and (6-57) results. The Rao-Bia~kwell Theorem The following corollary of (6-57) is used in parameter estimation (see page 313). We have shown in (6-32) that the mean .,. of «f,(x) equals the mean.,, of y. We show next that the variance of «f,(x) 186 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABII.ITY does not exceed the variance of y: .,.. = "'~· ..... <,.2 ,.2 (6-58) ") = y - 4>(x) + 4>(x) - .,,. it follows that (y - .,,)2 = (y - 4>(x)J2 + [4>(x) - TJc~~1 2 + 2[y - 4>(x)][ 4>(x) - "~• 1 • Proof. From the identity y - .,, We next take expected values of both sides. With q(x) = 4>(x) - .,,. it follows from (6-57) that the expected value of the last term equals zero. Hence, and (6-58) results. Risk and Loss Returning to the problem of estimating an RV y by a constant c, we note that the choice of cis not unique. For each c, we commit an error y - c, and our objective is to reduce some function L(y - c) of this error. The selection of the form of L(y - c) depends on the applications and is based on factors unrelated to the statistical model. The curve L(y - c) is called the loss function. Its mean value R = E{L(y- c)}= r. L(y- c)f(y)dy is the average risk. If L(y - c) = (y - c)2, then R is the MS error e in (6-50) and c = .,,. Another choice of interest is the loss function L(y - c) = ly - ci. In this case, our problem is to find c so as to minimize the average risk R = E{ly- cl}. Clearly, R = f . IY- clf<y)dy = t. (c- y)f(y)dy +I: (y- c)f(y)dy Differentiating with respect to c, we obtain - ~~ = J:,,.f(y)dy- J.." f(y)dy = F(c)- = 2F(c)- 1 = 1/2, that is, if c [I - F(c)] Hence, the average risk E{ly - cl} is minimum if F(c) equals the median of y. 6-4 System Reliability A system is an object made to perform a function. The system is good if it performs that function, defective if it does not. In reliability theory, the state of a system is often interpreted statistically. This interpretation has two related forms. The first is time-dependent: The interval of time from the moment the system is put into operation until it fails is a random variable. For example, the life length of a light buib is the value of a random variable. The second is time-independent: The system is either good with probability SEC. 6-4 SYSTF.M RF.I.IABII.ITY 187 p or defective with probability I - p. In this interpretation. the state of the system is specified in terms of the number p: time is not a factor. For example, for all practical purposes, a bullet is either good or defective. In this section, we deal primarily with time-dependent systems. We introduce the notion of time to failure and explain the meaning of conditional failure rate in the context of conditional probabilities. At the end of the section, we consider the properties of systems formed by the interconnection of components. • Definition. The tinw to failun• or /{/(• length of u system is the time intcrvul from the moment the system is put into operation until it fails. This interval is an RV x ::: 0 with distribution F(t) = P{x s t}. The difference R(t) = I F(l) =-= P{x > t} (6-59) is the sy.'ltt•m rt'liahility. Thus F(l) is the probability that the system fails prior to time 1. and R(l) is the probability that the system functions at time 1. The mean of x is called mt•tm timt• to fai/m·(•. As we sec from (4-92). f:{x} = J.•u xf(x)dx = J.' R(l)dt • u (6-6()) because F(x) = 0 for x < 0. The conditional distribution L'( r I XX > ) __ I - P{x s x. x? t} " -· ,, P{ > (6-61) l is the probability that a system functioning at time 1 will fail prior to timex. Clearly. F(xlx >I) = 0 if x < 1 and _ F(x) - F(l) (6- ) 62 F< X IX > I ) I _ F(l) x > I Differentiating with respect to x, we obtain the conditional density > _ f(x) f(x IX - I) - I _ F(l) x > (6-63) I The product f<xlx::: t)dx is the probability that the system will fail in the time interval (x, x + dx), assuming that it functions at time 1. Example 6.11 Suppose that x has an exponential distribution F(x) =(I - e·o•)U(x) /(x) = ae "'U(x) In this case (Fig. 6.11), f<xlx ::: I) ae·o• = ---=or = ae··ol.c -II e x> 1 (6-64) Thus, in this case, the probability that a system functioning at time t will fail in the interval (x, x + dx) depends only on the difference x- 1. As we shall see, this is true only if/(x) is an exponential. • 188 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY y X Figure 6.11 Conditional Failure Rate The conditional density f<xlx ~ t) is a function of x and t. Its value at x = tis a function of t fl(t) =/Ctlx ~ t) (6-65) I known as the conditional rate offailure or hazard rate. The product {j(t)dt is the probability that a system functioning at time t will fail in the interval (t, t + dt). Sincef(x) = F'(x) = -R'(x), (6-63) yields fl(t) Example 6.12 F'(t) F(t) =I- =- R'(t) R(t) (6-66) The time to failure is an av x uniformly distributed in the interval (0, T) (fig. 6.12). In this case, F(x) = x/T, R(t) = I - x!T, and (6-65) yields Q( ) liT I ,.. 1 = I - x/T = T - x 0s I <T • Using (6-66), we shall express F(x) in terms of {j(t). • Theorrm 1 - F(x) = R(x) = exp {- Ftg~~re J.: fJ(t)dt} 6.12 /(x) {J(t) F(x) T·~----~ 0 (6-67) T 1 T X X 0 T SEC. 6-4 SYSTEM RELIABILITY 189 • Proof. Integrating (6-66) from 0 to x and using the fact that F(O) = 1 R(O) = 0, we obtain - J: {3(1)d1 =In R(x) and (6-67) follows. To expressf(x) in terms of {3(1), we differentiate (6-67). This yields f(x) = {3(x) exp {- J: {3(1) d1} Note that the function {3(1) equals the value ofthe density f<xlx ~ 1) for x = 1; however, {3(1) is not a density. A conditional density f<xi.M.> has all the properties of a density only if the event .M. is fixed (it does not depend on x ). Thus f<xlx :::: 1) considered as a function of xis a density; however, the function f3(t) = f<tlx ~ t) does not have the properties of densities. In fact, its area is infinite for any F(:c). This follows from (6-67) because F(x) = I. EXPU~n:n •·AIUJRE RAn: The probability that a given unit functioning at time 1 will fail in the interval (1. 1 1 ~)equals P{l < x s 1 + ~lx > 1}. Hence. for small li. {J(I)~ ""P{l < x <::: I ~ lilx > 1} (6-69) This has the following ('mpiric'al inlnpl'('llllion: Suppose. to bt: concrete. that x is the time to failure of a light bulb. We turn on 11 bulbs at 1 = 0 and denote by ,, the number of bulbs that are still good at time 1 and by tin, the number of bulbs that fail in the interval (I. 1 + Iii. As we know Isee (2-54)) the conditional probability {J(I)li equals the relative frt:quency of failures in the interval (1. 1 + 8) in the subsequence of trial!. involving only bulbs that are still good at time 1. Hence {J(I)~ -111, ,, ·::: · - (6-70) Equation (6-691 is a probabilistic statement involving a single component: (6-70) is an empirical statement involving a large number of components. In the following. we give a probabilistic interpretation of (6-70) in terms of N components. where N is any number large or small. Suppose thut a system consists of N components und that the ith component is modeled by an RV X; with R;(t) =: P{x; > 1}. The number of units that are still good at time 1 is an RV n(l) depending on 1. We maintain that its expected value equals 17(1) = E{n(t)} = R,(tl + · · · + R_,(l) (6-71) • Proof. We denote by y; the zero-one RV associated with the event {x; > 1}. Thus y = I ifx1 > 1 and 0 ifx,·.c: t: hence. n(t) :..: y 1 ·I • • • + y.v. This yields 190 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABII.ITY .... E{n(t)} =L .... = L P{x; > t} E{y;} i~l i-1 and (6-71) results. Suppose now that all components are equally reliable. In this case. R,(t) =· · ·- Rs(t) = R(t) 17111 = NRitl (6-72) and (6-71) yields lsee (6-6611 {3(1) =- R'(t) R(t) =- (6-73) lJ'(t) 171t) Thus we have a new interpretation of {:J(t): The product {3(t)dt equals the ratio of the expected number lJ(I) - l)(t + dt) of failures in the interval (t, t + dt) divided by the expected numberl)(t) of good components at timet. This interpretationjustifies the term expected failure rate used to characterize the conditional failure rate {3(t). Note. finally, that (6-73) is the probabilistic interpretation of (6-70). WeibuU Distribution ies is the function A special case of particular interest in reliability stud{:J(t) = ct" (6-741 I This is a satisfactory approximation of a failure rate for most applications. at least for values of t near t = 0. The density of the corresponding time to failure is given by [see (6-68)] /(x) = ex" •e·•A•'"V(x) (6-75) This function is called the Weibu/1 density. It depends on the two parameters c and b, and its first two moments equal E{x} = (~r''" r(b; I) E{x2} = (~f2'" r(b; 2) where f(x) is the gamma function. This follows from (4-54) with x" = y. In Fig. 6.13, we show f(x) and {3(t) forb= I. 2, and 3. If b = I, then{3(t) = c = constant and f(x) = c:e··uu(x). This case has the following interesting property. Figure 6.13 f(x) Weibull c = t 0 X 0 SEC. 6-4 SYSTEM RHIABil.ITY 191 Memory/ess Systems We shall say that a system is memoryless if the probability that it fails in an interval (t, x) assuming that it functions at timet, depends only on the length x - t of this interval. This is equivalent to the assumption that f<xlx ~ I) = f<x - t) for every x :::: t (6-76) With x = t, it follows from (6-76) and (6-66) that {3(1) = f(tlx ~ t) = /(0) = constant. Thus a system is memoryless iff fC x) is an exponential density or. equivalently, iff its conditional failure rate is constant. Interconnection of Systems We determine next the reliability of a system S consisting of two or more components. Parallel Connection We shall say that two components s, and s~ are connected in parallel forming a system S if S functions when at least one of the systems 5 1 and s~ functions (fig. 6.14a). Denoting by x,. x~. and z the times to failure of the systems 5 1• s~. and S. respectively. we conclude that z ""'- t if the larger of the numbers x 1 and x~ equals t: hence. z =max (x 1 • x~) <6-77) The distribution F.-<:.> of z can be expressed in terms uf the joint distribution ofx 1 and x~ as in <5-7H). If the systems S 1 and s~ arc independent. that is. if the RVS x1 and x~ are independent. then /-'_.(1) -= F 1 (t}F~(t) <6-78) This follows from (5-7H); however. we shall establish it directly: The event {z < t} occurs if S fails prior to timet. This is the case if both systems fail prior to timet. thut is. if both events {x 1 < t} and {x~ < t} occur. Thus {z < t} is the intersection of the events {x 1 < t} and {x~ < t}. And since these events are independent. (6-7tH follows. We shall say that n systems S; are connected in parallel forming a system S if S functions when at least one of the systems S; functions. Reasoning as in (6-77) and (6-78), we conclude that if the systems S; are indepen- F"IJUre 6.14 z =max (x 1, x 2) (b) '(a) 192 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY dent, then z = max (x,, ... , Xn) (6-79) Series Connection Two systems are connected in series forming a system S if S fails when at least one of the systems 5 1 and s~ fails (fig. 6.14b). Denoting by w the time to failure of the systemS. we conclude that w = 1 if the smaller of the numbers x 1 and x~ equals.t: hence. w = min (x,. x~) (6-80) The reliability R,..(t) = P{w > t} of Scan be determined from (5-79). If the systems s, and 5 2 are independent. then R,.(t) = R,(I)R2(1) (6-811 Indeed. the event {z > t} occurs if S fails after timet. This is the case if both systems fail after t, that is. if both events {x 1 > t} and {x~ > t} occur. Thus {z > t} = {x 1 > t} n {x2 > t}; and since the two events on the right are independent. (6-81) follows. Generalizing. we note that if n independent systems are connected in series forming a system with time to failure w. then w = min (x 1• • • • • x,) R,..(l) = R 1(t) · · · R,(t) (6-82) St11nd-by Connection We put system S 1 into operation, keeping system S2 in reserve. When S, fails, we put S2 into operation. When S2 fails, the system S so formed fails (fig. 6.14c). Thus if t 1 and t 2 are the times of operation of s, and S2, then t 1 + t 2 is the time of operation of S. Denoting by 5 its time to failure, we conclude that 5 = X1 + X2 (6-83) The density fs(s) of 5 is obtained from (5-90). If the systems S 1 and S 2 are independent, then [see (5-97)) fs(t) equals the convolution of the densities jj(t) and Ji(t) /s(t) = J:jj(z)Ji(t- z)dz ajj(t) •Ji(t) (6-84) Note, finally, that if Sis formed by the stand-by connection of nindependent systems sit then 5 = X1 + ' • ' + X11 /,(I) = Ji(t) * • ' • * fn(t) (6-85) Example 6.13 We connect n identical, independent systems in series. Find the mean time to failure of the system so formed if their conditional failure rate is constant. In this problem, ~(I) = c R;(l) = e-rt R(t) = R1(1) • · • R 11(1) = e-nrt [see (6-82)]. Hence, E{w} = 1o" e-nrt dt = -I nc • SEC. 6-4 SYSTEM RELIABILITY 193 TIME-INDEPE~DE~T SYSTE!\1S A time-independent system is either good or defective at all times. The probability p = I - q that the system is good is called system reliability. Problems involving interconnections of time-independent or time-dependent systems arc equivalent if time is only a parameter. This is the case for series-parallel but not for stand-by connections. Thus to find the time-independent form of (6-78) and (6-81), we set p = R(t) and q = F(t). System interconnections are represented by linear graphs involving links and nodes. A link represents a component and is closed if the component is good, open if it is defective. A system has an input node and an output node. It is good if it contains one or more connected paths linking the input to the output. To find the reliability p of a system, we must trace all paths from the input to the output and the probability that at least one is connected. In general. this is not a simple task. The problem is simplified if we consider only series-parallel connections. Here are the reliabilities p = I - q of the four systems of Fig. 6-15: Qa = q,q2 Pb = PtP2 q,. = (1 - PtP2)q~ Pd = (I - q,q2)P3 Structure •·unction The state of the system S is specified in terms of an RV y taking the values I and 0 with probabilities P{y = I} =p P{y = 0} = q where p = I - q is the reliability of the system. This RV will be called the state variable. Thus a state variable is the zero-one RV associated with the event {good}. Suppose that S consists of n components S; with state vari- Figure 6.15 q~ =n - PrPz)QJ ~----OQJo-----~ ---o Pr o--o b P& = PtP2 P2 o-- b 194 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY ables y;. Clearly. y is a function y = 1/J(y, •...• y,) of the variables y; called the structure function of S. Here are its values for parallel and series connections: Pal'tllle/ Connection 1/J(y., •.• , y,.) From (6-79) it follows thai (1 - y 1) • = max y; = I - • (1 - y,.) (6-86) Series Connection From (6-82) it follows that (6-87) I/J(y 1• • • • • y,) = min y; = y 1 y~ · · · y, To determine the structure function of a system. we identify all paths from the input to the output and use (6-86) and (6-87). We give next an illustration. Example 6.14 The structure on the left of Fig. 6.16 is called a bridge. It consists of four paths forming four subsystems as shown. From (6-87) it follows that the structure functions of these subsystems equal Y1Y2 YlY4 Y•Y4Ys Y~~Ys respectively. The bridge is good if at least one of the substructures is good. Hence [see (6-86)), 1/J(y., Y2• Yl• Y4) = max (YIY2• YlY4• Y•Y4Ys• Y2Y4.Js) = I - (I - Y1Y2KI - Y3Y4)(1 - Y2Y3Ys) The determination of the reliability of this bridge is discussed in Problem 6-17. Flpre 6.16 Y1~Y2 ---<,,~,.>- • PROBI.F.MS 195 Problems 6-1 6-l 6-3 6-4 6-S We are given 10 coins; one has two heads. and nine are fair. We pick one at nmdom and toss it three times. (a) Find the probability p 1 that at the first toss. heads will show. (b) We observe that the first three tosses showed heads; find the probability p 2 that at the next toss. heads will show. Suppose that the age (in years) of the males in a community is an RV x with den!lity .03,.-·03'. (a) Find the percentqe of males between 20 and SO. (b) Find the average age of males between 20 and 50. Given two independent N(O, 2) RVS x andy. we form the Rvs z = 2x + y, w = x - y. Find the conditional density f(z'w) and the conditional mean E{zlw = 5}. If F(x) = (I - e-·2•)U(x). find the conditional probabilities PI = P{x < 513 < X < 6} P2 = P{x > 513 < X < 6} The avs x andy have a uniform joint density in the shaded region of Fig. P6.5 between the parabola y = x(2 - x) and the x-axis. Find the regression line f/>(x) = E{ylx}. Figure P6.S xl2- x) X 6-6 6-7 6-8 6-9 Suppose that x andy are two normal avs such that 11.• = 1Jy = 0, "·• = 2, u 1 = 4, rx1 = .5. (a) Find the regression line E{ylx} = t/J{x). (b) Show that the Rvs x and y - tb(x) are independent. Using the regression line tb(x) = E{y 1x}. we form the RV z = tb(x). Show that if ax is the homogeneous linear MS estimate of the av y in terms of x and Ax is the corresponding estimate of z in terms of x. then A = a. The length of rods coming out of a production line is modeled by an av e uniform in the interval (10, 12). We measure each rod and obtain the av x = e + 11where 11is the error. which we assume uniform in the interval ( -0.2, 0.2) and independent of e. (a) Find the conditional density £<xlc), the joint density £~<x. c), and the marginal density £(x). (b) Find the LMS estimate~= E{elx} of the length c of the received rods in terms of its measured value x. We toss a coin 18 times, and heads shows II times. The probability p of heads 196 CHAP. 6 CONDITIONAL DISTRIBUTIONS, REGRESSION, RELIABILITY is an RV p with density /(p). Show that the LMS estimate p of p equals p = 'Y J~ p'2(• - P>'f<p>dp Find 'Y if /(p) = 1. 6-10 The time to failure of electric bulbs is an RV :11: with density ce-""U(x). A box contains 200 bulbs; of these, 50 are of type A with c = 4 per year and ISO are of type 8 with c = 6 per year. A bulb is selected at random. (a) Using (6-7), find the probability that it will function at time t. (b) The selected bulb has lasted 6-11 6-12 6-13 6-14 three months; find the probability that it is of type A. We are given three systems with reliabilities Pt = I - q" P2 = I - q2, Pl = I q3 , and we combine them in various ways forming a new system S with reliability p = 1 - q. Find the form of S such that: (a) p = PtP2Pl; (b) q = q,q~l; (c) p = p,(l - q~3); (d) q = q,(l - P2Pl). The hazard rate of a system is ~(t) = ct/(1 + ct); find its reliability R(t). Find and sketch the reliability R(t) of a system if its hazard rate equals ~(t) = 6U(t) + 2U(t - t0 ). Find the mean time to failure of the system. (a) Show that if w = min (x, y), then P{ I < }Fy(l)- F!l}(t, I) X <?: I W - I - Fx(t) + F1 (1) - Fxy(l, I) (b) Two independent systems Sx and S1 are connected in series. The resulting 6-15 6-16 6-17 6-18 systemS.., failed prior tot. Find the probability p that S" is working at timet. SystemS., fails at timet. Find the probability p 1 that S" is still working. We connect n independent systems with hazard rates ~1 (1) in series forming a system S with hazard rate ~(t). Show that ~,(t) = ~,(1) + · · · + ~,(t). Given four independent systems with the same reliability R(t), form a new system with reliability I - [1 - R2(t)J2, Find the reliability of the bridge of Fig. 6-16. (a) Find the structure function of the systems of Fig. 6-14. (b) Given four components with state variables Y1t form a system with structure function _{I lfi(y,, 12 ' 13 ' Y•>- 0 ify, + Y:- Yl + Y• e: 2 otherwise 7_ _ _ _ Sequences of Random Variables We extend all concepts introduced in the earlier chapters to an arbitrary number of RVs and develop several applications including sample spaces. measurement errors. and order statistics. In Section 7-3 we introduce the notion of a random sequence, the various interpretations of convergence, and the central limit theorem. In the last section. we develop the chi-square. Student t, and Snedecor distributions. In the appendix. we establish the relationship between the chi-square distribution and various quadratic forms involving normal RVs, and we discuss the noncentral characterofthe results. 7-1 General Concepts All concepts developed earlier in the context of one and two Rvs can be readily extended to an arbitrary number of avs. We give here a brief summary. Unless otherwise stated. we assume that all avs are of continuous type. Consider n RVs x1• • • •• x, defined as in Section 4-1. The joint distribution of these Rvs is the function (7-1) F(x 1•• • • , x,,) = P{x 1 s: x 1•• • • • x, <; x,} 197 198 CHAP. 7 SEQUENCES OF RANDOM VARIABLES specified for every X; from - x to x. This function is increasing as X; increases. and F(x, . . . • x) = I. Its derivative ) _ ii"F(x 1• • • • • :c,) /( x, ....• x, (7-2) -"· . . . <1X !I u.t 1 11 is the joint density of the avs X;. All probabilistic statements involving the RVs x; can be expressed in terms of their joint distribution. Thus the probability that the point (x 1• . . , x,.) is in a region D of the n-dimensional space equals P{(x" ...• x,.) ED} = L, · · · Jf(x 1, • • • • x,)dx 1 • • • dx, (7-3) If we substitute certain variables in F<x 1, • • • • x,.) by x. we obtain the joint distribution of the remaining variables. If we integrate f(x 1• • • • , x,) with respect to certain variables. we obtain the joint density of the remaining variables. For example. F(x 1• :c3) = F(x" x, x 3• x) (7-4) /(XJ, XJ) = f.f.J(x,, X2, .l), X4)dx2dx4 The composite functions y, = g,(x" . . . , x,.) . . . y,. = g,.(x,, . . . , x,.) specify the" RVs y, •... , y,. To determine the joint density of these RVs, we proceed as in Section 5-3. We solve the system g,(x., . . . , x,.) = Yt . . . g,(x., . . . , x,.) = Yn (7-5) for X; in terms of)';. If this system has no solution for certain values of y,. then f,.(y 1• • • • , y,.) = 0 for these values. If it has a single solution (x" . , x,.), then .• y,.>=IJ< x, .. . 'x,l (7-6) where J(x" . . . , Xn) = (7-7) is the Jacobian of the transformation (7-5). If it has several solutions. we add the corresponding terms as in (4-78). We can use the preceding result to find the joint density of r < 11 functions y 1• • • • , y, of the n RVs X;. To do so. we introduce n - r auxiliary variables y,. 1 = x,. 1, • • • • y, = x,. find the joint density of the n RVs y, . . . . • y,. using (7-6), and find the marginal density ofy, •... , y, by integrcltion as in (7-4). The RVs x; are called (mutually) independent if the events {x s x;} are independent for every x;. From this it follows that SEC. F(xl, •••• Xn) ftxlt ... , Xn) 7-1 = FCxd · GF.NERAI. CONCEPTS · · F<x,) 199 (7-8) = .ftxd · · · .f<xn> Any subset of a set of independent avs is itself a set of independent avs. Suppose that the avs x,. x2 • and x~ are independent. In this case, f(xJ, Xz, X3) = /(xJ)/(x2)/(xJ) Integrating with respect to x3 , we obtainj(xJ, x2 ) = j(x1)/(x2). This shows that the RVS x 1and x 2 are independent. Note that if the RVs x; are independent in pairs, they are not necessarily independent. Suppose that the av y1 is a function g1(x1) depending on the av x1 only. Reasoning as in (5-26), we can show that if the avs x1 are independent, the avs y1 = g 1(x1) are also independent. Note. finally, that the mean of the RV z = .1f(X 1, . . .. X11 ) equals E{g(xlt . . . 'Xn)} = r.... r, g(XJ, • ••• Xn)f(xl •• •• 'Xn)dxl . •• dxn as in (5-30). From this it follows that E{~akgk(XI, . . . , Xn)} = Ia.E{g,(XJ, . . . , Xn)} (7-9) The covariance IJ.ii of the avs x1 and xi equals [sec (5-34)] IJ.ii = E{(x;- Tl;)(xi- TJi)} = £{xixi} - Tli''li where 11; = E{x1}. The n-by-n matrix #J.IIIJ.I2 (7-10) #J.21#J.22 ( #J.ni#Ln2 is called the covariance matrix of the n avs x;. We shall say that the avs x; are uncorrelated if IJ.v = 0 for every i =I= j. In this case. if z = a1x1 -'- · · · + Cl 11X11 then CT~ = a1CTT - • · • •a~~ (7-11) where CT~ = IJ.;; is the variance of X;. Note, finally, that if the avs x1 are independent, then r..... r. = r. gl(Xl) . . . Cn(Xn)f(xl) • .• f(xn)dxl • CJ(XJ)/(XJ)dxl . . . r. . dxn Cn(Xn>f<xn)dxn Hence. E{gJ(XJ) · · · Kn(Xn)} = £{gJ(XJ)} · · · E{g,(x,)) (7-12) As in (5-56). the expression <l>(sJ •... , s,.) = E{exp(SJXJ - · · · + S11X11 )} (7-13) will be called the joint moment function of the avs x1 • From (7 -12) it follows that if the avs x; are independent, then <l>(slt ...• Sn) = E{es•••} · · · E{e'•"•} where <I>( s 1) is the moment function of x1 • = <l>(sJ) · · · <l>(s11 ) (7-14) 200 CHAP. Normal 7 SEQUENCES OF RANDOM VARIABLES RVs We shall say that then RVs x; are jointly normal if the RV z = a1X + · · ·+ llnXn is normal for any a;. Introducing a shift if necessary. we shall assume that E{x;} = 0. Reasoning as in (5-98) and (5-99). we can show that· the joint density and the joint moment function of the Rvs X; are exponentials the exponents of which are quadratics in x; and s;. respectively. Specifically. <l>(s., . . . , s,.) = exp {-21 i, f.LuS;si} (7-15) 1)•1 " 'YuX;X; } (7-16) exp { - -2I L ij-1 where f.Lii are the elements of the covariance matrix C of X;. 'Yii are the elements the inverse c-• of C. and 4 is the determinant of C. We shall verify (7-16) for n = 2. The proof of the general case will not be given. If n = 2, then f.Lu = ui. 1L12 = ru1u2. 1L22 = u~. J<x1. • • •• x,.) I = v'(217') 11 4 c- ( 0'12 f0'10'2 c-l = ! (0'~ -rO'IO'~) = ('YII 4 -ro-10'2 In this case, the sum in (7-16) equals 2 'Y11X1 + 2-y12X1X2 + 2 'Y22X2 o-1 22 = ~1 (0'2X1 - 'Y~I 'YI2) 'Yn 2r0'10'2X1X2 ,, + O'jXi) in agreement with (5-99). If the matrix Cis diagonal, that is, if f.Lii = 0 fori =I= j, then 'Yii = 0 fori =I= j and 'Yu = 1/u~; hence, 1 1 f<x .. ••• , x,.) = V(21T )" exp {- -2 ( 0'1 + · · · + (7-17) 0'1 • • • u,. Thus if the RVs x; are normal and uncorrelated, they are independent. Suppose, finally, that the Rvs y; are linearly dependent on the normal Rvs x;. Using (7-6), we conclude that the joint density of the RVS y; is an exponential with a quadratic in the exponent as in (7-16); hence, the RVS y; are jointly normal. xt §)} 0',. Conditional Distributions In Section 6-2 we showed that the conditional density of y assuming x defined as a limit, equals f< Y I ) = /(X. }') x f<x> = x, SEC. 7-1 GENERAL CONCEPTS 201 Proceeding similarly. we can show that the conditional density of the RVs x, •... , "'''assuming x, = x 4 , • • • , x 1 = x 1 equals f (X,, • ••• X4•1 IXA, • •• , _ f(x,. . . • x,) Xt ) - /:;.,.--(....;..._----''Xt, . . • xd (7-18) For example, /(Xt, X2, X~) j( Xt, X2 ) Repeated application of (7-18) leads to the chain rule: f(x,, . .. , x,.) = /(x,.lx.. -t •... , Xt) · · · f<x2lx,)j(Xt) (7-19) We give next a rule for removing variables on the left or on the right of the conditional line. Consider the identity I j(x3, x2lx,) = j(x,)f(x,, x2. x~) ) _ 1 f( X3,X2, Xt - Integrating with respect to x 2 , we obtain I I~ _,.f<x3, x2lx,)dx2 = j(x,) . ,.f(x., x~. x1>dx2 I" f(x,, X3) = f(x,) = j(xllx,) Thus to remove one or more variables on the left of the conditional line, we integrate with respect to these variables. This is the relationship between marginal and joint densities [see (7-4)] extended to conditional densities. We shall next remove the right variables x2 and x3 from the conditional density f(x.lx3 , x2 • x,). Clearly, /(.t.lxh x2. x,)j(x~. x2lx,) = f(x., x~. x2 lx,) We integrate both sides with respect to x 2 and x~. The integration of the right side removes the left variables x2 and xh leavingf(x.lx,). Hence, r. r~f<x.lx.,, x~. x,)j(x~. x~lx.>dx~dx, ""f<x.lxt> Thus to remove one or more variables on the right of the conditional line, we multiply by the conditional density of these variables, assuming the remaining variables, and integrate. The following case. known as the ChClpmtlnKo/mogoro.D'equt~tion. is of special interest: r. f(x,lx~. x.>f<x21x,)dx2 = f<x~ix,) The conditional mean of the RV y assuming x; .::.: x; equals E{ylx,. •... , x,} = J: .vf<ylx,, •. ..• x,)dy <7-20) This integral is a function c/J(x 1• • • • • x,) of x;. Using this function, we form 202 CHAP. 7 SEQUENCES OF RANDOM VARIABLES the RV </>(x, •...• x,) The mean of this RV equals = E{ylx, •...• x,} f"" · · ·J"_,. <b<x.. ...• x,)j(x, • ...• x,) dx 1 • • • dx, Inserting (7-20) into this equation and using the identity f(x, •. ..• x,. y) = f<.vlx, •. ..• x,)j(x, •...• x,) we conclude that E{E{ylx,., ... , x,}} = r. ···r. yf(x., . . . , x,., y)dx, · · · dx,.dy = E{y} (7-21) The function <f>(x1• • • • • x,) is the gener.llization of the regression curve <f>(x) introduced in (6-39). and it is the nonlinear LMS predictor of y in terms of the RVs x1 (see Section 11-4). Sampling We are given an RV x with distribution F(x) and density /(x), defined on an Using this av, we shall form n independent and identically distributed (i.i.d.) RVS experiment~. (7-22) X1t • • • , X;, • • • , X 11 with distribution equal to the distribution F(x) of the RV x: F 1(x) = · · · = F,.(x) = F(x) (7-23) These avs are defined not on the original experiment ~ but on the product space ~11 =~X • • ·X fJ consisting of n independent repetitions of~. As we explained in Section 3-1, the outcomes of this experiment are sequences of the form ~ = ~ •... ~~ ... ~.. (7-24) where ~~ is any one of the elements of ~. The ith RV Xt in (7-22) is so constructed that its value Xt(~) equals the value x(~;) of the given av x: Xt<f1 ' • ' ~~ ' ' • ~11 ) = X(~;) Thus the values of x1 depend only on the outcome distribution equals F 1(x) = P{Xt s x} = P{x s x} ~~ (7-25) of the ith trial, and its = F(x) (7-26) This yields E{Xt} = E{x} = 71 (7-27) Note further that the events {x1 s x1} are independent because~.. is a product space of n independent trials; hence, the n avs so constructed are i.i.d. with joint distribution F(x 1 , ••• , x,.) = F(x1) • • • F(x,.) (7-28) SEC. 7-1 GENERAl. CONCEPTS 203 Thus. starting from an RV x defined on an experiment :1. we formed the product space ~f, and the 11 RVs x;. We shall call the set :1, .wmrp/C' .'ipac·t• and the 11 RVS X; a random samp/C' of size 11. This construction is called a .mmplilllf of a population. In the context of the theory ... population .. is a model concept. This concept might be an abstraction of an existing population or of the repetition of a real experiment. As we show in Part Two. the concept of sampling is fundamental in statistics. We give next an illustration in the context of a simple problem. Sample Mean The arithmetic mean x, + ... + x, i=-'----n <7-29) of the samples x; is called their samp/C' mc•cm. The RVs x; are independent with the same mean and variance: hence. they are uncorrelated and (see (7-II)J E{i} = ., cr~ + • · · .,. IT! a .., = - - ~- -· - TJ+···iTJ = 71 ll IT~ ll (7-30) Thus i has the same mean as the original RV x. hut its variance is cr~/11. We shall use this observation to estimate Tl· The RV xis defined on the experiment ~f. To estimate TJ. we perform the experiment once and observe the number x = x(~). What can we say about 71 from this observation'! As we know from Tchebycheft-s inequality (4-115), P{TJ - IOcr s x-;:: 71 + IOcr} ~ .99 (7-31) This shows that the event {TJ- IOcr s x < 71 IOcr} will almost certainly occur at a single trial, and it leads to the conclusion that the observed value x of the RV xis between 71 - lOu and 71 + lOu or. equivalently. that x - I Ocr < 71 < x -t (7-32) IOcr This conclusion is useful only if IOcr << Tl· If this is not the case. a single observation of x cannot give us an adequate estimate of Tl· To improve the estimation. we repeat the experiment " times and form the arithmetic average x of the samples x; ofx. In the context of a model ..r is the observed value of the sample mean i of x obtained by a single performance of the experiment ~f,. As we know, the mean of i is TJ. and its variance equals cr~ltr; hence. IOcr p {TJ - Vn _ S X <; IOcr} TJ + Vn 2! .99 Replacing x by i and u by u/Vn in (7-3 I), we conclude with probability .99 that the observed sample mean xis between 71 · IOcr/Vn and 71 + IOcr/Vn or, equivalently. the unknown 71 is between .f- IOcr/Vn and :i + IOu/Vn. Therefore, if n is sufficiently large, we can claim with near certainty that 71 === x. This topic is developed in detail in Chapter 9. 204 CHAP. 7 SEQUENCES OF RANDOM VARIABLES 7-2 Applications We start with a problem taken from the theory c~( meclS/tremellts. The distance c between two points is measured at various times with instruments of different accurclcies. and the results of the measurements are n numbers :r1• What is our best estimate of c'! One might argue that we should accept the reading obtained with the most accurclte instrument. ignoring all other readings. Another choice might be the weighted average of all readings. with the more accurclte instruments assigned larger weights. The choice depends on a variety of factors involving the nature of the errors and the optimality criteria. We shall solve this problem according to the following model. The ith measurement is a sum X; :.:: C + II; (7-33) where 111 is the measurement error. The Rvs 111 are independent with mean zero and variance a}. The assumption that £{11;} = 0 indicates that the instruments do not introduce systematic errors. We thus haven RVs x; with mean c and variance crl. and our problem is to find the best estimate of c in terms of the " numbers u;. which we assume known. and the n observed values x 1 of the RVs x;. If the instruments have the same accurclcy. that is. if CT; = u = constant, then. following the reasoning of Section 7-1. we use as the estimate of c the arithmetic mean of x1 • This. however, is not best if the accuracies differ. Guided by (7-30), we shall use as our estimate the value of an RV t with mean the unknown c and variance as small as possible. In the terminololgy of Chapter 9. this RV will be called the unhiused minimum variance e.ftimator of c. To simplify the problem. we shall make the additional assumption that c is the weighted average c = 'YJllJ + ... + 'Y..X.. (7-34) of the n measurements x1• Thus our problem is to find the n constants 'YI such that E{c} = c and Yare is minimum. Since E{x1} =c. the first condition yields 'YI + · • · + 'Yn = I (7-35) From (7-11) and the independence of the avs x1 it follows that the variance of c equals v = 'Yiui + · · · + 'Y;u; <7-36) Hence, our objective is to minimize the sum in (7-36) subject to the constraint of (7-35). From those two equations it follows that V = 'YTUT + · · · + 'Yrur + · · · + (I - 'Y1 - • • • - 'Y,._,)u~ This is minimum if av iJ'Y; = 2'Y;Ui~ - 2(1 - 'YI - • • • - 'Y,.-J}u;;~ =o . = I , ... , n - I And since the expression in parentheses equals 'Y11 , the equation yields 'Y;uT = 'Y,.u~ I SEC. 7-2 APPLICATIONS 205 Combining with (7-36). we obtain v ')'; = ~ v = 1/o-j + . . + 1/o-;, (7-37) We thus reach the reasonable conclusion that the best estimate of cis the weighted aver.lge • x,lo-i + · c= 1/o-j + . · + x,lrT;, . + 1/o-;, (7-38) where the weights')'; are inversely proportional to the variances of the instrument errors. Example 7.1 The length c· of an object is measured with three instruments. The resulting measurements are ·'"• = 84 X2 = 85 .l',l = K7 and the standard deviations of the errors equal I. 1.2. and 1.5. respectively. Inserting into (7-381. we conclude that the best estimate of c equals ~ = X1 + X2fl,44 + X,l/2.25 = UA ~ • c I ... 111.44 + 112.25 ...... 9. We assumed that the constants o-; (instrument errors) are known. However, as we see from (7-38). what we need is only their ratios. As the next example suggests. this case is not uncommon. Example 7.2 A pendulum is set into motion at time 1 = 0 starting from a vertical position as in Fig. 7.1. Its angular motion is a periodic function 8ttl with periud T = 2c. We wish to measure c. To do so. we measure the first 10 zero crossings of 0(11 using a measuring instrument with variance o- 2• The ith measurement is the sum t; = ic + 8; £{8;} = 0 t:{8i} = cr 2 Figure 7.1 206 CHAP. 7 SEQUENCES OF RANDOM VARIABLES where li; is the measurement error. Thus t; is an RV with mean k and variance an unknown constant u~. The results of the measurement are as follows: t; = 10.5 20.1 29.6 39.8 50.2 61 69.5 79.1 89.5 99.8 To reduce this problem to the measurement problem considered earlier. we introduce the Rvs 6. t; "; = i x;=7=c+"; Clearly.£{,;}= 0. £{,f} = u~li~: hence. ' £{x;} = c o:.~, = u;~ Finally. t X;=~= I 10.5 10.05 9.87 9.95 10.04 10.17 9.93 9.89 9.94 9.98 Inserting into (7-38) and canceling u~. we obtain • Xt + 4x~ + · · · + IOOx, 0 c= I + 4 + ... + 100 :: 9.91 Thus the estimate c~ of c does not depend on u. • Random Sums Given an RV n of discrete type taking the values I. 2. . RVs x 1• x2 • • • • , we form the sum . and a sequence of • 5 = }: x4 <7-39> 4-1 c. This sum is an RV defined as follows: For a specific the RV n takes the value n = n(C). and the corresponding values == 5(C) of s is the sum of the numbers x4 (C) from I to"· Thus the outcomes Cof the underlying experiment determine not only the values of x4 but also the number of terms of the sum. We maintain that if the Rvs x4 have the same mean E{x4 } = Tlx and they are independent of n, then (7-40) E{s} = TJxE{n} To prove this, we shall use the identity [see (7-21)] E{s} = E{E{5In}} If n = n, then 5 is a sum of n RVs, and (7-9) yields E{5ln} =E {± x.ln} = ±E{x.ln} = ±E{x.} ·-· ·-· ·-· The last equality followed from the assumption that the Rvs x4 are independent of n. Thus E{5ln} = TJJin E{E{5In}} = E{.,Jin} = TJJCE{n} We show next that if the RVS x4 are uncorrelatedwith variance u~. then E{s2} Clearly, 5 2 = .,~E{n2} + u~E{n} • • =~ ~ k•l m=l II Jlkllm E{s2ln} =~ (7-41) II ~ k•l m=l E{Xkllm} SEC. 7-2 APPI.ICATIONS 207 The last summation contains 11 terms with/.. ...,. m and ,~ 11 terms with /.:. :/: m. And since k - Ill . {E{xz} ..:. cr~ + 11~ f..{xAx,} = /•'{ lf.'{ 1 = 11•~ /.:.·I m .xu.x.,r we conclude that E{s2in} · · hri + 71~)n + 71.~(n~ - n) 71~n~ .,. uin Taking expected values of both sides, we obtain (7-41). Note, finally. that under the stated assumptions, the variance of s equals 1.'{ s-'} - c:.1.''{s} = ..,-u· ' n, ... cr·.., ' cr··'' = c:. (7-42) •f.\ .\ •1n Example 7.3 The thermal energy w =· parameters my~/2 of 11 particle is an 3 I' - :; KV having a gamma c"·nsity with I (' . k ., where Tis the absolute temperature of the p11rticle and k t ..n x 10 ~'joule degrees is the Holtzmunn constant. 1-'rum this and 15-(>51 it fulluws that . h /~{w} - ;: - 3H T , h ?tk~1: cr~ - ;.~ - - ... 2 The number n, of particles emitted from a rudio11ctivc substance in t seconds is 11 Poisson RV with parameter a - Jl.t. Find the me11n and the v11riance of the emitted energy ' As we know, E{n,} = tl = 11.1 Inserting into (7-40) and (7-42). we obtain - 3k nt E{l} . 2 . cr·• - CT~ ••• = Cl = 11.1 (3/i.T)~ 3k 2 T~ . . 15k~T2JI. T -2 11. T - - - 11.1 = 2 4 • Order Stati.\'tic We are given a sample of 11 Rvs X; defined on the sample space rt, of repeated trials as in (7-25). For a specific~ E y·,. the Rvs x; take the values X; = x;(e>. Arnmging then numbers X; in increasing order. we obtain the sequence x,, -s x,~ s · · · s x,. (7-43) We next form then RVs y; such that their values Y; == y;(~) equal the ordered numbers x,, y, = x,, !"= Y:! = x,: s · · · c;; y, -=- x,_ <7-44) Note that for a specific i, the values x;(f) of the ith RV X; may occupy different positions in the ordering as eranges over the space ~l,. For example, the minimum y1<e> might equal x~(~) for some~ hut xK(~) for some other ~. The Rvs y; so constructed are called the order sltlti.Hics of the sample X;. 208 CHAP. 7 SEQUENCES OF RANDOM VARIABLES As we show in Section 9-4, they are used in nonparametric estimation of percentiles. The /cth RV y" in (7-44) is called the kth order statistic. We shall determine its density fk(y) using the identity /,(y)dy = P{y < y,. s y + dy} (7-45) The event sf = {y < Y• s y + dy} occurs if k- 1 of the RVS X; are less than y, n - k are larger than y + dy, and, consequently, one is in the internal (y, y + dy). The Rvs X; are the samples of an RV x defined on an experiment ~.The sets (Fig. 7.2) ~2 = {y < X S y + dy} ~3 = {y + dy < x} ~. = {x s y} are events in ~, and their probabilities equal P(~,) = Pt = Fx(Y) P(~2) = P2 = ,h(y)dy P(t.e3) = P1 = I - Fx(Y + 4y) where Fx(x) is the distribution and ,h( x) is the density of x. The events ~. , ~ 2 • ~ 3 form a partition of~; therefore, if~ is repeated n times, the probability that the events, ~;will occur k; times equals [see (3-41)] n! k 'k 'k 1 pt•p~zp~' k, + k2 + k3 I • 2• 3• =n (7-46) Clearly, sf is an event in the sample space fin, and it occurs ifliJ, occurs k1 times, ~ 2 occurs once, and \1 3 occurs n - k times. With k1 = k - I k2 = I k3 = n - k (7-46) yields P{y < y" s y + dy} = (k _ l)!~!!(n _ k)! J1- 1(y)/x(Y)dy [I - Fx(Y + dy)Jn-lt. Comparing with (7-45), we conclude that fk(y) = (k- 1)7(!n - k)! p~-'(y)[l - F.(y)]n ·• ,h(y) (7-47) The joint distributions of the order statistics can be determined similarly. We shall carry out the analysis for the maximum Yn = Xmax and the minimum Yt = Xmin of X;. Fipre 7.2 z -----!1,----~. 1-+------983·------ ~2 SEC. 7-2 APPI.ICATIONS 209 EXTREME ORDER STATISTICS We shall determine the joint density of the RVS = y, z using the identity [.,..(z, w)dzdw The event ~ = {z = P{z < <zs z -t- dz, = Y1 w z s z + dz. l1' < 1\' < w s w + dw} z> w w s w ... dw} (7-48) (7-49) occurs iff the smallest of the avs X; is in the interval ( lt'. l1' .,.. dw ), the largest is in the interval (z, z + dz). and, consequently. all others are between w + dw and z. To find PC'-€), we introduce the sets t£ 1 = {x s w} C'i~ = {w < x s w + dw} rh = {w + dw < x s z} ~4 = {z < x s z ~ dz} ri ~ = {x > ~ + dz} These sets form a partition of the space ~~. and their probabilities equal p 1 = F,(w) p 2 = j.(w)dw p~ = F,(z) - F,(w + dw) P4 = f,(z) ch p~ = I - 1-',(z -t· dz) (7-SO) Clearly. the set~ in (7-49) is an event in the sample spaceY,, and it occurs iff!":t 1 does not occur at all, 0:~ occurs once, ~f., occurs n - 2 times, Qt 4 occurs does not occur at all. With once, and as k, = 0 k2 = I k~ = II it follows from (3-41) and (7-50) that 2 - P{z < z s .. + d.. w < w s w ~ = n(n - dl1'} ~· l).{.(w)dw[F,(z)- /.;4 :.: = "~ I 11 (ll - ! F,<w - dwW =0 p.p'! 2p 2)! - ·' 4 ~f,(z)dz z > wand zero otherwise. Comparing with (7-48). we conclude that for I" ("' ·) ... lt J~ ..- = { 11(n - 0 1)/,(z)J.(w)[F,Cz)- f,(wl)"· ~ z >w z< w (7-51 ) Integrating with respect to wand z respectively. we obtain the marginal densities f..(z) and f,..(w). These densities can be found also from (7-47): Setting k = n and k = I in (7-47). we obtain [.(z) f,..(w) = nf,(z)F~ = n.f,(w)ll 1 (z) - F,(wl]"' 1 (7-52) Range and Midrange Using (7-51). we shall determine the densities of the RVS z+w 2 The first is the range of the sample, the second the midpoint between the maximum and the minimum. The Jacobian of this tnmsformation equals I. Solving for z and w, we obtain ~ = ,., + r/2. w ·"" s - r/2. and (5-88) yields r-=z-w J,,,(r. s l = J..... (.'i s=-- + ,. ..\· - ~,.) 2 (7-53) 210 CHAP. 7 Example 7.4 SEQUENCES OF RANDOM VARIABLES Suppose that xis uniform in the interval 10. c·). In this case. F,l:c I= xlc forO< x < ": hence. F <lzl - F <I w I = c·~ - .!!: c I . 1 (W)-=- f <lzl = !c· I (' in the shaded area 0 s w s .: :s c of Fig. 7.3 and zero elsewhere. Inserting into 17-51 ). we obtain w) .(,".(<.. . I)(;::- ••·)" = n(n--,---· cc· ~ () < w s;:: < (' 17-54) and 7.ero elsewhere. This yield!; i: , n(n - I) 11 . = ---· ~ " (;:- w)" ·-dw =--~ z" [...ht•) = n(n - _.!1 J.' Cz w)"-~ dz .:... !!.. Cc' w /:(<.) (~ (~ 1 OS<:SC' - w)" 1 0 s w s (' For n = 3 the curves of Fig. 7.3 result. Note that E{z} = _n_ n + I c· E{w} (' =-" + I From (7-53) and (7-54) it follows that •.f) J,,.(r . = nln - I) , , r (~ I c 0 c c X fz(Z) f ... <w> 3 c 3 c z c X /,(r) 3 2c· 0 c z 0 c "' 0 c r SEC. for 0 s .f - 7-2 APPLICATIONS 211 r/2 s: s + r/2 s c and zero elsewhere. Hence. f,(r) = n(n- I) , c" /.(s) = J,.,,2 ,, . 2d.f = n(ll- I) r" f2• r" 2 dr =!!.. (2s)" Jo c" { 2,.. ·2.• 1o c" rl2 , r"-- dr n = -c" (2c 2(C- r) 1 - 2.f )" 1 Note that E{r} . I = II·-,. E{s}.::.. ,." + I IT' 2 2(11 - lk~ I I~( II I :!J • tr; , • (II I . = . . . ..... ... ( 2(11 + 1)(11 + (7-55) • 2)' Sums of Independent Random Variables We showed in Section 5-3 that the density of the sum z .,.. " + y of two independent RVs " and y is the convolution nz) = r..~~(Z - (7-56) y)j;.(y)dy of their respective densities [see (5-96)]. Convolution is a binary operation between two functions. and it is written symbolically in the form i<z> = fx<z> * J;.(z) From the definition (7-56) it follows that the operation of convolution is commutative and associative. This can also be deduced from the identity [see (5-92)] (7-57) fl>:(s) = 4>.,(s)4>.• (s) relating the corresponding moment functions. Repeated application of (7-57) leads to the conclusion that the density of the sum z = x, + ... + x, (7-58) of n independent avs X; equals the convolution i<z> = Ji<z> * · · · * /,(;.) (7-59) of their respective densities Ji<x ). From the independence of the avs follows as in (7-12) that E{en} = E{es<a,-···-a..l} = E{e•••} · · · E{e,.•} Hence, fl>:(s) = 4> 1(s) • • • tl>,(s) where tl>;(s) is the moment function ofx;. en; it (7-60) 212 CHAP. 7 Example 7.5 SEQUENCES OF RANDOM VARIABLES Using (7-60), we shall show that the convolution of n exponential densities Jj(x) = · · · = f,(x) = ce·"U(x) equals the Erlang density zn·l f..(;.) = en (n _ )! e ·•:U(z) (7-61) 1 Indeed. the moment function of an exponential equals cl»·(s) ' = <' i . e ''e"dx = -("- II (' - S Hence, «<»:(s) = cl»i(s) = (<' ('n (7-62) _ s)n and (7-61) follows from (5-63) with h = n. The function.f:(z) in (7-56) involves actually infinitely many integnlls. one for each z. and if the integnmd consists of several analytic pieces. the computations might be involved. In such cases. a grctphical interpretation of the integral is a useful technique for determining the integration limits: To find,h(z) for a specific z. we form the function/,(- y) and shift it z units to the right; this yields j,(z - y). We then form the product/,(<: - y)j..(y) and integrctte. The following example illustrcttes this. Example 7.6 The RVS x1 are uniformly distributed in the interval (0, 1). We shall evaluate the density of the sum z = x1 + x2 +X). As we showed in Example 5.17. the density of the sum y = x1 + x2 is a triangle as in Fig. 7.4. Since z = y + '"· the density of z is the convolution of the trianglef,.(y) with the pulse/3(x). Clearly,f:(l) = 0 outside the interval (0, 3) because.f.(y) = 0 outside the interval (0. 2) and/3(.t) ""' 0 outside the interval (0. 1). To find f..(z) for 0 < z < 3. we must consider three cases: If 0 s z s I. then f:<zl If I s r· z2 = J~ .v dy = 2 z s 2. then f:<ll = Finally, if 2 s J:_, ydy + J;' (2- y)d)• = zs f:<z> -: 2 + 3z ... ~ 3. then f 2 = ; 1 {2 - .v) dy = z2 2- 3; + 9 2 In all cases. f:(d equals the area of the shaded region shown in Fig. 7.4. Thus f:{z) consists of three parabolic pieces. • Binomial Distribution Revisited We shall reestablish the binomial distribution in terms of the sum of n samples Xi of an av x. Consider an event s4 with probability P(s4) = p. defined in an experiment~. The zero-one av associ- SEC. 7-2 APPLICATIONS 213 fi(x) 3 4 0 X h<z- J') z 2 )' z- I 0 2 z )' Fipre 7.4 ated with this event is by definition an RV x such that Thus x takes the values 0 and I with probabilities p and q. respectively. Repeating the experiment ~ , times. we form the sample space ~f, and the samples x;. We maintain that the sum z = x, + ... + x, has a binomial distribution. Indeed. the moment function of the <I>;(S) = RVS x; equals <l>,(s) = E{t''"} = P('' t q Hence. <l>:(s) = <1> 1(s) • • • <l>,(s) = (pe' t q)" From this and (5-66) it follows that z has a binomial distribution. We note = pq; hence. E{z} = np. u~ = npq. again that E{x;} = p. ur 214 CHAP. 7 SEQUENCES OF RANDOM VARIABLES 7-3 Central Limit Theorem Given " independent RVs x,. we form the sum z = "' + ... + x, (7-63) The central limit theorem (CI.T> states that under certain general conditions. the distribution of z approaches a normal distribution: (7-64) as" increases. Furthermore. if the RVs x; arc of continuous type. the density of z approaches a normal density: j:(~) == I t' •:-., ,:.~.,: (7-65) . u .. V21T Under appropriate normalization. the theorem can be stated as a limit: If Zo = (z - T'J:)Icr... then F:..(z) . - l:..(z) {;(z) I .. , ~ e : '- - (7-66) 21T for the general and for the continuous case. re!.pectively. No general statement can be made about the required size of 11 for a satisfactory approximation of /·:~(z) by a normal di!.tribution. For a specific"· the nature of the approximation depends on the form of the densities/;(x). If the RVs x; are i.i.d .. the value " = 30 is adequate for most applications. In fact, if/;(x) is sufficiently smooth, values of n as low as 5 can be used. The following example illustrates this. II •• Example 7.7 The Rvs X; II ' ' \. are uniformly distributed in the interval (0. I). and ~ E{x;} = u7 = 1 12 We shall compare the density of z with the normal approximation {7-65) for = 2 and " = 3. n=l n ll .... .,. =-=I 2 rr~ = ~ = ~ f.(;,) == ~! r• 11:-u: n=3 n 3 71:=2=2 , 11 I rr:=,2=4 As we showed in Example 7.6. f.(z) /2 ,, .~. : f:(;;)-=v;('-: 1 1 is a triangle for n = 2 and consists of three SEC. CENTRAl. LIMIT THEOREM 215 z z 2 0 7-3 Figure 7.5 parctbolic pieces for 11 = 3. In Fig. 7.5. we compare these densities with the corresponding normal approximations. Even for such small values of 11. the approximation error is small. • The central limit theorem (7-66) can be expressed as a property of convolutions: The convolution of a large number of positive functions is approximately a normal curve lsee (7-55)). The central limit theorem is not always true. We cite next sufficient conditions for its validity: If the Rvs x; are i.i.d. and E{xl} is finite. the distribution F 11(z) of their normalized sum Zo tends to N(O. I). If the Rvs x; are such that lx;l < A < x and u; > c1 > 0 for every i. then Fo(z), tends to N(O, I). If E{xl} < B < x for every i and u? = ui +· · ·+ u;. -+ x as"-+ x (7-67) then F0(z) tends to NCO. I). The proofs will not be given. Example 7.8 If jj(.t) = e· •U(.t I then t:{x7} ,.,. )·., x"t• • d.t "" 11! II Thus E{x:} = 6 < ~: hence. the centred limit theorem applies. In this case, 11: = nE{llj} = n, a~ = nu1 = n. And since the convolution of n exponential densities equals the F.rlang density (7-61). we conclude from (7-65) that .n-l 1 4 (n - I)! 1,-: "'::': _ _ ,,-c:-n.:'2" V21Tii for large n. In Fig. 7.6. we show both sides for n = 5 and n . .,. 10. • 216 CHAP. 7 SEQUENCES OF RANDOM VARIABLES -F.rlang ----~onnal s 0 Fipre 7.6 11ae De Moivre-Lapla~e Theorem The (7-64) form of the central limit theorem holds for continuous and discrete type Rvs. If the RVS x; are of discrete type. their sum z is also of discrete type. taking the values''. and its density consists of points h(z,) = P{z = z.}. In this case. (7-65) no longer holds because the normal density is a continuous curve. If. however. the numbers Zt. form an arithmetic progression. the numbersf(zd are nearly equal to the values of the normal curve at these points. As we show next. the De Moivre-Laplace theorem (3-27) is a special case. • Definition. We say that an RV xis of lattice type if it takes the values xk = ka where k is an integer. If the RVS x1 are of lattice type with the same a. their sum Z = Xt + • • • +X" is also of lattice type. We shall examine the asymptotic form of h(Z) for the following special case. Suppose that the RVS x1 are i.i.d .• taking the values I and 0 as in (7-60). In this case, their sum z has a binomial distribution: P{z = k} = (~) pkq"-• k = 0, I. ...• n and E{z} = np u: = npq - !lC as n - !lC Hence, we can use the approximation (7-64). This yields Fl(z) = ~ (~) pkq"-k ~ G (~) t.sl vnpq (7-68) From this and (4-48) it follows that 3vnpq :s z :s np + 3vnpq} = .997 (7-69) In Fig. 7. 7a, we show the function F"(z) and its normal approximation. Clearly, Fl(z) is a staircase function with discontinuities at the points z = k. If n is large, then in the interval (k - I, k) between discontinuity points, the normal density is nearly constant. And since the jump of Fl(z) at z = k equals P{np - SEC. 7-3 nNTRAL l.IMIT THEOREM 217 k z z np (b) Figure 7.7 P{z = k}, we conclude from (7-68) (Fig. 7.7b) that ( n) e-lk· np)Z:~npq 1 pkqn-4 :::: k (7-70) V21rnpq as in (3-27). Multinomial Distribution Suppose that I.~if 1• • • • , .54,] is a partition of ~ with p 1 = P(.s41). We have shown in (3-41) that if the experiment ~ is performed n times and the events .llf1 occur k1 times, then P{.s41 occurs k1 times} = k I•1 • ~!. k , pt• r• · · · P~' For large n, this can be approximated by I [<k 1 exp { - 2 - np 1) np 1 2 2 + · · · + ..:....<k..:....,_-_np~,..:....) l} np, (7-71) = 2 (7-71) reduces to (7-70). Indeed, in this case, PI+ P2 = I We maintain that for n Hence, 2 2 + - I ) -_ ..:...<k~~--_n....:P....:I'-) (7-72) np1 np2 np1 trP2 np1P2 And since k1 = k. P1 = p, P2 = I - p 1 = q, (7-70) results. (k1 - np1) + -..:...<k..:..~_-_n...:.P...:.l:-)2 _ ('· - 1\.1 - ) ( I npl 2 - Sequenc£'.'i and Limits An infinite sequence XI, • • • 'Xn, • • • ofavs is called a random process. For a specific'· the values Xn(') ofxn form a sequence of numbers. If this sequence has a limit, that limit depends in general on The limit of Xn(C) might exist for some outcomes but not for c. 218 CHAP. 7 SEQUENCES OF RANDOM VARIABLES others. If the set of outcomes for which it does not exist has zero probability. we say that the sequence x, converges almost everywhere. The limit of x, might be an RV x or a constant number. We shall consider next sequences converging to a constant c. Con'tl~rg~nc~ In th~ MS s~ns~ The mean E{(x,. - c)2} of (x,. - c) 2 is a sequence of numbers. If this sequence tends to zero E{(x,. - c)2} ,_,. 0 (7-73) we say that the random sequence x,. tends to c in the mean-square sense. This can be expressed as a limit involving the mean .,,. and the variance u~ ofx,.. • Th~onm. The sequence x, tends to c in the MS sense itT .,,.u,.,_,. 0 ,_ c .. (7-74) • Proof. This follows readily from the identity E{(x,. - c)2} = (TJ,. - c)2 + u~ Con'tl~rg~nc~ In Probability Consider the events {lx,. - cl > 8} where s is an arbitrary positive number. The probabilities of these events form a sequence of numbers. If this sequence tends to zero P{lx,.- cl > 8}(7-75) ,_,. 0 for every 6 > 0, we say that the random sequence x,. tends to c in probability. • Th~onm. If x,. converges to c in the MS sense, it converges to c in proba- bility. • Proof. Applying (4-116) to the positive RV lx,.- cl 2, we conclude, replacing ATJ by 8 2, that 2 P{lx,. - cl2 ~ 8 2} = P{lx,. - cl ~ 8} s E{lx,. ~ cl } (7-76) 6 Hence, if E{lx,. - cl2}- 0, then P{lx,. - cl > 8} - 0. Con'tl~rg~nce In Distribution This form of convergence involves the limit of the distributions F,.(x) = P{x,. s x} of the avs x,.. We shall say that the sequence x,. converges in distribution if F,.(x) ,_,. - F(x) (7-77) for every point of continuity of x. In this case, x,.(C) need not converge for any If the limit F(x) is the distribution of an RV x. we say that x,. tends to x in distribution. From (7-77) it follows that ifx,. tends toxin distribution, the probability P{x,. ED} that x,. is in a region D tends to P{x e D}. The Rvs x,. and x need not be related in any other way. The central limit theorem is an example of convergence in distribution: The sum z,. = x1 + · · · + x,. in. (7-63) is a sequence of Rvs converging in · distribution to a normal RV. c. SEC. 7-4 SPECIAL DISTRIBUTIONS OF STATISTICS 2J9 l.aw of LafJCe Numbers We showed in Section 3-3 that if an event .54 occurs k times in n trials and P(s4) = p. the probability that the ratio kin is in the interval (p - e, p + e) tends to I as n- 2; for any s. We shall reestablish this result as a limit in probability of a sequence of Rvs. We form the zero-one RVs X; associated with the event s4 as in (7-61), and their sample mean i, Since £{x;} = p and X1 +' ''+X, = --'---------"- u? = pq, it follows"that £{i,} =p u! = np!! = pq - .... " ,, ., 0 n... Thus the sequence i,. satisfies (7-73); hence, it converges in the MS sense and, therefore, in probability to its mean p. Thus P{li,. - PI > e} - n-" 0 And since i,. equals the ratio kin, we conclude that P{l~- pi > e} ~x 0 for every e > 0. This is the law of large numbers (3-37) expressed as a limit. 7-4 Special Distributions of Statistics We discuss next three distributions of particular importance in mathematical statistics. Chi-Square Distribution The chi-square density (7-78) is a special case ofthe gamma density (4-50). The term f(m/2) is the gamma function given by [see (4-51)] m =even= 2k m =odd= 2k + 1 (7-79) We shall use the notation x2( m) to indicate that an av x has a chisquare distribution.* The constant m is called the number of degrees of • For brevity we shall also say: The RV xis x2!m). 220 CHAP. 7 SEQUENCES OF RANDOM VARIABLES o.s s 0 Fipre 7.8 freedom. The notation x2(m, x) will mean the functionf(x) in (7-78). In Fig. 7.8, we plot this function form = I, 2, 3, and 4. With b = m/2 and c = 1/2, it follows from (5-64) that E{xn} = m(m + 2) · · · (m + 2n - 2) Hence, E{x} = m E{x2} = m(m + 2) u~ =2m Note [see (4-54) and (5-63)] that E{!}x ='Y Jo{"' x"''2-3e-x'2dx =_I_ m - 2 cl>(s) The = V(l m>2 1 (7-80) (7-81) (7-82) (7-83) - 2s)"' x2 density with m = 2 degrees of freedom is an exponential I I f(x) = 2 e·xl2 U(x) cl>(s) = 1 - 2s (7-84) and for m = 1 it equals /(x) because r(l/2) Example 7.9 1 =~ e-"12 U(x) V21TX 1 cl>(s) = VI - 2s (7-85) = v'1T. We shall show that the square of an N(O, I) freedom. Indeed, if RV has a and x2 density with one degree of y = x2 then [see (4-84)) r.(v) J\ • =.v.vJ' ~ r(Yr)U(\') - · = ~ ,.· r'2U(v) V21Ty · (7-86) • SEC. 7-4 SPECIAL DISTRIBUTIONS OF STATISTICS 221 The following property of x2 densities will be used extensively. If the RVs x andy are x2<m) and x2( n ). respectively. and are independent. their sum z "" x + y is x 2(m + ll) (7-87) • Proof. As we see from (7-tm. I <l>x(s) = YO - I cl>y(y) = 2s)m V(l - 2s)n Hence lsee (7-14)) I <l>:(s) = <l>.,(s)cl>,.(.f) :... ~is)"''" From this it follows that the convolution of two density: x2 densities is a x2 (7-88) l'undamen1al Property The sum of the squares of n independent N(O, I) RVs x; has a chi-square distribution with n degrees of freedom: " lfQ = ~ i Xf (7-89) then I • Proof. As we have shown in (7-86). the RVs xi are x2(1). Furthermore. they arc independent because the Rvs x; arc independent. Repeated application of (7-88) therefore yields (7-89). The Rvs X; arc N(O, I); hence [see (5-55)), E{xl} = I E{x1} = 3 Var xi = 2 From this it follows that E{Q} = n (7-90) cr~ = 2n in agreement with (7-81). As an application, note that if the RVs w; are N(71, cr) and i.i.d., the RVs X; = (w; - 71)/u are N(O, I) and i.i.d., as in (7-89); hence, the sum Q = i (W;- 71 ) 2 is }( 2(n) (7-91) CT ;-1 QUADRATIC FORMS The sum n Q =L Qijll; Xj (7-92) i.j•l is called a quadnttic form of order n in the variables x;. We shall consider quadratic forms like these where the Rvs x; are N(O. I) and independent, as in (7-89). An important problem in statistics is the determination of the conditions that the coefficients a;; must satisfy such that the sum in (7-92) have a x2 distribution. The following theorem facilitates in some cases the determination of the x2 chantcter of Q. 222 CHAP. 7 SEQUENCES OF RANDOM VARIABLES • Theortm 1. Given three quadratic forms Q. Q 1, and = Q, + o~ it can be shown that if o Q~ such that <7-93) (a) the RV Q has a x~<n) distribution (b) the RV Q 1 has a x2(r) distribution (c) Q~ :::: 0 then the RVs Q 1 and Q 2 are independent and Q 2 has a x~ distribution with n - r degrees of freedom. The proof of this difficult theorem follows from (7 A-6): the details. however, will not be given. We discuss next an important application. Sample Mean and Sample Variance Suppose that arbitrary population. The Rvs i =! i n ;. 1 i s2 =_I_ X; tr - I ;- 1 X; is a sample from an (7-94) (X;- i):! are the sample mean and the :rumple variance, respectively. of the sample As we have shown. , X;. E{i} ui = = Tl u- -; where., is the mean and u 2 is the variance of x;. We shall show that E{s2} = u 2 (7-95) • Proof. Clearly, (x; - TJ)2 = (X;- i +i - TJ)2 = (X; - i)~ + 2(x, - i)(i - TJ) + (i - TJ)~ n Summing from I ton and using the identity L (X; - i) = 0. we conclude that i-1 L" (x; - TJf j-J = L" (x; - i)2 + n(i - TJ)2 (7-96) j-J Since the mean ofi is., and its variance u 2/n, we conclude, taking expected values of both sides, that nu~ = E{(X; - i)2 } + a~ and (7-95) results. We now introduce the additional assumption that the samples x; are normal. • Theortm 2. The sample mean i and the sample variance s2 of a normal sample are two independent Rvs. Furthermore, the Rv is (7-97) SEC. 7-4 SPECIAL DISTRIBUTIONS OF STATISTICS 223 • Proof. Dividing both sides of (7-96) by u2, we obtain ± (X;- .,r = (i- .,r + ±(X;- ir ;~ (7-98) u u/Vn ;-t u The RV i is normal with mean 71 and variance u~ln. Hence, the first RV in parentheses on the right side of (7-98) is N(O, I), and its square is x2(1). Furthermore, the RV (x1 - 71)/u is N(O, I); hence, the left side of (7-98) is x2(n) [see (7-89)). Theorem 2 follows, therefore. from theorem I. Note, finally, that , 2u" (7-99) E{s2} = u 2 u··=-.•. n- I 1 This follows from (7-97) and (7-81) because u2 2 E{s2} = n _ E{Qz} u .~ 1 Student I 2 = (T ( n- ., I) u 2a~ - Distribution We shall say that an RV x has a Student l(m) distribution with m degrees of freedom if its density equals !< ·~) = 'Y This density will be denoted by is based on the following. • Theorem. If the r((m + 1)/21 'Y = ·-,.--- . V7Tm I (m/2) I ~x 2 /m)"'' 1 RVs z l(m. x). (7-100) The importance of the 1 distaibution and ware independent. z is Nm. I). and w is x2<m ), the ratio z X= vwr;;; is I( m) (7-101) • Proof. From the independence of z and w it follows that f..,(z. w) - e .. :z'~(w"'' 2 · •e·"·' 2 )U(w) To find the density of x, we introduce the auxiliary variable y = w and form the system x = zVmiW y = w This yields z = w = y; and since the Jacobian of the transformation equals YmiW, we conclude from (S-83) that xv"YJ'm, /xy(X, y) = ~hw(X~, y)- vye· .• ~~''"'(ymf2 .. le·y12) for y > 0 and 0 for y < 0. Integrating with respect to y, we obtain fx(x) - With Jn(.. ylm-llt2 { 2\' (I + x2)} m dy exp - (7-102) 224 CHAP. 7 SEQUENCES OF RANDOM VARIABLES (7-102) yields f,(x)- f• I V(l + x 1/m)"'' 1 lo , q«"''"'- e-'~dq ·- I V(l + x 21m)"'' 1 and (7-101) results. Note that the mean of x is zero and its variance equals E{xF = m£{z2}E {.!} = m__!!!__ w 2 (7-103) This follows from (7-82) and the independence uf z and w. • Corollary. If i and s2 are the sample mean and the sample variance, respectively, of an N(TJ, a) sample x; [see (7-94)), the ratio t - i - 71 is t(n - I) -s!Vn (7-104) • Proof. We have shown that the avs z i-11 = a!Vn s2 w = (n - I) a2 are independent, the first is N(O, 1), and the second is and (7-101) it follows that the ratio x2(n - 1). From this ~-....,..,.....,... i - 71 / a/Vn (n - l)s2 a 2(n - I) i - 71 =~ is t(n - 1). Snedecor F Distribution An av x has a Snedecor distribution if its density equals x•12-J /(x) = 'Y V(l + kxlm)k-m U(x) (7-105) This is a two-parameter family of functions denoted by F(k, m). Its importance is based on the following. • Theorem. If the RVs z and ware independent. z is x1(k). and w is x2(r). the ratio zlk x =is F(k, r) (7-106) wlr • Proof. From the independence of z and w it follows that J; ...(z. w)- (z41He-:·2)(w'12 •e-""'2} U(z)U(w) We introduce the auxiliary variable y rz/kw. y = w. The solution of this system is z equals rlkw. Hence [see (5-83)). = w and form the system x = = kxy/r, w = y. and its Jacobian SEC. 7-4 SPECIAl. DISTRIBUTIONS Of-" STATISTICS 225 for x > 0. y > 0. and 0 otherwise. Integrating with respect toy. we obtain I ,. . !.. ) (,(x) _ x4;2 1 ,.c4•11·2 1 cxp l-:.... ( 1 ... - x d-~· ' II ' 2 /' • With i , " q-:,...\' (1+-x ,. ) \' .:. . 2q · I ' kxl r we obtain f (X) - • < rA·~· 1 (.' ·--·· • -· · · · ·-., CJ14 ' ' 1 ~ t' ' 1 c/q (I + kx/r)' 4 ··~-. 11 and {7-106) results. Note that the mean of the E{x} = RV x in (7- 106) equals Isec <7-81) and (7-82>1 !:. f.'{z}f..' {.!.} k = w 1 !:. _k_ (7-107) -= - - · 2 kr 2 r Furthermore. if x is F(/.;.. rl Percentiles The 11-percentile of an that 11 = The 11-pcrccntiles of the noted by I . ,. then RV - IS X (7-108) x is by definition a number x, such P{x s x,} = P{x > x 1 x~(m). t(m). I '(I'. 1\) ,1 (7-109) and f'(k. rl distributions will be de- t,(m) F,(k. rl respectively. We maintain that F,(k. r) = h,-1(1. r) = t~(rl F,(k., r) :::,. • Proof. If the RV 11 (7-110) F 1 u : r. k) (7-111) ' kI x~<k > (7-112) x is F(k, r). then = P{x s F,(k, r)} = P {~ ;=: F,(!. r)} and (7-110) follows from (7-108) and (7-109). Ifx is NCO, 1). then x2 is x 2(1 ). From this and <7-10 I) it follows that if y is t(r), then y2 is F(l. r): hence. 2u - I = P{ly! s t,(r)} = P{y2 s t~(r)} = P{y2 s F 2,. 1( I, r)} and (7-111) results. Note, finally, that if w is x2(r). then Isec (7-81)) E{!} Var!,. = !,. r = I Hence, wlr- I as r- :lO, This leads to the conclusion that the RV x in (7-106) tends to zlk as r- x, and (7-112) follows because z is x2(k). 226 CHAP. 7 SEQUENCES OF RANDOM VARIABLES Appendix: Chi-Square Quadratic Forms Consider the sum II Q =~ (7A-I) aijXiXj iJ•I where x; are n independent N(O, I) random variables and aii are n2 numbers such that au = ai;. We shall examine the conditions that the numbers au must satisfy so that the sum Q has a x2 distribution. For notational simplicity, we shall express our results in terms of the matrices A = a,,.] [·a·"· a,., .. . a,.,. X= ["': ] X* = [x., . . . , x,.] x,. Thus A is a symmetrical matrix, X is a column vector, and X* is its transpose. With this notation, Q =X* AX We define the expected value of a matrix with elements wu as the matrix with elements E{wii}. This yields xx• = x1x1 · · · [ x,.x, £~*} = [- > (7A-2) where 1,. is the identity matrix. The last equation follows from the assumption that E{x;xi} = I if i = j and 0 if i =I= j. Diagonalization It is known from the properties of matrices that if A is a symmetric matrix with eigenvalues A.;, we can find a matrix T such that the product TAT* equals a diagonal matrix D: TAT* =D D = [.:. : (7A-3) From this it follows that the matrix Tis unitary; that is, T* equals its inverse r-•, and hence its determinant equals 1. Using this matrix, we form the vector z-.n- [] APPENDIX: CHI-SQUARF. QUADRATIC FORMS 227 We maintain that the components z; ofZ are normal, independent, with zero mean, and variance I. • Proof. The avs z; are linear functions of the normal avs X;; hence, they are normal. Furthermore, E{z;} = 0 because £{x;} = 0 by assumption. It suffices, therefore, to show that £{z;zi} = I fori = .i and 0 otherwise. Since Z* = X*T* ZZ* = 7XX*T* T* = T I we conclude that (7 A-4) £{ZZ*} = T £{XX*}T* .:.:. 7T I '- I,, Hence. the avs z; are i.i.d. and N(O, I). The following identity is an important consequence of the diagonalization of the matrix A. The quadratic form Q in (7 A-1) can be written as a sum of squares of normal independent avs. Indeed. since X = T* Z and X* = Z*T. (7A-3) yields " Q = Z* TA T*Z = Z* IJZ = L A;ZJ (7 A-5) , I • f'undamental Theorem. A quadratic form generated by a matrix A has a x2 distribution with r degrees of freedom iff r uf the eigenvalues A; of A equal I and the other n - r equal zero. Rearranging the order of all A; as necessary. we can state the theorem as follows: Q is x2(r) iff Ai ;. . :. { ~~ r<i<11 (7A-6) • Proof. From (7 A-5) it follows that if eigenvalues A, scttisfy (7 A-6). then , , . ' Q='-z; i (7A-7) I Hence rsee (7-87)]. Q is x~(r). Conversely. if A;:/- 0 or I for some i. then the corresponding term A;ZT is not x 2• and the sum Q in 17A-5) is not x~. The following consequence of (7 A-7) justifies the term dt•gret•s offret'dom used to characterize a x2 distribution. Suppose that Q =- ~ wr where w; are n avs linearly dependent on x;. If Q is x~(r). then ctt most r of these avs are linearly independent. Noncnllral Distrihlllion.\· Consider the sum ,. " ' Qu = Qu(y;) = ' - y; i (7A-8) I where y; are n independent N< 71;. I) avs. If 71; :I= 0 for some i. then the RV Oo docs not have a x2 distribution. Its distribution will he denoted by n where e =~ ,,.. 71~ = Qo(TI;) (7A-9) and will be called noncentrcll x2 with tr degrees of freedom and t•ccellfridty t'. We shall determine its moment function <l>(s). 228 CHAP. 7 SEQUENCES OF RANDOM VARIABLES • Theorem <l>(.f) I = t:{t'•O.·} = ~ exp { -se- } (1 - 2.f)" I - 2.f • Proof. The moment function of YJ equals £{e'Y;}. = I J• • ('"·e . vl1T 1\ - (7A-10) . q t·i2 dy; The exponent of the integrand equals , S\'~ - 1);)~ ( l'; • • I 2 ( ..:..: I)( 2 .l'·I .'1 - - 1), )~ - -- I - 2.\' .f1)J +I -2-f Hence !see (3A-I)J. cl>;(.\') . '} = /:.{('''· I = VI - } 2./"P { I .f1)l - 2s Since the RVS y; arc independent. we conclude from the convolution theorem (7-14) that <l>(.f) is the product of the moment functions <l>;(s); this yields (7 A-10). From (7 A-10) it follows that the x2(n. d distribution depends not on 1); separately but only on the sum ('of their squares. • Corollary E{Qo} = ll + e (7A-11) • Proof. This follows from the moment thet>rcm (5-62). Differentiating (7A-10) with respect to sand settings = 0, we obtain <1>'(0) = n + e. and (7 A-ll) results. Centering In (7 A-6). we have established the x2 charclcter of certain quadratic forms. In the following. we obtain a similar result for noncentrcll x2 distributions. Using the avs y; in (7A-tH. we form the sum II Q,(y;) = L aiJYiYi iJ""I where au are the elements of the quadrcltic form Q in (7A-l). We maintain that if Q is x2(r), then II where e = Q,(1)1) = L aii11i11J (7A-12) iJ-1 • Proof. The avs X;= y;- 1); are i.i.d. as in (7A-1 ), and Q 1(y;) = Q(x; + 1);). The avs z; in (7A-6) are linear functions of x;. A change, therefore, from X; to X;+ 1); results in a change from Zt to Zt + 81 where 8; = £{zt}. From this and (7 A-7) it follows that , Q,(y;) = Q(X; + 1)1) = L (Z; + 8t)2 i•l This is a quadratic form as in (7A-8); hence, its distribution is (7A-12). x2(r, e) as in PROBLEMS 229 Application We have shown in (7-97> thHt if i is the sample mean of the RVS x;. the sum II ~( X; Q = L.. -··IS X-(11 , X)- - I) , I Withy;= x; Q1 = i 71;. y = i + Tj. J::{y}..::: Tj. it follows from (7A-12) that the sum :i ;a. J (y;- y)2 is x2<n- I. e) Note. finally. (sec C7A-11)1 that E{Q.} :.... withe=}: (71;- Tj)2 (7A-13) ,.;J (ll - I) • e (7A-14) Noncentral t and f' ()istributions Noncentral distributions are used in hypothesis testing to determine the operating characteristic function of various tests. In these applications. the following forms of noncentrality are relevant. A noncentralt distribution with " degrees of freedom and eccentrity e is the distribution of the ratio z VwTin where z and ware two independent RVs, z is Nk. 1). and w is x~(,). This is an extension of<7-101). A noncentrcll F(k. r. (')distribution is the distribution of the ratio xlk wlr where z and w arc two independent avs. z is noncentral This is an extension of <7-106). x~<k. e>. and w is x~< r). Problems 7-1 A matrix C with elements P.ii is called nonnegative definite if L C;CjiJ.iJ ~ 0 for iJ 7-2 7·3 every c; and c:;. Show that if C is the covariance matrix of n Rvs x;, it is nonnegative definite. Show that if the RVS x, y. and z are such that r,,,. = '•: _, I, then rx: = I. The RVS x1 , x2 , x3 are independent with densities f,lx,), Ji<x2). Jl(xJ). Show that the joint density of the RVs y, = x, 7-4 Y2 = X2 - x, Yl - X] - X2 equals the product jj(y, >fi<Y2 - Y• )Jl(.Vl - Y2>· (a) Using (7-13), show that a4<11(.flt S2 , SJ, S4) = r.L'{x1x2X3X4 } as,as2aslas. (b) Show that if the RVs x; are normal with zero mean and E{x;x1} = P.ii. then E{x,x2XJX4} = P.I2P.l4 + P.uP.2• ~ P.1•P.21 230 CHAP. 7 SEQUENCES OF RANDOM VARIABLES 7-5 7-6 Show that if the avs x1 are i.i.d. with sample mean i, then E{(x1 - i)2} = (n - l)u 21n. The length c of an object is measured three times. The first two measurements, made with instrument A. are 6.19 em and 6.25 em. The third measurement, made with instrument B. is 6.21 em. The standard derivations of the instru· ments are 0.5 em and 0.2 em, respectively. Find the unbiased linear minimum variance estimate of c. Consider the sum [see also (7-39)) c 7-7 where P{n = k} = P4 k = 1, 2, . . . and the avs x1 are i.i.d. with density £(x). Using moment functions, show that if n is independent ofx1, then f:<z> = k-1 L" P•fx•'<;.) 7-8 where}141(x) is the convolution of.f..(x) with itself k times. A plumber services N customers. The probability that in a given month a particular customer needs service equals p. The duration of each service is an av x1 with density ce-~~u(x). The total service time during the month is thus the random sum z = x1 + · · • + x, where n is an av with binomial distribution. (a) Show that c" L (N) p"qN-n z"-le-~: n (n - 1)! /,(z) = N ,.~ 0 ' (b) The plumber charges $40 per hour. Find his mean monthly income if c 112, N = 1,000 and p = .1. = The avs X; are i.i.d., and each is uniform in the interval (9, 11). Find and sketch the density oftheir sample mean i for n = 2 and n = 3. 7·10 The avs x1 are i.i.d., and N(.,, u). Show that 7-9 I ~" (a) if y = - \}2 lx; - Til L t•l n . (b) ifz = 1 2 E{y} = u then " (n _ 1) ~ (x;- x 1• 1)2 then 0'2 ' = ( Tr - 1) -u2 2 n E{z} = u 2 7·11 Then avs x1 are i.i.d. with distribution F(x) = I - R(x) and z = max x1, w min X;. R,.(w) = R"(w) (a) Show that = (b) Show that if f(x) = e-IA-IIU(x - 8) then F,.lw) = e-III•·-IIU(w - 8) (c) Find F,(r.) and F,..(w) if /(x) = ce-~~u(x). (d) Find F,(z) and F,..(w) ifx has a geometric distribution as in (4-66). 7·11 The n avs x1 are i.i.d. and uniformly distributed in the interval c - 0.5 < x < c + 0.5. Show that ifz = max x~o w = min x1, then (a) P{w < c < z} = 1 - 1/2"; _ {n(n ...: 1)(z - w)"- 2 c - 0.5 < w < z < c + 0.5 (b) fz,.(z. w) - o otherwise PROBLEMS 231 7-13 Show that if Y• is the kth-order statistic of a sample of size n of x and .t.~ is the median of x. then 7-14 The height of children in a certain grade is an RV x uniformly distributed between 55 and 65 inches. We pick at random five students and line them up according to height. Find the distribution of the height t of the child in the center. 7-15 A well pipe consists of four sections. Each section is an RV X; uniformly distributed between 9.9 and 10.1 feet. Using (7-64). find the probability that the length x = x1 + x2 ... x3 + x.. of the pipe is between 39.8 and 40.2 feet. 7-16 The Rvs X; are N(O, u) and independent. Using the CLT. show that if y = xi + · + x!. then for large n, /y(y) ,J = u· 41Tn exp {- _I_ I\' 4nu 4 • - na-2)2} 7-17 (Random walk) We toss a fair coin and take a step of size c to the right if heads shows, to the left if tails shows. At the nth toss. our position is an RV x,. taking the values me where m = n, n- 2, . . . , -n. (a) Show that P{x,. = me} n· I _m+ll = ( k) 2" k- 2 (b) Show that for large"· P{x,. I = me} ,.., Vn1ii2 t• ., ""'·" (c) Find the probability {x50 > 6c} that at the 50th step, x,. will exceed 6c. 7-18 <Proof of the CLT) Given n i.i.d. RVS x; with zero mean and variance a- 2, we form their moment functions cJ>(s) = E{e,.·}. 'i'(s) =In Cl>(s). Show that (Taylor series) , 'i'(s) = ~- s 2 Ifz = (1/'Vn) (x 1 + · · · -t higher power of s then «<>:(s) + x,.), a-2 'i'.(s) = - s2 · 2 - = «<>"(s!Vn) I powers of-.,.. Vn Using the foregoing, show that f:(r.) tends to an N(O, u) density as n- or:. 7-19 The n avs x; are i.i.d. with mean 71 and variance a- 2. Show that if i is their sample mean and g(x) is a differentiable function, then for large n, the RV y = g(i) is nearly normal with Tly = g'(Tj) and rry = lg'(Tj)lutVii. 7-lO The RVs x; are positive, independent, andy equals their product: Y = Xt' ' 'X" Show that for large n, the density of y is approximately lognormal (see Problem 4-22) /y(y) "" cyVi; exp { - !. (lny - b)} U(y) 2 2 where b i~ the mean and c2 the variance of the RV z = In y = L In x; 232 CHAP. 7 SEQUENCES OF RANDOM VARIABLES This result leads to the following form of the central limit theorem: The distribution of the product y of n independent positive avs X; tends to a lognormal distribution as n- x. 7·21 Show that if the RV x has a x2(7) distribution, then E{ 1/x} = 1/5. 7-22 (Chi distribution) Show that ifx has a x2(m) distribution andy = Vi, then /y(y) = 'YY •e·y!i2U(y) 2"'2r~n/2) 'Y = 7-23 Show that if the RVs xi are independent N(TJ. u) and (xi - .i) 1 " s2 = n _ 1 ~ - u - 2 then E{s} = 7·24 The n RVs Xi i.i.d. with density /(x) = sample mean i, and show that E {J}x = n__!!£__ - I E 'JIn ce·····~·(x). {I} Xi 2 )'(n/2)u _ 1 r((n _ 0121 Find the density of their n2c2 = (n - I)(n - 2) 7-25 The avs xi are N(O, u) and independent. Find the density of the sum z =xi ... X~+ xi. 7-26 The components v.~, v,., and v~ of the velocity v = vv: + v~ - v: of a particle are three independent avs with zero mean and variance u 2 ;. kTim. Show that v has a Maxwell distribution I~-1r v2e·rrl2a·u(v) '' r(v) = _ Ju 0'3 7-27 Show that if the avs x andy are independent, xis N(O, l/2) andy is }( 2(5), then the RV z = 4x2 + y is x2(6). 7-28 The RVs z and ware independent, z is N(O, u), and w is x2(9). Show that ifx = 3z/Vw, then,h(x) - 1/(90'2 + x2)'. 7·29 The avs z and w are independent, z is x2(4), and w is x2(6). Show that if x = z/2w, then,h(x)- [x/(1 + 2x)')U(x). 7-30 The avs x andy are independent, xis N(TJ.~, ul, andy is N(TJ., u). We form their sample means i, y and variances s~ and ~ using samples ·of length n and m, respectively. (a) Show that if ci2 = as~ + ~ is the LMS unbiased estimator of u 2, then ~ v1 2u 4 2u 4 a = --b = --where v 1 = - V• = - Vt + v2 v, + v2 n - I · m - I are the variances of and respectively. Hint: Use (7-38) and (7-97). (b) Show that ifw = i - y,.,. = Tlx- .,,., and s: S:. cY., = (as~ + b~) (~ - !) then the av(w - Tlw)lcts has a t distribution with n + m - 2 degrees of freedom. 7-31 Show that the quadratic form Q in (7A-1) has a ]( 2 distribution if the matrix A is idempotent, that is, if A 2 = A. 7-32 The RVs x and y are independent, and their distributions are noncentral x2(m, e 1) and xlCn, e2), respectively. Show that their sum z = x + y has a noncentral x2(m + n, e 1 + e2) distribution. PART TWO STATISTICS 8 _ _ __ The Meaning of Statistics Statistics is part of the theory of probability relating theoretical concepts to reality. As in all probabilistic predictions, statistical statements concerning the real world are only inferences. However, statistical inferences can be accepted as near certainties because they are based on probabilities that are close to 1. This involves a change from statements dealing with an arbitrary space ~ to statements involving the space ~/, of repeated trials. In this chapter, we present the underlying reasoning. In Section 8-2, we introduce the basic areas of statistics, including estimation. hypothesis testing, Bayesian statistics, and entropy. In the final section, 8-3, we comment on the interaction of computers and statistics. We explain the meaning of random numbers and their computer genenttion and conclude with a brief comment on the significance of simulation in Monte Carlo methods. 8-1 Introduction Probability is a mathematical discipline developed on the basis of an abstract model. and its conclusions are deductions based on the axioms. Statistics deals with the applications of the theory to real problems, and its conclu- 235 236 CHAP. 8 THE MEANING OF STATISTICS sions are inferences based on observations. Statistics consists of two parts: analysis and design. Analysis, or mathematical statistics, is part of the theory of probability involving RVS generated mainly by repeated trials. A major task of analysis is the construction of events the probability of which is close to 0 or to 1. As we shall see, this leads to inferences that can be accepted as near certainties. Design, or applied statistics, deals with the selection of analytical methods that are best suited to particular problems and with the construction of experiments that can be adequately described by theoretical models. This book covers only mathematical statistics. The connection between probabilistic concepts and reality is based on the empirical formula n~ == np (8-1) relating the probability p =P(.!4) of an event .!4 to the number n81 of successes of .!4 inn trials of the underlying physical experiment~. This formula can be used to estimate the model parameter pin terms of the observed number n~. If p is known, it can be used to predict the number n~ of successes of .!4 in n future trials. Thus (8-1) can be viewed as a rudimentary form of statistical analysis: The ratio • n~ p=- n (8-2) is the point estimate of the parameter p. Suppose that f:l is the polling experiment and .!4 is the event {Republican}. We question 620 voters and find that 279 voted Republican. We then conclude thatp = 279/620 = .45. Using this estimate, we predict that 45% of all voters are Republican. Formula (8-1) is only an approximation. A major objective of statistics is to replace it by an exact statement about the value n81 - np of the approximation error. Since pis a model parameter, to find such a statement we must interpret also the numbers n and n~ as model parameters. For this purpose, we form the product space ~~~=~X • • • X f:/ consisting of then repetitions of the experiment~ (see Section 3-1), and we denote by n~ the number of successes of .!4. We next form the set ~ = {np - 3Viij)q < na~ < np + 3Viij)q} (8-3) This set is an event in the space f:/,, and its probability equals [see (7-69)] P{np - 3vnpq < na~ < np + 3Viij)q} = .997 (8-4) We can thus claim with probability .997 that n41 will be in the interval np :t 3 Viij)q. This is an interval estimate of n41 • We shall use (8-4) to obtain an interval estimate of p in terms of its point estimate p = na~ln. From (8-4) it follows with simple algebra that 9 P{(p - p)2 <- p(l - p)} = .997 n SEC. 8-J INTRODUCTION 237 Denoting by p 1 and P2 the roots of the quadratic (p - p)~ = ~ p(l n - p) (8-5) we conclude from (8-4) that (8-6) P{p, < p < P2} = .997 We can thus claim with probability .997 that the unknown parameter pis in the interval (p 1, p~). Thus using statistical analysis, we replaced the empirical point estimate (8-2) of p by the precise interval estimate (8-6). In the polling experiment, n = 620. p = .45. and (8-5) yields 9 (p - .45)-' = p( I - p) Pt = .39 P2 = .5 I 620 Sot(' In news broadcasts. this result is phrased as follows: "A poll showed that forty-five percent of all voters are Republican; the margin of error is :=6%." The number :6 is the difference from 45 to the endpoints 39 and 51 of the interval (39. 51). The fact that the result is correct with probability .997 is not mentioned. The change from the empirical formula (8-1) to the precise formula (8-4) did not solve the problem of relating p to real quantities. Since P(~) is also a model concept, its relationship to the real world must again be based on (8-1). This relationship now takes the following form. If we repeat the experiment 9-n a large number of times, in 99.7% of these cases, the number n.71 of success of .'iii will be in the interval np 3 vnpq. There is, however, a basic difference between this and (8-1). Unlike P(.s4.), which could be any number between 0 and I, the probability of the event ~ is almost I. We can therefore expect with near certainty that the event \1l will occur in a single performance of the experiment 9'n. Thus the change from the event .s4. with an arbitrary probability to an event~ with P(li) == I leads to a conclusion that can be accepted as near certainty. The foregoing observations are relevant, although not explicitly stated, in most applications of statistics. We rephrase them next in the context of statistical inference. = Statistics and Induction Suppose that we know from past observations the probability P(.M.) of an event .M.. What conclusion can we draw about the occurrence of this event in a single performance of the underlying experiment? We shall answer this question in two ways, depending on the size of P(.M.). We shall give one answer if P(.M.) is a number distinctly different from 0 or 1-for example, .6-and a different answer if P(.M.) is close to 0 or 1-for example, .997. Although the boundary between the two probabilities is not sharply defined (.9 or .9999?), the answers are fundamentally different. 238 CHAP. 8 THE MEANING OF STATISTICS Case 1 We assume, first, that P(.M.) = .6. In this case. the number .6 gives us only some degree of confidence that the event .M. will occur. Thus the known probability is used merely as a measure of our state of knowledge about the possible occurrence of .M.. This interpretation of P(.t.f.) is subjective and cannot be verified experimentally. At the next trial, the event vtt will either occur or not occur. If it does not, we will not question the validity of the assumption that P(.M.) = .6. Case 2 Suppose, next, that P(.M.) = .997. Since .997 is close to 1. we expect with near certainty that the event .M. will occur. If it does not occur, we shall question the assumption that P(.M.) = .997. Mathematical statistics can be used to change case I to case 2. Suppose that s4 is an event with P(s4) = .45. As we have noted, no reliable prediction about s4 in a single performance of~ is possible (case 1). In the space~" of 620 repetitions of~. the set~= {242 < n.x < 316} is an event with P(~) = .997 == I. Hence (case 2), we can predict with near certainty that if we perform Y, once, that is, if Y is repeated 620 times, the number na~ of successes of s4 will be between 242 and 316. We have thus changed subjective knowledge about the occurrence of s4 based on the given information that P(s4) = .45 to an objective prediction about a6 based on the derived probability that P(~) == .997. Note that both conclusions are inductive inferences. The difference between the two, although significant, is only quantitative. As in case I, the conclusion that ~ will occur at a single performance of the experiment Y, is not a logical certainty but only an inference. In the last analysis, no prediction about the future can be accepted as logical necessity. 8-2 The Major Areas of Statistics We introduce next the major areas of statistics stressing basic concepts. We shall make frequent use of the following: Percentiles The u-percentile of an RV x is a number x,. such that u = F(x,.). The percentiles of the normal, the x2, the t. and the F distributions are listed in the tables in the appendix. The u-percentile of the NCO, I) distribution is denoted by z,.. If xis N(71, u), then x,. With 'Y = 71 + z,.u because F(x) = G (x ~ 71 ) = I - a a given constant, it follows that (see Fig. 8.1) (8-7) Sample Mean Consider an RV x with density /(x), defined on an experiment ~. A samplt> of x of length n is a sequence of n independent and SEC. 8-2 THE MAJOR AREAS OF STATISTICS z 239 X F"JgUI'e 8.1 identically distributed (i.i.d.) avs x 1• • • • , x,. with density f(x) defined on the sample space ~f, = g. x · · · x 9- as in Section 7-1. The arithmetic mean i of x; is called the sample mean of x. Thus [see (7-30>1 I ~ x=-Lx; n ;. t 71.\ = .,, ' cr-:! =CT~ - ' (8-8) " If we perform the corresponding physical experiment 11 times, we obtain the • • • • x,.. These numbers arc the values x; = x;(C) of the samples X;. They will be called ob.'iervaticms; their arithmetic mean = i(C) is the observed sample• mean. From the CLT theorem (7-64) it follows that for large "· the av i is approximately normal. In fact, if/(x) is smooth. this is true for n as small as 10 or even less. Applying (8-7) to the RV i. we obtain the basic formula n numbers x 1• x (8-9) Estimation Estimation is the most important topic in statistics. In fact, the underlying ideas form the basis of most statistical investigations. We introduce this central topic by means of specific illustrations. We wish to measure the diameter 8 of a rod. The results of the measurements arc the values of the RV x = 8 + " where " is the measurement error. We know from past observations that "is a normal RV with 0 mean and known variance. Thus xis an N(8, a) RV with known a, and our task is to find 8. This is a problem in parameter estimation: The distribution of the RV xis a function F(x. 8) of known form depending on an unknown parameter 8. The problem is to estimate 8 in terms of one or more observations (measurements) of the RV x. We buy an electric motor. We are told that its life length is a normal RV x with known 71 and a. On the basis of this information, we wish to estimate the life length of our motor. This is a problem in predictio11. The distribution f'(x) of the RV xis completely known. and our problem is to predict its value x = x(C) at a single performance of the underlying experiment (the life length of a specific motor). 240 CHAP. 8 THE MEANING OF STATISTICS F(x,8) F(:<) 8 unknown known Estimate 8 (a) Predict X X (b) Figure 8.2 In both examples, we deal with a classical esimation problem. In the first example, the unknown 8 is a model parameter, and the data are observations of real experiments. We thus proceed from the observations to the model (Fig. 8.2a). In the second example, the model is completely known, and the number to be estimated is the value x of a real quantity. In this case, we proceed from the model to the observation (Fig. 8.2b). We continue with the analysis. We are given an av x with known density /(x), and we wish to predict its value x = x<C> at the next trial. Clearly, x can be any number in the range of x, hence, it cannot be predicted; it can only be estimated. Thus our problem is to find a constant c such as to minimze in a sense some function L(x - c) of the estimation error x - c. This problem was considered in Section 6-3. We have shown that if our criterion is the minimization of the MS error E{(x- c)2}, then c equals the mean Tlx ofx; if it is the minimization of E{lx- cj}, then c equals the medianx.s ofx. The constant c so obtained is a point estimate of x. A point estimate is used if we wish to find a number c that is close to x on the average. In certain applications, we are interested only in a single value of the av x. Our problem then is to find two tolerance limits c 1 and c2 for the unknown x. For example, we buy a door from a factory and wish to know with reasonable certainty that its length x is between c 1 and c2 • For our purpose, the minimization of the average prediction error is not a relevant objective. Our objective is to find not a point estimate but an interval estimate of x. To solve this problem, we select a number y close to I and determine the constants c 1and c2 such as to minimize the length c2 - c 1 of the interval (c 1 , c2 ) subject to the condition that P{c, < x < c2} = y (8-10) PREDICTION If this condition is satisfied and we predict that x will be in the interval (c 1, c2), then our prediction is correct in l()()y% of the cases. We shall find c 1 and c2 under the assumption that the density /(x) is unimodal (has a single maximum). Suppose, first, thatf(x) is also symmetrical about its mode Xmak as in Fig. 8.3. In this case, xmak = ., and c2 - c1 is minimum if c 1 = 71 -a, c2 = 71 + a where a is a constant that can be determined from (8-10). As we see from Fig. 8.3a, (8-11) 6=1-y a = 7l - Xa12 = Xt-812 - 7l SEC. 0 11 8-2 THF. MAJOR AREAS OF STATISTICS .'( 24) X (a) (b) (c) Figure 8.3 For an arbitrary unimodal density. c2 - c 1 is minimum if/(c 1) = j(c2 ) as in Fig. 8.3b. (Problem 9-9) This condition, combined with (8-10), leads by trial and error to a unique determination of the constants c 1 and c 2 • For computational simplicity we shall usc the constants (Fig. 8.3c) Ct = X312 C:! = Xt AI:! (8-12) For asymmetrical densities, the length of the resulting interval (c 1, c 2 ) is no longer minimum. The constant 'Y = I - 3 is called the confidence coefficient and the tolerance interval (c,. c2) the 'Y confidence interval of the prediction. The value of 'Y is dictated by two conflicting requirements: If 'Y is close to I, the estimate is reliable but the size c:2 - c 1 of the confidence interval is large; if 'Y is reduced, c· 2 - c, is reduced but the estimate is less reliable. The commonly used values of 'Yare .9, .95, .99. and .999. Example 8.1 The life length of tires of a certain type is a normal RV with ., = 25,000 miles and u = 3 ;000 miles. We buy a set of such tires and wish to find the .95-confidence interval of their life. From the normality assumption it follows that P{TJ - a < x < TJ + a} = 2G (~) - I = .95 242 CHAP. 8 THE MEANING OF STATISTICS This yields G(alu) = .915, alu = Z.97~· a = z. 97~u == 2u; hence. P{l9,000 <X< 31,000} = .95 Since .95 is close to I, we expect with reasonable confidence that in a single performance of the experiment, the event {.,., - a < x < .,., + a} will occur: that is. the life length of our tires will be between 19,000 and 31,000 miles. • PARAMETER ES11MA110N In the foregoing discussion, we used statistics to change empirical statements to exact formulas involving events with probabilities close to 1. We show next that the underlying ideas can be applied to parameter estimation. To be concrete, we shall consider the problem of estimating the mean., of an RV with known variance u 2• An example is the measurement problem. The empirical estimate of the mean ., = E{x} of an RV x is its observed sample mean x [see (4-90)]: 1 ., ==X=LX; n i-1 II (8-13) This is an approximate formula relating the model concept ., to the real observations x;. We shall replace it by a precise probabilistic statement involving the approximation error x - .,. For this purpose, we form the sample mean i of x and the event ?A = {., - a < i < ., + a} where a is a constant that can be expressed in terms of the probability of ~. Assuming that i is normal, we conclude as in (8-9) that if P(~) = 'Y = I - a. then a = z 1-&12u/Vn. This yields 6'2u} =-y ---:v;- Z1 -6J2U <- < + Z1P { .,-~ X 71 (8-14) or, equivalently, (8-15) We have thus replaced the empirical statement (8-13) by the exact probabilistic equation (8-15). This equation leads to the following interval estimate of the parameter 71: The probability that ., is in the interval a u=J--2 equals 'Y = I - a. As in the prediction problem, 'Y is the confidence coefficient of the estimate. Note that equations (8-14) and (8-15) are equivalent; however, their statistical interpretations are different. The first is a prediction: It states that if we predict that i will be in the .fixed interval., ~ z,p/Vn, our prediction will be correct in 100-y% of the cases.. The second is an estimation: It states that if we estimate that the unknown number ., is in the random interval i ~ Sf.C. 8-2 THE MAJOR ARF.AS OF STATISTICS 243 zp/Vn, our estimation will be correct in l()()y% of the cases. We thus conclude that if y is close to I, we can claim with near certainty that ., is in the interval x ~ z,u/Vn. This claim involves the average x of the n observations x; in a single performance of the experiment ~n· Using statistics, we have as in (8-4) reached a conclusion that can be accepted as near certainty. Example 8.2 We wish to estimate the diameter 1J of a rod using a measuring instrument with zero mean error and standard deviation u = I mm. We measure the rod 64 times and find that the average of the measurements equals 40.3 mm. Find the .95-confidence interval of 11· In this problem y = .95 I; 3 = .975 Z.91~ "" 2 n = 64 Inserting into (8-15), we obtain the confidence interval 40.3 ~ u=l 0.25 mm. • Hypothesis Testing Hypothesis testing is an important area of decision theory based on statistical considerations. Docs smoking decrease life expectancy? Do IQ scores depend on parental education? Is drug A more effective than drug 8? The investigation starts with an assumption about the values of one or more parameters of a probabilistic model ~0 • Various factors of the underlying physical experiment are modified, and the problem is to decide whether these modifications cause changes in the model par.lmeters, thereby generating a new model f:f 1• Suppose that the mean blood pressure of patients treated with drug A is Tlo· We change from drug A to drug B and wish to decide whether this results in a decrease of the mean blood pressure. We shall introduce the underlying ideas in the context of the following problem. The mean cholesterol count of patients in a certain group is 240 and the standard deviation equals 32. A manufacturer introduces a new drug with the claim that it decreases the mean count from 240 to 228. To test the claim, we treat 64 patients with the new drug and observe that the resulting average count is reduced to 230. Should we accept the claim? In terms of RVS, this problem can be phrased as follows: We assume that the distribution of an RV xis either a function Fo(x) with Tlo = 240 and u 0 = u = 32 or a function F 1(x) with .,, = 228 and u, = u = 32. The first assumption is denoted by Ho and is called the null hypothesis; the second is denoted by H 1 and is called the alternative hypothesis. In the drug problem, H 0 is the hypothesis that the treatment has no effect on the count. Our task is to establish whether the evidence supports the alternative hypothesis H 1 • As we have noted, the sample mean i ofx is a normal RV with variance u/Vn = 4. Its mean equals Tlo = 240 if H 0 is true and.,, = 228 if H 1 is true. In Fig. 8.4, we show the density of i for each case. The values of i are concentrated near its mean. It is reasonable, therefore, to reject H 0 iff xis to the left of some constant c. This leads to the following test: Reject the null hypothesis 244 CHAP. 8 THE MEANING OF STATISTICS Fipre 8.4 iff i < c. The region x < c of rejection of H0 is denoted by Rc and is called the critical region of the test. To complete the test, we must specify the constant c. To do so, we shall examine the nature of the resulting errors. Suppose, first, that H0 is true. If i is in the critical region, that is, if i < c, we reject H 0 even though it is true. Our decision is thus wrong. We then say that we committed a Type I error. The probability for such an error is denoted by a and is called Type I error probability or significance level of the test. Thus a = P{i < c!Ho} = r. fx~x. Tlo) dx (8-16)t Suppose next that H 1 is true. If i > c, we do not reject H 0 even though H1 is true. Our decision is wrong. In this case, we say that we committed a Type II error. The probability for such an error is denoted by f3 and is called Type II error probability. The difference P = 1 - f3 is called the power of the test. Thus f3 = P{i > ciH1} = L"' fx~x. ,.,, ) dx (8-17) For a satisfactory test, it is desirable to keep both errors small. This is not, however, possible because if c moves to the left, a decreases but f3 increases; if c moves to the right, f3 decreases but a increases. Of the two errors, the control of a is more important. Note In general, the purpose of a test is to examine whether the available evidence supports the rejection of the null hypothesis; it is not to establish whether Ho is true. If i is in the critic::al region R~, we reject H0 • However, if i' is not in R,., we do not conclude that Ho is true. We conclude merely that the evidence does not justify rejection of H0 • Let us clarify with a simple example. We wish to examine whether a coin is loaded. To do so, we toss it 100 times and observe that heads shows k times. If k = 15, we reject the null hypothesis because 15 is much smaller than 50. If k = 49, we conclude that the evidence does not support the rejection of the fair-coin hypothesis. The evidence, howt The expression P{i < ciH0} is not a conditional probability. It is the probability that i < c under the assumption that H0 is true. SEC. 8-2 THE MAJOR AREAS OF STATISTICS 245 ever, does not lead to the conclusion that the coin is fair. We could have as well concluded that p = .49. To carry out a test, we select a and determine c from (8-16). This yields a . Tlo) = G (cu!Vn - Tlo) = f;(c, (8-18) The resulting f3 is obtained from (8-17): /3 Example 8.3 = c - "'') I - F.t<c. Til)= 1 - G ( --;;J\1; (8-19) In the drug problem, .,, = 228 CT = 32 n = 64 i = 230 1)o = 240 We wish to test the hypothesis H0 that the new drug is not effective, with significance level .OS. In this case, a = .OS z,.. = -1.64S c = 233.4 /3 = I - G(I.3S) = .089 Since .i = 230 < c, we reject the null hypothesis; the new drug is recommended. The power of the test is P = I - /3 == .911 • Ftmdamellllll Note There is a basic conceptual difference between parameter estimation and hypothesis testing, although the underlying analysis is the same. In parameter estimation, we have a single model and use the observations to estimate its parameters. Our estimate involves only precise statistical considerations. In hypothesis testing, we have two models: A model ~0 representing the null hypothesis and a model~. representing the alternative hypothesis (Fig. 8.S). We start with the assumption that :.1 0 is the correct model and use the observations to decide whether this assumption must be rejected. Our decision is not based on statistical considerations alone. Mathematical statistics leads only to the following statements: If ~0 is the true model, then P{i > d = a (8-20) If~. is the true model, then P{i < c} = /3 These statements do not-indeed, cannot-lead to a decision. A decision involves the selection of the critical region and is based on other considerations, often subjective, that are outside the scope of mathematical statistics. Figure 8.5 Hypothesis testing 246 CHAP. 8 THE MEANING OF STATISTICS Bayesian Statistics In the classical approach to the estimation problem, the parameter 8 of a distribution F(x, 8) is viewed as an unknown constant. In this approach, the estimate of 8 is based solely on the observed values x; of the RV x. In certain applications, 8 is not totally unknown. If, for example, 9 is the probability of heads in the coin experiment. we expect that its possible values are close to .5 because most coins are fair. In Bayesian statistics, the available prior information about 8 is used in the estimation process. In this approach, the unknown parameter 9 is viewed as the value of a random variable 8, and the distribution of x is interpreted as the conditional distribution F. . <xl9> of x assuming 8 = 8. The prior information is used to assign somehow a density /,(8) to the RV 9, and the problem is to estimate the value 8 of 8 in terms of the observed value x of x and the density /,(8) of 8. The problem of estimating the unknown parameter 8 is thus changed to the problem of estimating the value 8 of an RV 9. In other words, in Bayesian statistics, estimation is changed to prediction. We shall illustrate with the measurement problem. We measure a rod of diameter 8; the results are the values x; = 8 + 11; of the sum 9 + " where " is the measurement error. We wish to estimate 9. If we interpret 8 as an unknown number, we have a classical estimation problem. Suppose, however, that the rod is picked from a production line. In this case, its diameter 9 can be interpreted as the value of an RV 8 modeling the diameters of all rods. This is now a problem in Bayesian estimation involving the RVS 8 and x = 8 + "· As we have seen in Section 6-3, the LMS estimate iJ of 8 is the regression line iJ = £{6jx} =f. 8f,(8jx)d8 (R-21) Here iJ is the point estimate of our rod and xis its measured value. To find iJ, it suffices to find the conditional density f,(8lx> of 8 assuming x = x. As we know [see (6-32)], Ji< 8l ) = .fx<xl8> /,(9) • ...'· x f....(x) (8-22) The function /,(8) is the unconditional density of 8, which we assume known. This function is called prior (before the measurement). The conditional density / 8(8jx) is called posterior (after the measurement). The conditional density .fx<xl9) is assumed known. In the measurement problem, it can be expressed in terms of the density / 11(11) of the error. Indeed, if 8 = 8, then x = 9 + v; hence, .fx<xl8> = f,(x - 8) Finally,.fx(x) can be obtained from the total probability theorem (6-31): .fx(x) = f . /.(xj8)/,(8) d8 (8-23) The conditional density ft<xl9> considered as a function of 8 is called the Sf.C. 8-2 THF. MAJOR AREAS OF STATISTICS 247 likelihood function. Omitting factors that do not depend on x. we can write (8-22) in the following form: Posterior - likelihood x prior (8-24) This is the basis of Bayesian estimation. We conclude with the model interpretation of the densities: The prior density / 8(0) models the diameters of all rods coming out of the production line. The posterior density fB<6 x) models the diameters of all rods of measured diameter x. The conditional density j~(xj6) models all measurements of a particular rod of true diameter 0. The unconditional density h(.t) models all measurements of all rods. 1\'ot'' In Bayesian estimation, the underlying model is a product space ~ "" ~~.where ~ 8 is the space of the RV Band~~. is the space of the RV x (fig. 8.6). In the measurement problem, rfH is the space of all rods and rt. is the space of all measurements of a particular rod. The product space ~is the space of all measurement'i of all rods. The numb~r 8 ha.. two meanings: It is the value of the RV 8 in the space :tH: it is also a parameter specifying the density J:<xj8) ""fv<x - 61 in the space :1,. :JH x The Controversy. Bayesian statistics is a topic of continuing controversy between those who interpret probability "objectively.·· as a measure of averages, and those who interpret it ··subjectively." as a measure of belief. The controversy centers on the meaning of the prior distribution F 6(0). For the objectivists, 1-'11 (8) is interpreted in terms of averages as in (4-25); for the subjectives, F 8 (8) is a measure of our state of knowledge concerning the unknown parameter 8. According to the objectivists. parameter estimation can be classical or Bayesian, depending on the nature of the problem. The classical approach is used if 8 is a single number (the diameter of a single rod. for example). The Bayesian approach is used if 8 is one of the values of an RV 0 (the diameters of all rods in the production line). The subjectivists use the Bayesian approach in all cases. They assign somehow a prior to the unknown 8 even if little or nothing is known about 6. A variety of methods have been proposed for doing so; however, they are of limited interest. Figure 8.6 X Obso:rn: x l'n:di.:tfl l~aycsian ~-stim:ttion 248 CHAP. 8 THE MEANING OF STATISTICS The practical difference between the two approaches depends on the number n of available samples. If n is small. the two methods can lead to very different results; however, the results are not very reliable for either method. As n increases, the role of the prior decreases, and for large nit has no effect on the result. The following example is an illustration. Example 8.4 We toss a coin n times, and heads shows k times. Estimate the probability p of heads. In the classical approach to estimation, pis an unknown parameter, and its empirical estimate is the ratio kin. In the Bayesian approach, p is the value of an Rv p with density /p(p ), and the LMS estimate p of p equals p = 'Y J~ pp•o - P>"-kJ,<p> dp <8-25> This follows from (6-21) and (6-54) (see Problem 6-9). For small values of n, the estimate depends on /,(p). For large n, the term p~l - p)"-• is a sharp narrow function centered at the point kin (Fig. 6.3). This shows that the right side of (8-25) approaches kin regardless of the form of /,(p). Note that if /p(p) is constant, then [see (6-24)] k+ J k p=---n+t .. -xn Thus, for large n, the Bayesian estimate of p equals its classical estimate. • Entropy Given a partition A the sum = [.Sif,, ••• , sfN] consisting of N events sf~o we form N H(A) = - L P1 In P1 i•l Pi = P(sfi) (8-26) This sum is called the entropy of the partition A. Thus entropy is a number associated to a partition as probability is a number associated to an event. Entropy is a fundamental concept in statistics. It is used to complete the specification of a partially known model in terms not of observations but of a principle. The following problem is an illustration. The average number of faces up of a given die equals 4.S. On the basis of this information, we wish to estimate the probabilities Pi = P{Ji} of the six faces {Ji}. This leads to the following problem: Find six numbers Pi such that Pt + 2p2 + · · · + 6p6 = 4.5 (8-27) This problem is ill-posed; that is, it does not have a unique solution because there are six unknowns and only two equations. Suppose, however, that we wish to find one solution. What do we do? Among the infinitely many solutions, is there one that is better in some sense than the others? To answer this question, we shall invoke the following principle: Among all possible solutions of(8-27), choose the one that maximizes the entropy H(A) = -(p 1 In Pt + · · · + P6ln p 6) of the partition A consisting of the six events {Ji}. This is called the principle of maximum entropy. PI + P2 + · · · + P6 = I SEC. 8-2 THE MAJOR ARF.AS OF STATISTICS 249 What is the justification of this principle? The pragmatic answer is that it leads to results that agree with observations. Conceptually. the principle is often justified by the interpretation of entropy as a measure of uncertainty: As we know, the probability of an event .s4 is used as a measure of our uncertainty about its occurrence. If P(sA.) = .999. we are prctctically certain that sA. will occur; if P(sA.) = .2, we are reasonably certain that it will not occur; our uncertainty is maximum if P(sA.} = .5. Guided by this. we interpret the entropy H(A) of a partition A as a measure of uncertainty not about a single event but about the occurrence of any event of A. This is supported by the following properties of entropy (see Section 12-1 ): H(A) is a nonnegative number. If P(sA.•J = I for some k, then P(sA.;) = 0 for every i i= k and H(A) = 0; in this case, our uncertainty is zero because at the next trial, we are certain that only the event sA.1c will occur. Finally, H(A) is maximum if all events of A have the same probability; our uncertainty is then maximum. The notion of entropy as a measure of uncertainty is subjective. In Section 12-3, we give a different interpretation to H(A) based on the concept of typical sequences. This concept is related to the relative frequency interpretation of probability, and it leads to an objective meaning of the notion of entropy. We introduce it next in the context of a partition consisting of two events. Consider the partition A = [sA., .<4 Jconsisting of an event sA. and its complement .<4. If we repeat the experiment n times, we obtain sequences s1 of the form TYPICAL SEQt.:ENCES AND RF.LATIVE FR.:QUENCY s1 = sA. .<;{ sA. • • • .<4 = I, . . . • 2" (8-28) Each sequence s1 is an event in the space 9',.. and its probability equals P(si) = plcqn-4 (8-29) where k is the number of successes of .<4. The total number of sequences of the form (8-28} equals 211 • We shall show that if pis not .5 and n is large, only a small number of sequences of the forms (8-28) is likely to occur. Our reasoning is based on the familiar approximation k == np. This says that of all 2" sequences, only the ones for which k is close to np are likely to occur. Such sequences are denoted by t1 and are called typical; all other sequences are called atypical. If t1 is a typical sequence, then k == np, n - k = n - np = nq. Inserting into (8-29), we obtain (8-30) P(ti) = p'q" ·lc == p""q"" 10 10 Since p = e P and q = e q. this yields P(lj) j = e"P In p-llq In q = e-.. 1/(A) (8-31) where H(A} = -(pIn p + q In q) is the entropy of the partition A =[sA., sA.). We denote by fJ the set {k == np} consisting of all typical sequences. As we noted, it is almost certain that k == np; hence, we expect that almost every observed sequence is typical. From this it follows that P(fJ) == I. 250 CHAP. 8 THE MEANING OF STATISTICS Denoting by n, the number of typical sequences, we conclude that P(5"} = n,P{tj} == I n, == enH!AI (8-32) This fundamental formula relates the number of typical sequences to the entropy of the partition [.si, .si], and it shows the connection between entropy and relative frequency. We show next that it leads to an empirical justification of the principle of maximum entropy. lfp = .5, then H(A) = -(.51n .5 + .51n .5) =In 2 and n, ==en In 2 = 2n. In this case, all sequences of the form (8-28) are typical. For any other value of p, H(A) is less than In 2. From this it follows that if n is large, (8-33) Thus if n is large, the number 2n of all possible sequences is vastly larger than the number enH!AI of typical sequences. This result is fundamental in coding theory. Equation (8-33) shows that H(A) is maximum itT n, is maximum. The principle of maximum entropy can thus be stated in terms of the observable number n,. If we choose p; so as to maximize H(A), the resulting number of typical sequences is maximum. As we explain in Chapter 12, this leads to the empirical interpretation of the maximum entropy principle. Concluding Remarks All statistical statements are probabilistic based on the assumption that the data are samples of independent and identically distributed (i.i.d.) Rvs. The theoretical results lead to useful practical inferences only if the underlying physical experiment meets this assumption. The i.i.d. assumption is often phrased as follows: I. The trials must be independent. 2. They must be performed under essentially equivalent conditions. These conditions are empirical and cannot be verified precisely. Nevertheless, the experiment must be so designed that they are somehow met. For certain applications, this is a simple task (coin tossing) or requires minimum effort (card games). In other applications, it involves the use of special techniques (polling). In physical sciences, the i.i.d. condition follows from theoretical considerations supported by experimental evidence (statistical mechanics). In the following chapters, we develop various techniques that can be used to establish the validity of the i.i.d. conditions. This, however, requires an infinite number of tests: We must show that the events {x1 s x 1}, • . • , {xn s Xn} are independent for every n and for every x;. In specific applications, we select only a small number of tests. In many cases, the validity of the i.i.d. conditions is based on our experience. Numerical data obtained from repeated trials of a physical experiment form a sequence of numbers. Such a sequence will be called random if the SEC. 8-3 RANJ)OM NUMBERS A~J) ('OMPUTI!R SIMULATION 251 experiment satisfies the i.i.d. condition. The concept of a rc1ndom sequence of physically generated numbers is empirical because the i.i.d. condition applied to real experiments is an empirical concept. In the next section. we examine the problem of generating random numbers using computers. 8-3 Random Numbers and Computer Simulation Computer simulation of experimental data is an important discipline based on computer generation of sequences of random numbers (RN). It has applications in many fields. including: usc of statistical methods in the numerical solution of deterministic problems; analysis of random physical phenomena by simulation; and use of random numbers in the design of random experiments, in tests of computer algorithms. in decision theory. and in other areas. In this section, we introduce the basic concepts stressing the meaning and generation of random numbers. As a motivation. we start with the explanation of the Monte Carlo method in the evaluation of definite integral!;. Suppose that a physical quantity is modeled by an RV u uniformly distributed in the interval (0. I) and that x = g(u) is a function of u. Since f,,(u) = I for u in the interval (0, I) and 0 elsewhere. the mean of x equals .,, = E{g(u)} = Jofl g(u)f,,(u) du 1 = Jn( ·l!(u) du (8-34) As we know [sec (4-90)1. the mean of x can be expressed in terms of its empirical average .'i. Inserting this approximation into (8-34). we obtain l l II g(u) du I '' I " L x; = - L g(u;) II II =- i-1 (8-35) i-1 where u; are the observed values of u in n repetitions of the underlying physical experiment. The approximation (8-35) is an empirical statement relating the model parameter.,. to the experimental data u;. It is based on the empirical interpretation (1-1) of probability and is valid if n is large and the data u; satisfy the i.i.d. condition. This suggests the following method for evaluating statistically a deterministic integral. The data u;, no matter how they are obtained, arc R~s; that is. they are numbers having certain properties. If, therefore. we can develop a method for generating such numbers. we have a method for evaluating the integral in <8-35). Random Numbers "What are RNs? Can they be generated by a computer·? Arc there truly rc1ndom sequences of numbers'!" Such questions were raised from the early years of computer simulation, and to this day they do not have a generc11ly accepted answer. The reason is that the term random sequence has two very 252 CHAP. 8 THE MEANING OF STATISTICS different meanings. The first is empirical: RNs are real (physical) sequences of numbers generated either as observations of a physical quantity. or by a computer. The second is conceptual: RNs are mental constructs. Consider, for example. the following, extensively quoted. definitions.· D. H. Lehmer (1951): A random sequence is a vague notion embodying the ideas of a sequence in which each term is unpredictable to the uninitiated and whose digits pass a certain number of tests, traditional with statisticians and depending somewhat on the uses to which the sequence is to be put. J. M. Franklin (1962): A sequence of numbers is random if it has every property that is shared by all infinite sequences of independent samples of random variables from the uniform distribution. These definitions are fundamentally different. Lehmer's is empirical: The terms vague, unpredictable to the uninitiated, and depending somewhat on uses are heuristic characterizations of sequences of real numbers. Franklin's is conceptual: infinite sequences, independent samples, and uniform distribution are model concepts. Nevertheless, although so different, both are used to define random number sequences. To overcome this conceptual ambiguity, we shall proceed as in the interpretation of probability (Chapter 1). We shall make a clear distinction between RNS as physically generated numbers. and RNs as a theoretical concept. THE DUAL INTERPRETATION OF R~s Statistics is a discipline dealing with averages of real quantities. Computer-generated random numbers are used to apply statistical methods to the solution of various problems. Results obtained with such numbers are given the same interpretation as statistical results involving real data. This is based on the assumption that computergenerated RNS have the same properties as numbers obtained from real experiments. It suggests therefore that we relate the empirical and the theoretical properties of computer-generated RNs to the corresponding properties of random data. Empirical Definition A sequence of real numbers will be called random if it has the same properties as a sequence of numerical data obtained from a random experiment satisfying the i.d.d. condition. As we have repeated noted, the i.d.d. condition applied to real data is a heuristic statement that can be claimed only as an approximation. This vagueness cannot, however, be avoided no matter how a sequence of random numbers is specified. The above definition also has the advantage that it is phrased in terms of concepts with which we are already familiar. We can ·Knuth, D. E .• Th~ Art ofComput~r Programming (Reading. MA: Addison Wesley. 1969). SEC. 8-3 RANDOM NUMBERS AND COMPUTER SIMULATION 253 thus draw directly on our experience with random experiments and use the well-established tests of randomness for testing the i.d.d. condition. These considerations are relevant also in the conceptual definition of RNs: All statistical results are based on model concepts developed in the context of a probabilistic space. It is natural therefore to define RNS in terms of RVS. Conceptual Definition A sequence of numbers x; is called random if it equals the samples x1 = x1(C) of a sequence of i.d.d. RVS x;. This is essentially Franklin's definition expressed directly in terms of Rvs. It follows from this definition that in applications involving computergenerated numbers we can rely solely on the theory of probability. We conclude with the following note. In computer simulation of random phenomena we use a single sequence of numbers. In the theory of probability we use afamily of sequences x;(C) forming a sequence x; of RVS. From the early years of simulation, various attempts were made to express the theoretical properties of RNS in terms of the properties of a single sequence of numbers. This is in principle possible; however, to be used as the theoretical foundation of the statistical applications of RNS, it must be based on the interpretation of probability as a limit. This approach was introduced by Von Mises early in the century [see (1-7)] but has not been generally accepted. As the developments of the last 50 years have shown, Kolmogoroff's definition is preferable. In the study of the properties of RNS, a new theory is not needed. GENERATION OF RNs A good source of random numbers is a physical experiment properly selected. Repeated tosses of a coin generate a random sequence of zeros (heads) and ones (tails). We expect from our past experience that this sequence satisfies the i.i.d. requirements for randomness. Tables of random numbers generated by random experiments are available; however, they are not suitable for computer applications. They require excessive memory, access is slow, and implementation is too involved. An efficient source of RNS is a simple algorithm. The problem of designing a good algorithm for generating RNS is old. Most algorithms used today are based on the following solution proposed in 1948. Lehmer's Algorithm Select a large prime number m and an integer a between 2 and m - I. Form the sequence z,. = az,.- 1 mod m n~ I (8-36)* Starting with a number Zo =I= 0 we obtain a sequence of numbers z,. such that J:sz,.sm-1 From this it follows that at·least two of the first m numbers of the sequence z,. • The notation A = B mod m means that A equals the remainder of the division of B by m. For example, 20 mod 13 = 7; 9 mod 13 = 9. 254 CHAP. 8 THE MEANING OF STATISTICS will be equal. Therefore. for"~ m. the sequence will be periodic with period mo!S;m-1: mo Example 8.5 !5; m- I We shall illustrate with m = 13. Suppose firstthat a= 5. lfZo = I, then z, equals I. 5. 12, 8, I, . . . ; if Zo = 2, then z, equals 2. 10, It, 3, 2•... The sequences so gcncrcttcd arc periodic with period m 0 = 4. Suppose next that a= 6. In this case, the sequence I, 6, 10, 8. 9. 2, 12, 7, 3. 5. 4, I I, I, . . . with the maximum period m0 = m - I =- 12 results. • A periodic sequence is not random. However, if in the problems for which it is intended the required number of samples is smaller than m, periodicity is irrelevant. It is therefore desirable to select a such that the period mo of z,. is as large as possible. We discuss next the properties that the multiplier a must satisfy so that the resulting sequence z,. has the maximum period m0 = m- I. • Theonm. If mo = m - I, then a is a primitive root of m; that is, am 1 = I mod m a" :/= I mod m for I < n < m - I (8-37) • Proof. From (8-36) it follows by a simple induction that z,. = Zoll" mod m Since z,. has the maximum period m - I. it takes all values between I and m - I; we can therefore assume that zo = I. If am• = I mod m and m0 < m I, then Zm. = amo = I mod m = Zo· Thus m0 is a period; this, however, is impossihle; hence, (8-37) is true. Note that if m is a prime number, (8-37) is also a sufficient condition for maximum period. Thus the sequence z,. generated by the recursion equation (8-36) has maximum period iff a is a primitive root of m. Suppose that m = 13. In this case, 6 is a primitive root of m because the smallest integer n such that 6" = I mod m is 12 = 13- I. However, 5 is not a primitive root because 54 = I mod 13. To complete the specification of the RN generator (8-36), we must select the integers m and a. A value form. suggested many years ago· and used extensively today, is the number m = 2 31 - I = 2,147,483,647 This number is prime and is large enough for most applications. In the selection of the constant a, our first requirement is that the resulting sequence z,. have maximum period. For this to occur, a must satisfy (8-37). Over half a billion numbers satisfy (8-37), but most of them yield poor RNs. To arrive at a satisfactory choice, we subject the multipliers that yield maximum period to a variety of tests. Each test reduces the potential choices • D. H. Lehmer, "Mathematical Methods in Large Scale Computing Units," An11u. Camput. l.ah. Harvard Univ. 26 !1951). SEC. 8-3 RANDOM NUMBERS A~D COMPUTER SIMULATION 255 until all standard tests are passed. We are thus left with a relatively small number of choices. Some of those are then subjected to special tests and are used in particular applications. Through a combination of additional tests and experience in solving problems. we arrive at a small number of multipliers. Such a multiplier is the constant a = 27 - I = 16.807 The resulting RN generator is* Zn = 16807z,. 1 mod 2147483647 (8-38) General Algorithms We describe next a number of more complex algorithms for generating R~s. We should stress that complexity does not necessarily improve the quality of a generator. The algorithm in (8-38) is very simple, but the quality of the resulting sequence is very high. Equation (8-36) is a first-order linear congrucntial recursion; that is, z,. depends only on Zn-1· and the dependence is linear. A general nonlinear congruential recursion is an equation of the form z,. = /(z,._,, ... , z,. .. ,) mod m The recursion is linear if Zn = (a,zn-1 + · · · - a,z,._, + c) mod m The special case z,. = (Zn-1 • z,..,) mod m (8-39) is of particular interest. It is simple and efficient. and if r is large. it might generate good sequences with period larger than m. Note. finally, that the assumption that m is prime complicates the evaluation of the product azn-l mod m. The computation is simplified if m = 2' where r is the available word length. This. however. weakens the randomness of the resulting sequence. In such cases. algorithms of the form Zn = (llZm-1 + c) mod m (8-40) are used. TESl'S OF RA~DOMNESS The sequence Z; in (8-38) is modeled by a sequence of discrete type RVs z; taking all integer values from 1 to m - I. The sequence Z; u·=, m (8-41) is modeled essentially by a sequence of continuous type Rvs u; taking all values from 0 to 1. If the RVs u; are i.i.d. and uniformly distributed, we shall say that the numbers u1 are random and uniformly distributed or, simply, random. The objective of testing is to establish whether a given sequence of RNS is random. The i.i.d. condition requires an infinite number of tests. We • S. K. Park and K. W. Miller... Random Number Generations: Good Ones Are Hard to Find ... Communications of til~ ACM 31, no. 10 (October 1988). 256 CHAP. 8 THE MEANING OF STATISTICS must show that P{u, S P{u; s u} = u 0< u< I P{u 1 s u., u2 s u2} = P{u 1 s u 1}P{u2 s u 2} u., U2 S U2, u3 S u3} = P{u 1 S u 1}P{u2 S u 2}P{u3 s (8-42) u 3} and so on. In real problems. we can have only a finite number of tests. We cannot therefore claim that a particular sequence is random but another is not. Tests lead merely to plausible inferences. We might conclude, for example, that a given sequence is reasonably random for certain applications or that this sequence is more r.mdom than that. There are two kinds of tests: theoretical and statistical. We shall explain with an example. We wish to test whether the RNs z; generated by the algorithm (8-38) are the samples of an RV with mean m/2. In a theoretical test, we reason using the properties of prime numbers that in a sequence of m - I samples, each integer will appear once. This yields the average _1_~ m- I ,_, 1 m- I= m m- I 2 In a statistical test, we generate n numbers and form their average = I.z1/n. If our assumption is correct, then i:.. m/2. In a theoretical test, no samples are used. All conclusions are exact statements based on mathematical reasoning; however, for most tests the analysis is difficult. Furthermore, all results involve averages over the entire period. Thus they might not hold for various subsequences of z1• For example, the fact that each integer appears once does not guarantee that the sequence is uniform. In an empirical test, we generate n numbers z1 where n is reasonably large but much smaller than m, and we use the numbers Z; to form various averages. The underlying theory is simple for most tests. However, the computations are time-consuming, and the conclusions are probabilistic, leading to subjective decisions. All empirical tests are tests of statistical hypotheses. The theory is developed in Chapter 10. Let us look at some illustrations. Z; =I+ 2 + · · · + z Distribution We wish to test whether the RNs u; are the samples of an RV u with uniform distribution. To do so, we use either the x2 test (10-85) or the Kolmogoroff-Smimov test (10-44). lndepelllknct We wish to test whether the subsequences X; = U2i Y; = U2i•l are the samples of two independent RVs x andy. To do so, we form two partitions involving the events {a; s x s fj;} and {'y1 s y s 31} and apply the x2 test (10-78). Tests based directly on (8-42) are, in general. complex. Most standard tests use (8-42) indirectly. For example, to test the hypothesis that the Rvs x andy are independent, we test the weaker hypothesis that they are uncorre- SEC. 8-3 RANDOM NUMBERS AND COMPt:TF.R SIMt:I.ATION 257 lated using as an estimate of their correlation coefficient r the empirical ratio r in (10-40). If the test fails, we reject the independence hypothesis. We give next two rather special illustrations of indirect testing. Gap Test Given a uniform RV u and two constants a and ~ such that 0 < I, we form the event .vl = {a < u < P(sf.) = ~ - a = P We repeat the underlying experiment and observe sequences of the form a<~< m (8-43) where sf. appears in the ith position if the event .<Ji. occurs at the ith trial. We next form an RV x the values of which equal the gap lengths in (8-43), that is. the number of times Si appears between successive .it's. The RV x so constructed has a geometric distribution as in (4-66): P{x = r} = p, = (I - p)'p r = 0. I. . . . (8-44) We shall use (8-44) to test whether a given sequence u; is random. If the numbers u; are the samples of u. then a < u; < {:J iff the event .<4 occurs. Denoting by n, the number of gaps of length r of the resulting sequence (8-43). we expect with near certainty that n, = p,n where n is the length of the sequence. This empirical statement is rephrased in Section 10-4 as a goodness-of-fit problem: To examine whether the numbers n, fit the probabilities p, in (8-44). we apply the x2 test (10-61). If the test fails. we conclude that u; is not a good RN sequence. Spectral Test Here is a simplified development of this important test. limited to its statistical version. Given an RV u with uniform distribution. we form its moment-generating function <l>(s) = E{e'•} = l" e'" du t e' - I = -- s With s = jcu, this yields (Fig. 8. 7) "' . >I I'I"(Jcu = 2lsin cu/21 lwl Figure 8.7 ~!sin w/~1 lw; 258 CHAP. 8 THE MEANING OF STATISTICS Hence, for any integer r, «<>(j27Tr) = {~ r=O r:FO (8-45) We shall usc this result to test whether a given sequence u; = z;lm is uniform. For this purpose, we approximate the mean of the function ei2w•• by its empirical average: «<>(j27Tr) ., } = E{e'·w•• = I ~ - LJ , . e'·w•:.-m (8-46) ni-l If the uniformity assumption is correct, the right side is small compared to I for every integer r :F 0. We shall now test for the independence and uniformity of the se· quences If two avs u and v are uniform and independent, their joint moment function equals «1>,.0 (s,, .t2) = E{e••••J:"} = «<>(s 1)«<>(s:!) From this and (8-46) it follows that r, = r2 = 0 I «1>(j21rr1 • j21rr2) = { h . ot erwase 0 Proceeding as in (8·45), we conclude that if the subsequences Z2; and Z2i+ 1 of an RN sequence Z; are independent. then '• = ,2 = 0 otherwise (8-47) The method can be extended to an arbitr.uy number of subsequences. Note Hypothesis testing is based on a number of untested assumptions. The theoretical results are probabilistic statements, and the conclusions involve the subjective choice of various parameters. Applied to RN sequences, testing leads, therefore. only to plausible inferences. In the final analysis, specific RN generators are adopted for general use not only because they pass standard tests but also because they have been successfully applied to many problems. RNs with Arbitrary Distributions All RN sequences z, generated by congruential algorithms are integers with uniform distribution between I and m - I. The sequence z, u, = - (8·48) m is essentially of continuous type, and the corresponding RV u is uniform in the interval (0, 1). The RV a+ bu is uniform in the interval (a. a+ b), and the av I - u is uniform in the interval (0, I). Their samples are the sequences a + bu; and I - u;, respectively. SEC. 8-3 £ u = F(x) X L 259 RANDOM NUMBERS AND COMPUTER SIMULATION ¥- u X Figure 8.8 :r ,... ( X We shall use the sequence u; or, equivalently. I - u; to generate RN sequences with arbitrary distributions. Algorithms generating nonuniform RNs directly are not available. We shall discuss two general methods and several special techniques. Pf:RCE~TILE-TRANSFORMATION !\IETHOD If~; and w; arc the samples of two RVS z and wand ifw = g(z), then w; = g(z;). Using this observation. we shall generate an RN sequence .t; with distribution a given function F(x ), in terms of the RN sequence II; in (8-48). The proposed method is based on the following theorem. • Theorem. If x is an RV with distribution F(x) and u = F(x) then u is uniformly distributed in the interval (0, I); that is. F,.(u) us I. (8-49) = u for 0 s • Proof. From (8-49) and the monotonicity of F(x) it follows that the events {u s u} and {x s x} arc equal. Hence (fig. 8.8), F,(u) = P{u s 11} = P{x s x} = F(x) = u and the proof is complete. Denoting by F 1-ll(u) the inverse of the function u = F(x), we conclude from (8-49) that • X= F 1- 11(u) (8-50) From this it follows that if II; are the samples of u, then F' 11(11;) (8-51) is an RN sequence with distribution F(x). Thus, to form a sequence x; with distribution F(x), it suffices to form the inverse p- 11(u) of F(x) and to compute its values for u = u;. X;= Example 8.6 We wish to generate a sequence of RNS with distribution F(x) = I - e·•'A x>0 (8-52) The inverse of F(x) is the function x = -A In (I - u). If u is uniform in (0, 1). then I - 260 CHAP. 8 THE MEANING OF STATISTICS F(x) JC 0 u (a) (b) Flpre 8.9 u is also uniform in (0, 1). Hence, the sequence x1 =-A In u1 bas an exponential distribution as in (8-52). • (8-53) Discrete Type RNs We wish to generate an RN sequence x1 taking them values a" with probabilities p 1" The corresponding distribution is a staircase function with discontinuities at the points a 1 < · · · < am (Fig. 8.9a); its inverse is a staircase function with discontinuities at the points c 1 < · · · < Cm = 1 where (Fig. 8.9b) c" = F(a~c) = Pt + · · · Pk k = I, • . . , m Applying (8-51) we obtain the following rule for generating x1: Set x1 = a, iff c~c-t s u; < c~c co= 0 (8-54) Example 8.7 (a) Binary RNs. The sequence 0 0 < u, < p {I p < U; < I takes the values 0 and I with probabilities p and I - p, respectively. (b) Decimal RNs. The sequence . k k+ I X; = k dT 10 < U; < 10 k = 0, I' . . . ' 9 X;= takes the values 0, I, . . . , 9 with equal probability. (c) Bernoulli RNs. If k = 0, I, . . . ,n then the sequence x 1 in (8-54) has a Bernoulli distribution with parameters n andp. SEC. 8-3 (d) Poisson RANDOM NUMBERS AND COMPliTER SIMUl.ATION RSs. If k then the sequence A. 261 .T; = 0. I •... in (8-54) has a Poisson distribution with parameter • FromF1 (X) toF.•(Y) We have an RN sequence X; with distribution Fx(x) and we wish to generate another RN sequence y; with distribution Fv(y). As we know lsee (8-40)], the sequence u; = F.,(x;) is uniform. Applying (8-51). we conclude that the numbers .V; = F;-: 1(u;) = F. 1IF,(x;)J (8-55) are the values of an RN sequence with distribution F,.(y) (Fig. 8.10). REJECTION METHOD The determination of the inverse x = F- 1(u) of a function F(x) by a computer is a difficult task involving the solution of the equation F(x) = u for every u. It is therefore desirable to avoid using the transformation method if the function F- 1(u) is not known or cannot be computed efficiently. We shall now develop a method for realizing an arbitrary F(x) that avoids the inversion problem. The method is based on the properties of conditional distributions and their empirical interpretation as averages of subsequences. Conditional Distributions Given an RV x and an event .M., we form the conditional distribution F.,<xi.M.> = P{x :s xi.M.} = P{x P~•.~) At} (8-56) [sec (6-8)]. The empirical interpretation of Fx<xi.M.> is the relative frequency of the occurrence of the event {x :s x} in the subsequence )'; = Xk, of trials in which the event .M. occurs. From this it follows that y1 is a sequence of RNs with distribution F 7{x} = P{y :s x} = f'.,(x;.-«.) (8-57) We shall use this result to generate a sequence)'; of RNS with distribution a given function. Flpre 8.10 ... X· II; I·~( X)= U _JL r+II· II F,.(y·c .v 262 CHAP. 8 THE MEANING OF STATISTICS • Rejection Theortm. Given an RV u uniform in the interval (0, I) and independent of x and a function rC.t) such that 0 :s r(x) :s I we form the event .M. = {u :s r(x)}. We maintain that .fx<xi.M.> = Z,f,(x)r(x) c = f.J,(x)r(x)dx (8-58) • Proof. The density of u equals I in the interval (0, I) by assumption; hence, .fx,.(x, u) = .fx(x )j,.(u) = .fx(x) O:su:sl As we know " I { dxl 11 } P{x < x :s x + dx, .M.} 8 5 Jx(X .M.) dx = P X < X :S X + .M. = P(.M.) ( - 9) The set of points on the xu plane such that u :s r( x) is the shaded region of Fig. 8.11. The event {x < x :s x + dx} consists of all outcomes such that x is the vertical strip (x, x + dx). The intersection of these two regions is the part of the vertical strip in the shaded region u :s r(x). In this region, .fx(x) is constant; hence, P{x :s x < x + dx, .M.} = .fx(x)r(x)dx P(.ttl = r. fx(x)r(x)dx =c Inserting into (8-59), we obtain (8-58). We shall use the rejection theorem to construct an RN sequence y; with a specified density /y(y). The function.J;.(y) is arbitrary, subject only to the mild condition that/y(x) = 0 for every x for which.fx(x) = 0. We form the function - /y(x) r(x)- a .fx(x) where a is a constant such that 0 :s r(x) :s I. Clearly, r. r(x).fx(x)dx =a r. J;.<x)dx =a F.....-e 8.11 II X dx SEC. 8-3 RANDOM NUMBERS AND COMPUTER SIMULATION 263 and (8-58) yields .fx<xi.M.) =J;.(x) , ={ ""' afv(x)} < u- f,(x) (8-60) From (8-60) it follows that if x; and u; are the samples of x and u, respectively, then the desired sequence y; is formed according to the following rejection rule: = x; 1.f Set Y; u; s h-<x;> a "( ) Jx Xi (8-61) Reject xi otherwise Example 8.8 We wish to generate a sequence Yi with a truncated normal distribution starting from a sequence x; with expontential distribution: /y(y) = 2 e·r'12 U(y) V'2,; /,(:c)= e·xu(xl In this problem, rc ) ~(~) With a i'2 = \; • ~· .•··"2 f2e = \1T e·'·' u:~ = v"iiife, (8-61) yields the following rejection rule: = X; if u1 < e·"· Reject x; otherwise Set Y; 11 :: • SPECIAL :\IF.THODS The preceding methods arc general. For specific distributions, faster and more accurate methods are available, some of them tricky. Here are several illustrations, starting with some simple observations. We shall usc superscripts to identify several RVs. Thus x 1, • • • , x"' are m RVS and x}, ••• , x7' are their samples. If x 1 is an RN sequence with distributionf(x), then any subsequent of x1 is an RN sequence with distribution j(x ). From this it follows that the RNS (8-62) X~ = Xmi-m• ~ • • • X7' = Xm; x} = Xmi-m·l 1 are the samples of the m i.i.d. RVs x , • • • , x"'; their distribution equals the functionf(x). Transformations If z is a function = g(x1, • • • , x"') (8-63) of the m RVS xk, then the numbers Zi = g(x J. . . . • x7') form an RN sequence z with distribution fr.(z). To generate an RN sequence Z; with distribution a given function fr.(z) it suffices, therefore, to find g such that the distribution of the resulting z equals /.(z). 264 CHAP. 8 THE MEANING OF STATISTICS Example 8.9 We wish to generate an RN sequence Z; with Gamma distribution: (8-64) /:(z) - zm·le-:'AU(z) x 1, • • • , We know (see Example 7.5) that if the RVS xm are i.i.d. with exponential distribution, their sum z = x 1 + · · ·- xm has a gamma distribution. From this it follows (see Example 8.6) that (8-65) • Example 8.10 (a) Chi-square RNs. We wish to generate an RN sequence z 1 with distribution x2(n): /:(z) - t''2-le·:-z If n = 2m, then this is a special case of (8-64) with A. = Ill -2 ~In 4; = ••• 2; hence, u: To realize z1 for n = 2m + I, we observe that ify is x2<2m), w is NCO, 1), and y and w are independent, then [see (7-87)] the sum z = y + w2 is x2(2m -t- I); hence, Ill -2 ~ In Z; = •=• u: . . (w1)2 where u~ are uniform RNS and w1 are RNS with normal distribution. (b) Student t RNs. If the RN sequences z1 and w; are independent with distributions N(O, I) and x2(n), respectively, then [see (7-101)] the ratio X1 - l; Vw;fn is an RN sequence with distribution t(n). (c) Snedecor RNs. If the RN sequences Z; and w1 are independent with distributions x 2(m) and xl(n), respectively, then [see (7-106)] the ratio Ztlm X;= W;fn is an RN sequence with distribution F(m, n). Example 8.11 • lh 1, • • • , x"' are m RVs taking the values 0 and I with probability p and q, respectively, their sum has a binomial distribution. From this it follows that if X; is a binary sequence as in Example 8. 7, then the sequence Z1 = X1 +· · ·+ has a binomial distribution. X 111 Z2 = Xm+l +•' ' + X2m • Mixina We wish to realize an RN sequence x1 with density a given function f(x). We assume that/(x) can be written as a sum /(x) = p,fj(x) + · · · + p,,.J,.(x) (8-66) SEC. 8-3 RANDOM NUMBERS AND COMPUTER SIMULATION 265 where Jk(x) are m densities and p 1 are m positive numbers such that p 1 + · · · + Pm = 1. We develop a method based on the assumption that we know the RN realizations x' of the densities Jk(x ). We introduce the constants co= 0 c:~c = P1 + · · · + P• k = I •... , m and form the events k=l. . . . ,m where u is an RV uniform in the interval (0, I) with samples u1• We maintain that the sequence x; can be realized by mixing the sequences x' according to the following rule: if (8-67) Set x1 = xf £'4 I !5 U; < £'4 • Proof. We must show that the density h(.lc) of the RN sequence x1 so constructed equals the functionf(x) in (8-66). The sequence x~ is a subsequence of the sequence x1 conditioned by the event .s4.4 ; hence, its density equals h<xlsf~c). This yields h(xi.<Ak) = fi.(x) because the distribution of xf equals Jk(x) by assumption. From the total probability theorem (6-8) it follows that h(X) = h(xisfi)P(.s4.1) + · · · + h(xls4m)P(s4.m) But P(sfk) = c1c- c1c-1 = p~c; hence,h(x) = /(x). Example 8.12 We wish to construct an sequence X; with a Laplace distribution I _ I _ I f(x) = 2 e x! = 2 e xu(x) + 2 e'U{-xl RN (Fig. 8.12). This is a special case of (8-66) with jj(x) = e·xu(x) fz(x) =jj(-xl P1..:.. P2 = .5 The functionsjj(x) andfz(x) are the densities of the avs -In v and In v respectively, where vis uniform in the interval (0, 1). lnsening into {8-67). we obtain the following rule: Set x1 = -In u; if 0 S II;< .5 Set x; = In u1 if .5 S II;< 1.0 • Figure 8.12 /{x) 0 }( 0 }( 266 CHAP. 8 THE MEANING OF STATISTICS Normal RNs We discuss next some of the many methods for realizing a normal RN sequence Z;. The percentile-transformation method cannot be used because the normal distribution cannot be efficiently inverted. Rejection and Mixing. In Example 8.8 we constructed a sequence x; with density the truncated normal curve = ft(x) j; e-x:' U(x) 2 The normal density /(z) can be written as a sum I I /(z) = 2 ft(z) + 2ft(-z) Applying (8-67), we obtain the following rule for realizing z;: Set z; = x; if 0 s u, < .5 Set Z; = - x; if .5 s u; < 1.0 Polar Coordinates If the avs z and w are NCO, 1) and independent and z = r cos f!, w = r sin f!, then (see Problem 5-27) the avs rand f!, are independent, f!, is uniform in the interval (-17', 1r), and r has a Rayleigh density f,(r) = re-,.:12 Thus f!, = 1T(2 - u) where u is uniform in the interval (0, 1). From (4-78) it follows (see also Problem 5-27) that ifx has an exponential density e-xu(x) and r = vlx, then r has a Rayleigh density. Combining with (8-63), we conclude that if the sequences u; and v; are uniform and independent, the RN sequences ~; = V -2 In Vt cos 17'(2 - w; = Ut) V -2 In Vt sin 7T(2 - u;) are normal and independent. Central Limit Theonm If u 1, avs and m >> 1, then the sum ••• , u"' are m independent uniform z = u1 + · · · + u"' is approximately normal [see (7-66)]. In fact, the approximation is very good even if m is as small as 10. From this it follows that if ut = Umi-m+lt are m subsequences of a uniform RN sequence u1 as in (8-62), then the sequence Zt = Ut + · • • + u, Z2 = U, .. , + • · • + Ullfl is normal. Mixing In the computer literature, a number of elaborate algorithms have been proposed for the generation of normal RNs, based on the expansion of /(z) into a sum of simpler functions, as in (8-67). In Fig. 8.13, we show such an expansion. The· major part of /(z) consists of r rectangles ft, ... ,J,. These functions, properly normalized, are densities that can be SEC. 8-3 RANDOM NUMBF.RS AND COMPUTER SIMIJI.ATJON 267 h 0 ;: Figure 8.13 easily realized in terms of a uniform RN sequence. The remaining functions J, • ., . . . . fr+m are approximated by simpler curves: however, since their contribution is small, the approximation need not be very accurate. Pairs of RNs We conclude with a brief note on the problem of generating pairs of RNS (x;. y;) with a specified joint distribution. This problem can, in principle. be reduced to the problem of generating one-dimensional RNS if we usc the identity f(x, y) = f(x)f(ylx) The implementation. however. is not simple. The normal case is an exception. The joint normal density is specified in terms of five parameters:.,.., 'TIY• ux. u_,.. and p.. This leads to the following method for generating pairs of normal RN sequences. Suppose that the RVs z and w are independent MO. I) with samples z; and w; and (8-68) x = a 1z + b,w + c, y = a2z + b2w + c2 As we know, the Rvs x and y are jointly normal. and ui = ai + h1 u~ = a~ - b~ p. = a1a2 + b1b2 By a proper choice of the coefficients in (8-68), we can thus obtain two RVs x and y with an arbitrary normal distribution. The corresponding RN sequences arc x; = a,z; + b,w; + c, Y; = t12Z; + b~w; ..... c2 At the end of this section we discuss the generation of RNs with multinomial distributions. 'Tix = c, .,, = c2 The Monte Carlo Method A major application of the Monte Carlo method is the evaluation of multidimensional integrals using statistical methods. We shall discuss this important topic in the context of a one-dimensional integral. We assume-intro- 268 CHAP. 8 THE MEANING OF STATISTICS ducing suitable scaling and shifting, if necessary-that the interval of integration is (0, 1) and the integrand is between 0 and 1. Thus our problem is to estimate the integral I = I~ g(u)du where 0 s gCu) s 1 (8-69) Method 1 Suppose that u is an RV uniform in the interval (0, I) and x = g(u) is a function of u. As we know, I is the mean of the RV x = g(u): I = E{g(u)} = TJ~ Hence, our problem is to estimate the mean of x. To do so, we apply the estimation techniques introduced in Section 8-2 (which will be developed in detail in Chapter 9). However, instead of real data, we use as samples of u the computer-generated RNs u1 • This yields the empirical estimate 1 II 1 II (8-70) I::<-~ x1 =- ~ g(u;) n i-1 n jcl To evaluate the quality of this approximation, we shall interpret the RNs x 1 = g(u1) as the values of the samples x1 = gCu;) of the RV x = g(u). With 1 II x=-~x; n jcl we conclude from (8-8) that E{i} = TJ~ =I 2 U'i 2 u~ =n = Ut2 where u~ = E{x2} .,~ = I~ g2(u) du - [f~ g(u) du - r (8-71) Thus the average x in (8-70) is the point estimate of the unknown integral/. The corresponding interval estimate is obtained from (8-15), and it leads to the following conclusion: If we compute the average in (8-70) a large number of times, using a different set of RNS u1 each time, in 100(1 - a)% of the cases, the correct value of I will be between + ~.o12u.,!Vn and 1.oJ2UJ:IVn. x x- Method 2 Consider two independent RVs u and v, uniform in the interval (0, 1). Their joint density equals 1 in the square 0 s u < 1, 0 s v < 1; hence, the probability masses in the region v s g(u) (shaded in Fig. 8.14) equals I. From this it follows that the probability p = P($4) of the event s4 = {v s g(u)} equals p = P{v s g(u)} = I This leads to the following estimate of I. Form n independent pairs (u1 , v1) of uniform RNs. Count the number nsa of times that y 1 s g(u1). Use the approximation p = 1 1 o g( u) du nsa ::< - n SEC. 8-3 RANDOM NUMBERS AND COMPUTER SIMULATION 269 u u 0 •·igure 8.14 To find the variance of the estimate. we observe that the RV n31 has a binomial distribution: hence, the ratio n,;~ln is an RV with mean p = I and variance u~ = /(1 - /)ln. Note. finally. that , <T2- I (' u,2 =;; Jo g(u)du II- g(u)]du > 0 Thus the first method is more accurate. Funhermore, it requires only one RN sequence u1• COMPUTERS IN STATISTICS A computer is used in statistics in two fundamentally different ways. It is used to perform various computations and to store and analyze statistical data. This function involves mostly standard computer programs, and the fact that the problems originate in statistics is incidental. The second use entails the numerical solution of various problems that originate in statistics but are actually deterministic, by means of statistical methods; they are thus statistical applications of the Monte Carlo method. The underlying principle is a direct consequence of the empirical interpretation of probability: All parameters of a probability space are deterministic, they can be estimated in terms of data obtained from real experiments, and the same estimates can be used if the data are replaced by computer-generated RNs. Next we shall apply the Monte Carlo method to the problem of estimating the distribution F(x) and the percentiles xu of an RV x. Distributions To estimate the distribution F(x) = P{x s x} ofx for a specific x, we generate n RNS x1 with distribution F(x) and count the number n., of entries such that x1 s x. The desired estimate of F(x) [see (4-24)] is nx F(x) ""n (8-72) 270 CHAP. 8 THE MEANING OF STATISTICS This method is based on the assumption that we can generate the RNs We can do so ifx is a function of other Rvs with known distributions. The method is used even if F(x) is of known form but its evaluation is complicated or not tabulated or if access to existing tables is not convenient. X;. Example 8.13 We wish to estimate the values of a chi-square distribution with m degrees of freedom. As we know [see (7-89)], ifz 1, • • • • z"' are m independent N(O. I) avs, then the RV x = (z 1) 2 + · · · - (z"')~ is x2Cm). It therefore suffices to form m normal RN sequences zf. To do so, we generate a single normal sequence Z; and form msubsequences zt = z,.;-,..• as in (8-62). The sum x; = <z}>2 + · · · + <zi') 2 is an RN sequence with distribution x2(m). Using the sequence x; so generated, we estimate the chi-square distribution from (8-72). Another method for generating x; is discussed in Example 8.10. • Percentiles The u-percentile of a function F(u) is a number x, such that F(x,) = u Whereas F(x) is a probability, x, is not a probability; therefore, it cannot be estimated empirically. As in the inversion problem, it can be only approximated by trial and error: Select x and determine F(x); if F(x) < u, try a larger value for x; if F(x) > u, try a smaller value; continue until F(x) is close enough to u. In certain applications, the problem is not to find x,. but to establish whether a given number xis larger or smaller than the unknown X 11 • In such problems, we need not find x,. Indeed, from the monotonicity of F(x) it follows that (8-73) x>x,itTF(x)>u x < x,, itT F(x) < u It thus suffices to find F(x) and to compare it to the given value of u. We discuss next an important application. Computer Simuladon in Hypothesis Testing Suppose that q = g(x 1, • • • , x"') is a function of m RVs x'. We observe the values x' of these Rvs and form the corresponding value q = g(x 1, • • • , x"') of the RV q. We wish to establish whether the number q so obtained is between the percentiles qa and qp of q where a and /3 are two given numbers: a = Fq(qa) /3 = Fq(qfJ) (8-74) In hypothesis testing, q is called a test statistic and (8-74) is the null hypothesis. To solve this problem we must determine q0 and qfJ. This, however, can be avoided if we use (8-73): From the monotonicity of Fq(q) it follows that (8-74) is true iff (8-75) a< Fq(q) < /3 qa < q < qfJ SF.C. 8-3 RANDOM NUMBERS ANI> COMPUTER SIMULATION 271 To establish (8-74), it suffices therefore to establish (8-75). This involves the determination of Fq(q) where q is the value of q obtained from the experimental data x'. The function Fq(q) can be determined in principle in terms of the distribution of the RVs x'. This, however, might be a difficult task particularly if many parameters are involved. In such cases. a computer simulation is used based on (8-72): We start with m RN sequences xi simulating the samples of the RVs x' and form the RN sequence q; = l((x] •. . . , x:") i = 1. . . . . n (8-76) The numbers q; are the computer-generated samples of the RV q. Hence, their distribution is the unknown function Fq(q). We next count the number n" of samples such that q; s q and form the ratio n,fq. This ratio is the desired estimate of Fq(q). Thus. (8-75) is true iff (8-77) Note that we have here two kinds of rc1ndom numbers. The first consists of the data x' obtained from a physical experiment. These data are used to form the value q = g(x 1, ••• ,x'• . . . • .t"') of the test statistic q. The second consists of the computer-generated sequences xr. These sequences are used to determine the sequence q; in (8-76) and the value Fiq> of the distribution of q from (8-72). Example 8.14 We have a partition A = [.~ 1 • • • • , .'1l,.J consisting of them events :A, and we wish to test the hypothesis that the probabilities P(~,) of these events equal m given numbers p,. To do so, we perform the underlying physical experiment N times and we observe that the event :A, occurs J.:' times where k1 . . . . . . + k"' = N PI ...... T p"' = I Using the data k'. we form the sum q = i ,., (k' - Np,)~ (8-78) Np, Our objective is to establish whether the number q so obtained is smaller than the upercentile q,. of the RV q ~ = r-1 £J (k'- Np,) 2 Np, (8-79) This RV is called PearJon's test statistic, and the resulting test chi-square test (see Section 10-4). As we have explained. < q, iff F.,(ql < u (8-80) To solve the problem, it suffices therefore, to find F>4(q) and compare it to u. For large N, the RV q has approximately a x2(m - I) distribution tsee (10-63)). For moderate values of N, however, its determination is difficult. We shall find it using computer simulation. q 272 CHAP. 8 THE MEANING OF STATISTICS The avs k' have a multinomial distribution as in 13-41). It suffices therefore to generate m RN sequences k), · · · , k'{' k] .... · · · + k':' = N with such a distribution. We do so at the end of this section. Using the sequences kf so generated, we form the samples q; "' (kj- Np,)2 N = i~l L p, i = I. 2. . . . . n (8-81) of Pearson's test statistic q, and we denote by n9 the number of entries q; such that q; < q. The ratio nqln is the desired estimate of Fq(q). Thus (8-80) is true itT nqlq < u. Note that q is a number determined from (8-78) in terms of the experimental data k', and q; is a sequence determined numerically from (8-81) in terms of the computer-generated RNs kj. • RNS with Multinomial Distribution In the test of Example 8.13 we made use of the multinomial vector sequence Kt = [k}, ••. , kf') This sequence forms the samples of m multinomially distributed avs K = [k 1, • • • , k"') of order N. To carry out the test, we must generate such a sequence. This can be done as follows: Starting with a sequence u; of RNs uniformly distributed in the interval (0, I), we form N subsequences (8-82) U; = [u}, . . . , u{, . . . , ufJ i = I, 2, . . . as iri (8-62). These sequences are the samples of the i.i.d. avs u1, • •• , ui, ... , u.-.· From this it follows that P{p, + · · · + p,. 1 sui< Pt + · · · + p,} = p, (8-84) The vector U; in (8-82) consists of N components. We denote by ki the number of components u{ such that for a specific i + · · · + p,_ t < u{ < Pt + · · · + Pr-t + p, (8-85) Comparing with (8-84), we conclude after some thought, that the sequence Pt k}' ... ' /(;, .•. ' k'!' so generated has a multinomial distribution of order N as in (3-41). g _ _ __ Estimation Estimation is a fundamental discipline dealing with the specification of a probabilistic model in terms of observations of real data. The underlying theory is used not only in parameter estimation but also in most areas of statistics, including hypothesis testing. The development in this chapter consists of two parts. In the first part (Sections 9-1 to 9-4), we introduce the notion of estimation and develop various techniques involving the commonly used parameters. This includes estimates of means, variances. probabilities, and distributions. In the last part (Sections 9-5 and 9-6), we develop general methods of estimation. We establish the Rao-Cramer bound, and we introduce the notions of efficiency, sufficiency. and completeness. 9-1 General Concepts Suppose that the distribution of an RV xis a function F(x, 6) of known form depending on an unknown parameter 6, scalar or vector. Parameter estimation is the problem of estimating 6. To solve this problem, we repeat the underlying physical experiment n times and denote by :c; the observed values 273 274 CHAP. 9 ESTIMATION of the av x. We shall find a point estimate and an interval estimate of 8 in terms of these observations. A (point) estimate is a function 6 = g(X) of the observation vector X= [x,, . . . , x, ]. Denoting by X = (x,, . . . , x,] the sample vector of x, we form the av 8 = g(X). This RV is called the (point) estimator of 8. A statistic: is a function of the sample vector X. Thus an estimator is a statistic. We shall say that 8 is an unbiased estimator of 8 if E{ 8} = 8; otherwise, 8 is called biased, and the difference £{8} - (J is bias. In general, the estimation error 8- 8 decreases as n increases. If it tends to 0 in probability (see Section 7-3) as n - ~.then 8 is a consistent estimator. The sample mean i of x is an unbiased estimator of its mean ., = E{x} and its variance equals u 2/ n; hence, E{(i - 71)2} = u 2/ n- 0. From this it follows that i tends to., in the MS sense, therefore, also in probability. In other words, i is a consistent estimator of.,. In parameter estimation, it is desirable to keep the error 8 - 8 small in some sense. This requirement leads to a search for a statistic g(X) having as density a function centered near the unknown 8. The optimum choice of g(X) depends on the error criterion. If we use the LMS criterion, the optimum 8 is called best. • Definition. A statistic 8 = g(X) is the best estimator of 8 if the function g(X) is so chosen as to minimize the MS error e = E{[8 - g(X)J~} r. ···r. = (9-1) (8- g(X}]2f(x., 8) · · · f<x,. 8)d:c 1 • • • dx, Unlike the nonlinear prediction problem (see Section 6-3), the problem of determining the best estimator does not have a simple solution. The reason is that the unknown in (9-1) is not only the function g(X) but also the parameter 8. In Section 9-6, we determine best estimators for certain classes of distributions. The results, however, are primarily of theoretical interest. For most applications, 6 is expressed as the empirical estimate of the mean of some function of x. This approach is simple and in many cases leads to best estimates. For example, we show in Section 9-6 that ifx is normal, then its sample mean i is the best estimator of.,. An interval estimate of a parameter (J is an interval of the form (81 , 62 ) where 61 = g 1(X) and 62 = g 2(X) are functions of the observation vector X. The corresponding interval (8" Bz) is the interval estimator of 8. Thus an interval estimator is a random interval, that is, an interval the endpoints of which are two statistics 81 = g 1(X) and 8z = g 2(X). • Definition. We shall say that (81 • #h) is a y-confidence interval of 8 if P{8, < 8 < 9:!} = 'Y (9-2) where 'Y is a given constant. This constant is called the confidence coefficient. The difference 8 = I - 'Y is the confidence level of the estimate. The statistics 8 1 and fh are called C()njidence limits. SF.C. 9-2 EXPECTED VALUES 275 If 'Y is a number close to I, we can expect with near certainty that the unknown 8 is in the interval (8 1 , 82 ). This expectation is correct in 1()()-y% of the cases. Part~meter Trt~n¥ormation Suppose that (8,, 8~) is a 'Y confidence interval of a parameter 8 and q(8) is a function of 8. This function specifies the parameter T = q(8). We maintain that if q(8) is a monotonically increasing function of 8, the statistics .;., = q( 8,) and 1-:! = q( 8:?> are the 'Y confidence limits ofT. Indeed, the events {81 < 8 < ~}and {T1 < T < ~}arc equal; hence, P{ t, < T < ~} = P{ 8, < 8 < 8:!} = 'Y (9-3) If q(8) is monotonically decreasing, the corresponding interval is (f2 • f,). The objective of interval estimation is to find two statistics 81 and ~ such as to minimize in some sense the length 8:! - 81 of the estimation interval subject to the constraint (9-2). This problem does not have a simple solution. In the applications of this chapter, the statistics 81 and ~ are expressed in terms of various point estimators with known distributions. The results involve percentiles of the normal. the chi-square. the Student t and the Snedecor F distributions introduced in Section 7-4 and tabulated at the back of the book. 9-2 Expected Values We start with the estimation of the mean., of an point estimate the average - I RV x. We shall use as its II X=-n LX; ;~I of the observations x; and as interval estimate the interval (i- a, i +a). To find a, we need to know the distribution of the sample mean i ofx. We shall assume that i is a normal RV. This assumption is true if x is normal; it is approximately true in general if n is sufficiently large (central limit theorem). Suppose first that the variance of x is known. From the normality assumption it follows as in (8-1 5) that (9-4) where (Fig. 9.1) I + 'Y 3 u=--=1-- 2 2 276 CHAP. 9 ESTIMATION u = 1-! 2 'Y = 2u- l x- -z - 0 x+z.!!... "..rn "..rn Figure 9.1 Unless otherwise stated, it will be assumed that u and 'Yare so related. Equation (9-4) shows that the interval (9-5) is a 'Y confidence interval of 71; in fact, it is the smallest such interval. Thus to find the 'Y confidence interval of the mean 71 of an RV x, we proceed as follows: Observe the samples x; of x and form their average x. 2. Select a number 'Y = 2u - I close to 1. 3. Find the percentile z, of the normal distribution. 4. Form the interval i ± z,u/Vn. I. As in the prediction problem, the choice of the confidence coefficient 'Y is dictated by two conflicting requirements: If 'Y is close to 1, the estimate is reliable, but the size 2z,u/Vn of the confidence interval is large; if 'Y is reduced, z, is reduced, but the estimate is less reliable. The final choice is a compromise based on the applications. The commonly used values of 'Y are .9, .95, .99, and .999. The corresponding values of u are .925, .975, .995, and .9995 yielding the percentiles (Table 1) Z.m = 1.440 Z.97S = 1.967 Z.99S = 2.576 Z.999S = 3.291 Note that Z.97s == 2. This leads to the slightly conservative estimate x± ~ Example 9.1 for 'Y = .95 (9-6) We wish to estimate the weight w of a given object. The error of the available scale is an N(O, u) av ., with u = 0.06 oz. Thus the scale readings are the samples of the av ll = w + "· SEC. 9-2 EXPECTED VALUES 277 i+z ..!!.. "..;n (a) l'igure 9.2 (a) We weigh the object four times, and the results are 16.02, 16.09, 16.13, and 16.16 oz. Their average = 16.10 is the point estimate of w. The .95 confidence interval is obtained from (9-6): x - X± 2u Vn = 16.10 ± 0.06 (b) We wish to obtain the confidence interval i ± 0.02. How many times must we weigh the object? Again using (9-6), we obtain 2u/Vn = 0.02; hence, n = 36. • One-Sided Intervals In certain applications, the objective is to establish whether the unknown 11 is larger or smaller than some constant. The resulting confidence intervals arc called one-sided. As we know (Fig. 9.2a), P{i < 11 + a} = G (u!Vn) = 'Y if a= \1n Hence, p { 11 > i - ~} = 'Y (9-7) This leads to the right 'Y confidence interval., > i - zp!Vn. Similarly, the formula (Fig. 9.2b) P {., <i - \?n} = 'Y = I - 8 (9-8) leads to the left 'Y confidence interval.,< i + Zyu!Vn because Z3 = - Zy· For these estimates we use the percentiles Z.9 = 1.282 Z.9S = 1.645 Z.99 = 2.326 Z.999 = 3.090 Example 9.2 A drug contains a harmful substance H. We analyze four samJ11es and find that the amount of H per gallon equals .41, .46, .SO and .54 oz. The analysis error is an N(O, u) av with u = 0.05 oz. On the basis of these observations. we wish to state with confidence coefficient .99 that the amount of H per gallon does not exceed c. Find c. 278 CHAP. 9 ESTIMATION In this problem, i =- .478 u = 0.05 n =4 z. 99 = 2.326 and (9-8) yields c = i + z.,u/Vn = 0.536 oz. Thus we can state with confidence coefficient .99 that on the basis of the measurements, the amount of II does not exceed 0.54 oz. • Tcbebycbeff Inequality If the RV x is not normal and the number n of observations is not large, we cannot use (9-4) because then i is not normal. To find a confidence interval for.,, we must find first the distribution of i. This involves the evaluation of n - I convolutions. To avoid this difficulty, we use Tchebycheff's inequality. Setting k = I IV'S in (4-115) and replacing x by i and u by u/Vn, we obtain (9-9) I Thus the interval x ± u/vna contains the 'Y confidence interval of '11· If we therefore claim that 11 is in this interval, the probability that our claim is wrong will not exceed a regardless of the form of F(x) or the size n of the given sample. Note that if')'= .95, then 1/Ya = 4.47 and (9-9) yields the interval x::!: 4.47/Yn. Under the normality assumption, the corresponding interval is.i::!: 2/Yn. UNKNOWN VARIANCE If u is unknown, we cannot use (9-4). To find an interval estimate of.,, we introduce the sample variance: s2 I " = --2 (X;- x)2 n- I i=l As we know [see (7-99)], the RV s2 is an unbiased estimator of u 2 and its variance tends to 0 as n - :lO; hence, s == u for large n. Inserting this approximation into (9-5), we obtain the estimate ZuS ZuS x--<.,<x+- Vii v'ii (9-10) This is satisfactory for n > 30. For smaller values of n, the probability that 11 is in the interval is somewhat smaller than 'Y· In other words, the exact 'Y confidence interval is larger than (9-10). To determine the exact interval estimate of.,, we assume that the RV x is normal and form the ratio i-., s/Vn Under the normality assumption, this ratio has a Student t distribution with SEC. 9-2 EXPf.CTED VALUES 279 n - I degrees of freedom [see (7-104)]. Denoting by t, its percentile, we obtain P { -t, < With 2u - I = "Y =I - ~~~ < t,} = 2u - I a, this yields t,s _ t,s } P {x-:r-<'I'J<x+-v: =-y \n VII (9-11) Hence, the exact "Y confidence interval of '11 (Fig. 9.3) is _ t,s _ t,s x--<'I'J<x~- Vn Vn (9-12) For n > 20, the t(n) distribution is approximately N('I'J. u) with variance u 2 = nl(n- 2) [see (7-103)). This yields t,(n) == Zu "' \j n _ 2 for n > 20 The determination of the interval estimate of '11 when u is unknown involves the following steps. I. 2. 3. Observe the samples x; of x, and form the !\ample mean .~ and the sample variance sz. Select a number "Y = 2u - I close to I. Find the percentile 1,, of the t(n - I) distribution. Form the interval i :t t,.~!Vn. Figure 9.3 280 CHAP. 9 Example 9.3 ESTIMATION We wish to estimate the mean '1'1 of the diameter of rods coming out of a production line. We measure 10 units and obtain the following readings in millimeters: 10.23 10.22 10.15 10.:!3 10.26 10.15 10.26 10.19 10.14 10.17 Assuming normality, find the .99 confidence interval of '1'1· In this problem, n = 10, t. 99s(9) = 3.25 _ I to ~ Xt = 10.2 10 i=l Inserting into (9-11), we obtain the interval 10.2 :t. 0.05 mm. • X = - Reasoning similarly, we conclude that the one-sided confidence intervals are given by (9-8) provided that we replace z, by t, and cr by s. This yields (9-13) Example 9.4 A manufacturer introduces a new type of electric bulb. He wishes to claim with confidence coefficient .95 that the mean time to failure '1'1 = exceeds c days. For this purpose, he tests 20 bulbs and observes the following times to failure: x ~ M ~ ~ ~ N ~ ~ ~ 83 10 79 85 81 84 73 91 73 Assuming that the time to failure of the bulbs is normal, find c. In this problem, X I I 20 = 20 ~ X; = 80.05 20 s2 = - ~ 19 t~t (Xi - n n X)2 = 29.74 Thus s and (9-13) yields c = 5.454 n = 20 = x- t.95 s!Vn = 77.94. t.9s(l91 = 2.09 • In a number of applications, the distribution of x is specified in terms of a single parameter 8. In such cases, the mean 71 and the variance cr 2 of x can be expressed in terms of 8; hence, they are functionally related. Thus we cannot use the preceding results. To estimate .,, we must develop special techniques for each case. We illustrate with two examples. We assume in both cases that n is large. This leads to the assumption that the sample mean i is normal. MEAN-DEPENDENT VARIANCE Exponential Distribution Suppose first that 1 11 f(x) =- e-·' U(x) A (fig. 9.4). As we know, 71 =A and cr =A= "1· From this and the normality assumption it follows that the RV i is normal with mean A and variance A2/n. SEC. 9-2 EXPECTED VALUES 281 fx(X) I ~ 0 X Figure 9.4 We shall use this observation to find they confidence interval of A. From = u = A. that (9-4) it follows with 71 ZuA ZuA } P {-x-\l,;<A<x+Vn =y Rearranging terms, we obtain i i } p { I + ZuiVn < A < I - zuf'Vn = 'Y (9-14) and the interval X y = 2u - I ± ZuVn results. Example 9.5 Suppose that the duration of telephone calls in a certain area is an exponentially distributed RV with mean.,. We monitor 100 calls and find that the average duration is 1.8 minutes. (a) find the .95 confidence interval of.,. With i = 1.8 and z,. == 2, (9-14) yields - X I ± 2/Vn -=(1~ .,.,~) ... ~·~- (b) In this problem, F(x) = I - e·x'"; hence, the probability p that a telephone call lasts more than 2.5 minutes equals p = I - F(2.51 Find the .95 confidence interval of p. = (' · 2 ·~'" The number p is a monotonically increasing function of the parameter .,. We can therefore use (9-3). In our case, the confidence limits of ., are 1.5 and 2.25; 282 CHAP. 9 ESTIMATION hence, the corresponding limits of p are Pt = e-2.S;u = .19 P2 = e-2.~1225 = .33 We can thus claim with confidence coefficient .95 that the percentage of calls lasting more than 2.5 minutes is between 19% and 33%. • Poisson Distribution Consider next a Poisson-distributed ter A P{x A" = k} = e-A k! k = 0, RV with parame- I, . . . In this case [see (4-110)],., =A. and u 2 = A.= 71; hence, the sample mean i of x is approximately N(A, VX). This approximation holds, of course, only for the distribution of x. Since x is of the discrete type, its density consists of points. With u = VX, (9-4) yields P {x- z,. ~<A< i + z,. ~} = 'Y (9-15> Unlike (9-14), this does not lead readily to an interval estimate. To find such an estimate, we note that (9-15) can be written in the form P {<A- i)2 <~A} = 'Y = 2u- I (9-16) The points (.i, A) that satisfy the inequality are in the interior of the parabola , (A - .i)2 = z; A. n (9-17) From this it follows that the 'Y confidence interval of A is the vertical segment (A 1, A2) of Fig. 9.5. The endpoints of this interval are the roots of the quadratic (9-17). FIID'e9.5 SF.C. 9-2 EXPECTED VALUES 283 Note that in this problem. the normal approximation holds even for moderate values of n provided that nA. > 25. The reason is that the av ni = x 1 + · · · + x,. is Poisson-distributed with parameter nA.: hence, the normality approximation is based on the size of nA.. Example 9.6 The number of monthly fatal accidents in a region is a Poisson RV with parameter A. In a 12-month period. the reported accidents per month wc:rc 4 2 5 2 7 3 I 6 3 8 4 5 Find the .95 confidence interval of A. In this problem. i = 4.25. 11 = 12. and :.u ,.. 2. Inserting into 19-17). we obtain the equation . 4 (A - 4.25)· A 12 The roots of this equation are A1 ...: 3.23. A~ = 5.59. We can therefore claim with confidence coefficient .95 that the mean number of monthly accidents is between 3.23 and 5.59. • Probabilities We wish to estimate the probability p = P(sl.) of an event .<A. To do so, we repeat the experiment n times and denote by k the number of successes of st. The ratio fi = kin is the point estimate of p. This estimate tends top in probability as n - x (law of large numbers). The problem of estimating an interval estimate of p is equivalent to the problem of estimating the mean of an av x with mean-dependent variance. We introduce the zero-one av associated with the event .<A. This RV takes the values I and 0 with P{x = I} P{x = 0} = p =q = I - p Hence, 1h =p U7' = pq The sample mean .'i or x equals kin. furthermore. TJr = p. u} = pqln. as in (8-8). Largen The RV i is of discrete type. taking the values kin. For large n. its distribution approaches a normal distribution with mean p and variance pqln. It therefore follows from (9-4) that (9-18) We cannot use this expression directly to find an interval estimate of the unknown p because p appears in the variance term pqln. To avoid this difficulty. we introduce the approximation pq = 114. This yields the interval - k (9-19) x=n 284 CHAP. 9 ESTIMATION Fipre 9.6 This approximation is too conservative because p(l - p) s 1/4, and it is tolerable only if pis close to 1/2. We mention it because it is used sometimes. The unknown p is close to kin; therefore. a better approximation results if we set p ""'i in the variance term of (9-18). This yields the interval y=2u-l (9-20) for the unknown parameter p. We shall now find an exact interval. To do so, we write (9-18) in the form (9-21) The points (i, p) that satisfy the inequality are in the interior of the ellipse - p) .'f = I! (9-22) n n From this it follows that the y-confidence interval of p is the vertical segment (p 1 , p 2 ) of Fig. 9.6. The endpoints of this segment are the roots of the quadratic (9-22). (p _ .i)2 = Example 9.7 z! p(l In a local poll, 400 women were asked whether they favor abortion; 240 said yes. Find the .9.5 confidence interval of the probability p that women favor abortion. In this problem, i = .6, n = 400, and z, == 2. The exact confidence limits p 1 and p 2 are the roots of the quadratic I (p - .6)2 = 100 p(l - p) Solving, we obtain p 1 = ..5.50, p 2 = .647. This result is usually phrased as follows: "Sixty percent of all women favor abortion. The margin of error is ±.5%." The SEC. 9-2 EXPECTED VALUES 285 approximations (9-19) and (9-20) yield the intervals x- -+ - -•.2:r. . : . .6-+ . I 2vll .t ~ -:-r:I .vr=:-·. .r-o · .l'1 -· .6 +: .049 VII • Note that (9-18) can be used to predkt the number k of successes of s4 if pis known. It leads to the conclusion that. with confidence coefficient 'Y· the ratio x = kl tr is the interval vp(. _ p) k {p(l-:.. p) (9-23) p - z,. tr <;; < p -t z,. 'J-n-This interval is the horizontal segment Example 9.8 (Xt. x~) of Fig. 9.6. We receive a box of 100 fuses. We know that the probability that a fuse is defective equals .2. Find the .95 confidence interval of the number k = ni of the good fuses. In this problem. p = .8. Zu == 2. n = 100. and (9-23) yields the interval np ::!: z,. v'np(l - p) = 80 ::!: 8 (9-24) Thus the number of good fuses is between 72 and 88. • Small n For small n, the determination of the confidence interval of p is conceptually and computationally more difficult. We shall first solve the prediction problem. We assume that p is known, and we wish to find the smallest interval (k 1 , k2 ) such that if we predict that the number k of successes of .s4 is between k 1 and k2 , our prediction will be correct in l()()y% of the cases. The number k is the value of the RV k = ni = Xt + · · · x., (9-25) This RV has a binomial distribution; hence, our problem is to find an interval (kt, k2 ) of minimum size such that = P{k1 s k s k2} = 2•. (k) p 4q"- 4 = I (9-26) - 6 n This problem can be solved by trial and error. A simpler solution is obtained if we search for the largest integer k 1 and the smallest integer k2 such that 'Y 4 L' •·k, L . 6 " ( n ) pkq" k < _ 6 <_ (9-27) k 2 k-42 k 2 To find k1 for a specific p, we add terms on the left side starting from k = 0 until we reach the largest k 1 for which the sum is less than 6/2. Repeating this for every p from 0 to I, we obtain a staircase function k1(p), shown in Fig. 9.7 as a smooth curve. The function k2(p) is determined similarly. The functions k1(p) and k2(p) depend on nand on the confidence coefficient 'Y· In Fig. 9.8, we show them for n = 10, 20. SO, 100 and for 'Y = .95 and .99 using as horizontal variable the ratio kin. The curves k 1(p)/n and k2(p)ln approach the two branches of the ellipse of Fig. 9.6. 4=0 ( n ) p•q"-k 286 CHAP. 9 ESTIMATION 0 Figure 9.7 Example 9.9 A fair coin is tossed 20 times. Find the .95 confidence interval of the number k of heads. The intersection points of the line p = .SS with then = 20 curves of Fig. 9.8 are k 1/n = .26 and k2/n = .74. This yields the interval 6 !S k s 14. • We turn now to the problem of estimating the confidence interval of p. From the foregoing discussion it follows that P{k,(p) s; k s; k2(p)} = 'Y (9-28) Fipre 9.8 'Y = 0.95 'Y = 0.99 :!0 ~0 10 10 10 20 20 k n k n SEC. 9-2 EXPECTED VAI.UES 287 This shows that the set of points <k. p) that satisfy the inequality is the shaded region of Fig. 9.7 between the curves k1(p) and k 2(p); hence, for a specific k, the 'Y confidence interval of p is the vertical segment (p 1 • p 2 ) between these curves. Example 9.10 We examine" = 50 units out of a production line and find that k = 10 are defective. Find the .95 confidence interval of the probability p that a unit is defective. The intersection of the line i =kin = .2 with then = 50 curves of Fig. 9.8 yields p 1 = .I and p 2 = .33; hence •. I < p < .33. If we use the approximation (9-20), we obtain the interval .12 < p < .28. • Bayesian Estimation In Bayesian estimation, the unknown parameter is the value 9 of an RV 8 with known density /,(9). The available information about the RV x is its conditional density ,h(xl9). This is a function of known form depending on 9, and the problem is to predict the RV 8, that is. to find a point and an interval estimate of 9 in terms of the observed values x; of x. Thus the problem of Bayesian estimation is equivalent to the prediction problem considered in Sections 6-3 and 8-2. We review the results. In the absence of any observations, the LMS estimate iJ of 8 is its mean £{8}. To improve the estimate. we form the average .f of then observations x;. Our problem now is to find a function c/>(.f) so as to minimize the MS error E{l8- c/>(i)J2}. We have shown in (6-54) that the optimum c/>(.f) is the conditional mean fJ = £{81 i} = f~ OJ4(tJ l.f) d9 (9-29) I The function f,(9l.t> is the conditional density of 9 assuming i = i, called posterior density. From Bayes' formula (6-32) it follows (Fig. 9.9) that 1 /,(9 I i) = -y.fi<il9)/,(9) 'Y = - (9-30) /;:(:c) The unconditional density ,h-(x) equals the integral of the productf,(il9)/8 (9). Thus to find the LMS Bayesian point estimate of 9, it suffices to determine the posterior density of 8 from (9-30) and to evaluate the integral in (9-29). The function .fx-(il9) can be expressed in terms of fx(xi9). For large n. it is approximately normal. The 'Y confidence interval of 9 is an interval (9 - a 1 , 9 - a2) of minimum length a 1 + a2 such that the area of / 11(91i) in this interval equals 'Y P{fJ - a, < 8 < fJ + a2li} = 'Y (9-31) These results hold also for discrete type RVs provided that the relevant densities are replaced by point densities and the corresponding integrals by sums; example 9-12 is an illustration. 288 CHAP. 9 ESTIMATION prior X likelihood = posterior 9 Fipre9.9 Example 9.11 The diameter of rods coming out of a production line is a normal RV with density = _ I _ e· IB-Bu~l2no (9-32) uov'2W We receive one rod and wish to estimate its diameter 8. In the absence of any measurements, the best estimate of 8 is the mean 80 of 6. To improve the estimate. we measure the rod n times and obtain the samples x, =- 8 + II; of the RV x = 8 + "· We assume that the measurement error" is an N(O. u) RV. From this it follows that the conditional density of x assuming 6 = 8 is N(8, ul and the density of its sample mean i is N(8, u/Vn). Thus /,(8) J.r<il8) = I t' ·mi-s.=rza: (9-33) uVf.iiin Inserting (2-28) and (2-29) into (2-26) we conclude omitting the fussy details that the posterior density of j,(8li) of 6 is a normal curve with mean fJ and standard deviation (see Fig. 9.9) where u nu 2 ---..-u~> - u! 1 u! + u 21n , ...." d- 2 nd- 2 8 = 2 80 + - 2 X • 0' 0' , ...." (9-34) X This shows that the Bayesian estimate fJ of 8 is the weighted average of the prior estimate 80 and the classical estimate i. Furthermore, as n increases, (J tends to i. Thus, as the number of measurements increases, the effect of the prior becomes negligible. From (8-9) and the normality of the posterior density f 1(8li) it follows that P{fJ - uz., < 6 < fJ + uz.,li} = 'Y = 2u - I (9-35) This shows that the Bayesian 'Y confidence interval of 8 is the interval (J ~ z.,U The constants fJ and u are obtained from (9-34). This interval tends to the classical interval i ± z.,u/Vn as n - ao. • SEC. 9-2 EXPECTED VALUES 289 Bayesian Estimation of Probabilities In Bayesian estimation, the unknown probability of an event sf. is the vaJue p of an RV p with known density /p(p). The available information is the observed value k of the number k of successes of the event sf. in n trials, and the problem is to find the estimate of p in terms of k. This problem was discussed in Section 6-1. We reexamine it here in the context of Rvs. In the absence of any observation, the LMS estimate p of p is its mean E{p} = J: p/p(p)dp (9-36) in agreement with (6-15). The updated estimate p of pis the conditional mean p= E{plk} = f~ pf,Cpllddp (9-37) as in (6-21). To find p, it thus suffices to find the posterior density f,,(plk). With .-:A. = {k = k} it follows from (6-14) that f,(plk) = yP{k = klp}f,(p) where P{k =kip}= P{k = k} (~) p4q" = f~ P{k = 4 'Y = P{k = k} (9-38) and kip}/p(p) dp ( = f~ ~) p4q" 4f,(p) dp Thus to find the Bayesian estimate p of p, we determine f,,(plk) from (9-38) and insert it into (9-37). The result agrees with (6-21) because the term ( ~) cancels. Example 9.12 We have two coins. The first is fair, and the second is loaded with P{h} = .35. We pick one of the coins at random and toss it 10 times. Heads shows 6 times. Find the estimate fJ of the probability p of heads. In this problem, p takes the values p 1 ""' .5 and p~ = .35 with probability 1/2; hence, the prior density f,(p) consists of two points as in Fig. 9.10a, and the prior l'igure 9.10 fp(PI .75 .5 .5 .15 0 0.35 0.5 p (a) 0 0.35 0.5 p (b) 290 CHAP. 9 ESTIMATION estimate of p equals p,l2 that P{k ~ p 212 = 6} = ( ~) ( = .425. To find the posterior estimate p. we observe 4 2~ 0 4 X - X .356 Inserting into (9-38) and canceling the factor ( ~) = ( ~) X ~0 ) X .00135 we obtain p~qt { .15 i =I = .00135 == .25 i = 2 Thus the posterior density of p consists of two points as in Fig. 9.10b, and p = . 15 X .5 "'" .25 X .35 = .4625 r( Jp P; lk) • Difference of Two Means We consider, finally, the problem of estimating the difference Tlx -.,,of the means Tlx and TIY of two RVs x andy defined on an experiment~. It appears that this problem is equivalent to the problem of estimating the mean of the RV w=x-y considered earlier. The problem can be so interpreted only if the experiment ~ is repeated n times and at each trial both Rvs are observed yielding the pairs (Fig. 9.lla) (x,, y,), (x2, .Y2), ••• , (x,., y,.) (9-39) The corresponding samples of w are the n numbers w; = x; - y;. The point estimate of the mean .,,.. = Tlx - .,,. of w is the difference w= x - y, and the corresponding 'Y confidence interval equals [see (9-5)] - _ zp,.. X - Y - Yn < Tlx - Tly < - - X - )' zp,.. Yn + (9-40) . t he vanance . where cr,..2-- cr .•2 + cr.v2 - 2P.x,. IS of w. This approach can be used only if the available observations are paired samples as in (9-39). In a number of applications, the two Rvs must be sampled sequentially. At a particular trial, we can observe either the RV x or the RV y, not both. Observing x n times andy m times, we obtain the values (Fig. 9.11b) Xt, • • • , Xn, Yn-1 • • • • • Yn-tn (9-41) of then samples x;ofx and of them samples y;ofy. In this interpretation, the n + m RVs X; and Y• are independent. Let us look at two illustrations of the need for sequential sampling. Figure 9.11 Sequential samples Paired samples X, Yt Y; y,. Yn+tn SEC. 9-2 EXPECTED VALUES 291 I. Suppose that x is an RV with mean Ylx. We introduce various changes in the underlying physical experiment and wish to establish whether these changes have any effect on TJx. To do so, we repeat the original experiment n times and the modified experiment m times. We thus obtain then + m samples in (9-41). The first n numbers arc the samples x1 of the RV x representing the original experiment, and the next m numbers y 1 arc them samples of the av y representing the modified experiment. This problem is basic in hypothesis testing. 2. A system consists of two components. The time to failure of the first component is the RV x and of the second. the RV y. To estimate the difference TJx - TJ:-, we test a number of systems. If at each test we can determine the times to failure x1 and y1 of both components, we can obtain the paired samples (x1 , yt). Suppose, however, that the components are connected in series. In this case, we cannot observe both failure times at a single trial. To estimate the difference TJx - Tl.r• we must usc the independent samples in (9-41). We introduce the RV I "' (9-42) Yt m i•l The RV wis defined by (9-42); it is not a sample mean. Clearly, TJ;r = TJ.f - TJ.v = TJx - TJ~· Hence, wis an unbiased point estimator of Ylx - .,, • To find an interval estimate, we must determine the variance of w. From the independence of the samples x1 and y1 it follows that w= i - y where i x = u; n ~ ., ., u! I = -n 2 x; 1-1 0'~ ~ y =- 2 ., ., 0'( u~ = -;:- O'f = ,;; u~ + ,;; (9-43) Under the normality assumption, this leads to the 'Y confidence interval .t - Y - O';;;Z, < TJx - TJ, < i - y + O';;;:Z11 (9-44) This estimate is used of O'x and u>. are known. If they are unknown and n is large, we use the approximations u·, == s·, ·' ·' I- LJ ~(x· =n - I i=l ' x-)2 , u~ · , ""' s; · I X. =, LJ m- i-1 -2 (y; - y) in the evaluation of u;;;. Example 9.13 Suppose that x models the math grades of boys and y the math grades of girls in a senior class. We examine the grades of SO boys and 100 girls and obtain x = 80. y = 82, = 32, s~ = 36. Find the .99 confidence interval of the difference 1Jx - .,, of the grctde means 1Jx and TJr. s; 292 CHAP. 9 ESTIMATION In this problem, n = SO, m = 100. zll = z. 99, ""'2.S8. Inserting the approximation s2 u!.. ,.,. ....!. + "' so s2 ..:L =1 100 into (9·44), we obtain the interval -2 ± 2.S8. • The estimation of the difference "111 - .,, when u 11 and u 1 are unknown and the numbers n and m are small is more difficult. If u 11 =I= uv, the problem does not have a simple solution because we cannot find an. estimator the density of which does not depend on the unknown parameters. We shall solve the problem only for U 11 = Uy = (T The unknown u be estimated either by s~ or by s~. The estimation error is reduced, however, if we use some function of sJl and s,. We shall select as the estimate u2 of u 2 the sum as~ + bs~ where the constants a and b are so chosen as to minimize the MS error. Proceeding as in (7-38), we obtain (see Problem 7-29) u2 = (n - 1)s~ + (m - l)s~ (9-45) n + m- 2 2 can The variance of wequals u~ = u 2(1/n + 1/m). To find its estimate, we replace u 2 by u2• This yields the estimate o-t = (n "' (! ..!.) 1)s~ + (m - 1)s: + n+m-2 n m (9-46) We next form the RV (9-47) As we show in Problem 7-29, this RV has a 1 distribution with n + m - 2 degrees of freedom. Denoting by Ill its u-percentile, we conclude that P {-Ill < w~w"'w < Ill} = 'Y = 2u - I Hence, (9-48) Thus to estimate the difference sampled, we proceed as follows: "'x - .,, of two RVs x and y sequentially 1. Observe the n + m samples Xt and y 1 and compute their sample means x, y and variances s~, s~. 2. Determine {r from (9-45) and cTw from (9-46). 3. Select the coefficient 'Y = .2u - 1 and find the percentile lll(n + m - 2) from Table 3. 4. Form the interval x - y : luUw. SEC. Example 9.14 9-3 VARIANCE AND CORRELATION 293 Two machines produce cylindrical rods with diameters modeled by two normal Rvs x andy. We wish to estimate the difference of their means. To do so. we measure 14 rods from the first machine and 18 rods from the second. The resulting sample means and standard deviations. in millimeters. are as follows: .t = 85 y = 85.05 .f, = 0.5 .~. ::. 0.3 Inserting into (9-46). we obtain •~ =- 13s; ~. 17s; { _!_ Thus uw = 0.14. n -t ~· 30 m - 2 = 30. 1,9(30) _!_ 1 = () ()" 14 + 18 ; . - = 1.31. and IIJ-481 yields the estimate • ·0.24 < .,.,,, - '11.• < 0.14 9-3 Variance and Correlation We shall estimate the variance v = cr 2 of a normal RV x. We assume first that its mean TJ is known, and we use as point estimate of v the sum . V ~ ( = -nI ;-t ~ X; , (9-49) 71)· - where x; are the observed samples of x. As we know. the corresponding estimator v is unbiased, and its variance tends to 0 as n -+ x; hence, it is consistent. To find an interval estimate of v, we observe that the RV nv/u 2 has a x2 distribution with n degrees of freedom (see (7-92)]. Denoting by x~(n) its upercentiles, we conclude (Fig. 9.12) that P {xk<n> < ;~ < xi -&12(n)} = 'Y = I This yields the 'Y confidence interval nO > 2 -2-CT > }(&12(n) The percentiles of the 2 - a (9-50) nO (9-51) Xt-&12(n) x2 distribution are listed in Table 2. Figure 9.12 0 X 294 CHAP. 9 ESTIMATION We note that the interval so obtained does not have minimum length because the x2 density is not symmetrical; as noted in (8-12), it is used for convenience. If 11 is unknown, we use as point estimate of v the sample variance s2 i =~ n I ;., (x; - i)2 Again, this is a consistent estimate of v. To find an interval estimate, we observe that the x2(n - I) distribution; hence, 2 (n - l)s u2 2 P { Xat2Cn - 1) < . 2 < X• RV (n - l)s2/u 2 has a } &'2(n - I) = 'Y (9-52) This yields the interval estimate (n - l)s xi-at2<n - Example 9.15 2 < u 2 < ...:<~n_-__;_:l):..;;.s_ 2 xk<n - 1) (9-53) I) The errors of a given scale are modeled by a normal av " with zero mean and variance u 2• We wish to find the .95 confidence interval of u 2• To do so. we weigh a standard object 11 times and obtain the following readings in grams: 98.92 98.68 98.85 97.07 98.98 99.36 98.70 99.33 99.31 98.84 99.20 These numbers are the samples of the av x = w • " where w is the true weight of the given object. (a) Suppose, first, that w = 99 g. In this case, the mean w of x is known; hence, we can use (9-51) with () = I II TI ,_, L (X; - 99)2 = 0.626 Inserting the percentiles x~o:sOI) = 3.82 and i97~(11) = 21.92 of x2Cll), into (9-51) we obtain the interval ~- 0 30 2 11021.92 - · < u < 3.82 - 1.8 Note that the corresponding interval for the standard deviation u [see (93)) is vr:s vU30 =.55< u < = 1.34 (b) We now assume that w is unknown. From the given data, we obtain i I II L = x1 = 99.02 lli~l s2 = - I II L (x; - 10;-1 99.02)2 = 0.622 SEC. 9-3 VARIASCE AND CORRELATION 295 In this case, x~o2sOO) =3.25 and i97~(10) = 20.48. Inserting into (9-53), we obtain the estimate 0.303 < u 2 < 1.91 • Covariance and Correlation A basic problem in many scientific investigations is the determination of the existence of a causal relationship between observable variables: smoking and cancer, poverty and crime, blood pressure and salt. The methods used to establish whether two quantities X and Y are causally related are often statistical. We model X and Y by two Rvs x andy and draw inferences about the causal dependence of X and Yfrom the statistical dependence ofx andy. Such methods are useful, however. they might lead to wrong conclusions if they are interpreted literally: Roosters crow every morning and then the sun rises; but the sun does not rise because roosters crow every morning. To establish the independence of two Rvs, we must show that their joint distribution equals the product of their marginal distributions. This involves the estimate of a function of two variables. A simpler problem is the estimation of the covariance P.11 of the Rvs x andy. If p. 11 = 0, x andy are uncorrelated but not necessarily independent. However, if p. 11 :/: 0, x andy are not independent. The size of the covariance of two Rvs is a measure of their linear dependence, but this measure is scale-dependent (is 2 a high degree of correlation or 200?). A normalized measure is the correlation coefficient r = p. 11 /o:tuy ofx andy. As we know, lri ~ I. and if lrl = I, the Rvs x and y are linearly dependent-that is. y = ax + b; if r = 0, they are uncorrelated. Since P.11 = E{<x- 'I.• )(y -1)y)}, the empirical estimate of p. 11 is the sum (1/n)}:(x; - i)(y; - y). This estimate is used if the means 11x and 11r are known. If they are unknown, we use as estimate of p. 11 the sample covariance • #Lll 1 L,. ~ = ---=--t n i-1 (X; - - (y; X) - .V-) (9-54) The resulting estimator 1'11 is unbiased (see Problem 9-26) and consistent. We should stress that since r is a parameter of the joint density of the avs x andy, the observations x; andY; used for its estimate must be paired samples, as in (9-39). They cannot be obtained sequentially, as in (9-41). We estimate next the correlation coefficient r using as its point estimate the ratio ; = JLJJ = I(x; - .f)(y; - y) (9-55) VI(x; - x)2I(y; - y)2 This ratio is called the sample correlation coefficient of the RVs x andy. To find an interval estimate of r, we must determine the density Jf(r) of the RV r. The functionf,(r) vanishes outside the interval (-1. I) (Fig. 9.13a) because lrl ~ I (Problem 9-27). However, its exact form cannot be found easily. We give next a large sample approximation. SxSy 296 CHAP. 9 ESTIMATION J~(Z) z X (a) (b) Figure 9.13 Fisher's AuxiUary Variable z= We introduce the transformations 1 1+f _ t 2 In I t e2z-1 I = e2z + (9-56) It can be shown that for large n, the distribution of the RV z so constructed is approximately normal (Fig. 9.13b) with mean and variance 1+r 1 I Tit"" 2ln 1 _ r (9-57) (7'2"" - - n- 3 l The proof of this theorem will not be given. From (9-57) and (8-7) it follows that P {Tit - ~ < z < Tit + ~} = 'Y = 2u n-3 n-3 1 In this expression, Zu is not the u-percentile of the RV z in (9-56); it is the normal percentile. To continue the analysis, we shall replace in (9-57) the unknown r by its empirical estimate (9-55). This yields A Tit - 1 1 +; 2 In 1 _ (9-58) 1 From (9-58) and the monotonicity of the transformations (9-56) it follows [see also (9-49)] that where r _ e2t• - 1 t - e2t• + 1 e'Ztz - 1 r2 = e2t + 2 1 z2 lt =. + Zu Tit - .y;;-::-j (9-59) Thus to find the 'Y confidence interval (r1 , r2 }ofthe correlation coefficient r, we compute f from (9-55), 'J}t from (9-58), and r1 and r2 from (9-59). SEC. Example 9.16 9-4 297 PERCENTILES AND DISTRIBUTIONS We wish to estimate the correlation coefficient rofSAT scores, modeled by the RV x, and freshman rankings, modeled by the RV y. For this purpose, we examine the relevant records of 52 students and obtain the 52 paired samples (x;, y;). We then compute the fraction in (9-55) and find f = .6. This is the empirical point estimate of r. We shall find its .95 confidence interval. Inserting the number f = .6 into (9-58), we obtain .,;l == .693, and with Z11 = 2, (9-59) yields zt=.41 zz=.98 rt=.39 rz=.75 Hence, .39 < r < .75 with confidence coefficient .95. • 9-4 Percentiles and Distributions Consider an RV x with distribution F(x). The u-percentile of x is the inverse of F(x); that is, it is a number X 11 such that F(x11 ) = u (Fig. 9.14a). In this section, we estimate the functions X 11 and F(x) in terms of the n observations x; of the RV x. In both cases, we assume that nothing is known about F(x). In this sense, the estimates are distribution-free. Percentiles We write the observations x1 in ascending order and denote by y1 the ith number so obtained. The resulting Rvs y1are the ordered statistics Yt s Y2 s · · · s Yn introduced in (7-44). In particular, y1 equals the minimum and Yn the maximum of x;. The determination of the interval estimate of X 11 is based on the following theorem. • Theonm. For any k and r, P{yk <XII< Yk+r} = kfl (~) um(l - u)n•"' (9-60) X.s X lll&k l'igure 9.14 Y1r 0~~----~~~~ (a) Y40 (b) 298 CHAP. 9 ESTIMATION • Proof. From the definition of the order statistics it follows that Yt < xu itT at least k of the samples X; are less than x,; similarly, y 4 ., > x, itT at least k + r of the samples x; are greater than x,. Therefore, }'4 < x, < Ylc+r itT at least k and at most k + r- I of the samples x; are less than x,. In other words, in the sample space~~~. the event {y" < x, < y• .,} occurs itT the number of successes of the event {x s x,} is at least k and at most k + r - I. This yields (960) because P{x s x,} = u. This theorem leads to the following 'Y confidence interval of Xu: We select k and r such as to minimize the length Ylc+r - Ylc of the interval (yk, Yk.,) subject to the condition that the sum in (9-60) equals 'Y· The solution is obtained by trial and error involving the determination of 'Y for various values of k and r. Example 9.17 We have a sample of size 4 and use as estimate of x, the interval (y 1 • y 4 ). In this case. 'Y equals the sum in (9-60) for m from I to 3. Hence. 'Y For u = P{y, < x, < Y•} = 4u(l - u)3 + 6u 2(1 - u)2 + 4u 3(1 = .5 • .4•. 3, we obtain 'Y = .875, .845, .752. • - u) If n is large, we can use the approximation (3-35) in the evaluation of 'Y· This yields 'Y - 0.5 - nu) - G ( kV+ 0.5 - nu) (9-61) nu(l - u) nu(l - u) For a given 'Y· the length of the interval (y,, y,.,) is minimum if its center is Y11• where nu is the integer closest to nu. Setting k + 0.5 == n, - m and k + r - 0.5 == nu + min (9-61), we obtain = P{yk <Xu < 'Y Y~.:+,} == G ( +Vr = P{ytt.-m <X, < This yields the 'Y = Ytt.+m} == 2G ( V m ) tru(l - 11) (9-62) I - 8 confidence interval Ytt.-m < Example 9.18 k X, < Ytt.+m m == Z1-61~ V nu(l - u) (9-63) We have 64 samples ofx, and we wish to find the .95 confidence interval of its median x. 5 • In this problem, n = 64 u = .5 n, = nu = 32 Z1-a'2 ""' 2 m =8 and (9-63) yields the estimate y2• < x.5 < y40 • We can thus claim with probability .95 that the unknown median is between the 24th and the 40th ordered observation of x (Fig. 9.14b). • Distribution The point estimate of F(x) is its empirical estimate k F(x) = 2 n (9-64) SEC. 9-4 Pf.RCf.NTII.F.S ASD DISTRIBUTIONS 299 where kx is the number of samples x; that do not exceed x [see (4-25)]. The function F(x) has a staircase form with discontinuities at the points x;. lfx is of continuous type, almost certainly the samples x; are distinct and the jump of i"(x) at x; equals 1/n. It is convenient, however, if II; samples are close, to bunch them together into a single sample x; of multiplicity n;. The corresponding jump at x; is then n;ln. We shall find two interval estimates of F(x). The first will hold for a specific x. the second for any x. Variable-length estimate For a specific x. the function F(x) is the probability p = P{x::: x} of the event s4. = {x::: x}. We can therefore apply the earlier results involving estimates of probabilities. For large n. we shall use the approximation (9-20) with p replaced by F(:c) and.{ by i"(x). This yields they confidence interval F(x):!: a a= ~ VF(x)(l - F(x)) (9-65) for the unknown F(x). In this estimate. the length 2a of the confidence interval depends on x. KolmogorotT estimate The empirical estimate of F<x> is a function F(x) of x depending on the samples x; of the RV x. It specifics therefore a random family of functions F(x), one for each set of samples x;. We wish to find a number c, independent of x, such that P{IF<x> - F<x>l ::: c} ::: y for every x. The constant y is the confidence coefficient of the desired estimate F(x) :!: c of F(x). The difference IF(x) - F(x)l is a function of x and its maximum (or least upper bound) is a number w (Fig. 9 .15) depending on the samples x;. It specifies therefore the RV w = max IF(x) - F(x)l (9-66) -:~.<.t<z Fipre 9.15 II )()() CHAP. 9 ESTIMATION From the definition of wit follows that w < c iff IF(x) - F(x)l < c for every x, hence, P{maxiF(x) - F(x)l :S c} = P{w :S c} = F.,.(c) = 'Y (9.67) It where F.,(w) is the distribution ofthe av w. To find c, it suffices therefore to find F,..(w). The function F...(w) docs not depend on the distribution F(x) ofx. This unexpected property of the maximum distance w between the curves F(x) and F(x) can be explained as follows: The difference F(x) - F(x) does not change if the x axis is subjected to a nonlinear transformation or, equivalently, if the av xis replaced by any other RV y = g(x). To determine F.,.(w), we can assume therefore without loss of generality that F(x) = x for 0 s x s I [see (4-68)]. Even with this simplification, however. the exact form of F.,.(w) is complicated. For most purposes, it is sufficient to use the following approximation due to KolmogorotT: 1 F,..(w) =r I - 2e- 2""' for w > llYn (9-68) This yields 'Y = I - 8 = F.,.(c) ""' I - 2e- 21".1 Hence, the 'Y confidence interval of F(x) is (9-69) F(x) ± c Thus we can state with confidence coefficient 'Y that the unknown distribution is a function F(x) located in the zone bounded by the curves F(x) + c and F(x)- c. Example 9.19 The IQ scores of 40 students rounded off to multiples of 5 are as follows: Xi l'(x) = .025 15 80 85 90 95 100 lOS 110 115 120 2 3 5 6 8 6 4 2 2 125 Find the .95 confidence interval of the distribution F(x) of the scores. In this problem, we have II samples xi with multiplicity n;, and .075 .150 .275 .425 .625 .775 .875 .925 .915 1.000 for xis x <xi+ 1, i = I, . . . , 11 where x12 = =. With 8 = .05, equation (9-69) yields SEC. 9-5 301 MOMENTS AND MAXIMliM I.IKE.LIHOOD the interval F(x) ±. c: (' = I I .05 i V- 80 n T = .217 For example, for 90 :s x < 95. the interval .425 ±. .217 results. • Kolmogoroff's test leads to reasonable estimates only if n is large. As the last example suggests, n should be at least of the order of 100 (see also Problem 9-31 ). If this is true, the approximation error in (9-68) is negligible. 9-5 Moments and Maximum Likelihood The general parameter estimation problem can be phrased as follows: The density of an RV x is a function f(x, 61 , • • • , 6,} depending on r ~ I parameters 6; taking values in a region 9 called the parameter space. Find a point in that space that is close in some sense to the unknown parameter vector (6 1 , • • • , 6,). In the earlier sections of this chapter, we developed special techniques involving the commonly used parameters. The results were based on the following form of (4-89): If a parameter 6 equals the mean of some function q(x) of x, then we use as its estimate fJ the empirical estimate of the mean of q(x): • I~ 6 = E{q(x)} (9-70) 6 = Ll q(x;) n In this section, we develop two general methods for estimating arbitrary parameters. The first is based on (9-70). Method of Moments The moment m• of an RV x is the mean of xk. It can therefore be estimated from (9-70) with q(x} = x". Thus • mk n "-"' X;k = I Ll (9- 71) The parameters 6; of the distribution of x are functions of m• because m" = r. x"f(x, 6,, . . . , 6,)dx (9-72) To find these functions, we assign to the index k the values I to r or some other set of r integers. This yields a system of r equations relating the unknown parameters 6; to the moments m". The solution of this system expresses 6; as functions 6; = y;(m 1 , • • • , m,) of m". Replacing m~c by their estimates "'", we obtain iJ; = y,{m, •...• m,) (9-73) These are the estimates of 6; obtained with the method of moments. 302 CHAP. 9 ESTIMATION Note that if 6 is the estimate of a parameter 6 so obtained, the estimate of a function 1'(8) of 8 is T{B). Example 9.20 Suppose that x has the Rayleigh density j(x. To estimate 8, we set k 8) = ~ e_.,:. 2,:U(xl = I in (9-72). This yields m, = ~2 J: x e 2 ,,:!26: dx = 8 ~~ Hence, • Example 9.21 The av xis normal with mean 2. We shall estimate its standard deviation u. Since m2 = .,2 .,. u 2 = 4 + u 2 , we conclude that • Example 9.22 We wish to estimate the mean., and the variance u~ of a normal av. In this problem, m1 = Tl and u 2 = m2- mi: hence, -;; = ,;,, "' ; u2 = ,;,2 - (i)2. • Method of Maximum Likelihood We shall introduce the method of maximum likelihood (ML) starting with the following prediction problem: Suppose that the RV x has a known density f(x, 8). We wish to predict its value x at the next trial; that is, we wish to find a number x close in some sense to x. The probability that xis the interval (x, x + dx) equals/(x, (J)dx. If we decide to select x such as to maximize this probability, we must set x = Xmax where Xmax is the mode of x. In this problem, 8 is specified and Xmax is the value of x for which the density j(x, 8), plotted as a function of x, is maximum (Fig. 9.16a). In the estimation problem, 8 is unknown. We observe the value x ofx, and we wish to find a number 6close in some sense to the unknown 8. To do so, we plot the density f(x, 8) as a function of 8 where xis the observed value of x. The curve so obtained will be called the likelihood function of 8. The value 8 = 6max of 8 for which this curve is maximum is the ML estimate of 8. Again, the probability that xis in the interval (x, x + dx) equalsf(x, 8)dx. For a given x, this is maximum if 8 = 8ma,.. We repeat: In the prediction problem, 8 is known and Xmu is the value of x for which the density f(x, 8) is maximum. In the estimation problem, xis SEC. 9-5 MOMENTS A~D MAXIMUM LIKELIHOOD 303 Likelihood Density ., "' .\' 8=- 0 .\' Ia I lbl Figure 9.16 known and 8max is the value of 8 for which the likelihood functionf(x, 8) is maximum. Note Suppose that 6 = 8(T) and/!x, 8!T)) is maximum forT= f. It then follows that.f{x, 8) is maximum for 8 = 8(f). From this we conclude that iff is the ML estimate of a parameter T, the ML estimate of a function 8 = 8(f) ofT equals 8(f). t:xample 9.23 The time to failure of type A hulhs is an RV x with density f(x. IJ) = O~.H' "'U!x) In this example. the density of x !Fig. 9.16al is maximum for.\· - l/8, and its likelihood function !Fig. 9.16h) is maximum for 0 "' 2/x. This leads to the following estimates: Prediction: The MI. estimate .i of the life length x of u particular bulb equals 110. Estimation: The MI. estimate {J of the parameter (J in terms of the observed life length x of a particular bulb equals 2/x. • We shall now estimate IJ in terms of the 11 observations X;. The joint density of the corresponding samples x; is the product ,((X. H) ...:. /{x,. H)· · • f<x,. H) X= lx, .... x,] Fora given 8./(X. (})is a function of then variablesx;. For a given X.f(X, 8) is a function of the single variable fJ. If X is the vector of the observations x;. then.f<X. fJ) is called the likelihood function of H. Its logarithm = In f<X. 8) == I In .f(x;. 9) (9-74) is the log-/ikelilwod function of 8. The M L estimate II of 0 is the value of 8 for which the likelihood function is maximum. If the maximum of f<X. 8) is in the interior of the domain (o-) of 8. then {J is a root of the equation L(X. fJ) nf<X. 8) ,., 0 atJ <9_75 > 304 CHAP. 9 ESTIMATION In most cases, it is also a root of the equation aL(X, 6) = ~ 1 aj(x~o 6) = (9-76) 0 a6 f(x;, 6) a6 Thus to find (J, we solve either (9-75) or (9-76). The resulting solution depends on the observations x;. Example 9.24 The RV x has an exponential density 8e-~U(x). We shall find the ML estimate (J of 8. In this example, f(X, 8) = 8"e-11Jit+···•JI• 1 = 8"e_,,.; L(X, 8) = n In 8 - 8nX Hence, aL = !! - ,x = o fJ = ~ atJ 8 x Thus the ML estimator of 8 equals 1/i. This estimator is biased because E{i} 118 and £{1/i} :1: 11.,)1. • = .,)1 = In Examples 9.23 and 9.24, the likelihood function was differentiable at 6 = fJ, and its maximum was determined from (9-76) by differentiation. In the next example, the maximum is a comer point and (J is determined by in- spection. Example 9.25 The RV xis uniform in the interval (0, 8), as in Fig. 9.17a. We shall find the ML estimate (J of lJ. The joint density of the sample x; equals 1 /(X, 8) = ,. for 0 < x 1• • • • • x,. < 8 8 and it is 0 otherwise. The corresponding likelihood function is shown in Fig. 9.17b, where z is the maximum of x;. As we see from the figure,f(X, 8) is maximum at the comer point z =max X; of the sample space; hence, the ML estimate of 8 equals the maximum of the observations x;. This estimate is biased (see Example 7.4) because E{x.n..} = (n + 1)8/n. However, the estimate is consistent. • Figure 9.17 I<X,8) /(x,8) ·~----... 8 8" 0 0 (a) 8 (b) SEC. 9-5 MOMENTS AND MAXIMUM LIKELIHOOD 305 p 0 Figure 9.18 Example 9.26 We shall find the ML estimate p of the probability p of an event stl. For this purpose, we form the zero-one av x associated with the event .IIi. The joint density of the corresponding samples x; equals p 4q 11 -t where k is the number of successes of stl. This yields iJL k n- k ------=0 L(X. p) = kIn p + (n - k)ln q ilp p q Solving for p, we obtain k(l - p) - (n - k)p =0 • k n p.::- This holds if p is an interior point of the parameter space 0 s p s l. If p = 0 or I, the maximum is an endpoint (Fig. 9.18). However, even in thi$ case, p =kin. • The determination of the ML estimates 6; of several parameters 8; proceeds similarly. We form their likelihood function f(X, 81 , • • • , 8, ), and we determine its maxima or, equivalently, the maxima ofthe log-likelihood L(X, 8., . .. , 9,) = ln/(X, 9, • . . . • 9,) Example 9.27 We shall find the ML estimates ij and 0 of the mean 11 and the variance u = u 2 of a normal av. In this case, f(X, .,, u) I { = (Yl11'u)" exp - l ,... •} lu £... (x; - TJ)- n In (211'u) =- 2 u £... (x; - 11>2 2 L(X, .,, u) '\."' (X·..,) = 0 a., =!u £... ' ., aL Solving the system. we obtain ij = .t - I~ + J., ~ (x· au = - .!!.. 2u 2u- £... ' aL .. ,... -· u =- £... (x; - x)- n (9-77) 71) ., 2 =0 306 CHAP. 9 ESTIMATION The estimate;; is unbiased, but the estimate 0 is biased because E{v} = (n - l)u 21n. However, both estimates are consistent. • Asymptotic Properties or ML Estimators The ML method can be used to estimate any parameter. For moderate values of n, the estimate is not particularly good: It is generally biased, and its variance might be large. Furthermore, the determination of the distribution of iJ is not simple. As n increases, the estimate improves, and for large n, iJ is nearly normal with mean 8 and minimum variance. This is based on the following important theorem. • Theonm. For large n, the distribution of the ML estimator iJ of a parameter 8 approaches a normal curve with mean 8 and variance 1/nl where 1=E {1:8 L<x. 8>n = lnf(x, 8) L(x, 8) (9-78) In Section 9-6, we show that the variance of any estimator of 8 cannot be smaller than Ifni. From this and the theorem it follows that the ML estimator of a parameter 8 is asymptotically the best estimator of 8. The number I in (9-78) is called the information about 8 contained in x. This concept is important in information theory. Using integration by parts, we can show (Problem 9-36) that I =-E ca; L(x, 8)} (9-79) 2 In many cases, it is simpler to evaluate I from (9-79). The theorem is not always true. For its validity, the likelihood function must be differentiable. This condition is not too restrictive, however. The proof of the theorem is based on the central limit theorem but it is rather difficult and will be omitted. We shall merely demonstrate its validity with an example. Example 9.28 Given an RV x with known mean.,, we wish to find the ML estimate 0 of its variance v = u 2• As in (9-77), !n 2 0= The RV Tl)2 (XI - n¥/u 2 has a x2(n) distribution; hence, E{t} =u2 u~u = -2un 4 Furthermore, for large li, tis normal because it is the sum of the independent avs asymptotically, the ML estimate t of u 2 is normal with mean u 2 and variance 2u 4 /n. We shall show that this agrees with the theorem. In this problem. (xt - 11>2• Thus, -I aL(x, v> av = 2v (x - + 11>2 2V2 a2L(x, v> aV2 and (9-79) yields • {-I (x - 11>2} 2v2 + v3 in agreement with the theorem. • I =E I = 2u 4 I == 2v2 - (x - 11>2 v3 SEC. 9-6 BEST ESTIMATORS AND THE RAO-CRAMf.R BOUND 307 9-6 Best Estimators and the Rao-Cramer Bound All estimators considered earlier were more or less empirical. In this section, we examine the problem of determining best estimators. The best estimator of a parameter 6 is a statistic (J = g(X) minimizing the MS error e = E{(B - 8)2} = t (g(X) - 8) /(X. 8) 2 dX (9-80) In this equation, 6 is the parameter to be estimated, f(X, 6) is the joint density of the samples x;. and g(X) is the function to be determined. To simplify the problem somewhat, we shall impose the condition that the estimator (J be unbiased: (9-81) E{iJ} = 6 This condition is only mildly restrictive. As we shall see, in many cases, the best estimators are also unbiased. For unbiased estimators, the LMS errore equals the variance of fl. Hence, best unbiased estimators are also minimumvariance unbiased estimators. The problem of determining an unbiased best estimator is difficult because not only the function g(X) but also the parameter 8 is unknown. In fact, it is even difficult to establish whether a solution exists. We show next that if a solution does exist. it is unique. • Theorem. If fJ1 and {h. are two unbiased minimum-variance estimators of a parameter 6. then fJ1 = {h. • Proof. The variances of fJ1 and {h. must be equal because otherwise the one or the other would not be best. Denoting by u 2 their common variance, we conclude that the statistic A I . . "= 2 (8, + ~) is an unbiased estimator of 6 and its variance equals 2 I 2 I 2 2 2 CT; = 4 (u + CT + 2ru ) = 2u (1 + r) where r is the correlation coefficient of (J, and ~. This shows that if r < I, then u; < u, which is impossible because (J, is best; hence, r = I. And since the avs (J, and {h. have the same mean and variance. we conclude as in (5-46) that (J, = ~. We continue with the search for best estimators. We establish a lower bound for the MS error of all estimators and develop the class of distributions for which this bound is reached. This material is primarily of theoretical interest. Regularity The density of an av x satisfies the area condition r. f(x, 6)dx =I 308 CHAP. 9 ESTIMATION The limits of integration may be finite, but in most cases they do not depend on the parameterS. We can then deduce, differentiating with respect to S, that aj(x, 6) dx = (9-82) 0 _,. as I" We shall say that the density f(x, 6) is regular if it satisfies (9-82). Thus f(x, 6) is regular if it is differentiable with respect to 6 and the limits of integration in (9-82) do not depend on 6. Most densities of interest are regular; there are, however, exceptions. If, for example, xis uniform in the interval (0, 6), its density is not regular because 6 is a boundary point of the range 0 s x s S of x. In the following, we consider only regular densities. lnjol'lllation The log-likelihood of the RV x is by definition the function L(x, 6) = lnf(x, 6). From (9-82) it follows that _ I- aj(x, 6) f(x, 6) dx = 0 6) as I_..,"' aL(x,a6 6) f(x, S) dx = I"_., f(x, This shows that the mean of the ance by I, we conclude that RV aL(x, S)/ a6 equals 0. Denoting its vari- (9-83) The number I is the information about 6 contained in x [see also (9-78)]. Consider, next, the likelihood function L(X, 6) = In f(X, 6) of the sample X = [x., . . . , x,.]. Its derivative equals aL(X, S) = 1 af(X, S) = L aL(x;, S) as f<X. s> as as The avs aL(x;, 6)1a6 have zero mean and variance I. Furthermore, they are independent because they are functions of the independent RVS X;. This leads to the conclusion that (9-84) The number nl is the information about 6 contained in X. We tum now to our main objective, the determination of the greatest lower bound of the variance of any estimator II = g(X) of 6. We assume first that II is an unbiased estimator. From this assumption it follows that E{ II} = t g(X)f(X' 6) dX = s Differentiating with respect to S, we obtain ( (X) aj(X, 6) dX = I JR 8 as This yields the identity E {g(X) aL~, 8)} = I (9-85) SEC. 9-6 BEST ESTIMATORS AND THE RAO-CRAMER BOUND 309 The relationships just established will be used in the proof of the following fundamental theorem. The proof is based on Schwarz's inequality [see (5-42)]: For any z and w, E 2{zw} s E{z2}E{w2} (9-86) Equality holds iff z = cw. THE RAO-CRAMER BOUND The variance of an unbiased estimator iJ = g(X) of a parameter 6 cannot be smaller than the inverse I I nl of the information n/ contained in the sample X: u~ = E{[g(X) - I 6]2} ~ nl (9-87) Equality holds iff iJL(X, (}) ao = nl [g(X) - 6] (9-88) • Proof. Multiplying the first equation in (9-84) by 6 and subtracting from (9-85), we obtain I = E {lg(X)- 9) iJL(:e, (})} We square both sides and apply (9-86) to the Rvs g(X) - 6 and iJL(X, 6)/a8. This yields [see (9-84)] I s E{lg(X) - 6]2}£ { liJL(!, OW} = u~nl and (9-87) results. To prove (9-88), we observe that (9-87) is an equality iff g(X) - 6 equals ciJL(X, 8)/iJ8. This yields I = E { cliJL(!, O)n = en/ Hence, c = 1/n/, and (9-88) results. Suppose now that iJ is a biased estimator of the parameter 8 with mean E{ IJ} = 7(6). If we interpret iJ as the estimator of 'T(O), our estimator is unbiased. We can therefore apply (9-87) subject to the following modifications: We replace the function aL(X, 8)/iJ8 by the function iJL[X, 6(7)] = iJL(X, 6) _I_ iJ7 iJ(J 7'(8) and the information n/ about 8 contained in X by the information 1 E { liJL[X, 6(7)]1 iJ7 ' 2 } = _1_ E {laL(X,BW} = _!!!_ [7'(6)] 2 iJ(J [7'(6)] 2 310 CHAP. 9 ESTIMATION about 1'(8) contained in X. This yields the following generalization of the Rao-Cramer bound. • Corollary. If iJ = g(X) is a biased estimator of a parameter 8 and E{ iJ} = 1'(8), then u~ = E{[g(X) - T(8)]2} ::: (T'!~)J2 (9-89) (Jj (9-90) Equality holds iff iJL(X, 8) a8 = ...!!!.._ I (X) _ T'(8) g EmCIENT ESTIMATORS AND DESSffiES OF EXPONENTIAL TYPE We shall say that iJ is the most efficient estimator of a parameter 8 if it is unbiased and its variance equals the bound 1/n/ in (9-87). If Sis biased with mean 1'(8) and its variance equals the bound in (6-89), it is the most efficient estimator of the parameter T((J). The Rao-Cramer bound applies only to regular densities. If j(x, 8) is regular, the most efficient estimator of 8 is also the best estimator. The class of distributions that lead to most efficient estimators satisfy the equality condition (9-88) or (9-90). This condition leads to the following class of densities. • Definition. We shall say that a density j(x, 9) is of the exponential type with respect to the parameter 8 if it is of the form j(x, (J) = h(x) exp{a((J)q(x) -- b(9)} (9-91) where the functions a((J) and b((J) depend only on (J and the functions h(x) and q(x) depend only on x. We shall show that the class of exponential type distributions generates the class of most efficient estimators. • Theorem. If the density /(:c. 8) is of the form 19-91). the statistic iJ = g(X) = !n L q(X;) (9-92) is the most efficient estimator of the panlmeter T((J) = b'(8) a'(8) The corresponding Rao-Cramer bound equals U~ _ (T'((J))2 _ 7'(8) ' nl - na'((J) • Proof. From (9-91) it follows that ln/(x, 8) =In h(x) + a(8)q(x)- b((J) and (9-74) yields iJL(X, 8) ~ a = a'(8) ~ q(x;) - nb'(8) = na'(8)[g(X) - 1'(8)] 8 (9-93) ( _ ) 9 94 SEC. 9-6 BEST ESTIMATORS AND THE RAO-CRAMER BOUND 311 This function satisfies (9-90) with T~~) = na'(8) Hence, I = a'(8)T'(8). Inserting into (9-89), we obtain (9-94). Note that the converse is also true: If L(X, 8) satisfies (9-90), thcnf<x. 8) is of the exponential type. We give next several illustrations of exponential type distributions. Normal (a) The normal density is of the exponential type with respect to its mean because f (x. 71) I = '\I21TU cxp { - Iv (x-, - 2x71 - 71-) '} 2 This density is of the form (9-91) with 0('17) = !1. b('l1) v = '17~ q(x) 2v =x In this case. '11 == g(X) = .!n L X; = .f Hence, the sample mean i is the most efficient estimator of '17· (b) The normal density is also of the exponential type with respect to its variance because j(x. v) = ~ cxp {-In vV- dv (x- '17>~} This is also of the form (9-91) with a(v) = - - I 2v b(v) = In vV q(x) = (x - 71)~ In this case. T{v) = v; hence. the statistic v = I(x; - 71)~/n is the best estimator of v. The variance of the estimation [sec (9-94)] equals , T'(v) 2u 4 (7'~ v = - - =na'(v) n Exponential The exponential density j(:c, 8) = 8e" 9~ = exp{-8x + In 8} is of the exponential type with a(8) = -8, b(8) = -In 8. and q(x) = x. Hence. the statistic "i.x;ln = .'f is the most efficient estimator of the parameter T(8) = b'(8) a'(8) = ! 8 The Rao-Cramer bound also holds for discrete type Rvs. Next we give two illustrations of most efficient estimators and point densities of the exponential type. Poisson The Poisson density 8~ t j(x, 8) = e ·II x! = x! exp {x In 8 - 8} X= 0, I •... 312 CHAP. 9 ESTIMATION is of the exponential type with a(8) =In 8 b(8) =8 q(x) 1'(8) = b'(8) = 8 =x a'(8) Hence, the statistic~ q(x;)ln =xis the most efficient estimator of 8, and its variance equals l/na'(8) = 8/n. Note that the corresponding measure of information about 8 contained in the sample X equals nl = na'(8}r'(8) = n/8. Probability If p = P(.s4) is the probability of an event .s4 and x is the zero-one RV associated with .s4, then x is an RV with point density f(x, p) = p'"(l - p) 1- .. = exp {x In p + (I - x) In (I - p)} x = 0, I This is an exponential density with a(p) =In _P_ b(p) =In (I - p) 1-p Hence, the ratio~ q(x) =x T(p) =p x;ln = kin is the most efficient estimator of p. Sufficient Statistics and Completeness We tum now to the problem of determining the best estimator il of an arbitrary parameter 8. If the density /(x, 8) is ofthe exponential type, then il exists, and it equals the most efficient estimator g(X) in (9-92). Otherwise, il might not exist, and if it does, there is no general method for determining it. We show next that for a certain class of densities, the search for the best estimator is simplified. • Dejinition. If the joint density /(X, 8) of the n samples X; is of the form /(X, 8) = H(X)J[y(X), 8] (9-95) where the functions H(X) and y(X) do not depend on 8, the function y(X) is called a sufficient statistic of the parameter 8. From the definition, it follows that if y(X) is a sufficient statistic, then ky(X) or, in fact, any function of y(X) is a sufficient statistic. If f(x, 8) is a density of exponential type as in (9-91), then /(X, 8) = n h(x;) exp {a(8) L q{.t;) - nb(8)} is of the form (9-95); hence, the sum y(X) = ~ q(x;) is a sufficient statistic of8. The importance of sufficient statistics in parameter estimation is based on the following theorem. • S'4fliciency Tlaeortm. If z is an arbitrary statistic and y is sufficient statistic of the parameter 8, then the regression line lll(y) = E{zly} = f. ~C<:Iy)dz is also a statistic; that is, it does not depend on 8. SEC. 9-6 BEST ESTIMATORS AND THE RAO-CRAMh BOUND 3J3 • Proof. To prove this theorem, it suffices to show that the functioni<zjy) does not depend on 8. As we know, I )d _ h·z<Y· z)dydz -_ P{(y. z) ED,.:} (9-96) ' z y zj..(y)dy P{y E DJ . . where D.vz is a differential region and D_,. a vertical strip in the .vz plane (Fig. 9.19a). The trctnsformation y = y(X), z = z(X) maps the region D_,.l of the yz plane on the region A,.: of the sample space and the region D_,. on the region A_,. (Fig. 9.19b). The numerator and the denominator of (9-96) equal the integral of the density f<X. 8) in the regions A_,.; and A,. respectively. In these regions. y(X) = y is constant; hence the term J(y.O) can be taken outside the integral. This yields rsee (9-95)]. P{(y, z) E D,-:} = J(y, 8) 1/(X)dX ~"( J: L,, P{y ED..} = J(y. 8) t. H(X)dX Inserting into (9-96) and canceling J(y, 8), we conclude that the function .f:<zly) does not depend on 8. • Corollary 1. From the Rao-Blackwell theorem (6-58) it follows that if z is an unbiased estimator of 8, then the statistic lJ = E{zly} is also unbiased and its variance is smaller than u~. Thus if we know an unbiased estimator z of 8, using the sufficiency theorem, we can construct another unbiased estimator E{zly} with smaller variance. • Corollary 2. The best estimator of 8 is a function 1/l(y) of its sufficient statistic. • Proof. Suppose that z is the best estimator of 8. If z does not equali/J(y), then the variance of the statistic lJ = E{zly} is smaller than u~. This, however, is impossible; hence. z = 1/l(y). •·igure 9.19 z X spal.'C y (a) tb) 3J4 CHAP. 9 ESTIMATION It follows that to find the best estimator of 8, it suffices to consider only functions of its sufficient statistic y. This simplifies the problem. but even with this simplification, there is no general solution. However, if the density /y(y, 8) of the sufficient statistic y satisfies certain conditions related to the uniqueness problem in transform theory, finding the best estimator simply entails finding an unbiased estimator, as we show next. COMPLETENESS Consider the integral = Q(8) f. q(y)k(y, 8)dy (9-97) where k(y, 8) is a given function of y and 8 is a parameter taking values in a region 8. This integral assigns to any funciton q(y) for which it converges a function Q(8) defined for every 8 in e. This function is called the transform of q(y) generated by the kernel k(y, 8). A familiar example is the Laplace transform generated by the kernel e-". We shall say that the kernel k(y, 8) is complete if the transform Q(8) has a unique inverse transform. By this we mean that a specific Q(8) is the transform of one and only one function q(y). We show next that the notion of completeness leads to a simple determination of the best estimator of 8. • Definition. A sufficient statistic y is called complete if its density /y(y, 8) is a complete kernel. • Theorem. If iJ is a function mean equals 8: E{iJ} = then iJ = q(y) of the complete statistic y and its f. q(y)J;.(y, 8)dy =8 (9-98) iJ is the best unbiased estimator of 8. • Proof. Suppose that z is the best unbiased estimator of 8. As we have seen, z = t/l{y); hence, E{z} = f. t/l(y)/y(y, 8)dy =8 (9-99) The last two equations show that 8 is the transform of the functions q(y) and t/J(y) generated by the kernel/y(y, 8). This kernel is complete by assumption; hence, 8 has a unique inverse. From this it follows that q(y) = t/l(y); therefore, iJ = z. These results lead to the following conclusion: If y is a sufficient and complete statistic of 8, then to find the best unbiased estimator of 8, it suffices to find merely an unbiased estimator w. Indeed, starting from w, we form the RV z = E{wly}. This RV is an unbiased function ofy; hence, according to the sufficiency theorem, it is the best estimator of 8. These conclusions are based on the completeness of y. The problem of establishing completeness is not simple. For exponential type distributions, completeness follows from the uniqueness of the inversion of the Laplace SF.C. 9-6 315 BEST ESTIMATORS ANI> TilE RAO-CRAMER BOUND transform. In other cases, special techniques must be used. Here is an illustration. Example 9.20 We are given an RV x with uniform distribution in the interval((). 8) IFig. 9.20a), and we wish to find the best estimator iJ of 8. In this case. 0< f(X. tl) = ;, X~o • • • • Xn < fJ and zero otherwise. This density can be expressed in terms of the maximum z = Xma• and the minimum w = Xmin of X;. Indeed./( X. 8) '- 0 itT w < 0 or z > 8; hence. I /(X. 8> = , U(w)U(8 - <:) (9-100) 8 where U is the unit-step function. This density is of the form 19-95) with y( X) = z; hence. the function y = Xma• = z is a sufficient statistic of 8. and its density (sec Example 7.6) equals j;.(y. 8) = ; y'' 1 0 < y < II (9-101) as in Fig. 9.20b. Next we show that y is complete. It suffices to show that if Q(8) = .!!._ 9" (H q(y)\·11·1 J\• Jo · · then Q<8> has a unique inverse q(y). For this purpose. we multiply both sides by 8" and differentiate with respect to 8. This yields 118 11 - I Q{ 8) + 8"Q' (8) -"- llq( (;I)(J 11 1 Hence. q(8) .._ Q(8) + (81n)Q'C8l is the unique inverse of QlfJ). To complete the determination of iJ. we must find an unbiased estimator of 8. From (9-101) it follows that E{y} ,_ n8/(11- 1). This leads to the conclusion that the statistic • n-1 11+l 6=. n - y = -11- x rna• is an unbiased estimator of 8. And since it is a function of the complete statistic y, it is the best estimator. • Figure 9.20 J;.ly. 0) f,<x.OI y = "'"·~ I 0 t---------. 0 0 (a) X 0 0 I hi 316 CHAP. 9 ESTIMATION Let us tum, finally, to the use of completeness in establishing the independence between the sufficient statistic y and certain other statistics. • Theorem. If z is a statistic such that its density is a functionf.(z) that does not depend on 6 and y is a complete statistic of 6, then the RVS z and y are independent. • Proof. It suffices to show that I.<ziy> = f.(::.). The area of the density J;.(y, 6) of y equals I. From this and the total probability theorem (6-31) it follows that fz<z> = f . i:<zly)J;.(y, 6)dy = J". I:<z>J;<y. 6)dy (9-102) The density fz(z) does not depend on 6 by assumption, and the conditional density fz<zly) does note depend on 6 because y is a sufficient statistic of 6 (sufficiency theorem). And since the kemelfv(y, 6) is complete, we conclude · from (9-102) thatfz<zly> = f:(z). Using the properties of quadnttic forms, we have shown in Section 7-4 that the sample mean i and the sample variance s2 of a normal RV are independent. This result follows directly from the theorem just proved. Indeed, i is a complete statistic of the mean of x because its density is normal and a normal kernel is complete. Furthermore. the density of s2 does not depend on 11· Hence, i and s2 are independent. Problems 9-1 9-2 9-3 The diameter of cylindrical rods coming out of a production line is a normal av x with cr = 0.1 mm. We measure n = 9 units and find that the average of the measurements is i = 91 mm. (a) Find c such that with a .95 confidence coefficient, the mean.,., ofx is in the interval x ±c. (b) We claim that.,., is in the interval (90.95, 91.05). Find the confidence coefficient of our claim. The length of a product is an av x with cr = I mm and unknown mean. We measure four units and find that x = 203 mm. (a) Assuming that x is a normal av, find the .95 confidence interval of,.,. (b) The distribution ofx is unknown. Using Tchebycheff's inequality. find c such that with confidence coefficient .95, .,., is in the interval 203 ± c. We know from past records that the life length of type A dres is an av x with cr = 5,000 miles. We test 64 samples and find that their average life length is = 25,000 miles. Find the .9 confidence interval of the mean of x. We wish to determine the length a of an object. We use as an estimate of a the average of n measurements. The measurement error is approximately normal with zero mean and standard deviation 0.1 mm. Find n such that with 95% confidence, xis within ±0.2 mm of a. An object of length a is measured by two persons using the same instrument. The instrument error is an N(O, cr) av where cr = I mm. The first person x 9-4 9-5 x PROBLEMS 9-6 9-7 9-8 9·9 9-10 9·11 9·12 9-13 9-14 9·15 317 measures the object 25 times, and the average of the measurements is i = 12 mm. The second person measures the object 36 times. and the average of the measurements is y = 12.8. We use as point estimate of a the weighed average c = ai - by. (a) Find a and b such that c is the minimum-variance unbiased estimate of cas in (7-37). (b) Find the 0.95 confidence interval of c. In a statewide math test, 11 students obtained the following scores: 49 57 64 72 75 77 78 79 81 81 82 84 85 87 89 93 96 Assuming that the scores are approximately normal, find the .95 confidence interval of their mean (a) using (9-10); (b) using (9-12). A grocer weighs 10 boxes of cereal, and the results yield i = 420 g and s = 12 g for the sample mean and sample standard deviation respectively. He then claims with 95% confidence that the mean weight of all boxes exceeds c g. Assuming normality, find c. The RV xis uniformly distributed in the intervaltJ - 2 < x < 8 + 2. We observe 100 samples x 1 and find that their average equals i = 30. Find the .95 confidence interval of 8. Consider an av x with density /(x) = xe-xU(x). Predict with 95% confidence that the next value of x will be in the interval (a, b). Show that the length b -a of this interval is minimum if a and h are such that /(a) =/(b) P{a < x < h} = .95 Find a and b. (Estimation-prediction). The time to failure of electric bulbs of brand A is a normal av with u = 10 hours and unknown mean. We have used 20 such bulbs and have observed that the average i of their time to failure is 80 hours. We buy a new bulb of the same brand and wish to predict with 95% confidence that its time to failure will be in the interval 80 ::!: c. Find c. The time to failure ofan electric motor is an RV x with density ~e-IJxU(x). (a) Show that if i is the sample mean of n samples of x, then the av 2ni/~ has a x2(2n) distribution. (b) We test n = 10 motors and find that i = 300 hours. Find the left .95 confidence interval of~· (c) The probability p that a motor will be good after 400 hours equals p = P{x > 400} = t>- 400#. Find the .95 confidence interval p >Po of p. Suppose that the time between arrivals of patients in a dentist's office constitutes samples of an RV x with density 8e" 8'U(x). The 40th patient arrived 4 hours after the first. Find the .95 confidence interval of the mean arrival time 11 = 118. The number of particles emitted from a radioactive substance in I second is a Poisson-distributed av with mean A. It was observed that in 200 seconds, 2,550 particles were emitted. Assuming that the numbers of particles in nonoverlapping intervals are independent, find the .95 confidence interval of A. Among 4,000 newboms,,2,080 are male. Find the .99 confidence interval of the probability p = P{male}. In an exit poll, of 900 voters questioned, 360 responded that they favor a particular proposition. On this basis, it was reported that 40% of the voters favor the proposition. (a) Find the margin of error if the confidence coefficient of the results is .95. (b) Find the confidence coefficient if the margin of error is ::!:2%. 318 CHAP. 9 ESTIMATION 9-16 In a market survey, it was reported that 29Ck of respondents favor product A. The poll was conducted with confidence coefficient .95, and the margin of error was ±4%. Find the number of respondents. 9-17 We plan a poll for the purpose of estimating the probability p of Republicans in a community. We wish our estimate to be within ..t.02 of p. How large should our sample be if the confidence coefficient of the estimate is .95? 9-18 A coin is tossed once, and heads shows. Assuming that the probability p of heads is the value of an av p uniformly distributed in the interval (0.4. 0.6) find its Bayesian estimate (9-37). 9-19 The time to failure of a system is an avx with densityf(x, 8) = 8e- 11'U(x). We wish to find the Bayesian estimate 8 of 8 in terms of the sample mean i of the n samples x1 of x. We assume that 8 is the value of an av 8 with prior density /,(8) = ce-•'BU(8). Show that • n I 8=----(' + ni ,....,. x 9-20 The av x has a Poisson distribution with mean 8. We wish to find the Bayesian estimate 8 of 8 under the assumption that 8 is the value of an av 8 with prior density /,(8) - 8be-< 1U(8). Show that . ni+b+l n-c 8=---- 9-21 Suppose that x is the yearly starting income of teachers with a bachelor's degree andy is the corresponding income of teachers with a master's degree. We wish to estimate the difference 11.• - 11~· of their mean incomes. We question n = 100 teachers with a bachelor's degree and m = 50 teachers with a master's degree and find the following averages: 9-22 9-23 9-24 9-25 9-26 i = 20K j = 24K s, = 3.1 K .f~ = 4K Assuming that the avs x and y are normal with the same variance, find the .95 confidence interval of .,.,, - 11...Suppose that the IQ scores of children in a certain grade are the samples of an N(71, u) av x. We test 10 children and obtain the following averages: i = 90, s = 5. Find the .95 confidence interval of 11 and of u. The avs X; are i.i.d. and N(O, u). We observe that + • • · + .do = 4. Find the .95 confidence interval of u. The readings of a voltmeter introduces an error "with mean 0. We wish to estimate its standard deviation u. We measure a calibrated source V = 3 volts four times and obtain the values 2.90, 3.15, 3.05, and 2.96. Assuming that "is normal, find the .95 confidence interval of u. We wish to estimate the correlation between freshman grades and senior grades. We examine the records of 100 students and we find that their sample correlation coefficient is;= .45. Using Fisher's auxiliary·variable, find the 0.9 confidence interval of r. Given then paired samples (x;, y1) of the avs x andy, we form their sample means i and y and the sample covariance xt • = n _1~ -) -) 1£11 ~ (x; - x Cy; - y 1 Show that E{liu} = #£11. PROBLEMS 3)9 9-27 (a) (Cauchy-Schwarz inequality). Show that 11 ( LI a;h; i· (b) Show that if .. r2 ,. = 2 )~ n "'" ~ af i (X;- .t)(y;- (x;- I n LI hi i- _v;r _,L (y; ·· yl· _, ,;; s I then xl· 9-28 Given the 16 samples 93 75 40 73 61 42 68 64 78 54 87 H4 71 49 72 58 of the RV ¥.find the probability that its median is between 68 and 75 (a) exactly from (9-60): (b) approximately from (9-61). 9-29 The avs y 1• • • • • y~ are the order statistics of the five samples ¥ 1, • • • • x~ ofx. Find the probability P{y 1 < x,. < y~} that the u-percentile x, oh is between Y1 and y~ for u = .5 • .4. and .3. 9-30 We use as estimate of the median x~ of an RV ¥the interval <y4 • Y4. 1) between the order statistics closest to the center of the range <y 1 • Ynl: k :.,: n/2 s k ... I. t.;sing (9-611. show that X;= ... 2 P{y4 1<-~.~<y,.l},.. 9-31 \.1r11 We use as the estimate of the distribution Fix 1 of an RV x the empirical distribution f"(x) in (9-67) obtained with 11 samples X;. We wish to claim with 90% confidence that the unknown F<x) is within .02 frum F(x). Find n. The RV x has the F.rlang density j'(.r 1 - c·4.\' 3c• ..., VI x ). We observe the samples x; = 3.1, 3.4. 3.3. Find the ML estimate(' of c-. The RV x has the truncated exponential density f(xl =- c·e-•c.•-·'"1U(x - x0 ). Find the ML estimate c of c- in terms of the 11 samples .\'; of x. The time to failure of a bulb is an av x with density ce · '• U< x). We test 80 bulbs and find that 200 hours later. 62 of them are still good. Find the ML estimate of c:. The av x has a Poisson distribution with mean 8. Show that the ML estimate of 8 equals .'t. Show that if L(x, 8) = ln.f<x. 8) is the likelihood function of an RV "·then = 9-32 9-33 9-34 9-35 9-36 E{la'-~:· 8~} = -Er'~~;~~- o} 9-37 The time to failure of a pump is an RV x with density lie• 8·'U(x). (a) Find the information I about 8 contained in x. (b) We have shown in Example 9-24 that the ML estimator iJ of 8 equals 1/i. Show that for, > 2 E{ iJ} = _!!!!_ u~ - ~~~~ 9 n - I - <n - I )2(n - 2) 9-38 Show that if y is the best estimator of a parctmcter 8 0 and z is an arbitrary statistic with zero mean, then y and z arc uncorrelated. 9-39 The RV" has the gamma density 82xe- 1"U(:c). Find the best estimator of the parameter 118. * 320 CHAP. 9 ESTIMATION 9-40 The av x has Weibull distribution x•-le-•''1U(81. Show that the most efficient estimate of 8 is the sum IJ = _!._ ~ x' n8~ ' 9-41 Show that if the function/(x. 8) is ofthe exponential type as in (9-91), the ML estimator of the parameter 1' = b'(8)/a'(8) is the best estimator of 7'. 9-42 Show that ifx;are the samples of an rv x with density e-••-'1U(x). the RV w = min X; is a sufficient and complete statistic of 8. 9-43 Show that if the density f(x, 8) has a sufficient statistic y (see (9-95)) and the density /y(y, 8) ofy is known for 8 = 80 , then it is known for every 8. 9-44 If/(x) = 8e- 11 U(x) and n = 2, then the sum y = x 1 + x2 is a sufficient statistic of 8. Show that if z = x 1, then E{zly} = y/2, in agreement with the sufficiency theorem. 9-45 Suppose that Xi are the samples of anN(.,, 5) av x. (a) Show that the sum y = x1 + · · · + X11 is a sufficient statistic of.,. (b) Show that if a; are n constants such that a 1 + · · · + a11 = 0 and z = a 1x1 .... • • • + a11x11 , then the avs y and z are independent. 10 _ _ __ Hypothesis Testing Hypothesis testing is part of decision theory. It is based on statistical considerations and on other factors. often subjective, that are outside the scope of statistics. In this chapter. we deal only with the statistical part of the theory. In the first two sections. we develop the basic concepts using commonly used parameters, including tests of means, variances. probabilities, and distributions. In the next three sections, we present a variety of applications, including quality control, goodness-of-fit tests, and analysis of variance. The last section deals with optimality criteria. sequential testing. and likelihood ratios. 10-1 General Concepts A hypothesis is an assumption. A statistical hypothesis is an assumption about the values of one or more parameters of a statistical model. Hypothesis testing is a process of establishing the validity of a hypothesis. In hypothesis testing, we arc given an RV x modeling a physical quantity. The distribution ofx is a function F(x, 8) depending on a parameter 8. We wish to test the 321 322 CHAP. 10 HYPOTHESIS TESTING hypothesis that 8 equals a given number 80 • This problem is fundamental in many areas of applied statistics. Here arc several illustrcltions. I. We know from past experience that under certain experimental conditions, the parameter 8 equals 80 • We modify various factors of the experiment. and we wish to establish whether these modifications have any effect on the value of 8. The modifications might be intentional (we try a new fertilizer), or they might be beyond our control (undesirable changes in a production process). 2. The hypothesis that 8 = 80 might be the result of a theory to be verified. 3. The hypothesis might be a standard that we have established (expected productivity of a worker) or a desirable objective. Terminology The assumption that 8 = 8o will be denoted by H 0 and will be called the null hypothesis. The assumption that 8 :/= 80 will be denoted by H 1 and will be called the alternative hypothesis. The set of values that 8 might take under the alternative hypothesis will be denoted by 9 1 • If 9 1 consists of a single point 8 = 81 , the hypothesis H 1 is called simple: otherwise, it is called composite. Typically. 9 1 is one of the following three sets: 8 :/= 80 , 8 > 80 • or 8 < 80 • The null hypothesis is in most cases simple. THE TEST The purpose of hypothesis testing is to establish whether experimental evidence supports rejecting the null hypothesis. The available evidence consists of then samples X= [x 1 • • • • , x,.] of the av x. Suppose that under the null hypothesis the joint density f(X, 80 ) of the samples x; is negligible in a certain region D,.. of the sample space, taking significant values only in the complement Da of D,... It is reasonable then to reject H0 if X is in D,.. and to accept it if X is in Da. The set D,.. is called the critical region of the test, and the set D,. is the region of acceptance of H0 • The terms "accept" and "reject" will be interpreted as follows: If XED,.., the evidence supports the rejection of H 0 ; if X fE D,., the evidence does not support the rejection of Ho. The test is thus specified in terms of the set D,... The choice of this set depends on the nature of the decision errors. There are two types of errors, depending on the location of the observation vector X. We shall explain the nature of these errors and their role in the selection of the critical region of the test. Suppose, first, that H 0 is true. If X E D,., we reject H 0 even though it is true-a Type I error. The Type I error probability is denoted by a and is called the significance level of the test. Thus a= P{X E D,.IHo} (lo-1) The difference 1 - a equals the probability that we accept Ho when true. Suppose, next, that H0 is false. If X fE De, we accept H0 even though it is false-a Type II error. The Type II error probability depends on the value of 8. It is thus a function {3(8) of 8 called the operating characteristic: (OC) of SEC. 10-1 GF.~ERAI. CONCI:.PTS 323 the test. Thus ~(0) = P{X fE /J,IIIt} (10-2) The difference P(8) = I - ~HI) = P{X E D,IHt} (10-3) equals the probability that we reject 1/11 when it is false. The function P(O) is called the power ofthe test. For brevity. we shall often identify the two types of errors by the expressions a error and #(IJ) error. respectively. To design an optimum test. we assign a value to a and select the critical region D,. so as to minimize the resulting #· If we succeed. the test is called most powerful. The critical region of such a test usually depends on 8. If it happens that a most powerful test is obtained with the same /), for every IJ E (N) 1 • the test is called uniformly mmil powerful. Note Hypothesis testing belongs to decision theory. Statistical considerations lead merely to the following conclusions: If H 0 is true. then P!X E 1>, } = a If H11 is false. then P{X e I>, } ,.., /3161 Guided by this. we reach a dt:c·Mmr: Reject 1/11 iff X E /), (10-4) ( 10-5) This decision is not based only on ( 10-4). It takes into account our prior knowledge concerning the validity of 1/0 • the consequences of a wrung decision. and possibility other. often subjective factors. Test Statistic: The critical region is a set D,. in the sample space. If it is properly chosen, the test is more powerful. This involves a search in the ll-dimensional space. We shall use a simpler approach. Prior to any experimentation we select a function g(X) and form the RV q = g(X). This RV will be called the test statistic. The function g(X) may depend on 80 and on other known parameters, but it must be independent of 8. Only then is G(X) a known number for a specific X. The test of a hypothesis involving a test statistic is simpler. The decision whether to reject H0 is based not on the value of the vector X but on the value of the scalar q = g(X). To test the hypothesis H 0 using a test statistic, we find a region R,. on the real line, and we reject H 0 itT q is in R, .. The resulting error probabilities are a= P{q E ~(8) = P{q fE L. j~(q. = t ..l~<q. R,lllu} '-' flu)dq R,IH.} 8)c/:. where Ra is the region of acceptance of Ho. The density fq(q, 8) of the test statistic can be expressed in terms of the function g(X) and the joint density 324 CHAP. 10 HYPOTHESIS TESTING f(X, 8) of the samples X; [see (7-6)]. The critical region Rc is determined as before: We select a and search for a region Rc minimizing the resulting OC function {j(8). In the next section, we design tests based on empirically chosen test statistics. In general, such tests are not most powerful no matter how Rc is chosen. In Section 10-6, we develop the conditions that a test statistic must satisfy in order that the resulting test be most powerful (Neyman-Pearson criterion) and show that many of the empirically chosen test statistics meet these conditions. 10-2 Basic Applications In this section, we develop tests involving the commonly used parameters. The tests are based on the value of a test statistic q = g(X). The choice of the function g(X) is more or less empirical. Optimality criteria are developed in Section 10-6. In the applications of this section, the density fq(q, 8) of the test statistic has a single maximum at q = qmu. To be concrete, we assume that [q(q, 8) is concentrated on the right of qmu if 8 > 80 , and on its left if8 < 80 as in Fig. 10.1. Our problem can be phrased as follows: We have an RV x with distribution F(x, 8). We wish to test the hypothesis 8 = 80 against one of the alternative hypotheses 8 :f= 80 , 8 > 80 , or 8 < 80 • In all three cases, we shall use the same test statistic. To carry out the test. we select a function g(X). We form the RV q = g(X) and determine its density [q(q, 8). We next choose a value for the a error and determine the critical region Rc. We compute the resulting OC function {j(8). If {j(8) is unacceptably large, we increase the Figure 10.1 c (a) (b) q c (c) q SEC. 10-2 BASIC APPLICATIONS 325 number n of the samples x;. Finally, we observe the sample vector, compute the value q = g(X) of the test statistic, and reject H 0 iff q is in the region R, .. To complete the test, we must determine its critical region R,.. The result depends on the nature of the alternative hypothesis. We consider each of the three cases 8 :/: 80 , () > 0. 8 < 0, and we determine the corresponding OC function. I. Suppose that /1 1 is the hypothesis (J :/: 9.,. In this case, R,. consists of the two half-lines q < c 1 and q > c~. as in fig. IO.Ia. The corresponding a error is the area under the two tails of the density /q(q, 8o): a= f~Jq(q, 8o)dq + J.:i~(q, (10-6) 8o)dq The constants c1 and c2 are so chosen as to minimize the length c:2 - c 1of the interval (c: 1 , c:2 ). This leads to the condition.t;,(c 1 • 80 ) = /q(c: 2 , 80 ); however, the computations required to determine c: 1 and c2 are involved. To simplify the problem. we choose c:, and c:2 such that the area of /q(q, 80 ) under each of its tails equals a/2. This yields (10-7a) Denoting by q, the u-percentile of the test statistic q, we conclude that c: 1 = qa12 • c2 = qt-a/2. The resulting OC function {3(8) equals the area of /q(q, 8) in the interval (c 1 • c:~). 2. Suppose that H 1 is the hypothesis 8 > 80 • The critical region is now the half-line q >cas in Fig. IO.Ib, and the corresponding a error equals the area of the right tail of /q(q, 80 ): a = f' fq(q, 8o)dq c = q, a (10-7b) The OC function {3(8) equals the area of/q(q, 8) in the region q <c. 3. Suppose that H 1 is the hypothesis 8 < 80 • The critical region is the half-line q < c of Fig. IO.lc, and the a error equals a= J:.. .t;,(q, 8o)dq c = qa (10-7c) The OC function is the area of /q(q, 8) in the region q > c. Summary To test the hypothesis 8 = 80 against one of the alternative hypotheses 6 :/: 80 , 8 > 80 , or 8 < 80 • we proceed as follows: 1. Select the statistic q = g(X) and determine its density fq(q, 8). 2. Assign a value to the significance level a and determine the critical region R,. for each case. 3. Observe the sample X and compute the function q = g(X). 326 CHAP. 10 HYPOTHESIS TESTING 4. Reject H 0 itT q is in the region R,.. 5. Compute the OC function fj((J) for each case. Here is the decision process and the resulting OC function fj(6) for each case. The numbers qu are the u-percentiles of the test statistic q under hypothesis Ho. Accept H 0 itT c; s q s c2 H, : 6 =I= 6o C2 = ql-o/2 fj((J) H 1 : 6 > 60 .. (10-Sa) (}) dq Accept H 0 iff q s c fj((J) H 1 : 6 < 60 = f.C1 /q(q, = f.. = f = q 1 ,. (10-Sb) /q(q. (J) dq Accept H0 itT q fj((J) c ~ c· c = qa (10-8c) /q(q. (J)dq Notes I. The test of the simple hypothesis 8 = 80 against the alternative simple hypothesis 8 = 81 > 80 is a special case of ( I0-8b); the test against 8 = 81 < 80 is a special case of (10-8c). 2. The test of the compositt> null hypothesis 8 s 80 against the alternative 8 > 80 is identical to (10-8b); similarly, the test of the composite null hypothesis 8 =:: 80 against 8 < 80 is identical to (10-8c). For all three tests, the constant a is the Type I error probability only if 8 = 80 when H 0 is true. 3. For the determination of the critical region of the test, knowledge of the density f 9(q, 8) for 8 = 80 is necessary. For the determination of the OC function {J(8), knowledge of/9 (q, 8) for every 8 E 0 1 is necessary. 4. If the statistic q generates a certain test, the same test can be generated by any function of q. 5. We show in Section 10-6 that under the stated conditions. all tests in (10-8) are most powerful. Furthermore, the test!i against 8 > 80 or 8 < 80 are uniformly most powerful. This is not true for the test against 8 :1: 80 • For a specific 8 > 80 , for example. the critical region q > c yields a Type II error smaller than the integral in (10-8a). 6. The OC function {J{8) equals the Type II error probability. If it is too large, we increase a to its largest tolerable value. If {J(8) is still too large, we increase the number n of samples. 7. Tests based on (10-8) require knowledge of the perc·c•ntiles q, of the test statistic q for various values of u. However, all tests can be carried out in terms of the distribution Fq(q. 8) of q. Indeed. suppose that we wish to test the hypothesis 8 = 80 against 8 :/: 80 • From the monotonicity of distributions it follows that SF.C. 10-2 BASIC APPI.ICATIONS 327 Hence, the test (2-2a) is equivalent to the following test: Determine the function F.,(q, 80 ) where q is the observed value of the test statistic Accept //11 itT~ < F.,lq. flul < I - ~ This approach is used in tests based on computer simulation (see Section 8-3). Me• em x with mean .,, and we wish to test the hypothesis 1/o: 71 = Tlo against H,: 71 =/:. Till• 71 > .,,, or 71 < Tlo We have an RV Assuming that the variance of x is known. we use as the test statistic the Till u!V'n ll - q = RV (10-9) where i is the sample mean of x. With the familiar assumptions, the RV i is N(TJ, u!Vn ). From this it follows that under hypothesis //0 , the test statistic q is N(O, 1); hence, its percentile q, equals the u-percentile z, ,lfthe standard normal distribution. Setting q, = z, in (I 0-8), we obtain the critical regions of Fig. 10.2. To find the corresponding OC functions. we must determine the density of q under hypothesis 1/1 • In this case. i is N(TJ. u!Vn ). and q is N(.,.,. 1) where ., - Till (10-10) ., = - - Since z1_, " -z,. u!Yn = (10-8) yields H,: 71 =/:. Tlo Accept 1/o iff.:..,~< q.,:;: -za•2 {J(TJ) = G(-z..,2- .,.,) - G(z.. -~- .,.,, H,: 71 > Tlo Accept Ho iff q {J(TJ) = G<-z.. - .,.,> !:i ~•-·• (10-Jia) (10-llb) t'ipre 10.1 I- a 17<1'1u l'lo 1'1 l'lo 328 CHAP. 10 HYPOTHESIS TESTING Accept Ho iff q ~ Za J3(71) = I - G(z.. - .,,,) The OC functions J3(71) are shown in Fig. 10.2 for each case. H,: 71 <'rio Example 10.1 (10-llc) We receive a gold bar of nominal weight 8 oz .• and we wish to test the hypothesis that its actual weight is indeed 8 oz. against the hypothesis that it is less than 8 07.. To do so. we measure the bar 10 times. The results of the measurements are the values X;= 7.86 7.90 7.93 7.95 7.96 7.97 7.98 8.01 8.02 8.04 of the RV x = 71 + "where 71 is the actual weight of the bar and " is the measurement error, which we assume normal with zero mean and cr = 0.1. The test will be performed with confidence level a= .05. In this problem. Ho: 71 = 71o = 8 H 1 : 71 < 8 CT x- 71o .'i = 7.96 Vn = 0.032 q = - - = ··1.25 Zo = -1.645 cr/Vn Since -1.25 is not in the critical region q < -1.645. we accept the null hypothesis. The resulting OC function equals ~(71) = I - G ( - 1.65 - ~.C~3 :) If the variance of x is unknown, we use as test statistic the q • RV =i -'riO (10-12) s/Yn where s2 is the sample variance of x. Under hypothesis H 0 • this RV has a t(n - 1) distribution. We can therefore use (10-8), provided that we set q,. equal to the t,.(n - 1) percentile. To find J3(71), we must determine the distribution of q for ., =I= 'rio. This is a noncentral Student t distribution introduced in Chapter 7 [see (7A-9)]. Example 10.2 The mean 71 of past SAT scores in a school district equals 560. A class of 25 students is taught by a new instructor. The students take the test, and their scores x1 yield a sample mean i = 569 and a sample standard deviation s = 30. Assuming normality, test the hypothesis 71 = 560 against the hypothesis 71 :/: 560 with significance level a = .05. In this problem. Ho: 71o = 560 H,: 71 :/: 71o 569 560 q= = I5 lt-o/2(24) = 1.97~.(24) = 2.06 = -l.cm 30/v'B . Thus q is in the interval (-2.06, 2.06) of aceeptance of H0 : hence. we accept the hypothesis that the average scores did not change. SEC. 10-2 BASIC APPLICATIONS 329 EQt.;AUTY o•· TWO M•:ANS We have two normal Rvs x andy, and we wish to test the hypothesis that their means are equal: Ho: TJx = TJ~· (10-13) As in the problem of estimating the difference of two means, this test has two aspects. Paired Samples Suppose, first, that both avs can be observed at each trial. We then have n pairs of samples (x;, Y; ). as in (9-39). In this case. the RV w=x-y is also observable, and its samples equal w; = x;- y;. Under hypothesis H 11 , the mean.,.,. = Tl.t - TJ~· of w equals 0. Hence, (10-13) is equivalent to the hypothesis H 0 : ., .. = 0. We thus have a special case of the test of a mean considered earlier. Proceeding as in ( 10-9), we form the sample mean w = i - y and use as test statistic the ratio w u.,.tvn = i-y u...tvir where IT;.. = cr~ + cr~ - 2p...- (10-14) Under hypotheses Ho, this RV is N(O, I); hence. the test of //0 against one of the alternative hypothesis ., ... =I= 0 • ., ... > 0 . ., ... < 0 is specified by ( 10-11) provided that we replace q by the ratio Vntr - .v>hr. . and TJq by Vn(TJ, - TJ~· )/ u .... If u.., is unknown, we use as test statistic the ratio w (10-15) s.,../Vn where s~ is the sample variance of the RV w, and proceed as in (10-12). Example 10.3 The RVS x andy are defined on the same experiment 9'. We know that ax = 4, u 1 = 6, and #Lxy = 13.5, and we wish to test the hypothesis.,, = .,_.against the hypothesis TJx :/: TJr Our decision will be based on the values of the 100 paired samples (x~o Y;). We average these samples and find i = 90.7, y = 89. In this problem, n = 100, I - 01/2 = .915, u~ Since 3.4 > z.97, = 36 + 16 - 27 = 25 w=l.l 1\' --=3.4 u .. = 2, we reject the hypothesis that .,. = TJ~·. • tvn Sequential Samples We assume now that the Rvs x andy cannot be sampled in pairs. This case arises in applications involving Rvs that are observed under different experimental conditions. In this case, the available observations are n + m independent samples, as in (9-41). Using these samples, we form their sample· means x, y and the RV w= i - y (10-16) 330 CHAP. 10 HYPOTHESIS TESTING Known Variances Suppose, first, that the parameters CTx and known. To test the hypothesis Tl.t = Tlv. we form the test statistic - where q =wCTw If Ho is true, then Tlw = Tl.t - Tl.v therefore use ( I 0-11 ) with = 0; ., .+ ~ CT( CT...,=" n hence, the - CT,. , CT; ~ (10-17) m RV are · q is N(O, I). We can .,,. Tlq=~ CT;;: Example 10.4 A radio transmitter transmits a signal offrequency '1· We wish to establish whether the values"" and.,, of., in two consecutive days are different. To do so, we measure '1 20 times the first day and 10 times the second day. The average of the first day's readings x1 = "" + v1 equals 98.3 MHz, and the average of the second day's readings y1 = .,, + v1 equals 98.32 MHz. The measurement errors v1 and v1 are the samples of a normal RV "with.,, = 0 and u, = 0.04 MHz. Based on these measurements, can we reject with significance level .OS the hypothesis that "" = .,,. ? In this problem, n = 20, m = 10, z. 97~ = 2, x = 98.3, y = 98.32. uw = CT, J; + ~ = 0.015 w= 0.02 w -=1.3 uw Since 1.3 is not in the critical region (- 2. 2), we accept the null hypothesis. • Unknown Y IU'iances Suppose that the parameters CT., and a,. are unknown. We shall carry out the test, under the assumption that CT, = U'\· = CT. As we have noted in Section 9-2, only then is the distribution of the test statistic independent of CT. Guided by (9-47), we compute the sample variances s~ and s~ of x and y, respectively, and form the test statistic = (n-l)s~+(m-l)s;(l -n + -mI) (IO-I 8) n + m- 2 is the estimate of the unknown variance CT~ of the RV w= i - y [see (9-46)]. q w = -. u-. h .~ w ere CT~ Under the null hypothesis, this statistic is identical to the estimator in (9-47) because if Tlx = Tlv, then Tlw = 0. Therefore, its density is t(n + m - 2). This leads to the following test: Form the n + m samples x; and y1 : compute i, s~. s~; compute a-,.. from (10-18); use the test (10-11) with .v. X- Y q = -.CT-. 'rlq Replace the normal percentile Zu by t,.(n Example 10.5 = Tlx. 'rlv CT;;: +m - 2>. The avs x andy model the IQ scores of boys and girls in a school district. We wish to compare their means. To do so, we examine 40 boys and 30 girls and find the following averages: Boys: x = 95, s" = 6 Girls: y = 93, s,. = 8 SEC. 10-2 BASIC APPLICATIONS 331 Assuming that the avs x and y are normal with equal variances, we shall test the hypothesis H0 : TJ., =.,,.against H1 : TJ., :1= .,.,_.with a = .05. In this problem. ·n = 40. m = 30, t Y7~(6M) = :.111< ...,. :!. ~t v ... = 39s~ + 29s~ ( ..!._ ~ .! ) = 1 LA~ 68 40 30. ·'" q = 95 93 = 1 , 2 ·- 1.64 Since 1.22 < 2. we accept the hypothesis that the mean IQ scores of boys and girls are equal. • Mean-dependent Variances The preceding results of this section cannot be used if the distributions of x andy depend on a single parameter. In such problems, we must develop special techniques for each case. We shall consider next exponential distributions and Poisson distributions. Exponential Distributions Suppose that I H, r.( \-') = .... (' \' Ho(j( \') J• • . In this case, 1)., = 60 and.,,.= 8 1 : hence, our problem is to test the hypothesis Hn: H1 = 60 against /1 1 : 8.1 + 60 • Clearly. the RVS xlfJ11 and y/8 1 have a x 2(2) distribution. From this it follows that the sum ni/fJo of the" samples x;/00 has a x~(2n) distribution and the sum my/8 1 of them samples y;IH 1 has a x 2(2m) distribution (see (7-87)). Therefore, the mtio ni12n80 81i my/2m8, = Boy has a Snedecor F(n. m) distribution. We shall use as test statistic the ratio q = i/y. Under hypothesis H 0 , q has an F{n, m) distribution. This leads to the following test. I. 2. 3. Example 10.6 Form the sample means i and f and compute their ratio. Select a and find the a/2 and I - a/2 percentiles of the F(n, m) distribution. Accept Ho iff Fo12(n, m) < ily < Ft-odn. m). We wish to examine whether two shipments of transistors have the same mean time to failure. We select 9 units from the first shipment and 16 units from the second and find that ·"• + · · · + x 11 ::. 324 hours _v 1 + · · · 1 -"'" = 380 hours Assuming exponential distributions. test the hypothesis 8 1 = 611 against fl 1 :/: flo with u=.l. In this problem, F. 11~(9, 16) =- 2.54 F. 11 ~(9, 16) = 0.334 .f = 36 y ""' 23.75 Since .'il.v = 1.52 is between 0.334 and 2.54. we accept the hypothesis that fl 1 = fl0 • • 332 CHAP. 10 HYPOTHESIS TESTING Probabilities Given an event ~ with probability p = P(~). we shall test the hypothesis p = Po against p :/= Po, p > Po, or p < Po, using as test statistic the number k of successes of~ inn trials. We start with the first case: Ho: p =Po against H,: p ?=Po (10-19) As we know, k is a binomially distributed RV with mean np and variance npq. This leads to the following test: Select the consistency level a. Compute the largest integer k 1 and the smallest integer k2 such that ~ (~) PAqs-• < I Determine the number k of successes .~, (~) pAqs-' < I of~ <t0-2o> inn trials. Accept Ho iff k, s k s kl. The resulting OC function equals p(p) = P{k 1 s k s k2IHd = t (~) ·-·· p'q"-' For large n, we can use the normal approximation. This yields kt = npo +Zan.~ k2 = npo- Zan.Vnpoqo p(p) = G(k2 - np) _ G(k' - np) vnpq Vnpq One-sided Tests We shall now test the hypothesis Ho: p = Po against H,: p >Po The case against p < p 0 is treated similarly. We determine the smallest integer k2 such that (10-2 l) (10-22) i (~) pAq8 ·' < a 4 -4! and we accept Ho itT k < k2 • For large n, k2 = npo + Zt-o Vnpoqo p(p) = G (k~) npq. (10-23) Note that (10-22) is equivalent to the test of the composite hypothesis H(,: p s Po against Hj: p >Po· Example 10.7 We toss a given coin 100 times and observe that heads shows 64 times. Does this evidence support rejection of the fair-coin hypothesis with consistency level .OS? In this problem, np0 =SO, Vnpoqo = S, z.o25 = -2; hence, the region of acceptance of H0 is the interval npo ± 2Vnpoqo ==SO± 10 Since 64 is outside this interval. we reject the fair-coin hypothesis. The resulting OC SEC. 10-2 BASIC APPI.ICATIONS 333 function equals ~(p) = (; ( ~ .~ lOOp) _ G ( 40 -· ~I') IOVpq lliVpq • Rare l<:vents Suppose now that p11 << I. If n is so large that 11p11 >> I, we can use (10-21). We cannot do so if np0 is of the order of I. In this case, we use the Poisson approximation 4qt· . = (n) k P 4 >.4 (,-A _ A -· liP k! ( 10-24) Applying (10-24) to (10-19) and (10-20). we obtain the following test. Set Ao = npo and compute the largest integer k 1 and the smallest integer k2 such that ( 10-25) Accept Ho iff k 1 s: k !5 k2 • The resulting OC function equals ~(A) = (' 4) L A' (10-26) k• The one-sided alternatives p > Po and p < p11 lead to similar results. Here is an example. A --:-j 4 4o Example 10.8 A factory has been making radios using process A. Of the completed units, 0.6% are defective. A new process is introduced. and of the first 1,000 units, 5 are defective. Using a = 0.1 as the significance level. test the hypothesis that the new process is better than the old one. In this problem. n = 1,000, Ao = 6, and the objective is to test the hypothesis 110 : p = p0 = .006 against H 1 : p < p0 • To do so, we determine the largest integer k1 such that (10-27) and we accept H0 iff k > k1 • In our case. k = 5. a hence. we accept the null hypothesis. • = .I, and (10-27) yields k1 = 2 < 5: EQUALITY OF TWO PROBABII.ITIES In the foregoing discussion, we assumed that the value p0 of p under the null hypothesis was known. We now assume that p 0 is unknown. This leads to the following problem: Given two events .s4o and sf 1 with probabilities p 0 = P(.s40 ) and p, = P(.slft), respectively, we wish to test the hypothesis Ho: p, =Po against H,: p, =I= Po· To do so, we perform the experiment no + n, times, and we denote by ko the number of successes of .s4o in the first no trials and by k1 the number of successes of .s41 in the following n 1 trials. The observations are sequential; hence, the Rvs ko 334 CHAP. 10 HYPOTHESIS TESTING and k 1 are independent. Sequential sampling is essential if :A0 models a physical event under certain experimental conditions (a defective component in a manufacturing process. for example), and .~ 1 models the same physical event under modified conditions. We shall use as our test statistic the RV ko kl q =- - (10-28) no n1 and, to simplify the analysis, we shall assume that the samples are large. With this assumption, the RV q is normal with "''Jq CT2 =Po- P1 = poqo + P1q1 no q n1 (10-29) Under the null hypothesis, Po =PI; hence, "''Jq =0 , = poqo ( -1 + -I ) no n1 (10-30) CTq This shows that we cannot use q to determine the critical region of the test because the numbers Po and qo are unknown. To avoid this difficulty, we replace in (10-30) the unknown parameters p 0 and q0 by their empirical estimates Po and q0 obtained under the null hypothesis. To find Po and qo, we observe that if Po= p 1 • then the events .s40 and .s4 1 have the same probability. We can therefore interpret the sum k0 + k 1 as the number of successes of the same event .s40 or .s4 1 in n0 + n 1 trials. This yields the empirical estimates A - ko + kl - I - . Po - no + nl - qo A2 CT q = Poqo A A ( 1 I) - +no n1 (10-31) where d'q is the corresponding estimate of CTq· Thus under the null hypothesis, the RV q is N(O, d'q); hence, its u-percentile equals Applying (10-8a) to our case, we obtain the following test: z,aq. Determine the number k0 and k 1 of successes of the events .s40 and .s4 1 , respectively. 2. Compute the sample q = k01n0 - k 1/n 1 of q. 3. Compute d'q from (10-31). 4. Accept Ho itT ZoncTq s q s -Zond-q. 1. Under hypothesis H 1 , the RV q is normal with "''Jq and CTq• as in (10-29). Assigning specific values to Po and P1 , we determine the OC function of the test from (10-Sa). Example 10.9 In a national election, we conducted an exit poll and found that of 200 men. 99 voted Republican and of 120 women, 45 voted Republican. Based on these results, we wish to test the hypothesis that the probability p 1 of female voters equals the probability Po SEC. of male voters with consistency level a 99 45 tf"'" ~(Kl • (r ~ '1 -'-' • 4'i • · 120 -· .I:! X 'i'i ( ·-- = .05. 10-2 RASIC APPLICATIONS In this problem. 99 1- 4'i /;u = 2(KI +·120 -·' .4 5 ··~··· + _!_. ) 120 2(KI :.u~i · - :.•m "'. 335 2. ciu = 55 tr,, - o.o57 Since 0.12 > :. ..11 ~tr,, == 0.114. the hypothesis that p = l'n is rejected. hut barely. • Poisson Distributions The RV xis Poisson-distributed with mean>... We wish to test the hypothesis Hn: A -= >..o against H 1 : A =I= A0 • To do so, we form the 11 samples X; and usc as test statistic their sum: q = X1 I • • ·+X, This sum is a Poisson-distributed RV with mean nA. We next determine the largest integer k 1 and the smallest integer k~ such that _ , ~ (tr.\n)1 a (' 11"''L.J--<H k! 2 _,,. ~ (11.\n)' a ('"''L.J--<- ( 10-32) k! 2 The left sides of these inequalities equal P{q <; kdHn} and P{q ~ k:!IHn} respectively. This leads to the following test: Find the sum q .:. . .t 1 + · · · + x, of the observations X;; accept H 0 iff k 1 c;: q s; q~. The resulting OC function equals 4 ': ( 10-33) For large n. we can use the normal approximation. Under hypothesis H 0 • TJ,, = n>..o: u~ = n.\o. Hence lsee (10-11)1. ( 10-34) Example 10.10 The weekly highway accidents in a certain region are the values of a Poisson-distributed RV with mean A = Ao = 5. We wish to examine whether a change in the speed limit from the present 55 mph to 65 mph will have any effect on A. To do so, we monitor the number x; of weekly accidents over a tO-week period and obtain X; = 5 2 9 6 3 7 9 6 8 Choosing a = .05, we shall test the hypothesis A -"' 5 against A :/: 5. In this problem. ;:,.12 = -2. nA11 = 50. L ~ 2 viii;; """ 50 ± 15 q -= X; .:..: 56 Since 56 is between 35 and 65. we accept the A ..... 5 hypothesis. • llAo 336 CHAP. 10 HYPOTHESIS TESTING EQUALITY OF 1WO POISSON DISTRIBUTIONS We have two Poissondistributed avs x and y with means Ao and A1 , respectively, and we wish to test the hypothesis Ho: A1 = Ao against H, ; At =/: Ao or A, > Ao or At < Ao where Ao and A1 are two unknown constants. Our test will be based on the no+ n, samples x; and y1, obtained sequentially. The exact solution of this problem is rather involved because the variances of x andy depend on their mean. To simplify the analysis, we shall consider only large samples. This leads to two simplifications: We can assume that the avs x andy are normal, and we can replace the variances u~ = Ao and = At by their empirical estimates. Under hypothesis Ho. A1 = A0 ; hence, the avs x andy have the same distribution. We can therefore interpret the n0 + n 1 observations x; andy; as the samples of the same av x or y. The resulting empirical estimate of Ao is u; Ao = no ! n, (t + I Y;) X; ;-t 1-t Proceeding as in (10-17), we form the sample means i and yofthe avs x and y and use as test statistic the ratio i-v where u;.~ = Ao• ·-I + -I ) (10-35) q=~ 1·no nt u ... is the empirical estimate of the variance of the difference w = i - y. From this it follows that if H 0 is true. then q is approximately N(O, I). We can therefore use the test (10-11). In particular, (10-lla) yields the following test of the hypothesis Ho: A1 = Ao against H,: At =/: Ac,. Compute x..Y. and u,.. Accept Ho iff l.f Example 10.11 - .VI!u,.. < - Zal1· The number of absent boys and girls per week in a certain school district are the values x; and Y• of two Poisson-distributed avs x and y with means Ao and At, respectively. We monitored the boys for 10 weeks and the girls for 8 weeks and obtained the following data: 27 24 14 10 10 16 23 19 13 8 Yl = 13 We shall test the hypothesis At = Ao against At :F Ao with consistency level a = .05. In this problem, -loa .,. 2. X;= J5 12 i.o = 21~2 = - y = 180 - 17 16.2 19 12 23 a-~= 16.2 ( 4 17 1~ + ~-i _!g = 4 q = 1.91 = 2.09 > 2 10 8 Hence, we reject the null hypothesis, but barely. • i a;:= 1.91 SEC. 10-2 BASIC APPLICATIONS 337 Variance and Correlation Given anN(.,, u) RV x, we wish to test the hypothesis Ho: u = cro against II,: cr ':/:: uo. cr > cru. or u < O'o Suppose. first. that the mean., of x is known. In this case, we use as test statistic the RV q-= ~ (X; - T'/)~ £.J - - , - ( 10-36) Uii ; I Under hypothesis H 0 , the RV q is x2(n). We can therefore use the tests (10-8) where q,. is now the x~(n) percentile. To find the corresponding OC function {3(u), we must determine the distribution of q under hypothesis H 1 • As we show in Problem 10-10, this is a scaled version ofthe x2(n) density. Example 10.12 We wish to compare the accuracies of two measuring instruments. The error 11o ofthe first instrument is an N(O, u 0 ) RV where uo = 0.1 mm, and the error.,, of the second is an N(O. u) RV with unknown u. Our objective is to test the hypothesis H 0 : u = u 0 = 0.1 mm against H 1 : u :/:: u 0 with a = .05. To do so. we measure a standard object of length 11 = 8 mm 10 times and record the samples X;= 8.15 7.93 8.22 8.04 7.H5 7.95 8.06 IU2 7.86 7.92 of the RV x = 11 + ,, . Inserting the results into ( 10-361. we obtain q =- (X;- 8)~ LI~ · - ·= 0.01 i I 14.64 From Table 2 we find the percentiles Since 14.64 is in the interval (3.25. 20.4H) of acceptance of 1111 • we accept the hypothesis that u = cr0 • • If ., is unknown, we use as test statistic the sum i i i = .!. x, (10-37) uo n ; ., Under hypothesis H 0 , the RV q is x2(n - 1). We can therefore again use (10-8) where q,. is now the x~(n - I) percentile. q = (X; -:-2 i)2 1-1 Example 10.13 Suppose now that the length 11 of the measured object in Example I0.12 is unknown. Inserting the 10 measurements x1 into (10-37), we obtain x = 8.01 q= L10 ( i-1 ::\' x,~ = 14.54 0.01 X; - 338 CHAP. 10 HYPOTHESIS TESTING From Table 2, we find that X~o2~(9) = 2.70 Since 14.54 is in the interval (2.70. 19.02). we accept the hypothesis that u-= u 11 • • Equality of Two Variances The RV xis N(-q,,. u .• ), and the RV y is N(-q,. "·' ). We shall test the hypothesis = u,. against H 1 : u .• :F u.,.. u, > u,.• or u .. < a,. Suppose, first, that the two means Yl.< and -q~ are known. In this case. we use as test statistic the ratio I " Ho: CTx q = -n L (X; ;~• I 'f/x)2 (10-38) m -m L (y; - -q,>2 ;~• where X; are the n samples of x and y; are the m samples of y obtained sequentially. From (7-106) it follows that under hypothesis H 0 , the Rv q has a Snedecor F(n, m) distribution. We can therefore use the tests (10-8) where q,. is the F,.(n, m) percentile. If the means 'fix and 'f/~· are unknown. we use as test statistic the ratio ~ s; s;. (10-39) q = -; where si is the sample variance of x obtained with the n samples x; and s; is the sample variance of y obtained with them samples y;. From (7-106) it follows that if CTx = u,, then the RV q is F(n - I, m - 1). We can therefore apply the tests (10-8) where q,. is the F,.(n - I, m - I) percentile. Example 10.14 We wish to compare the accuracies of two voltmeters. To do so, we measure an unknown voltage 10 times with the first instrument and 17 times with the second instrument. We compute the samples variances and find that Sx = 8 p.V Sy = 6.4 p.V Using these data, we shall test the hypothesis Ho: Ux = Uv against H 1 : Ux ::1: u,. with · · consistency level a = .I. In this problem. ~ s; s; q = .. = 1.56 n = 10 = 17 m From Table 4 we find F.~(9, 16) = 2.54 F. 0 ~(9, 16) I (l 6 9) = 0.334 .9~ • Since 1.56 is in the interval (0.334, 2.54). we accept the hypothesis that cr. = u,. • = F Correlation We wish to investigate whether two Rvs x andy are uncorrelated. With r their correlation coefficient, our problem is to test the hypothesis H 0 : r = 0 against H 1 : r :F 0. To solve this problem, we perform the underly- SEC. 10-2 BASIC APPLICATIONS 339 ing experiment n times and obtain then paired samples (x;, y1). With these samples, we form the estimate Pof r as in (9-55) and the corresponding value z of Fisher's auxiliary variable z as in (9-56): • ~(X; - .t)(y; - v) r = v'~Cx; - i)2~(y; I ·..v)2 I- f z "' 2 In I + ( I0-40) f We shall use as test statistic the RV q = zv'ii- 3 (10-41) Under hypothesis H 11 • the RV z is N(O. llvli ·-=--3) Isec (9-57)). Hence, the test statistic q is N(O, 1). We can therefore use the test (10-ll) directly. Example 10.15 We wish to examine whether the freshman grades x and the senior grades yare correlated. For this purpose, we examine the grades x; and Y; of n = 67 students and compute f. z. and q from (10-40) and (10-37). The results are f =- 0.462 z - 0.5 q - 4 To test the hypothesis r = 0 against r t. 0, we apply ( 10-Kal. Since q we conclude that freshman and senior gl"'cldes are correlated. • = 4 > lz..12 l -= 2, Distributions We wish to examine whether the distribution function F(x) of an RV x equals a given function Fo(x). Later we present a method based on the x2 test. The following method is based on the empirical distribution (9-64). Our purpose is to test the hypothesis H 0 : F(x) = Fo(x) against 1/1: F(x) :/: F 0(x) (10-42) For this purpose. we form the empirical estimate PCx) of FCx) as in (9-64) and usc as test statistic the maximum distance between F(x) and fj,(x ): q = maxl.'(x) - Fo(..\)1 (10-43) KOLMOGOROFF-SMIRNOV T•:sT t If Ho is true. £{F<x)} = F0(x); hence q is small. It is therefore. reasonable to reject H11 itT the observed value q of q is larger than some constant £'. To complete the test. it thus suffices to find c such that P{q > c:ll/0 } = a. Under hypothesis H 0 • the statistic q equals the RV win (9-66). Hence [see (9-68)), a = P{q > ciHo} ==- 2e 2n..• This yields the following test: Using the samples X; of x. form the empirical estimate hx> of F(x); plot the difference i-'(x) - F0(x) (Fig. 10.3), and evaluate q from ( 10-43): Accept H 0 itT q < ,~ 2~ In·~ = c ( 10-44) 340 CHAP. 10 HYPOTHESIS TESTING 125 75 X Figure 10.3 Example 10.16 We shall test the hypothesis that the tabulated IQ scores of Example 9.19 are the samples of a normal av x with 1J = 100 and u = 10. In the following table, we list the bunched samples X;, the corresponding values of the empirical distribution !(x), the normal distribution Fo(x), and their distance L\(x) = IF(x) - Fo(x)l (see also Fig. 10.3). 7S 80 8S 90 9S 100 lOS 110 liS 120 12S F(x1) .02S 0.7S I. SO .21S .425 .62S .71S .81S .92S .97S .100 Fo(x1) .006 .023 .067 .IS9 .308 .soo .691 .841 .933 .917 .994 A(x1) .019 .OS2 .083 .116 .117 .12S .084 .034 .008 .002 .006 Xj As we see from this table. L\(x) is maximum for x = 100. Thus c I = V- I I 80 n 2.OS = .217 q = L\( 100) = .125 Since .12S < .217, we accept the hypothesis that the RV xis N(IOO. 10). • EQUALITY OF TWO DISTRIBUTIONS We are given two independent RVS x and y with continuous distributions, and we wish to test the hypothesis H 0 : F.r(w) = F,.(w) against H 1 : f"'.c(w) :1= F,.(w) (10-45) We shall give a simple test based on the special case p 0 = .5 of the test (10-19) of the equality of two probabilities. Sip Test From the independence of the RVs x and y it follows that if H 0 is true, then /x(x)/y(y) = fx(y)/y(x) SEC. 10-2 BASIC APPLICATIONS 341 This shows that the joint density ft.,.<x. y) is symmetrical with respect to the line x = )'; hence. the probability masses above and below this line are equal. In other words. P{x > y} = P{x < y} = .5 (10-46) Thus under hypothesis H11 , the probability p = P(dl) of the event .54 equals .S. To test (10-45). we shall test first the hypothesis Ho: p = .5 Hi: p against :1= .5 = {x > y} (10-47) To do so. we proceed as in ( 10-20) with p11 = .5: Select a consistancy level a', and compute the largest integer k 1 and the smallest integer k~ such that (10-48) In this case. k2 = n- k, because ( ~) = (n ~ k) . ..-or large n, we can use the normal approximation (10-21). With p 0 = .5. this yields vii n v'li k. = - - ... , (10-49) - 2 ""' - 2 Form the n paired samples (x 1• y1) ignoring all trials such that X; = y 1• Determine the number k of trials such that X; > y 1• This number equals the number k of times the event sll occurs. or k > k~ (10-50) Reject H0 iff k < k 1 If H 0 is false. then H0 is also false. This leads to the following test of /111 • Compute k. k,. and k2 as earlier. or k > k~ (10-51) Reject H11 if k < II., n 2 k, =- + ~ ·~- 41 "~ 2 Note The tests of H 0 and H 0are not equivalent because H 0might be true even when H 0 is false. In fact, the corresponding error probabilities a,~ and a',~· are different. Indeed, Type I error occurs if H 0is rejected when true. In this case. H0 might or might not be true: hence. a< a'. Type II error occurs if H 0is not rejected when false. In this case. 110 is false; however. it might be false even if H 0is true; hence, ~ > ~·. Example 10.17 The Rvs x andy model the salaries of equally qualified male and female employees in a certain industry. Using (10-50), we shall test the hypothesis that their distributions are equal. The consistency level a of the test is not to exceed the value a' = .OS. To do so, we interrogate 84 pairs of employees and find that X; > y1 for 54 pairs, X; < y 1 for 26 pairs. and X; =- y 1 for 4 pairs. Ignoring the last four cases, we have :0'1.! == -·2 n 2~ z.,·,~ vii 2 = 40:!: 8.94 Hence. k 1 = 31 and k2 = 49. Since k = 54 > 49. we reject the null hypothesis. 342 CHAP. 10 HYPOTHI::SIS TESTING We have assumed that the Rvs x1, y1 are i.i.d.; this is not. however, necessary. The sign test can be used even if the distributions of x1 and y1 depend on i. Here is an illustration. Example 10.18 We wish to compare the effectiveness of two fertilizers. To do so, we select 50 farm plots, divide each plot into two parcels. and fertilize one half with the first fertilizer and the other half with the second. The resulting yields are the values (x1• y;) of the avs (x;. y1). The distributions of these avs vary from plot to plot. We wish to test the hypothesis that for each plot, x1 and y1 have the same distribution. Our decision is based on the following data: X;> y 1 for 12 plots and X;< y, for 38 plots. We proceed as in (10-49) with a' = .05. This yields the interval n Vn 2: 2T = Since k = 12 < 25 : 1.01 18 = k 1 , we reject the null hypothesis. • 10-3 Quality Control In a manufacturing process, the quality of the plant output usually depends on the value of a measurable characteristic: the diameter of a shaft, the inductance of a coil, the life length of a system component. Due to a variety of .. random" causes, the characteristic of interest varies from unit to unit. Its values are thus the samples of an RV x. Under normal operating conditions, this Rv satisfies certain specifications. For example, its mean equals a given number 7Jo. If these specifications are met during the production process, we say that the plant is in control. If they are not, the plant is out of control. A plant might go out of control for a variety of reasons: machine failure, faulty material, operator errors. The purpose of quality control is to detect whether, for whatever reasons, the plant goes out of control and, if it doe~. to find the errors and eliminate their causes. In many cases, this involves interruption of the production process. Quality control is a fundamental discipline in industrial engineering that involves a variety of control methods depending on the nature of the product. the cost of inspection. the consequences of a wrong decision, the complexity of the analysis, and many other factors. We shall introduce the first principles of statistical quality control. Our analysis is a direct application of hypothesis testing. The problem of statistical quality control can be phrased as follows: The distribution of the RV x is a known function F(x, 8) depending on a parameter 8. This parameter could, for example. be the mean of x or its variance. When the plant is in control, 8 equals a specified number 80 • The null hypothesis Ho is the assumption that the plant is in control, that is, that 8 = 80 • The alternative hypothesis H 1 is the assumption that the plant is out SEC. 10-3 Ql,;ALITY CONTROl. 343 of control, that is, that 6 :/: 60 • In quality control, we test the null hypothesis periodically until it is rejected. We then conclude with a certain significance level that the plant is out of control, stop the production process. and remove the cause of the error. The test proceeds as follows. We form a sequence x; of independent samples of x. These samples are the measurements of production units selected in a variety of ways. Usually they form groups consisting of n units each. The units of each group are picked either sequentially or at random forming the samples X1, • • • • •l'11 • • • • , r,,.,, .... .Tm ·I • • • • • . Using the 11 samples of each group, we form a test of the hypothesis H 11 • We thus have a sequence of tests designed as in Section 10-2. The testing stops when the hypothesis H 0 is rejected. Control Test Statistic The decision whether to reject the hypothesis that the plant is in control is based on the values of a properly selected test statistic. Proceeding as be• x,) of fore, we choose a significance level a and a function q = g(x 1 , • then samples x;. We determine an interval (c 1 , c~) such that P{c, < q < C2;Hn} = 1 - a ( 10-52) We form the samples q, = g(x,. 1 • • • • , x,.,) of q. If q, is between c, and c~. we conclude that the plant is in control. If q, < c, or q, > c2. we reject the null hypothesis: that is. we conclude that the plant is out of control. The constant c 1 is called the upper control limit (UCL); the constant c2 is called the lower control limit (LCL). In Fig. 10.4 we show t.he values of q, as the test progresses. The process terminates Ti' w;-------------~""~---UCL 99';; 95'.1 JL ~------------------------------- I _1_ Warning c,~------------------------------------------LCL 2 3 4 S 6 7 8 9 10 II m 344 CHAP. 10 HYPOTHESIS TESTING when qm is outside the control interval (c 1 , c2 ). The graph so formed is called the control chart. This test forms the basis of quality control. To apply it to a particular problem, it suffices to select the function g(x 1 , • • • , x,.) appropriately. This we did in Section 10-2 for most parameters of interest. Special cases involving means, variances, and probabilities are summarized at the conclusion of this section. Warning Limits If the time between groups is long, it is recommended that the testing be speeded up when qm approaches the control limits. In such cases, we select a test interval (w,, w2) with a' >a: P{w, < qm < wziHo} = I - a' and we speed up the process if qm is in the shaded area of Fig. 10.4. Control Error In quality control, two types of errors occur. The first is a false alarm: The plant is in control, but qm crosses the control limits. The second is a faulty control: The plant goes out of control at m = m0 , but qm remains between the control limits, crossing the limits at a later time. Faist Alarm The control limits might be crossed even when the plant is in control. Suppose that they are crossed at the mth step (Fig. 10.5a). Clearly, m is a random number. It therefore defines an av z taking the values J, 2, . . . We maintain that (10-53) P{z = miHo} = (I - a)m-la Indeed, z = m iff the statistic q is in the interval (c 1 , c2 ) in the first m - J tests and it crosses to the critical region at the mth step. Hence, (10-53) follows from (10-52). Thus z has a geometric distribution, and its mean [see (4-109)] equals J .,_ (10-54) .,, = -a This is the mean number of tests until a false alarm is given, and it suggests that if qm crosses the control limits for m > 11:, the plant is probabily still in control. Such rare crossings might be interpreted merely as warnings. Figure 10.5 UCL UCL of l- Out control II z =II w=3 (a) (b) SEC. 10-3 QUALITY CONTROL 345 Faulty Control Suppose that the plant goes out of control at the mth test but q,. crosses the control limits not immediately but k tests later (Fig. 10.5b). The plant is then in operation fork test intervals while out of control. Again, k is a random number specifying the RV w. This RV takes the values 1, 2. . . . with probability P{w = k!H.} = ~'- (9)( I 1 - ~((l)j (10-55) where 8 is the new value of the control parameter and ~(0) = P{c, < q < <"l!H,} is the OC function of the test. Thus w has a geometric distribution with p I ·- ~(8) = P(8). and its mean equals I = (10-56) .,., ... = P(6) where P(8) is the power of the test. If 8 is distinctly different from 80 , .,.,.,. is small. In fact, if ~(8) < .5. then.,.,.,.< 2. As the difference 18 - 80 1decreases, .,., ... increases; for 8 close to 80 , .,.,.,. approaches the mean false alarm length TJ;. Control of the Mean We shall construct a chart for controlling the mean .,., of x. Suppose, first, that its standard deviation u is known. We can then use as the test statistic the ratio q in (I 0-9): X - TJo q = u!Vn Xm - TJo qm = u!Vn - I " x, = ;; ~ x,..; .,.,= When the process is in control • 170 ; hence, q is N(O, 1), and the control limits are +Zaf2 • If we use as the test statistic the sample mean x,. of the samples x,. .. ; of the mth group. we obtain the chart of Fig. 10.6a with control Figure 10.6 x,. q,. flo+ 3a/vn t UCL L to/2(11 - 1) ~ UCL -;- ~ ~flo ~ 0 ...... _L 8 LCL flo- 3a/vn t m -r.12 (11- 1) x,.- flo s,./yn q =-m m 0 (a) (b) LCL 346 CHAP. J0 HYPOTHESIS TESTING limits (10-57) These limits depend on a. In the applications. it is more common to specify not the significance level a but the control interval directly. A common choice is the 3u interval TJo ± 3u/Vn. Since P {11o- 3 :n < i < TJo + 3 :n = .997 IHo} the resulting consistency level a equals .003. The mean false alarm length of this test equals 1/a - 333. Example 10.19 A factory manufactures cylindrical shafts. The diameter of the shafts is a normal RV with u = 0.2 mm. When the manufacturing process is in control, ., = 1Jo = 10 mm. Design a chart for controlling., such that the mean false alarm length 11: equals 100. Use 25 samples for each test. In this problem, a= 11.,, = .Ol,z.995 = 2.576, and n = 25.1nserting into (10.57), we obtain the interval 2 10 z 2.576 ... 10: 0.1 °; If u is unknown, we use as the test statistic the _ Xm - TJo qm - :r Sm1 vn RV 2 _ _1_ ~ Sm - n- I ~ ;~r ( q in (10-12). Thus Xm+i _ - )2 Xm The corresponding chart is shown in Fig. 10.6b. Since the test statistic q has a t(n - I) distribution, the control limits are LCL = ladn- I) UCL = -1,,,2(n- I) Example 10.20 Design a test for controlling ., when u is unknown. The design requirements are CL = :t3 False alarm length 11: = 100 In this problem. a = II1J: = .01: hence. the only unknown is the number n of samples. From Table 3 we see that t.99~(n - I) = 3 if n - I = 13; hence. n = 14. In quality control, the tests are often one-sided. The corresponding charts have then one control limit, upper or lower. Let us look at two illustrations. Control of the Standard Deviation In a number of applications, it is desirable to keep the variability of the output product below a certain level. A measure of variability is the variance u 2 of the control variable x. The plant is in control if u is below a specified level u 0 ; it is out of control if u > u 0 • This leads to the one-sided test H 0 : u = SEC. 10-3 QUAI.ITY CONTROL ..;q;, qm x!<n- I) UCL X1 • 0 (n- I) ll': 'i -.....g 347 q, = II .l: ~3 '=I <x,.t - 0 UCL x,)2 0 m m (a) (b) l'igure 10.7 uuagainst /1 1: u > u 0 or. equivalently, H 0: u~ = ufiagainst u; u~ > ufi. To carry out this test, we shall use as the test statistic the sum in ( 10-37): II ( X;- -X· )' )' _ ~ _ , II.. ( X 111 .; -X,.,q - ~ ' q,- ~ , i-1 U(i ; 1 Ujj Under hypothesis H 0• the RV q has a x~(, - I) distribution. From this it follows that if P{q > c:l H 0}= a, then<. = xi _.,(n - I): hence, the UCL equals xi-,.(11 - 1), and the chart of Fig. 10.7a results. Note Suppose that q is a test statistic for the parameter 8 and the c.:>rresponding control limits are c 1 and c~. If T = r(8) is a monotonic increasing function of 6. then r(q) is a test statistic of the parameter T, and the corresponding control limits are r(c- 1) and r(c2 ). This is a consequence of the fact that the events {c 1 < q < c-2} and {rktl < r(q) < r(C'2)} are equal. Applying this to the parctmeter 6 = o- 2 with r(fJ) :.:: VB. we obtain the chan of Fig. 10.7h for the control of the standard deviation u of x. In a number of applications, the purpose of quality control is to keep the proportion p of defective units coming out of a production line below a specified level Po· A defective unit can be identified in a variety of ways, depending on the nature of the product. It might be a unit that docs not perform an assigned function: a fuse, a switch, a broken item. It might also be a measurable characteristic. the value of which is not between specified tolerance limits; for example. a resistor with tolerance limits 1,000 :t 5% ohms is defective if R < 950 orR > 1050. In such problems, p = {defective} is the probability that a unit is defective, and the objective is to test the hypothesis Ho: p s Po against H,: p >Po where Po is the probability of defective parts when the plant is in control. This test is identical to the test Hi.: p = Po against Jlj: p > Pn· Proceeding as in (10-22), we use as the test statistic the number x of successes of the event :A.= {defective} inn trials: q = x and q, = x,. Thus x, is the number of defective units tested as the mth inspection. As we know, the RV x has a binomial distribution. This leads to the conclusion that the CONTROL OF DEFECTIVE UNITS 348 CHAP. 10 HYPOTHESIS TESTING t "Po + Zt- .. .../npoqo k21--_..._................;......__ t:CL ::0:. ~ "Po I _jL_ o~~~~~~~~~~ Figure 10.8 UCL is the smallest integer k2 such that P{x ~ k2IHo} = i ·~·l (~)P3ti3-t < a For large n [see (10-23)], UCL = k2 = npo + Zt o v;;p;;;j;, and the chart of Fig. 10.8 results. Thus if at the mth inspection, the number Xm of defective parts exceeds k2 , we decide with significance level a that the plant is out of control. 10-4 Goodness-of-Fit Testing The objective of goodness-of-fit testing is twofold. I. It seeks to establish whether data obtained from a real experiment fit a known theoretical model: Is a die fair? Is Mendel's theory of heredity valid? Does the life length of a transistor have a Weibull distribution? 2. It seeks to establish whether two or more sets of data obtained from the same experiment, performed under different experimental conditions, fit the same theoretical model: Is the proportion of Republicans among all voters the same as among male voters? Is the distribution of accidents with a 55-mph speed limit the same as with a 65-mph speed limit? Is the proportion of daily faulty units in a production line the same for all machines? Goodness of fit is part of hypothesis testing based on the following seminal problem: We have a partition A = [Jilt • • • • .simJ SEC. 10-4 GOODNESS-OF-FIT TESTING 349 A •·igure IO.CJ consisting of m events s4; (Fig. 10.9), and we wish to test the hypothesis that their probabilities Pi = P(.sfli) have m given values Pu,: Ho: Pi= Poi• all i against 1/1: p; ~ Poi• some i (10-58) This is a generalization of the test (10-19) involving a single event. To carry out the test, we repeat the experiment n times and denote by ki the number of successes of s4;. Pearson's Test Statistic We shall use as the test statistic the exponent of the normal approximation (7-71) of the generalized binomial distribution: ~ (ki - npo;) 2 (10-59) ;-• npo; This choice is based on the following considerdtions: The RVS ki have a binomial distribution with E{ki} = npi ui. = np;q; (10-60) Hence. the rdtio k;ln- Pi as"- x. This leads to the conclusion that under hypothesis Hr.. the difference lki- npo;l is small. and it increases as IPi- Pnil increases. The test proceeds as follows: Observe the numbers k; and compute the sum in (10-59), select a significance level a. and find the percentile q 1 _0 of q. • "!, ( ki IIPo;)~ . Accept Hutff L <.. q 1 ... (10-61) q= ~ ;· 1 IIPu; Note that the computations in (10-59) are simplified if we expand the square and use the identities ~ po; = I, ~ k; = n. This yields m k? q=L-' _, i-1 npo; (10-62) CHI-SQUARE TEST To carry out the test, we must determine the percentile q,. of Pearson's test statistic. This is, in general, a difficult problem involving the determination of a function of u depending on many parameters. In Sec. 8-3, we solve this problem numerically using computer simulation. In this section we use an approximation based on the assumption that n is large • Theorem. If n is large, then under hypothesis Ho. Pearson's test statistic q is x2(m - II (10-63) 350 CHAP. 10 HYPOTHESIS TESTING This is based on the following facts. For large n, the Rvs k; are normal. Hence [see (10-60)]. if p; = po;, the Rvs (k;- npo;) 2/npmqo; are x2(1). However, they are not independent because I (k;- npo;) = 0. Using these facts, we can show, proceeding as in the proof of (7-97). that q can be written as the sum of the squares of m- I independent N(O, I) Rvs; the details. however. are rather involved (see (7A-6)]. We shall verify the theorem form = 2. In this case (see (7-72)]. kt + k~ = ll Pot = I - Pn2 lkt - "Pot I = lk~ - llPn~l Hence. (kt - npm )~ (k2 - trPo2>~ (k, - trpud~ q= + :-: (10-64) nPut llPu~ llPotPn~ This shows that form = 2, q is x~( I). in agreement with ( 10-63). From (10-63) it follows that if n is large. Pearson's test takes the form Accept //11 iff q < x=<m ·· I) ( 10-65) A decision based on (10-65) is called a chi-sqtwrt• ll'!it. The Type II error probability depends on p; and can be expressed in terms of the noncentral x2 distribution (see the Appendix to Chapter 7). Example 10.21 We wish to test the hypothesis that a die is fair. To do so, we roll it 450 times and observe that the ith face shows k; times where k; = 66 60 84 72 81 77 In this problem. n = 450. pu; = 116. npo; = 75. and ± 75)2 = 7.68 75 Since x~9~(5) = II > 7.68, we accept the fair-die hypothesis with a = .05. • q = (k;- i=l The next example illustrates the use of the chi-square test in establishing the validity of a scientific theory, Mendel's theory of heredity. Example 10.22 Peas are divided into the following four classes depending on their color and their shape: round-yellow, round-green, angular-yellow, angular green. According to the theory of heredity. the probabilities of the events of each class are 9 3 3 l Po; = 16 16 16 16 To test the validity of the theory. we examine 556 peas and find that the number of peas in each class equals k; = 315 101 32 108 Using these observations. we -shall test the hypothesis that P; = po; with cr = .05. J0-4 SEC. i.,~m In this problem. - 6.25. and 11 ""'" liP«~ = 556 and =2 4 q GOOONESS·OI'·FfT TESTING 35 J 312.75. 104.25. 1114.25. 34.75. m = 4. (k·; - llpu;)·, = 0.470 npo; i•t Since 6.25 is much larger than 0.470, the evidence !;tronl,tly supports the null hypothesis. • Refined Data In general, the variations of a random experiment are due to causes that remain the same from trial to trial. In certain cases, we can isolate certain factors that might affect the data. This is done to refine the data and to improve the test. The next example illustrates this. Example 10.23 In Mendel's theory. the ratio of yellow to green pea!; is 3: I. We wish to test the hypothesis that the theory is correct. We pick 478 pea!; from 10 plants and observe that 355 of them are yellow. What is our decision'! The problem is to test the hypothesis p = p 0 = .75 where pis the probability that a pea is yellow. To solve this problem. we use the chi-square test with m = 2 or, equivalently, the test (10-19). In our case. k = 355. 11 = 478. np0 q 0 = 89.6, and (10-14) yields (k - 11Pulfu)2 q .._ - - - · .... IIPut/o = O.IJ6 < , x~.,.111- 3.84 Hence. we accept the 3: I hypothesis with a ,... .05. Refinement There might be statistical differences among plants. This possibility leads to a refined chi-square test: We group the data according to plant and observe that the jth plant has II; peas where ni = 36 39 19 97 37 :!6 45 53 64 62 of which k, - 25 32 14 70 arc green. Using the data k1 and II; 24 20 32 44 50 44 of the jth plant. we J'brm the 10 statistics q; _ ~kL.-, ~'iPu.'ful.~ lljPot/o (/; and we obtain = 0.59 1.03 0.02 0.34 2.03 0.05 0.36 1.82 o.:n 0.54 To usc the differences among plants. we shall use as the test statistic the sum q = z~ ~ t/j 1 t z~ = ~ ;-t <k, - n;puqu)~ ..... .... n;puqu :: 7.19 The corresponding RVs q 1 arc independent. and each is x2<I l: hence. their sum q is x2(10). Since x 2~(10J = 18.3 > 7.19. we accept the 3: I hypothesis. • 352 CHAP. 10 HYPOTHESIS TESTING Incomplete Null Hypothesis We have assumed that under the null hypothesis, them probabilities p; = po; are known. In a number of applications, they are specified only partially. The following cases are typical. Cast 1 Only m - r of the m numbers p 01 are known. Cast 2 The m numbers p 01 satisfy m - r equations: f/>j(Po~o • • • • Pom) = 0 j = I, . . . , m - r (10-66) These equations might, for example. be the results of a scientific theory. Case 3 Them numbers p 01 are functions of r unknown parameters 8i: po; = t/1;(8,, . . . • 8,) i = I, . . . • m (10-67) In all cases, we have r < m unknown pantmeters. MODIFIED PEARSON'S STATISTIC To carry out the test, we find the maxi- mum likelihood (ML) estimates of the r unknown parameters and use the given constraints to determine the ML estimates p01 of all m probabilities p 01 • In case I, we find the ML estimates of the r unknown probabilities po;. In case 2, we find the ML estimates of r of the m numbers po; and use the m - r equations in (10-66) to find the ML estimates of the remaining m - r probabilities. In case 3, we find the ML estimates (Ji of the r parameters 8j and use them equations in (10-66) to find the ML estimates p01 of them numbers po;. Inserting the estimates so obtained into (10-59), we obtain the modified Pearson's statistic, (10-68) where p01 = po; for the values of i for which p 01 is known. To complete the test, we need the distribution of q. • Tlttonm. If n is large. then under hypothesis H 0 , the statistic q is x2Cm - r - I) (10-69) The proof is based on the fact that the m parameter. po; satisfy r constntints and their sum equals I. This theorem leads to the following test: Find the ML estimates po; using the techniques of Section 9-5. Compute the sum q in ( 10-68): Accept Ho itT q < xf-a<m - r - I) (10-70) A decision based on (10-70) is called a modified chi-square test. The foregoing is an approximation. An exact determination of the distribution of q can be obtained by computer simulation. Example 10.24 The probabilities p 1 of the six faces {Ji} of a die are unknown. We participate in a game betting on {even}. To examine our chances, we count the number k1 oftimesfi 10-4 SEC. GOODNESS-OF-FIT nSTING 353 shows in 120 rolls. The results are fori = I. . . . . 6. k; = IK 16 15 24 22 25 On the basis of these data. we shall test the hypothesi!> H 11 : P{even} = .5 against /11: P{cven} -" .5 This is case 2 of a modified Pearson's tclit with the null hypothesis. we have the constrctints 11 .,.. 120. m -= 6, r = I. Under h P{even} = Po~ + P04 + P06 LPo; = 0.5 (10-71) =I i I We use as free parameters the probabilities Po•· Po2. Po)• and Pn4· To find their ML estimates. we determine the maxima of the density f<PoJ• · · • • Ptlfol "- ')'(Pol )A' subject to (1(). 7 J). With /. = In f. (9-761 yields • • • (JI•w• )A,. ill. = ~ __ k~ = O ilL_ _ h _ ~'!.. ., 0 ilpm Pol Po• ilpu~ l'u~ l'•w· (10-72) ilL _ ~} _ !_~ -= O .f!l._ = k~ _ k!!. _ 0 ilpo) P11.1 Po~ i1Pt14 P04 Ptw. The solutions of the six equations in (1()-71) and ( 10-721 yield the ML estimates of p 111 : Po~ = .123 Po) = .200 Pt14 = .IKS fiu• - .200 fJ,"' = .192 Inserting into ( 10-68). we obtain q = O.H34. Since]( ~.,~(41 ...; 9.49 > 0.834. we conclude that the evidence strongly supports the null hypothesis with (~ "" .05. One might argue that we could have applied (10-19) to the event {even} with k = k2 - k4 + kt,. We could have; however, the resulting Type II error probability would be larger. P111 = .164 Test of Independence and Continpency Tahles We arc given two events;~ and~; h =- with probabilities P(~) c .:...: /'('(!.) and we wish to test the hypothesis that they are independent: H 0 : P(f'A n '~·;) "" be against H,: P<:13 n '(.;) ::/: he (10-73) To do so, we form the four events~ n ~~. ;~ n 'f.. ;in 'L and~ n <t shown in Fig. IO.IOa. These events form a partition of~f; hence. we can use the chisquare test. properly interpreted. Figure 10.10 flu ~II .w!l .'Jl !! .(II .l.u 1'21 1'22 J; ~· :;l 22 't (a) 1'12 lbl k!2 354 CHAP. 10 HYPOTHESIS TESTING As preparation for the generalization of this important problem. we introduce the following notations. We identify the four events by :A;1 where i = I, 2 andj = I. 2. their probabilities by pq. and their number of successes by k;1 • For example. .9f,~ = \~ n 't P•~ = P<.7ld The number k 1 ~ equals the number of successes of the event .?t 1 ~. that is. the number of times:" occurs but'~ does not occur. Figure IO.IOb is a redrawing of fig. IO.IOa using these notations. This diagram is called a nmtingetrQ' tclhle. We know that P(~) = I - band P('t) = I - (', Furthermore. if the events ~~ and '-€ are independent. then the events :~ and 't. :1l and <f.. and 00 and <i arc also independent. From this it follows that under hypothesis /111 • Pu = be P1~ = b(l - c) P~1 = (I - b)£' P!~ Applying ( I0-70) to the four-clement partition A = ).s.f11. sf,~. sf21· we obtain the following test of ( 10-73): = (I ( 10-74) .,(3) (10-75) :A~~) ., , k , Accept Ho itT! ! ( ·v - npv>: < ; • ,., np;, Example 10.25 - b)(l - c) xi It is known that in a certain district, 52% of all voters are male and 400-i are Republican. We wish to test whether the proportion of Republicans among all voters equals the proportion of Republicans among males. that is, whether the events ~ = {Republicans} and '€ = {male} are independent. For this purpose, we poll200 voters and find that k11 =3.S k 12=?l k2t=43 k22=Sl In this problem, n = 200, b = .52, c = .4. Hence. Pu = .208 P12 = .312 P21 = .192 Pn = .288 Inserting into the sum in (10-75). we obtain q ~ 3.54. Since accept the hypothesis of independence with a = .OS. • x: ,(3)- 7.81 > 3.54, we 9 Suppose now that the probabilities b = P(~) and c = P('€) are unknown. This is case 3 of the incomplete null hypothesis with 6 1 =band 6~ = c. In our case, the constraints (10-67) are the four equations (10-74) expressing the probabilities pq in terms of the r = 2 unknown parameters b and c:. The ML estimates of band c: are their empirical estimates n'Ain and n'f./n, respectively, where n~ = k 11 + k12 is the number of successes of the event ~ = .!lt 11 U .!lt 12 and n·• = ku + k21 is the number ofsuccesses of the event <€ = .!lt21 U .!lt22. Thus ,~ -- kll + k21 6 -- kll +n kl2 n = k11 + k12 + k21 + k22 n Replacing in (10-74) the probabilities band c: by their ML estimates 6 and c, we obtain the ML estimates Pii of Pii· To complete the test, we form the SEC. J0-4 355 GOODNESS-OF-FIT TESTING modified Pearson's sum qas in (10-68). In our case. m- r- 1 = I, and the following test results: Accept Ho iff q = Example 10.26 2 2 L L i=J np (k·· _ I) i~t • npii )2 xi-..<0 v < (10-76) Engineers might have a bachelor's, master's, or doctor's degree. We wish to test whether the proportion of bachelor's degrees among all engineers is the same as the proportion among all civil engineers, that is. to test whether the events 00 = {bachelor} and <€ = {civil} are independent. For this purpose, we interrogate 200 engineers and obtain the contingency table of Fig. IO.IIa. In this problem. n = 400. n 11 = 54 + 186 '- 240. "' - 54 • 26 = KO. h = .6 ,~ = .2 ''" = .12 ,;,2 = .48 ti!t = .OK P!! = .32 and (10-67) yields q = 2.34. Since x~9 ~ (I)= 3.84 > 2.34. we accept the assumption of independence with a = .05. GENERAL CONTING.:NCY T ARI.ES Refining the objective of Example 10-26. we shall test the hypothesis that each of the three events-engineers with bachelor's (00 1). master's (~Ah>. and doctor's t:1l~) degrees-is independent of each of the four events-civil (<€, ). electrical ('~ 2 ). mechanical (<!6l), and chemical (<!54) engineer. For this purpose. we interrogate the same 400 engineers and list the data in the table of Fig. 10.11 b. This refinement is a special case of the following problem. We are given two partitions B = r~ •..... :1.\,J c= l't, ..... '{?,.J consisting of the u events~; and the v events ~;1 with respective probabilities b; = P(~;) 1 sis u '1 = P(<€i) 1s j s v We wish to test the hypothesis Ho that each of the events ~; is independent of each of the events <f,i· For this purpose. we form a partition A consisting l'igure 10.11 't ·t d'. 54 186 240 ::A 26 134 160 80 320 (a) - 400 .:II, 54 101 so 35 240 :1\2 20 44 33 23 120 "1\3 6 IS 12 7 40 80 160 95 65 400 (b) 356 CHAP. 10 HYPOTHESIS TESTING of the m = uv events ~ij = ~~ n <€1 (Fig. 10.12) and denote by PIJ their probabilities. Under the null hypothesis, PIJ = b1CJ; hence, our problem is to test the hypothesis H 0 : Pu = b1c.). all i,j against H 1: PiJ :/: b1tj, some i,j (10-77) We perform the experiment n times and observe that the event ~il occurs kiJ times. This is the number of times both events ~1 and '€i occur. For example, in the table of Fig. 10.11, k23 is the number of electrical engineers with a doctor's degree. Suppose, first, that the probabilities b1 and CJ are known. In this case, the null hypothesis is completely specified because then Pu = b1c1 • Applying the chi-square test (10-63) to the partition A, we obtain the following test: ~ ~ (k·· - nb·c·1 ) =~ ~ v b· . ' 1-1 1-1 n ,ti 2 Accept Ho itT q ~ < Xi-a(uv - I) (10-78) If the probabilities b1 and t) are unknown. we determine their ML estimates 6; and i) and apply the modified chi-square test (10-70). The number n~. of occurrences of the event ~; is the sum of the entries kiJ of the ith row of the contingency table of Fig. 10.12, and the number n·t, of occurrences of the event <€1 equals the number of entries ku of the jth column. Hence, n~ 1u 6; = -= = - L n n J=l k/j Cj = n·tn = -n1~ ~ k/j _ -...! (10-79) 1 1 In (10-79), there are actually r = u + v - 2 unknown parameters because l: b; = 1 and l: c1 = 1. Hence, m - r - 1 = uv - (u + v - 2) - 1 = (u - l)(v - 1) Inserting into (10-70), we obtain the following test: , u k t • q = L L iJ - n_D;Cj < xi-al<u - l)(v - l)J (10-80) if Accept Ho 1• 1 1• 1 n6;c1 Fipre 10.12 ~ '{:1 .111 ~li .. .. . li ( . .!Iii• ... .111 dl II sA if .14,1 ::1,. SI'C. Example 10.27 Goodness J0-4 GOODNJ:::SS-01"-FIT TESTING 357 We shall apply ( 10-80) to the data in Fig. 10.11 b. In this problem. n - 400. and ( 10-791 yields h; = .6 .3 .I c~ = .2 .4 .2375 .1625 The resulting estimates Pii = hl; of Pii equal p;1 = .12 .24 .1425 .0975 P~i = .06 .12 .0712 .(l4HH PJJ ''" .02 .04 .0238 .0162 Hence. q "" 5.94. Since u = 3. v = 4. (u - l)(v - II = 6. and x:01~(6) = 12.59. we conclude with consistency level a "' .05 that the proportion of engineers with bachelor's. master's. or doctor's degrees among the four groups considered is the same. • l~f' Fit (~r Distributions In Section 10-2. we presented a method for testing the hypothesis that the distribution F(x) equals a given function .1-j,(x) for all x: Ho: .1-'(x) = F 0(x). all x against H,: F(x) ::/:. .1-j,(x). some x (10-81) In this section. wt: havt: a more modest objective. We wish to test the hypothesis that f'(x) = F 0(x) only at a set of m - I points a; (Fig. 10.13): H(,: f'(a;) = Fo(a;), all i against Hi: F(a;) ::/:. .1-j,(a;), some i Figure 10.13 X (10-82) 358 CHAP. 10 HYPOTHESIS TESTING If x is of the discrete type, taking the values a;, then the two tests are equivalent. If x is of the continuous type and H~ is rejected, then Ho is rejected; however, if H 0 is accepted, our conclusion is that the evidence does not support the rejection of H 0 • The simplified hypothesis H;, is equivalent t~ ( 10-58): hence, we can apply the chi-square test. To do so. we form them events (10-83) .s41 = {a;. 1 < x s a;} Is is m The probabilities p; = P(.s41) of these events equal ( 10-84) P(.si;) = F(a;) - F(a; ,) tlo = -x Clm = +x and under hypothesis H 0• p1 = po; = fQ(cl;) - Fo(tl; t). Hence, (10-82) is equivalent to (10-58). This equivalence leads to the following test of //~: Compute them probabilities Pcli = Fu(a;)- F 0(cl;. 1I. Determine then samples x 1 of x, and count the number k1 of samples that arc in the interval (tl; , < x < a;). Form Pearson's statistic (10-59). (10-85) Reject H~ iff q > Xl-o(m - I) Example 10.28 We wish to test the hypothesis that the time to failure x of a system has an exponential distribution Fo(x) = I - e-x'2• x > 0. Proceeding as in (10-83), we select the points 3. 6. 9 and form the events {x s 3} {3 < x s 6} {6 < x ~ CJ} {x > 9} In this problem. Put = fo(3) = .221 Pu2 Pol = Fo(9) PU4 - ffj.6) .:. .134 = fj,(61 -· Fu(3) = .173 = I - fi~9) = .472 We next determine the times to failure :c; of n .._ 200 systems. We observe that k; are in the ith interval, where k; = 53 42 35 70 Inserting into ( 10-59). we obtain ± 2 (k; - npo;) = 12.22 > x:.,(3) = 11.34 ;-c npo; Hence. we reject H 0, and therefore also Ho. with a= .I. • q= Note The significance level a used in (10-85) is the Type I error probability of the test (10-82). The corresponding error a' of the test (10-81) is smaller. An increase of the number m of intervals brings a • closer to a decreasing, thus the resulting Type II error. This, however. results in a decrease of the number of samples k1 in each interval (a1_,, a;), thereby weakening the validity of the large n approximation (10-63). In most cases, the condition k; > 3 is adequate. If this condition is not met in a particular test, we combine two or more intervals into one until it does. We have assumed that the function Fo(x) is known. In a number of cases, the distribution of x is a function F0(x, 81 , • • • , 8,) of known form depending on r unknown parameters 8i. In this INCOMPLETE NULL HYPOTHESIS SEC. J0-4 GOODNESS-OF-FIT TESTING 359 case, the probabilities po; satisfy them equations f'u; = 1/1; as in ( 10-67) where 1/1;(8,, . . . • 8,) = Fo(a;. 81• • • • • 8,) - J-j,(a; ,. 8,. . . . , 8,) This is case 3 of the incomplete null hypothesis. To complete the test, we find the ML estimates {Ji of 8i and insert into (10-84). This yields the ML estimates of p11;. Using these estimates, we form the modified Pearson's statistic (10-68). and we apply (10-70) in the following form: (10-86) Reject if llu q > xi- a<m r - I) Example I 0.29 We wish to test the hypothesis that the number of particles emitted from a radioactive substance in to seconds is a Poisson-distributed RV with parameter 8 = Ato: 84 P{x = k} = e· 9 k! k = 0, I. . . . (10-87) We shall carry out the analysis under the assumption that the numbers of particles in nonoverlapping intervals are independent. In Rutherford's experiment, 2,612 intervals were considered, and it was found that in n4 of these intervals, there were k particles where k= 0 n4 =51 2 203 383 3 525 4 5 532 408 6 273 7 8 139 49 9 27 10 20 II 4 12 >12 0 Thus in 57 intervals, there were no particles, in 203 intervals there was only one particle, and in no one interval there were more than 12 particles (Fig. 10.14). We next form the 12 events {x = 0}. {x = 1}•...• {x = to}. {x 2:: II} and determine the ML estimates po; of their probabilities p,~. To do so, we find the ML estimate {J of8. As we know. {J equals the empirical estimate i ofx (see Problem Figure 10.14 t t soo r nk I 100 r t 0 2 3 4 s 7 ,• • 8 9 10 •I I •12 k 36() CHAP. 10 HYPOTHESIS TESTING 9-35). The RV x takes the values k = 0, I, 2, . . . with multiplicity nk; hence. . _ I 12 12 9 = x = - ~ knt = 3.9 n = ~ n4 = 2,612 n k=O k=O Replacing 9 by 8 in (10-87). we obtain •A • = e -i !_ POi k! • . = k. + I = I • . . . • II · = ,. ' ( POi I 8 = 3.9. this yields Po;= .020 .079 .154 .012 .200 .195 811 81~ m + 1:!! ) i = 12 and with .152 .099 .055 .021 .012 .005 .003 To find Pearson's test statistic q. we observe that the numbers k, in ( 10-68) are for this problem the numbers n;. 1• The resulting sum equals "' q=~ i•l ( . )' -?POi-= 10.67 n;.,po; n;-t We have m = 12 events and r = I unknown (namely. the parameter 9): hence. cj is x2(10). Since x:95(10) = 18.31 > q. we conclude that the evidence does not support the rejection of the Poisson distribution. • 10-5 Analysis of Variance The values of an RV x are usually random. In a number of cases, they also depend on certain observable causes. Such causes are called factors. Each factor is partitioned into several characteristics called groups. Analysis of variance (ANOVA) is the study of the possible effects of the groups of each factor on the mean Tlx of x. ANOVA is a topic in hypothesis testing. The null hypothesis is the assumption that the groups have no effect on the mean of x; the alternative hypothesis is the assumption that they do. The test (10-13) of the equality of two means is a special case. In ANOV A, we study various tests involving means: the topic, however, is called analysis of variance. As we shall presently explain, the reason is that the tests are based on ratios of variances. We shall illustrate with an example. Suppose that the RV x represents the income of engineers. We wish to study the possible effects of education, sex, and marital status on the mean of x. In this problem, we have three factors. Factor 1 consists of the groups bachelor's degree, master's degree, and doctor's degree. Factor 2 consists of the groups male and female. Factor 3 consists of the groups married and unmarried. In an ANOVA investigation, we have one-factor tests, two-factor tests, or many-factor tests. In a one-factor test, we study the effects of the groups of a single factor on the mean of x. W'e refine the underlying experiment, forming a different RV for each group. We thus have m RVS x1 where m is the number of groups in the factor under consideration. In a two-factor SEC. 10-5 ANALYSIS OF 361 VARIANCE test, we form mr RVS xi" where m is the number of groups of factor I and r is the number of groups of factor 2. In a one-factor problem where x is the income of engineers and the factor is education, the RVS x1 • x2 , x3 represent the income of engineers with a bachelor's, a master's, and a doctor's degree, respectively. If we wish to include the possible effect of sex, we have a two-factor tests. In this case, we have six avs lj/c where j = I, 2, 3 and k = I, 2. for example, xu represents the income of male engineers with a doctor's degree. In the course of this investigation, we shall use samples of the avs xi and Xjk. To avoid mixed subscripts, we shall identify the samples with superscripts, retaining the subscripts for the various groups. Thus xj will mean the ith sample of the RV xi; similarly, xj11 will mean the ith sample of the RV xik· The corresponding sample means will be identified by overbars. Thus -I~;- Xj = -ni L,- 'Xjk 'Xj 1 = -nikI~; 2.. 'Xjk ,_ • (10-88) where ni and nik are the number of samples of the Rvs xi and xi4 , respectively. TH•: ANOVA PRINCIPLE We have shown in (7-95) that if the RVS uncorrelated with the same mean and variance and Qo = L"' (Zj - i)2 E{Qo} = (m - l)u 2 then Z; are (10-89) j•l Furthermore, if they are normal, then the ratio Q0 /u 2 has a x2(m - I) distribution. The following result is a simple generalization. Given m uncorrelated Rvs wi with the same variance u 2 and means Tli• we form the sum Q = L"' (wj- w>2 - I "' w=-Lwi m j-1 j~l (10-90) • Theonm E{Q} = (m - l)u 2 + e (10-91) where ~ (Tli-TI -)2 I ~ ,_.{-} (10-92) m i-1 Furthermore, if the Rvs wi are normal and e = 0, then the ratio Q/u 2 has a x2(m - I) distribution. e=~ - .,=-~Tii=c.w i"l • Proof. This result was established in (7A-14); however, because of its importance, we shall prove it directly. The Rvs zi = wi - Tli are uncorrelated, with zero mean and variance u 2 ; furthermore, wi- w = zi - i + Tli- :rj. Inserting into (10-90), we obtain Q = L"' j•l [(Zj - i) + (Tii - Tj)] 2 362 CHAP. 10 HYPOTHESIS TESTING We next square both sides and take expected values. This yields E{Q} = E {i (zi- i)2} i + (7Ji- 1j)2 /=1 J~l because E{(zj - i)(7Ji - Tj)} = 0, and (10-91) follows from (10-89). If e = 0, then Q = Q0 ; hence, Q/u 2 is x2(m - 1). If e + 0, then the distribution of Q/u 2 is noncentral x2(m - 1, e). This theorem is the essence of all ANOV A tests; based on the following consequence of (10-91). Suppose that u2 is an unbiased consistent estimator of u 2• If n is sufficiently large, then Q is concentrated near its mean (m - l)u 2 + e, and u 2 is concentrated near its mean u 2• Hence, if e = 0that is, if the RVS wi have the same mean-then the ratio Q/(m - l)u2 is close to I, and it increases as e increases. One-Factor Tests Them RVS llt • • • • , Xj• • • • , X,. represent them groups of a factor. We assume that they are normal with the same variance. We shall test the hypothesis that their means Tli = E{xi} are equal: Ho: 111 = · · · = .,,. against H,: 7J; + 71/• some i, j (10-93) For this purpose, we sample the jth RV ni times and obtain the samples xJ (Fig. 10.15). The total number of samples equals N = n1 + · · · + n,. We next form the sum Q, m =L "' x>2 = L <w,- w)2 nJ<xj- j-1 J~l Fipre 10.15 Samples x! X~ x~' x, X~ X~ x:z x2 x, xi I x'"' I xn"' "' x,. (10-94) SEC. 10-5 ANALYSIS OF VARIANCE. 363 where ii is the average of the samples xj of x1 as in ( 10-88), w1 .:.. Vnji1, and I m -x=-~x ~- I - 1 ,. m w=-~w1 m J-1 mi I The RVs i 1 are independent, with mean TIJ and variance u 2/ n1• From this it follows that the RVs w1 are independent, with mean TIJ Vnj and variance u 2 • Hence lsee (10-91)], E{Q,} = (m- l)u 2 + e e '" = ,.... ~ i ( - ~ n, TIJ - Tl )- (10-95) I The constant e equals the value of Q 1 when the RVs i 1 and i are replaced by their means Tli and Tj. To apply the ANOV A principle, we must find an unbiased estimator of u 2• For this purpose, we form the sum (10-96) For a specificj, the Rvs z; = xj are i.i.d., with variance u 2 and sample mean i = i 1 . Applying (10-89). we obtain E{ i (xj - i1) 2 } r•l = (n1 - l)u 2 Hence, E{Q2} = L'" (ni - l)u 2 = (N - m)u 2 (10-97) j-1 Reasoning as in the ANOV A principle. we conclude that the ratio Q 1/(m - I )(r 2 q = (10-98) Q2/(N- m)(T 2 is close to I if e = 0, and it increases as (' increases. It is reasonable, therefore, to use q as our test statistic. To complete the test, we must find the distribution of q. • Theorem. Under hypothesis H0 • the RV (N- m)Q, q-= (m- I)Q:! is F<m- I. N m) (10-99) • Proof. If H 0 is true, the Rvs w1 = i 1Vnj are i.i.d., with variance u 2• Hence, as in (10-90), the ratio Q 1/u 2 is x2(m - 1). Similarly, the ratio Yi = ....·~(; 2L Xj i-1 -)2. 'Xj IS u From this and (7-87) it follows that the sum (10-100) 364 CHAP. 10 HYPOTHESIS TESTING degrees offreedom. Furthermore, Yi is independent ofiJ [see (7-97)], and Q1 is a function of ii; hence, the RVS Q1 and Q2 are independent. Applying (7-106) to the Rvs Q 1/u 2 and Q 2/u 2, we obtain IIQ-99). A one-factor ANOV A test thus proceeds as follows: Observe the ni samples xj of each RV xi. Compute ii, i, and the sums Q1 and Q2 • . (N- m)Q 1 Accept H 0 afT (m _ UQ < F 1 -a(m - I. N - m) (IQ-101) 2 where Fu(m - I, N - m) is the u-percentile of the Snedecor distribution. OC Function The Type II error probability is a function {3(e) depending only on e. To find it, we must determine the distribution of q under the alternative hypothesis. The distribution of Q 2/u 2 is >c 2(N - m) even if H 0 is not true because the value of Q 2 does not change if we replace the Rvs xj by xj- Tli· If e =I= 0, the distribution of Q 1/u 2 is a noncentral >c 2(m - I, e) with eccentricity e (see the Appendix to Chapter 7). Hence, the ratio q has a noncentral F(m - I, N - m, e) distribution. Example 10.30 A factory acquires four machines producing electric motors. We wish to test the hypothesis that the mean times to failure of the motors are equal. We monitor nJ = 7, S, 6, 4 motors from each machine and find the following times to failure xj, in weeks. Machine 1: Machine 2: Machine 3: Machine 4: 6.3 6.3 6.9 8.2 7.2 7.9 9.2 7.7 s.s 9.0 8.S 7.S In this problem, m = 4, N Q, = 4.09 Since F.,(3, 18) Example 10.31 GS: HS: C: = 3.16 < 9.4 6.9 4.9 6.8 = 22. i Q2 4.7 4.2 8.1 S.8 8.S 7.0 i, = 7.0 i2 = 6.8 x1 = 6.9 = 8.0 6.9 i. = 7.17S, = S.68 q 18 = 3 ~' = 4.32 4.32, we reject the null hypothesis with a = .OS. • We wish to test the hypothesis that parental education has no effect on the IQ of children. We have here three groups: grammar school (GS), high school (HS), and college (C). We select 10 children from each group and observe the following scores. 77 88 102 92 99 90 102 80 94 lOS 118 9S liS 92 76 83 112 108 90 92 80 120 108 122 7S 79 86 In this problem, m = 3, n1 = n2 = n3 = 10, N = 30, i Q, = 134.9 Q2 = S,9S6 q= 2 ~~~· 96 123 110 x, = 94.4 = 97.8 x3 = 99.S X2 = 97.23, = 3.06 Since F.,(2, 27) = 3.3S > 3.06, we accept the null hypothesis with a = .OS. • 10-5 SEC. ANALYSIS OF VARIANCE 365 Two-Factor Tests In a two-factor test. we have mr RVs xi, j = I. . . . . m k = I. . . . , r The subscript j identifies the m groups of factor I. and the subscript k identifies the r groups of factor 2. The purpose of the test is to determine the possible effects of each factor on the mean TIJk = E{x1k} of Xjk. We introduce the averages I , I "~ I "' ; x·~· =-r k-1 x·, X.4 = Xjk x.. = mr ~ ~ XjA (10-102) 1 L m L ;-1 Thus Xj. is the average of all RVS in the jth row of Fig. 10.16, and x.k is the average of all Rvs in the kth column. The average of all RVS (global average) is x... We shall study the effects of the groups on the means (10-103) TJ.;. = E{x1.} 71.4 = E{x.d 71 .. = f;{x .. } of the Rvs so formed. For this purpose. we introduce the constants (Fig. 10.17) a; = 7li. - 71.. {34 = 71.4 - 71.. "'/iA - TJ.;4 - 71 •• - CXj -- {3, (10-104) Thus ai is the deviation of the row mean 71;. from the global mean 71 ••• f3 4 is the deviation of the column mean 7l.k from the global mean 71 ••• and "Yik is the deviation of 7liA from the sum 71 •• + ai + {34 • Each of the sums of the m numbers ai, the r numbers {34 , and the mr numbers -y14 is 0. In a two-factor test, the objective is to study the effects of the groups on the parameters ai, {3'" and Yik· This problem has many aspects. We shall briefly discuss two cases related to the following concept. Figure 10.16 Groups of factor :! .. 0 xu XJ2 XI, ·"I· X21 x22 x2, ·"2· '!$ ....!! .Y;. Xpc 0 a ::1 e x,., ~ x., .\',2 .\' .. x,, x,.. ., .\',. .\' 366 CHAP. 10 HYPOTHESIS TESTING 'ljlc ..,,. .,,. (Jk "' 1/ .. / / ... 011 J F~pn 10.17 ADDITIVITY We shall say that two factors are additive if the corresponding parameter 'Yik is 0 or, equivalently, if T/jt = TJ .. + Clj + 134 (10-105) In the engineer e:llample. the condition 'Yi' = 0 means that the variations in average income due to education are the same for male and female engineers and the variations due to sex are the same for engineers with bachelor's, master's, and doctor's degree. As in the one-factor tests, the analysis of a two-factor test is based on the ANOV A principle. We start with the following problem: We assume that factprs 1 and 2 of a two-factor problem are additive, and we wish to test the hypothesis that ai = 0, that is, that the mean TJi4 ofx14 does not depend onj: H0 : 'Yik = 0, ai = 0, allj against H 1 : 'Yik = 0. ai :1= 0. somej (10-106) One might argue that ( 10-106) could be treated as a one-factor problem: Ignore factor 2 and test the hypothesis that the groups of factor 1 have no effect on the mean of Xjk. We could do so. The reason we choose a twofactor test is that as in Example 10.23, the possible variations within the groups of factor 2 lead to a more powerful test. Proceeding as in (10-94), we form the sum QJ = r L"' J-1 (Xj.- x.Y = L"' (w1 J•l w.)2 W·J =X·'J· Vr (10-107) This sum is essentially identical to Q 1 if we replace sample means by column averages and ni by r. This yields [see (10-95)] e m =rL j-J (TJj. - TJ.Y "' =rL J•l a} (10-108) SEC. 10-5 ANALYSIS OF VARIANCE 367 We next form the sum ,., Q4 = r LL (x_~, - XJ. - x.k • x.. )~ (10-109) i' I 4 ·J Under the additivity assumption. 'Y.i4 = 0: hence. 7JJ4 = T}j. + TJ.4 - TJ .. for both hypotheses. From this it follows that if we subtract the mean from all RVs in (10-109). the sum Q. remains the same: hence. it has the same distribution under both hypotheses. Reasoning as in (7-97). we can show that the ratio Q! is x~[(mu· l)(r- I)) (10-110) The details of the proof are rather involved and will be omitted. Since the mean of an RV with x2<n> distribution equals n. we conclude that I::{Q4 } ..::. (m - I )(r -· l),r~ ( 10-111) Proceeding as in ( 10-98). we form the ratio Q~/(m - I )IT~ q = Q 4 /(m - I ><r _:. ' I)~~~ (10-112) If //11 is true. then a;= O: hence. the eccentricity t• in ( 10-108) is 0. From this it follows that under hypothesis llu. Q,. , -:;IS x·(m -· IT" I (10-113) ) Combining with ( 10-110), we conclude that if H11 is true. then the ratio q in (10-112) is Flm - I, (m - l)(r- 1)]. This yields the following test of (10106): . (r- I)Q, Accept H 0 tff Q. · < F 1 ,.[m - I. (m - l)(r - I)] (10-114) Example 10.32 We shall examine the effects of sex and education on the yearly income x;k offactory workers. In this problem, we have two factors. Factor I. sex, consists of the two factors M and F. Factor 2. education. consists of the three factors GS. HS, and C. We observe one value of each RV and obtain the following list of incomes in thousands. "i' M F xu X24 X.J,; GS GC c Xjl Xj'! Xj.l Xj, 15 14 14.5 18 16 17 27 24 25.5 20 18 X .. = 19 Assuming that there is no interaction between the groups (additivity), we shall test the hypothesis that 011 = 0 and the hypothesis that ~4 - 0. 368 CHAP. 10 HYPOTHESIS TESTING = 0. With m = 2 and r = 3, we obtain 2 QJ = 6 Q. = I q = Q3 = 12 Q. Since F.95 (1, 2) = 18.6 > q, we conclude with a = .OS that the evidence (a) H0 : OlJ does not support the rejection of the hypothesis that sex has no effect on income. (b) H0 : /34 = 0. We proceed similarly, interchangingj and k. We have the same Q4 , but m = 3 and r = 2. This yields q = QJ = 133 Q4 Since F.95(2, 2) = 19 < q, we reject the hypothesis that education has no effect on income. Note that in both cases, the evidence is limited, and the resulting Type II error probability is large. • In (10-106), we assumed that the two factors are additive, and we tested the hypothesis that the first factor has no effect on "'lk. Now we test the hypothesis that the two factors are additive: Ho: 'Yik = 0, all j, k against Ha : 'Yilt :/= 0, some j, k (I 0-115) Unlike (10-106), the test of (10-115) cannot be carried out in terms of a single sample. We shall use n samples for each RV x1". This yields the mrn samples (Fig. 10.18) TEST OF ADDmVITY i =I, . . . , n j =I, . . . , m k =I, . . . . , r Flpre 10.18 •tlJ/c ~~-'" ~~~ Groups of factor ::! k xl, .t\t x]k x),l i x),, SEC. 10-6 NF.YMAN-PEARSON. SEQUENTIAL, AND I.IKELIHOOD RATIO TF.STS 369 The test is based again in the ANOVA principle. We form the sum "' , Q, = n j•l L 2 (iiA - k:l iJ. - i.A + i.Y (10-116) where overbars indicate sample averages and dots row and column averages as in (10-102). The sum Q 5 is identical to the sum Q 4 in (10-109) provided that we replace the RVs Xjk byiik Vii. From this and (10-110) it follows that if H 0 is true, then (10-117) If H 0 is not true, that is, if Yit :/= 0. then the ratio Q~/u 2 has a noncentral x2 distribution, and E{Q 5 } = (m - l)(r - 1)u~ + e (10-118) where again, e is the value of Q 5 if all RVs are replaced by their means: Ill e = n j•l 2 I 2 <11ik - A-·1 M 1Jj. - 11.• + 11.Y r =2 2 YJ• j· I k-1 We next form the sum "' Q6 , If = ~~"' ~ ~ ~ i (Xjt - -, (10-119) Xjt )- j-1 k-1 j·l For specificj and k, the RVS xJ, are i.i.d .. with sample mean iJk; hence [see (7-97)), 12 ~ j ~ (Xj4 (T - - • Xj4 )'• IS )(-'( n - 1) i· I The quadratic form Q6 /u~ is the sum of mr such terms: therefore [see (7-87)), Q~ is )( 2[mr(ll (T· -· 1)) (10-120) Combining with (10-117). we conclude that the ratio q = Q~/(m - l)(r - Q.lmr(n _ 1) 1) . • IS f[(m - l)(r- 1), mr(n - 1)] This leads to the following test of (10-115): Accept Hu iff q < F 1-,.[(m - l)(r - 1), mr(n - 1)] (10-121) (10-122) 10-6 Neyman-Pearson, Sequential, and Likelihood Ratio Tests In the design of a hypothesis test, we assign a value a to the Type I error probability and search for a region De of the sample space minimizing the resulting Type II error probability. Such a test is called most powerful. A 370 CHAP. 10 HYPOTHESIS TESTING simpler approach is the determination of a test based on a test statistic: We select a function q = g(X) of the sample vector X and search for a region R,. of the real line minimizing {3. We shall say that the statistic q is most powerful if a test so designed is most powerful. In our earlier discussion, we selected the function g(X) empirically. In general, such a choice of g(X) does not lead to a most powerful test. In the following, we determine the function g(X) such that a test based on the test statistic g(X) is most powerful, and we determine the properties of the corresponding critical region. The analysis will involve the simple hypothesis 8 = 80 against the simple hypothesis 8 = 8 1 • We denote by f<X. 8) = /(x 1 • 8) · · · f<xn. 8) the joint density of then samples x; of x. and we form the ratio r=r(X)=/(X. 8o) (10-123) /(X. 81 ) We shall show that r is a most powerful test statistic. NEYMAN-PEARSON CRITERION Suppose that D, is the critical region of the test of the hypothesis Ho: 8 = 80 against H,: 8 = 8 1 (10-124) We maintain that the test is most powerful itT the region D,. is such that r(X) ~ c for XED,. and r(X) > c for XeD, (10-125) Thus r = c: on the boundary of D,.• r < c in the interior of D,.• and r > c outside De. The constant c: is specified in terms of the Type I error probability a = P{X E D,.IHo} = P{r ~ clllo} (10-126) The resulting Type II error probability equals {3 = P{X e D,.IH,} = P{r > cjH,} (10-127) • Proof. Suppose that D;. is the critical region of another test and the corresponding error probabilities equal a' and {3'. It suffices to show that if a' = a then {3' > {3 In Fig. 10.19, we show the sets Dr = A U Band D; = A U B' where A = D,. n D;. Under hypothesis H 0 , the probability masses in the regions Band B' are equal; hence, we can assign to each differential element~ V; of B centered at X a corresponding element ~Vi of B' centered at X', with the same mass /(X. 8o)4V; =/(X', 8o)~V; Under hypothesis H 1 , the masses in the regions ~ V; and ~ v; equal /(X, 8 1 )~ V; and/(X'. 8 1 )~Vi. respectively. Hence, as we replace~ V; by~ Vi. the resulting increase in {3 equals ~{3 =/(X. 8, )~ V; -/(X'. 8, )d Vi Sf.C. 10-6 NEYMAN-PEARSON, SEQUENTIAl., AND LIKELIHOOD RATIO TESTS 371 r=C' Figure 10.19 But f(X, Bo) < cf(X, Btl /(X'. Bo) > cf(X', B.> Therefore, 4a; f-1 > !c [/(X, Bo)4 V - f(x', Bo)] = 0 And since {3' == {3 + I.4{3;, we conclude that {3' > {3. The ratio r will be called the NP test statistic. It is a statistic because the constants Bo and B1 are known. From the Neyman-Pearson criterion it follows that r is a most powerful test statistic. The corresponding critical region is the interval r < c where c = r1-a is the I - a percentile of r [see ( 10126)]. The most powerful test of(I0-124) thus proceeds as follows: Observe X and compute the nttio r. f(X, 8o) (10-128) Reject iff Ho /(X, B,) S TJ-a Note There is a basic difference between the NP test statistic and the test statistics considered earlier. The NP statistic r is specified in terms of the density of x. and it is in all cases most powerful. In general this is not true for the empirically chosen test statistics. However. as the following illustrations suggest, for most cases of interest empirical test statistics generate most powerful tests. To carry out the test in (10-128) we must determine the distribution of r. This we can do, in principle, with the techniques of Section 7-1; however, the computations are not always simple. The problem is simplified if r is of the form r = 1/J(q) (10-129) where q is a statistic with known distribution. We can then replace the test (10-128) with a test based on q. Such a test is most powerful, and its critical region is determined as follows: Suppose, first, that the function 1/J(q) is monotonically increasing as in Fig. 10.20a. In this case, r s c iff q s C4 ; 372 CHAP. 10 HYPOTHESIS TESTING , , Figure 10.20 hence, a = P{r ~ clHo} = P{q ~ cuiHo} (10-130a) Denoting by qu the u-percentile of q, we conclude that Ca = qQ. Thus Ho is rejected itT q < qQ. Suppose, next, that l{l(q) is monotonically decreasing (Fig. 10.20b). We now have a = P{r ~ c!Ho} = P{q Ho is rejected itT q ~ ~ c,lHo} c, = q,_Q (10-130b) q,_Q. The general case is not as simple because the critical region R, might consist of several segments. For the curve of Fig. 10.20c, R,. consists of the half-line q ~ c 1 and the interval c2 ~ q ~ c3 ; hence, a= P{q s c,IHo} + P{c2 ~ q ~ cliHo} To find R,., we assign values to c from 0 to I, determine the corresponding points c;, and compute the sum until its value equals a. Example 10.33 The RV xis N(6, u) where u is a known constant. In this case, the NP ratio r equals r = exp { 2 2 [2 (x;- 6,)2- 2 (X;- 6o)2 ]} = exp { 2; 2 ((6f- 6i>- 2(6,- 6o).iJ} (10-131) ! This shows that the NP test statistic is a function r = 1/J(i) of the sample mean i of x. From this it follows that i is a most powerful test statistic of the mean 6 of x. To find the corresponding critical region, we use (10-130): If6 1 < fJo, then X(il is a monotonically increasing function; hence, we reject H 0 iff i < q0 where q,. is the u-percentile of i under hypothesis H0 • If 6, > 60 , then t/l(.i) is monotonically decreasing; hence, we reject H0 iff .i > q 1-o. Note that the test against the simple hypothesis 6 = 6 1 < 60 is the same as the test against the composite hypothesis 6 < 60 because the critical region < q 0 does not depend on 61 • And since the test is most powerful for every 61 < 60 , we conclude that .i < q0 is the uniformly most powerful test of the hypothesis 6 = 60 against 6 < 6o. Similarly, .i > qt-o is the uniformly most powerful test of the hypothesis 6 = 60 against 6 > 6o. • x Sf.C. J0-6 NEYMAN-PEARSON, SEQUENTIAL, AND LIKELIHOOD RATIO TESTS 373 Exponential Type l)istributions The results of Example 10.33 can be readily generalized. Suppose that the RV x has an exponential type distribution f(x, 8) = h(x) exp {a(8)cb(x) - b(9)} as in (9-91). In this case. the NP ratio equals r = exp {[a(60 ) - a(8 1 )JG(X) - [b(Hu) - b(8 1 )]} ( 10-132) where G(X) = ~cb(x;). Thus r is a function til(q) of the test statistic q = G(X); hence, q is a most powerful test statistic of the hypothesis 8 = 80 against 8 = 9,. We shall now examine tests against the composite hypothesis 8 < 80 or 8 > 8o. Suppose that a(8) is a monotonically increasing function of 8. If 81 < flo [see (10-132)], the function 1/l(q) is also monotonically increasing. From this it follows that the test statistic q generates the uniformly most powerful test of the hypothesis fl = 80 against fl < fl 1 , and its critical region is G(X) s qa where q" is the u-percentile of G(X) under hypothesis H0 • If 81 > 80 , then tiJ(q) is monotonically decreasing, and the test against 0 > 00 is uniformly most powerful with critical region G(X) =:!:: q 1• ... If a(8) is monotonically decreasing, G(X) is again the uniformly most powerful test statistic against the hypothesis 8 < 80 or fl > 8o, and the corresponding critical regions are G(X) =:!:: q,_a and G(X) :S qa. respectively. Example 10.34 The RV xis N(TJ. 8) where., is a known constant. In this case. f(x. 8) :.:: ~ exp {- 2 ~ 2 {.t - TJ) 2 · In 8} This is an exponential density with =- I cb(x) = (x - TJ)~ q = L(X; - TJ): 282 Thus a(8) is a monotonically increasing function of 8; hence, the statistic q generates the uniformly most powerful tests of the hypothesis 8 = 811 against 8 < 80 or 8 > 81 • The corresponding critical regions are q s qo and q ?. q 1• a• respectively. To complete the test, we must find the distribution of q under hypothesis H 0 • As we know. if 8 = 8o. then the RV a(8) ~ = ~~ L (x; - TJ) 2 is x2<n) Hence. q11 = 8~)(~(n). This leads to the following tests: H,: 8 < 80 Reject Ho iff ~(x; - 71) 2 < 8Ax~(n) H,: 8 > 8o Reject Ho iff~(x;- TJ)1 > 8ijxi-a(n) We have thus reestablished earlier results )see (10-36)). However, using the NP criterion, we have shown that the tests are uniformly most powerful. • Suftident Statistics Suppose that the parameter 8 has a sufficient statistic y(X); that is, the functionf(X, 8) is of the form f(X, 8) = H(X)J[y(X). 8) 374 CHAP. 10 HYPOTHESIS TESTING as in (9-95). ln this case, the NP r.ttio equals J[y(X). 8o)J f(X, 8o) 8,) ' = f(X, = J(y(X). 8, IJ Thus r is a function of y(X); hence, the sufficient statistic q = y(X) of 8 is a most powerful test statistic of the hypothesis 9 -= 90 against (J = 9, . Example 10.35 We have shown in Example 9.29 that if x is uniform in the interval (0, 8), then the maximum q = Xma• of its samples X; is a sufficient statistic of 8 and its distribution [see (9-100)] equals I Fq(q, 8) = ,. q"U(8- ql 8 From this it follows that q is the most powerful test statistic of 8, and u Hence. q, = 88Vu. = P{q s q,iHo} = Fq(q,, 80 ) = I 88 q: Furthermore [see 110-142)]. , = (~ r U(8 - q) Since r is a function of q, we shall determine the critical region R, of the test directly in terms of q. If 81 < 80 • then R, is the interval q < c where a= P{q s c!Ho} = The corresponding p error equals p = P{q ::e c!H,} = 1 - (~r (f.r c = 8o~ = 1 - a (~ r If 81 > 80 , then R.- is the interval q > c, and a= P{q > c!Ho} = 1- p = P{q s ciH,} = (~r (;,r = c (1- a) = 8o~ (:~r • Sequential Hypothesis Testing In all tests considered so far, the number n of samples used was specified in advance, either directly in terms of the complexity of the test or indirectly in terms of the error probabilities a and p. Now we consider a different approach. We continue the test until the data indicate whether we should accept one or the other of the two hypotheses. Suppose that we wish to iest whether a coin is fair. If at the 20th toss, heads has shown 9 times, we decide that the coin is fair; if heads has shown 16 times, we decide that it is loaded; if heads has shown 13 times, we consider the evidence inconclusive and continue the test. In this approach, the length of the test-that is, the number n of trials required until a decision is reached-is random. It might be SEC. J0-6 NEYMAN-PEARSON, SEQUENTIAl.. ASD LIKELIHOOD RATIO TESTS 375 arbitrarily large. However, for the same error probabilities, it is on the average smaller than the number required for a fixed-length test. Such a test is called sequential. We shall describe a sequential test of two simple hypotheses: Ho: 8 = 8o against 11,: 8 = 8, The method is based on the NP statistic (10-123). We form the ratio r "' 80 ) /(X,, 81) =/(X,, wheref(X,, 8) = .f(.t,. 0) · · · f(x,, 8) is the joint density of the first m samples x1 , • • • • x,., of x. We select two constants c0 and c 1 such that c0 < c: 1 • At the mth trial, we reject H0 if r, < c0 : we accept 110 if r, > c 1 : we continue the test if c0 < r, < c 1 • Thus the test is completed at the nth step (Fig. 10.21) iff cn < r, < c, for every m < 11 and r, < ''u orr, > c 1 00-133) The ratio r,., can be determined recursively: r, /(Xm• 8u) 0,) = r,., /(X,. To carry out the test. we must express the constants co and c 1 in terms of the error probabilities a and p. This is a difficult task. We shall give only an approximation. • Theorem. In a sequential test carried out with the constants c0 and c 1 • the Type 1 and Type II error probabilities a and fJ satisfy the inequalities a ~ !5 C'o _L < _ a 1 _!_ c, (10-134) Figure 10.21 , , f<X.,, 0 1 = . ··- 0 [lX,,. 8 1 ) f Rcjco:t /10 c, ~--------------\ cnl-----------~-~ Accept //11 0 \ \ 11 m 376 CHAP. 10 HYPOTHESIS TESTING {J Figure 10.22 The proof is outlined in Problem 10-30. These inequalities show that the point (a, {J) is in the shaded region of Fig. 10.22 bounded by the lines a + {3,.0 = c0 and a + {Jc 1 = I. Guided by this, we shall use for c0 and c 1 the constants a I -- a (10-135) co=-C'J = - 1-{J {3 In the construction of Fig. 10.22, (a, {J) is the intersection of the two lines. From (10-134) it follows that with c0 and c 1 so chosen, the resulting error probabilities are in the shaded region; hence, in most cases. the choice is conservative. Example 10.36 The av xis N(TJ, 2). We shall test sequentially the hypothesis ho: ., = 20 against H 1: ., = 24 with a= .05, ~=.I. Denoting by i,. the average of the first m samples X; of x, we conclude from (10-131) with 80 = 20, 81 = 24, and u = 2 that i ((24 r,. = exp { 2 - 202) - 2(24 - 20)i,.)} = exp {- m(i,. - 22)} From (10-135) it follows that 05 co= · 95 ,., = · lnc0 = -2.89 Inc·, = 2.25 .9 .I And since lnr,. = -m(i,. - 22), we conclude that c·0 < r,. < c 1 itT -2.89 < SEC. 377 10-6 NEYMAN-PEARSON, SEQUENTIAL, AND LIKELIHOOD RATIO TESTS .45 • s.~ 24 ~' 22 + 2.89 m ~ RcjectH0 im.~ 22 ~ ~ 22- 2.25 m 234567m 12 13 14 IS 16 17 (a) m lbl Figure 10.23 · m(.t',, 22) < 2.25. that is, iff 22 . ~·m25 < i, < 22 + 2 8 · ~ m x, Thus the test terminates when crosses the boundary lines 22 • 2.25/ m and 22 2.89/m of the uncertainty interval (Fig. 10.23a). We accept 1/11 if.i,., crosses the lower line; we reject H 11 if it crosses the upper tine. • Example 10.37 Suppose that :tl is an event with probability p. We shall test the hypothesis against with a= .05. /3 = .I. We form the 1.ero-one RV x associated with the event :A. This RV is of the discrete type with point density p•(l - p) 1 ·' where x = 0 or I. Hence • .f<km• p) p4.,(1 - p)m 4•• and rm = (Po)4.( I PI - Pn)"' I -PI 4. = (~)4~( ~ )"' 4m 4 6 where k,, is the number of successes of :A in the first m trials. In this case. r,., is a monotonically increasing function of the sample mean .i, = k,.,lm; hence. -2.89 <In r,., < 2.25 iff 045 5.56 045 . ·- 7.14 --,;;-<x,.,< . + m This establishes the boundaries of the uncertainty region of the test as in Fig. 10.23b. • 378 CHAP. 10 HYPOTHESIS TESTING Likelihood Ratio Tests So far, we have considered tests of a simple null hypothesis H 0 against an alternative hypothesis H 1, simple or composite. and we have assumed that the parameter 0 was scalar. Now we shall develop a general method for testing any hypothesis H 0 , simple or composite, involving the vector 0 = [o,, ... , o.J. Consider an RV x with density f(x, 0). We shall test the hypothesis H 0 that 0 is in a region 8 0 of the k-dimensional parameter space against the hypothesis H 1 that 0 is in a region 8 1 of that space: Ho: 0 E 8o against H,: 0 E 8, (10-136) The union 8 = So U 8 1 of the sets So and 8 1 is the parameter space. For a given sample vector X, the joint density f(X, 0) is a function of the parameter 0. The maximum of this function as 0 ranges over the parameter space 8 will be denoted by Om. In the language of Section 9-5,/(X, 0) is the likelihood function and Om the maximium likelihood estimate of 0. The maximum off( X, 0) as 0 ranges over the set 8 0 will be denoted by Om0. If H0 is the simple hypothesis 0 = 00 , then the region 8 0 consists of the single point Oo and Om0 = 8o. With Bm and 8,.., so defined, we form the ratio A =/(X, OmO) (10-137) /(X, Bm> A test of (10-136) based on the corresponding statistic A is called the likelihood ratio (LR) test. We shall determine its critical region. Clearly. /(X. 8m0) s /(X, Om): hence. for every X, 0 s As I (10-138) From this it follows that the density of A is 0 outside the interval (0. I). We maintain that it is concentrated near 1 if 0 E 8 0 and to the left of 1 if 0 E 8 1, as in Fig. 10.24. The data X are the observed samples of an RV x with distribution/(x, B) where ii is a specific unknown number. The point Om is the ML estimate of ii; therefore (see Section 9-5), it is close to ii if n is large. Figure 10.24 SEC. 10-6 NEYMAN-PEARSO~. SEQUENTIAL, A~D I.IKELIHOOD RATIO TF.STS 379 Under hypothesis H0 • 0is in the set (·)0 ; hence. with probability close to I. 8, equals 8,o and A = 1. If 0 is in e,, then 8, is different from Bmo and A is less than I. These observations form the basis of the LR test. The test proceeds as follows: Observe X and find the extremes 6m and 8m0 of the likelihood functionf(X, 6). Reject Ho (10-139) iff To complete the test. we must find c:. If Ho is the simple hypothesis 6 = Bo. then c equals the corresponding percentile A,. of .\. If, however. H0 is composite, the Type I error probability a(6) = P{.\ s c:IHo} is no longer a constant. In this case, we select a constant a0 as before and determine c such that a(9) does not exceed au. Thus c is the smallest number such that P{.\ s d s a 11 for every 6 E (-)11 (10-140) In many cases. P{.\ s dH11 } is maximum for 0 -= ()n. In such cases. c is determined as in the simple hypothesis 8 = 00 : that is. the constant cis such that P{.\ s ci8 = 9o} = ao. Example 10.38 The av 8has an exponential density /(x. 91.::: 9e- 8·'U<xl. We shall test the hypothe- sis fJ ~ Hn : against 611 using the LR test. In this case (fig. 10.25al. j'( X. 0) = H"(' H.tH (I > II As we sec. I 6,, = -::. 6 11 "' X 1/.t = {Oo .t011 >I .tfJo < I itln > I .tHn < I Figure 10.25 !IX, 8) :> I 0- .~ ,., I I I 0 "' ,I I I ./ \ (J -< \' X , I O o '. ' II ' 0 .i: (a) (b) 380 CHAP. 10 HYPOTHESIS TESTING Thus Ais an increasing function of x (Fig. 10.25b). Hence, in this case. the LR test is equivalent to a test with test statistic q = i. Since A. s c· itT s c 1, the corresponding critical region is x s c 1 where c 1 is such that P{i s c 1l90 } = ao. • x ASYMPTOTIC PROPERTIES OF LIKELIHOOD RATIOS In an LR test, we must find the distribution of the LR A; in general, this is not a simple prob- lem. The problem is simplified if we can find a function q = 1/I(A) with known distribution. If 1/I(A) is monotonically increasing or decreasing, then the LR test>. s cis equivalent to the test q s c 1 or q ~ c 1• Example 10-38 illustrated this. Of particular interest in the transformation w = -2 In A. As the next theorem shows, for large n. the RV w has a x2 distribution regardless of the form of f(x, 8). We shall need the following concept. Free Pammeters Suppose that the distribution of x depends on k parameters 81 • • • • , 8k. We shall say that 81 is a free parameter if its possible values in the parameter space e are noncountable. The number of free parameters of 8 will be denoted by N. We define similarly the number N 0 of free parameters of 9 0 • Suppose that x is a normal RV with mean ,., and variance u 2 where ,., is any number and u > 0. In this case, e is the halfplane -:x <,., < :x, u > 0; hence, N = 2. If under the null hypothesis,,., = 'lo and u > 0, then only u is a free parameter, and No = I. If '1 ~ 'lo and u > 0, then both parameters arc free, and No = 2. Finally, if,., = 'lo and u = uo. then No= 0. • Theorem. If N > N0 and n is large, then under hypothesis H0 , the statistic w = -2 In A is x~<N- N0 ) (10-141) The proof of this rather difficult theorem will not be given. We shall demonstrate its validity with two examples. Example 10.39 The RV xis normal with mean 11 and known variance u. We shall test the hypothesis /111 : 11 =-= 1/1: 11 ~ lJo against lJo (10-142) u:~ing the LR test. and we shall show that the result agrees with (10-141). In this problem. we have one unknown parameter and f(X. 111- exp {- In this problem. lJmO observe that 2~ L (x1 - 1112 } = lJo because Ho is the simple hypothesis 11 "' lJn· To find 11m• we '\.' , '\.' _, (10-143) L. (x;- 11~ = L. (X; - x)- + n(x - lJ)- . Thus/(X, lJ) is maximum if the term (X - 1112 is minimum. that is. if 11 = .,, From this it follows that exp {A= { 2~ L (x; 1 ... exp - 2u L 1Jo)2 } _ •} Cx1 - x )· = exp {- ;u (.r- 2 1Jo) } = x. (10-144) SEC. 10-6 NEYMAN-PI:.ARSOr.;, SEQUENTIAL, AND LIKELIHOOD RATIO TESTS 381 Mi) 0 flo- cl :~ flo l'igure 10.26 In Fig. 10.26, we plot A. as a function of i. As we see. A. s c iff !.i- 11ol ~ c 1 • This shows that the LR test is equivalent to the test based on the test statistic q of (10-9). To verify (10-141) we observe that, w = -2 In A = { i - 1Jo)~ vln Under hypothesis H 0 , the RV i is normal with mean 1Jo and variance vln; hence, w is x2(1). This agrees with (10-141) because N = I and No -= 0. • Example 10.40 We shall again test {10-142), but now we as!lume that both parameters 11 are v are unknown. Reasoning as in Example (9-27), we find "'ntO = "'l VntO = I~ , ~ {X; - TJo)• n 1Im vm =X = -11t L -, <x1 - x)· (10-145) The resulting LR equals A~ u;."" exp j- f.;~ ex, - ..rj v,,,2 exp {- 2!, L {X; - = (;.:,)"' .~)2} This yields [see (10-143>1 , (X - 1JI)! .v- = v,., ' Thus A. is a decreasing function of lyl; hence. A s c iff IYI ~ <·,. Ltlrge 11 In this example. N = 2 and N0 = I, to verify {10-141) we must show that the RV - 2 In A is x2{1). Suppose that H 0 is true. We form the avs v "' = -I n L -, (x· - x)• ' , {i ·· 1Jn)2 y- " " - - - v,., 382 CHAP. 10 HYPOTHESIS TESTING As n increases, the variance of v, approaches 0 [see 17-99)]. Hence, for large n " --I v == v v, == E{v,} = n Furthermore, i - 'l'lo- 0 as n - ~. From this it follows that 2 « I 2 = n In (I + y 2) == ny: == ( i - '17o) vln with probability close to I. This agrees with (10-141) because the RV i- 'l'lo is normal with 0 mean and variance vln. • Y2 == (i - '17o) v -2 In A Problems 10-1 10-2 10-3 10-4 10-5 10-6 We are given an RV x with mean., and standard deviation u = 2. and we wish to test the hypothesis '11 = 8 against '11 = 8. 7 with a = .0 I using as test statistic the sample mean i of n samples. (a) Find the critical region Rc of the test and the resulting fj if n = 64. (b) Find n and Rr if fj = .OS. A new car is introduced with the claim that its average mileage in highway driving is at least 28 miles per gallon. Seventeen cars are tested, and the following mileage is obtained: 19 20 24 25 26 26.8 27.2 27.5 28 28.2 28.4 29 30 31 32 33.3 35 Can we conclude with significance level at most .05 that the claim is true'? The weights of cereal boxes are the values of an RV x with mean '11· We measure 64 boxes and find that .t = 7. 7 oz. and s = 1.5 oz. Test the hypothesis H 0 : '11 = 8 oz. against H 1 : '11 ::F 8 oz. with a =- .I and a = .01. We wish to examine the effects of a new drug on blood pre!tsure. We select 250 patients with about the same blood pressure and separate them into two groups. The first group of 150 patients receives the drug in tablets, and the second group of 100 patients is given identical tablets without the drug. At the end of the test period, the blood pressure of each group, modeled by the avs x andy, respectively, is measured, and it is found that i = 130, sA = 5 andy= 135, Sy = 8. Assuming normality, test the hypothesis 'I'IA = '11~ against 'l'lx "' 'l'lr with a= .05. Brand A batteries cost more than brand 8 batteries. Their life lengths are two avs x andy. We test 16 batteries of brand A and 26 batteries of brand 8 and find these values, in hours: i = 4.6 Sx = 1.1 y = 4.2 s~ = 0.9 Test the hypothesis 'I'IA = 'l'l.v against '17• > 'l'ly with a = .05. Given r avs xk with the same variance u 2 and with means E{x4} = 'l'lk· we wish to test the hypothesis , H0 : ~ k~t , ,.,.,, = 0 against H 1: L ck'l'lk "'0 4~1 where c4 are r given constants. To do so, we observe n4 samples x4, of the kth PROBLEMS 383 RV x4 and form their respective sample means .'f4 • Carry out the test using as the test statistic the sum y = I c·,i,. 10·7 A coin is tossed 64 times. and heads shows 22 times. Test the hypothesis that the coin is fair with significance level .05. 10-8 We toss a coin 16 times, and heads shows f.; times. If J.; is such that k 1 s k s k~. we accept the hypothesis that the coin is fair with significance level a = .05. Find k 1 and k~ and the resulting {3 error: (a) using (10-20); (b) using the normal approximation (I 0-21 ). 10-9 In a production process. the number of defective units per hour is a Poissondistributed RV x with parameter>..= 5. A new process is introduced. and it is observed that the hourly defectives in a 22-hour period are X; = 3 0 5 4 2 6 4 I 5 3 7 4 0 8 3 2 4 3 6 5 6 9 Test the hypothesis >.. = 5 against >.. < 5 with a = .05. 10-10 Given anN(.,, u) RV x with known.,, we test the hypothesis u = u 0 against u > u 0 using as the test statistic the sum q = -:h ~ uo the resulting OC function {3(u) equals the area of the xi -..(n)u~/u 2 (Fig. PIO.IO). , 0 O(j c ' .,. Xi-o Oj (X; - T/) 2• Show that x2(n) density from 0 to ' =X!-c. Figure PIO.IO 10-11 The Rvs x and y model the gntdes of students and their parental income in thousands of dollars. We observe the gmdes x; of 17 students and the income .V; of their parents and obtain X; Y; 50 65 55 17 59 70 63 20 66 45 68 15 69 55 70 30 70 25 72 28 72 42 75 28 79 18 84 28 89 32 93 75 96 32 Compute the empirical correlation coefficient; of x andy (see (10-40)1. and test the hypothesis r = 0 against r :/: 0. 384 CHAP. 10 HYPOTHESIS TESTING 10.12 It is claimed that the time to failure of a unit is an av x with density 3x 2e- 3x'U(x) (Weibull). To test the claim, we determine the failure times x1 of 10.13 18-14 10.15 10.16 10.17 10.18 10.19 80 units and form the empirical distribution F'lx) as in (10-43), and we find that the maximum of the distance between Fix) and the Weibull distribution Fo(x) equals 0.1. Using the KolmogorofT-Smimov test, decide whether the claim is justified with o = .I. We wish to compare the accuracies of two measuring instruments. We measure an object 80 times with each instrument and obtain the samples (x1, y 1) of the avs x = c + .,,. and y = c + ~'b· We then compare x1 and y 1 and find that x1 > y 1 36 times, x1 < y 1 42 times, and x1 = y 1 2 times. Test the hypothesis that the distributions F,.(v) and Fb(v) of the errors are equal with o = .I. The length of a product is an RV x with cr = 0.2. When the plant is in control, 11 = 30. We select the values o = .05, p (30.1) = .I for the two error probabilities and use as the test statistic the sample mean i of n samples of x. (a) Find nand design the control chart. (b) Find the probability that when the plant goes out of control and 11 = 30.1 the chart will cross the control limits at the next test. A factory produces resistors. Their resistance is an RV x with standard deviation cr. When the plant is in control. cr > 200; when it is out of control. cr > 20!1. (a) Design a control chart using 10 samples at each test and o = .01. Use as test statistic the RV y = VI(x1 - i )2• (b) Find the probability p that when the plant is in control, the chart will not cross the control limits before the 25th test. A die is tossed 102 times, and the ith face shows k1 = 18, 15, 19, 17. 13, and 20 times. Test the hypothesis that the die is fair with o = .05 using the chi-square test. A utility proposes the construction of an atomic plant. The county objects claiming that 55% of the residents oppose the plant, 35% approve, and 10% express no opinion. To test this claim, the utility questions 400 residents and finds that 205 oppose the plant, 150 approve, and 45 express no opinion. Do the results support the county's claim at the 5% significance level? A computer prints out 1,000 numbers consisting of the 10 integers j = 0, I, . . . , 9. The number ni oftimesj appears equals ni = 85 110 118 91 78 105 122 94 101 96 Test the hypothesis that the numbers j are uniformly distributed between 0 and 9, with o = .05. The avs x andy take the values 0 and I with P{x = 0} = Pu P{y = 0} = p,. A computer generates the following paired samples (x1, y 1) of these avs: X; Y1 0 0 I 0 I I 0 0 0 I I 0 0 I I I 0 0 I 0 I 0 I 0 I 0 I 0 0 I I I 0 0 I I 0 0 0 I 0 0 0 0 0 0 I I 0 0 0 0 0 I I 0 I PROBLEMS 385 Using a = .05. test the following hypotheses: (a) p, ::: .5 against Px :/: .5; (b) .5 against p~. :/: .5: (c) the RVS x and y are independent. 10-20 Suppose that under the null hypothesis. the probabilities P; satisfy the m equations (10-67). Show that for large n the sum P~· '-' • • • '"' <k; - "P;l 1s '-~ -k; -ap; q• = £... m1mmum 1'f £... ; I liP; , I p; a8, 2 =0 This shows that the ML estimates of the par.tmeters H; minimize the modified Pearson's sum q. 10-21 The duration of telephone calls is an RV x with distribution F(x). We monitor 100 calls x; and observe that x; < 7 minutes for every i and the number n4 of calls of duration between k - I and k equals 24. 20. 16. 15, II, 8, 6 fork =- I. 2. . . .• 7, respectively. Test the hypothesis that FL'C) = (I - e "x)UI.t) with a = .1. (a) Assume that 8 = 0.25. (b) Assume that 8 is an unknown parctmeter. as in (10-67). 10-22 The Rvs xj are i.i.d. and N(.,, u). Show that if m Q = ~ ,, L (xJ - i) 2 then i- I j- I where Q 1 and Q2 are the sums in (10-94) and (10-96). 10-23 Three sections of a math class, taught by different teachers. take the test and their grades are as follows: Section 1: Section 2: Section 3: 38 42 41 45 51 53 56 46 50 54 57 58 61 60 65 62 66 67 65 68 70 66 72 73 69 74 75 71 77 80 73 80 81 74 82 84 76 87 92 79 91 83 96 86 90 96 (a) Assuming normality and equal variance, test the hypothesis that there is no significant difference in the average grades of the three sections with a = .05 [see (10-101)]. (b) Using (10-97) with N = 46 and m -=- 3, estimate the common variance of the section grades. 10-24 We devide the pupils in a certain gntde into 12 age groups and 3 weight groups. and we denote by x1" the grades of pupils in the jth weight group and the kth age group. Selecting one pupil from each of the 36 sets so formed. we observe the following gntdes. k 60 58 58 2 3 4 66 70 65 66 69 69 74 73 72 5 77 75 75 6 80 77 77 7 80 78 79 8 82 80 79 9 85 83 81 10 89 87 84 1I 92 91 88 12 96 93 92 Assuming normality and additivity, test the hypothesis that neither weight nor age has any effect on grades. 386 CHAP. 10 HYPOTHESIS TESTING 10-25 An RV x has the Erlang distribution x"' ~ xulx) m! Using the NP criterion, test the hypothesis m = 0 against m = I (Fig. PI0.25) with a = .25 in terms of a single sample x of x. Find the resulting Type II error f<x) = - /3. 0 Figure PlO.lS 18-26 Suppose thatf,(r, 8) is the density of the NP ratio r =/(X. 80 )//(X, 81). (a) Show thatf,(r, 80 ) = rf,(r, 81). (b) Findf,(r, 8) for n = I if/(x. 8) = 8e-•xu(x) and verify (a). 18-27 Given an RV x with/(x, 8) = 8 2xe-•xu(x), we wish to test the hypothesis 8 = 80 against 8 < 80 using as test statistic the sums q = x1 + · · · + x,.. (a) Show that the test Reject Ho itT q < c = x;~on) is most powerful. (b) Find the resulting OC function /3(8). 18-28 Given an RV x with density 8x 9 - 1U(x- I) and samples X~t we wish to test the hypothesis 8 = 2 against 8 = I. (a) Show that the NP criterion leads to the critical region x 1 • • • x,. <c. (b) Assuming that n is large. express the constant c and the Type II error probability 13 in terms of a. 18-29 We wish to test the hypothesis H0 that the distribution of an av xis uniform in the interval (0, I) against the hypothesis H 1 that it is N(l.25. 0.25), as in Fig. PI0.29, using as our data a single observation x of x. Assuming a = .I, determine the critical region of the test satisfying the NP criterion, and find /3. 18-30 We denote by A,, R,. and U, the regions of acceptance, rejection, and uncertainty, respectively, of H0 at the mth step in sequential testing. (a) Show that the sets A,. and R,, m = l, 2, . . . are disjoint. (b) Show that a= i m•l J.R.. /(X,, 8o)dX, s I -a= if m•l A., co /(X,, 8o)dX, i ,.., JR.. /(X,, 81)dX, =coO i ~ c1 m=l fA., /(X,, 81)dX,. = -/3) CJ/3 PROBLEMS 0 387 X ~------Rc-------- Figure P10.29 10-31 Given an event .Slf with p ..::. P(.Slf). we wish to test the hypothesis p = p 0 against p -:1: p0 in terms of number k of successes of~ in n trials, using the LR test. Show that _ , pfttl - Po)"··A A - n kk(n - k)"-A 10-32 The number x of particles emitted from a radioactive substance in I second is a Poisson RV with mean 8. In 50 seconds. 1.058 particles are emitted. Test the hypothese 80 = 20 against 8 20 with a = .OS U!iing the asymptotic approximation (10-141). 10-33 Using (10-146) and Example 10-40, show that the ANOVA test (10-101) is a special case of the LR test (10-139). * II The Method of Least Squares A common problem in many applications is the determination of a function fb(x) fitting in some sense a set of n points (x;, y;). The function fb(x) depends on m < n parameters";, and the problem is to find ";or, equivalently, to solve the overdetermined system y; = fb(x;), i = l, . . . , n. This problem can be given three interpretations. The first is deterministic: The coordinates x; and y; of the given points are n known numbers. The second is statistical: The abscissa x; is a known number (controlled variable), but the ordinate y; is the value of an RV y; with mean fb(x;). The third is a prediction: The numbers x; and y; are the samples of two avs x and y, and the objective is to find the best predictor'= fb(x) ofy in terms ofx. We investigate all three interpretations and show that the results are closely related. 11-1 Introduction We are given n points (x 1 , y,), . ·.. , (x,., y,.) on a plane (Fig. 11.1) with coordinates the 2n arbitrary numbers x; and y;. These numbers need not be different. We might have several points on a 388 SEC. 11-1 INTRODUCTION 389 J 0 Figure 11.1 horizontal or vertical line; in fact, some of the points might be identical. Our first objective is to find a straight line y = a + bx that fits "y on x" in the sense that the differences (errors) }'; - (a + bx;) = v1 (11-1) are as small as possible. The unknowns arc the two constants a and b; their determination depends on the error criterion. This is a special case of the problem of fitting a set of points with a general curve c/J(x) depending on a number of unknown parameters. This problem is fundamental in the theory of measurements, and it arises in all areas of applied sciences. The curve c/J(x) could be the statement of a physical law or an empirical function used for interpolation or extrapolation. Suppose that y is the temperature inside the earth at a distance x from the surface and that theoretical considerations lead to the conclusion that y = a+ bx where a and bare unknown parameters. To determine a and b, we measure the temperature at n depths x1• The numbers x1 are controlled variables in the sense that their values are known exactly. The temperature measurements y1, however, involve errors, as in (11-1). We shall develop techniques for estimating the parameters a and b. These estimates can then be used to determine the temperature y 0 = a + bx0 at a new depth x0 • An example of empirical curve fitting is the stock market. We denote by y1 the price of a stock at time x1, and we fit a straight line y = a + bx, or a higherorder curve, through the n observations (x1, y1). The line is then used to predict the price y 0 of the stock at some future time x0 • THREE INTERPRETATIONS The curve-fitting problem can be given the following interpretations, depending on our assumptions about the points (X;, y;). Deterministic In the first interpretation, x 1 and y 1 are viewed as pairs of known numbers. The5e numbers might be the results of measurements involving random errors; however, this is not used in the curve-fitting process, and the goodness of fit is not interpreted in a statistical sense. 390 CHAP. II THE METHOD OF LEAST SQUARES Statistical In the second interpretation, the abscissas x1 are known numbers (controlled variables), but the ordinates y1 are the observed values of n RVs y1 with expected values a + bx1 • Prediction In the third interpretation, x; and y1 are the samples of two RVS x and y. The sum a + bx is the linear predictor of y in terms of x, and the function cf>(x) is its nonlinear predictor. The constants a and b and the function cf>(x) are determined in terms of the statistical propenies of the Rvs x andy. We shall develop all three interpretations. The results are closely related. The deterministic interpretation is not, strictly, a topic in statistics. It is covered here, however, because in most cases of interest, the data x1 and y 1 are the results of observations involving random errors. REGRESSION LINE This term is used to characterize the straight line a + bx, or the curve cf>(x), that fits yon x in the sense intended. The underlying analysis is called regression theory. Note The errors v1 are the deviations of y 1 from a + bx1 (Fig. 11.2a). In this case, y = a + bx is a line fitting yon x. We could similarly search for a line x = a + ~Y that fits x on y. In this case, the errors p.1 = x1 - (a - ~y1 ) are the deviations of x1 from a + ~y1 (Fig. 11.2b). The errors can, of course, be defined in other ways. For example, we can consider as errors the distances d1 of the points (x;. y 1) from the line Ax + By + C = 0 (Fig. 11.2c). We shall discuss only the first case. Overdetermined Systems The curve-fitting problem is a problem of solving somehow a system of n equations involving m < n unknowns. For m = 2, this can be phrased as follows: Consider the system a + bx, = Y; i = I, . . . , n where x1 and y 1 are given numbers and a and b are two unknowns. Clearly, if the points x1 and y 1 are not on a straight line, this system does not have a Figure ll.l y X (a) 0 X (b) II:) SEC. ll-2 DETERMINISTIC INTERPRETATION 391 solution. To find a and b, we form the system Y; - (a + bx;) = v; i = l. . . . , n and we determine a and b so as to minimize in some sense the "errors" v1 • 11-2 Deterministic Interpretation We are given n points (x1 , y;), and we wish to find a straight line y =a + bx fitting these points in the sense of minimizing the least square (LS) error Q = Iv1 (11-2)* where v1 = y; - (a + bx1). This error criterion is used primarily because of its computational simplicity. We start with two special cases. Horizontal Line Find a constant a0 such that the line (Fig. ll.3a) y = a0 is the LS fit of the n points (x;, y;) in the sense of minimizing the sum Qo = I(y;- ao) 2 This case arises in problems in which y1 are the measured values of a constant ao and v; = y; - ao are the measurement errors. The abscissas x1 might be the times of measurement; their values, however, are not relevant in the determination of ao. Clearly, Q0 is minimum if aQo aao = -2I (y·' ao) =0 This yields Do= n·~ Y; = Y- (11-3) ~ Thus a0 is the average of the n numbers y 1• II • The notation ~ will mean ~ . i•l Figure 11.3 y 0 X X (a) (b) 0 X {c) 392 CHAP. 11 THE METHOD OF LEAST SQUARES Homogeneous Line Find a constant b 1 such that the line y 11.3b) is the best LS fit of then points (x;, y;). In this case, Q 1 = I(y; - b 1x;)2 aQ, ab; = -2! (y;- b,x;)X; = 0 = b 1x (Fig. (11-4) (11-5) Hence, Q 1 is minimum if b _ Ix;y; ,--Ix1 (11-6) From (11-5) it follows that the LS error equals Q, = I(y; - b;x;)Y; = Iv;y; Geometric Interpretation The foregoing can be given a simple geometric interpretation. We introduce the vectors X = [Xt, • • . , Xn] Y = [y,, . . . , Yn] N = [v,, . . . , Vn] and we denote by (X, Y) the inner product of X and Y and by lXI the magnitude of X. Thus (X, Y) = Ix;y; (X, X) = Ixf = IXI2 (11-7) Clearly, where N = Y - b 1X is the error vector (Fig. I I .4). The length INI of N depends on b 1 , and it is minimum if N is orthogonal to X, that is, if (N, X) = I(y; - b1x;)X; = 0 (11-8) in agreement with (11-5). The linear LS fit can thus be phrased as follows: Q 1 is minimum if the error vector N is orthogonal to the data vector X (orthogonality principle) or, equivalently, if btX is the projection of Yon X (projection theorem). GENERAL LINE Find the LS fit of the points (X;, Y;) by the line (Fig. 11.3c) y = 11 + bx In this case, the square error (11-9) Q = I[y; - (a + bx1)]2 Ftpre 11.4 X SEC. 11-2 DETERMINISTIC INTERPRETATION 393 is a function of the two constants a and b, and it is minimum if aQ aa = -2~(y,- ~~ = -2~[y1 - (a + bx;)] =0 (a + bx;)]x, (11-lOa) =0 (11-lOb) This yields the system na + bl:x; = l:y1 (Il-l Ia) al:x; + bl:xr = Ixm (11-llb) Denoting by and j the averages of x1 andy,, respectively. we conclude from (1 1-1 I) that a= j - hi (11- 12a) b = n~x;y; - ~x,~y, = I(x; - i)(y; - f) = l:(x1 - i)y1 nl:x1- (l:x,) 2 l:(x;- i)2 l:xr - n(x)2 (ll-l 2b) This yields y- y = b(x- i) (11-13) Hence, the regression line y = a + bx passes through the point (i, y), and its slope b equals the slope b 1 of the homogeneous line fitting the centered data (y, - j, x, - i). Note [see (11-10)] that Ily1 - (a + bx1)1(a + bx1) = 0 From this it follows that Q = l:[y;- (a + bx;))y; (11-14) x The LS error Q does not depend on the location of the origin because (see (I 1-12)] Q = ~[(y,- y) - b(x; - i)]l = l:(y1 - y) 2 - b2~(x1 - X)2 (11-15) The ratio Q ~(y,- j)2 =l-r2 2 - [l:(x, - x)(y; - j)]l r - ~(x;- x)2I(y;- y)2 is a normalized measure of the deviation of the points (x1, y1) from a straight line. Clearly, (see Problem 9-27) 0 s lrl s I (I 1-16) and lrl = 1 iff Q = 0, that is, iff y1 = a + bx1 for every i. Example 11.1 We wish to fit a line to the points X; 2.3 6 Yt 56 73 3 7.2 52 70 3.4 8 51 82 4.2 9 57 89 4.2 9.8 5.1 6 It 61 67 99 12 73 86 lOS 394 CHAP. 11 THE METHOD OF LEAST SQUARES y X Flpre 11.5 Case 1: y = b.x From the data we obtain l:.x;y; = 7,345 l:.xl = 717 b, = 10.24 l:y~ = 78,973 Hence (fig. 11.5), Q, = '}:y1- bt'l:.X;Y; = 3,760 = Case2: y 11 + 6r The data yie1di = 6.514, j = 73.357.1nserting into (11-12) and (11-14), we obtain a = 38.67 b = 5.32 Q = ISO • Multiple Linear Regression The straight line fit is a special case of the problem of multiple linear regression involving higher-order polynomials or other controlled variables. In this problem, the objective is to find a linear function (11-17) y = C1W1 + • • ' + CmWm of the m controlled variables w4 fitting the points Wu, • .. , wmi, Y; i = I, . . . , n This is the homogeneous linear regression. The nonhomogeneous linear regression is the sum C0 + CJWJ ' ' ' + CmWm This, however, is equivalent to (11-17) if we replace the term c0 by the product C0 W 0 where W 0 = 1 [see also (11-72)]. As the following special cases show, the variables w4 might be functionally related: If WA: = x•, we have the problem of fitting the points (x1, Y;) SEC. 11-2 DETERMINISTIC INTERPRETATION 395 by a polynomial. If w4 = cos w4x, a fit by a trigonometric sum results. We discuss first these cases. Parabola We wish to fit the n points (x;. y;) by the parabola (11-18) y =A+ Bx + Cx 2 This is a special case of (11-17) with m = 3 w1 = I w~ = x ,~., = x 2 Our objective is to find the constants A = c 1 • B = c2 • C = c3 so as to minimize the sum (11-19) Q = l:[y; - (A + Bx; + Cxl>F To do so, we set :~ = -2~lY;- <A aQ aB = -2l:[v·(A + Br· + .1 • 1 !~ = - 2l:[y; - + Bx; + cxr>J = Cx~)]x· I o = I (A ... Bx; + cxT>Jxr = 0 (11-20) o This yields the system nA + B~x; + Cl:.xr = ~)'; Al:x; + Buf + Cl:xl = l:.t;.V; Al:x7 + Bl:xl • Cl:x1 = };.xry; (11-21) Solving, we obtain A, B, and C. From (11-20) it follows that the LS error equals Q = l:ly;- Example 11.2 <A + Bx; + cxr>b·; (11-22) Fit a parabola to the points (Fig. 11.6) X; Y; 0.1 1.31 0.2 1.15 0.3 0.98 0.5 0.4 1.27 0.6 1.41 1.40 0.8 1.92 0.7 1.60 0.9 1.75 I 2.04 In this case, ~X;= ~Y; = 55 14.83 and (11-21) yields A = 1.196 ~xf = 3.85 ~XtYi = 8.98 B = 0.314 ~xl = 3.02 }'.x~y; C = 1.193 Given then points (wu • . . . . the m constants c4 such that the sum GEN•:RAL CASE }'.xf = 2.53 = 6.68 }'.yr = 23.03 Q = 0.14 Wm;. y;), • we wish to find (11-23) 396 CHAP. 11 THE METHOD OF LEAST SQUARES y • • 0 Figure 11.6 is minimum. This is the problem of fitting the given points by the function in (11-17). In the context of linear equations, the problem can be phrased as follows: We wish to solve the system CtWu + ... + CmWm; = y; I s ;s n of n equations involving m < n unknowns c,, ... , em. To do so, we introduce the n errors Y; - (c,wu + · · · + CmWm;) = II; I s iS n (11-24) and we determine c11 such as to minimize the sum in (11-23). Clearly, Q is minimum if aQ ack = -2~(y;- (ctWli + ' ' ' + CmWm;))Wk; =0 1S k S m (IJ-25) This yields the system c,~wi; + · · · + Cm~Wm;Wu = ~WuY; (11-26) Ct~WJiWm; + ' ' ' + Cm~w!,; = ~Wm;Y; Solving, we obtain c11 • The resulting LS error equals Q = ~~~~ = ~II;Y; = ~y~ - Ct~WuY; - ' ' ' - Cm~Wm;Y; (11-27) Note that (11-25) is equivalent to the system ~II;Wk; = 0 k = 1, . . . , m (IJ-28) This is the orthogonality principle for the general linear regression. Nonlinear Regression We wish to find a function (11-29) y = cf>(x) fitting then points (x,, y;). Such a function can be used to describe economically a set of empirical measurement, ·to extrapolate beyond the observed data, or to determine a number of parameters of a physical law in terms of SEC. IJ-2 DETERMINISTIC INTF.RPRF.TATJON 397 noisy observations. The problem has meaning only if we impose certain restrictions on the function cl>(x). Otherwise. we can always find a curve fitting the given points exactly. We might require. for example, that cl>(x) be smooth in some sense or that it depend on a small number of unknown parameters. Suppose, first, that the unknown function cb(x) is of the form (11-30) where q4(x) are m known functions. Clearly. ct><x> is a nonlinear function of x. However, it is linear in the unknown parameters ck. This is thus a linear regression problem and can be reduced to ( 11-17) with the transformation w4 = q4(x). A special case is the fit of the points (:r;. Y;) by the polynomial )' = Ct + c~x + · · · + c,,x"' 1 This is an extension of the parabolic fit considered earlier. Let us look at another case. Trigonometric Sums We wish to fit then points (x;, )';)by the sum )' = C:t COS WtX + ' ' ' + C:m COS Wm:C where the frequencies wk Wk (11-31) are known. This is a special case of (11-17) with = COS W4X 11'4i = COS W4X; Inserting into (11-26), we conclude that the coefficients c4 of the LS fit are the solutions of the system C:t~ cos 2 w,x; + · · · + c:,~ cos w,x; cos w,x; = ~Y; cos WtX; (11-32) c,~ cos WmX; cos w,x; + · · · + C:m~ cos2 w,x; = ~)';cos WmX; Sums involving sines and cosines lead to similar results. Example 11.3 We wish to fit the curve y = c 1 X; 0.1 Y; 9 0.2 10.1 • c 2 cos 2.5x to the points (Fig. II. 7) 0.3 0.5 0.5 0.6 0.7 9.1 8.0 7.4 6.9 6 0.8 s.o 0.9 4.6 1.0 4.2 This is a special case of (11-31) with w1 = 0 and w2 = 2.5. Inserting into (11-32), we obtain the system nc 1 - c 2l: cos 2.5x; = Iy, c 1l: cos 2.5x; t c2l: cos2 2.5x; = Iy; cos 2.5x; This yields 10c1 + 1.48c2 = 70.3 c, = 6.562 1.48c 1 ... 3.88c2 = 22.0 c2 = 3.159 • Next we examine two examples of nonlinear problems that can be linearized with a log transformation. 398 CHAP. II THE METHOD OF LEAST SQUARES y Y: c1 + c2 cos 2Sx 0 X Figure 11.7 Example 11.4 We wish to fit the curve y = ye·u to the points (fig. 11.8) I 2 3 4 5 6 7 81 55 42 29 20 IS II 8 7 9 5 10 3 so as to minimize the sum (11-33) Figure 11.8 y 0 10 SEC. 11-2 399 DETERMINISTIC I!'IITF.RPRETATION On a log scale. the curve is a straight line In y =- In y- A.t. This line is of the form u - h.t where z = In y c1 = In y h = -A Hence. the LS fit is given by !I 1-12). This yields ,. - I I A=- -(.t;- x)_n,.v. In 'Y =--,~In,.... A.{ ~(X; - x)n .£... ·' Inserting the given data into {11-~4). we obtain A= 0.355 In y = 4.7R z= (11-34) y = tl9e·IIW.• Note that the constants y and A so obtained do not minimize 0 1-33); they minimize the sum ~[z;- (tl • h.\';)f -" ~[In Y;- lin y - A.t;lF • Example 11.5 We wish to fit the curve y 2.0 2 3 3.0 3.8 = yx# to the points (Fig. 4 5.0 5 6.5 11.9) 6 7 7.0 7.6 8 lU 9 9.1 10 9.R This is again a nonlinear problem; however. on a log-log scale, it is a straight line of the form In y = In y - {3 In x bw where : = In y u = In y h=/3 z =u - ''"-In x •·igure 11.9 10 • 0 10 400 CHAP. II THE METHOD OF LEAST SQUARES With In x; the average ofln x;. (1)-12) yields p = l:(ln X; -~)In)'; = 0.715 ~(In x; - In x; )2 In 'Y = !n~ ~ In v - p In x .I The constants 'Y and ~ I = 0.64 y = t.9xo.m so obtained minimize the sum • }:[In}'; - (In 'Y + pIn x;)] 2• PERTURBATIONS In the general curve-fitting problem, the regression curve is a nonlinear function y = f/>(x, .\, IJ., • • • ) of known form dependinf· on a number of unknown parameters. In most cases, this problem has no closed-form solution. We shall give an approximate solution based on the assumption that the unknown parameters.\, IJ., ••• are close to the known constants .\o, 1J.o , • • • Suppose, first, that we have a single unknown parameter.\. In this case, y = f/>(x, .\)and our problem is to find.\ such as to minimize the sum l:(y;- f/>(X;, ,\)) 2 (11-35) If the unknown .\ is close to a known constant .\o, we can linearize the problem (Fig. 11.10) using the approximation (truncated Taylor series) f/>(x, .\) =- f/>(x, .\o) + (.\ - .\o)f/>"(x, .\o) «/>" = :t (11-36) Indeed, with z =y - f/>(x, .\o) the nonlinear equation y = fP(X, .\)is equivalent to the homogeneous linear equation z = (.\- .\o)w. Our problem, therefore, is to find the slope.\- .\o of Figure 11.10 y y=t{l(x,Al Yt 0 X SEC. 11-2 DETERMINISTIC INTERPRETATION 401 this equation so as to fit the points Z; = Y;- c/>(x;, Ao) in the sense of minimizing the LS error I[y;- c/>(x;, Ac,) - (A - Ao)c/>A(x;, A.o>J2 = I[z; - (A - A0 )wt1 2 Reasoning as in (11-6), we conclude with b1 = A - Ao that A - Ao = Iz;~; = I(y;- ~(X~, Ao)]c/>A(x;, .\o) ~c/>A(x;, Iw; 01-37) Ao) Suppose next that the regression curve is a function c/>(x, A, fJ.) depending on two parameters. The problem now is to find the LMS solution of the overdetermined nonlinear system cb(x;, A, fJ.) = Y; i = I, . . . . n (11-38) that is, to find A and fJ. such as to minimize the sum I[y;- c/>(x;, A. tJ.)J2 (11-39) We assume again that the optimum values of A and fJ. are ncar the known constants Ao and fJ.o. This assumption leads to the Taylor approximation cf>(x, A, fJ.) = cb(x, Ao, fJ.o) + (A - Ao)c/>A(x, .\o, fJ.o) + (fJ. - tJ.o)cf>,.(x, Ao, fJ.o) (11-40) where lf'A = alf'laA and "'"' = aiPiafJ.. Inserting into (11-38) and using the tr.msformations z = y - c/>(x, Ao, fJ.o) we obtain the overdetermined system Z; = C;Wu + C2W~; CJ = A - Ao C~ = fJ. - fJ.o (11-41) This system is of the form (11-17); hence, the LS values of c 1 and c 2 are determined from (11-26). The solution can be improved by iteration. Example 11.6 (a) We wish to fit the curve y = 5 sin Ax to the 13 points X; y; 0 4 0.5 6 I 18 1.5 30 2 46 2.5 42 3 52 3.5 40 4 44 4.5 34 5 20 5.5 36 6 10 of Fig. I 1.11. As we see from the figure, a sine wave with period 12 is a rea'ionable fit. We can therefore use as our initial guess for the unknown A the value Ao =. 27TI12. In this problem, f/>(x, A) = 5 sin Ax, f/>A(x, A) = 5x cos Ax, z = y - 5 sin Aox w = 5x cos A.0 x 402 CHAP. II THE METHOD OF LEAST SQUARES • s 6 0 )C Fipre 11.11 Hence [see (11-37)), A = Ao + 5 ~x1 cos Aox;(Y; - 5 sin Aox;) = .491 0 25 Ixf cos2 Aox1 and the curve y = 5 sin 0.491x results. (b) We now wish to fit the curve y = p. sin Ax to the same points. using as initial guesses for the unknown parameters A and p. the values 7r Ao=6 p.o=5 With f/l(.x, A, p.) = (11-40) yields y = p. sin Ax tb,. = sin A.x p. sin Ax = J.&o sin Aox + (p. - p.o) sin A0 x + (A. - Ao)J.&oX cos Aox and the system y = c 1w 1 + c 2w2 w 1 = sin Aox w 2 = p.ox cos Aox results where c 1 = p., c2 = A - Ao. Inserting into (11-26), we obtain Isin2 Aox1 T- c2 ~J.&oX; sin Aox 1 cos Aox; = Iy; sin Aox; c 1 Ip.ox1 sin A0x1 cos A0x1 + c2 Ip.ijxf cos2 Aox; = Iy;p.ox; cos A.ox; Hence, c1 = 4.592, c 2 = -0.0368, p. = 4.592 A. = 0.487 y = 4.592 sin 0.487x c1 • 11-3 Statistical Interpretation In Section I 1-2, we interpreted the data (x;, Yi) as deterministic numbers. In many applications, however, Xi and Yi are the values of two variables x and y related by a physical law y = f/>(x ). It is often the case that the values Xi of the SEC. 11-3 STATISTICAL INTERPRETATION 403 controlled variable x are known exactly but the values y; of y involve random errors. For example, x; could be the precisely measured water temperature and y; the imperfect measurement of the solubility TJ; of a chemical substance. In this section. we use the random character of the errors to improve the estimate of the regression line f/>(x). For simplicity, we shall assume that fb(x) is the straight line a + bx involving only the parameters a and b. The case of multiple linear regression can be treated similarly. We shall use the following probabilistic model. We are given n independent RVs y., . . . • y, with the same variance u 2 and with mean E{y;} = TJ; = a + bx; (11-42) where x; are known constants (controlled variables) and a and b arc two unknown parameters. We can write y; as a sum E{v;} = 0 (11-43) 2 where 11; are n independent Rvs with variance u • In the context of measurements. TJ; is a quantity to be measured and Jl; is the measurement error. Our objective is to estimate the parameters a and b in terms of the observed values Y; of the Rvs y;. This is thus a parameter estimation problem differing from earlier treatments in that E{y;} is not a constant but depends on i. We shall estimate a and b using the maximum likelihood (ML) method for normal RVS and the minimum variance method for arbitrary RVs and shall show that the estimates are identical. Furthermore, they agree with the LS solution ( 11-12) of the deterministic curve-fitting problem. y; = a + bx; + II; MAXIMUM LIKELIHOOD Suppose that the RVS y; are normal. In this case, their joint density equals 1 (cr'\121T)" exp {- ~L 2u [y;- (a + bx;)J2} (11-44) This density is a function of the parameters a and b. We shall find the ML estimators i and 6 of these panlmeters. The right side of (11-44) is maximum if the sum Q = ~[}'; - (a + bx;>P is minimum. Clearly, Q equals the LS error in ( 11-9) hence, it is minimum if the parameters a and hare the solutions ofthe system (11-11). Replacing in (11-12) the average j of the numbers y; by the average y of then Rvs y; we obtain the estimators a=y-bx - x)y; 6 = ~<x; ~(X;- X)2 As we shall see, these estimators are unbiased. (11-45) 404 CHAP. 11 THE METHOD OF LEAST SQUARES We assume again that the RVS y1 are independent with the same variance u 2, but we impose no restrictions on their distribution. We shall determine the unbiased linear minimum variance estimators (best estimators) (11-46) a= Ia;y; of the regression coefficients a and b. Our objective is thus to find the 2n constants a1 and /31 satisfying the following requirements: The expected values of a and 6 equal a and b respectively: E{i} = ~a;1J; = a E{b} = ~/3;711 = b (11-47) and their variances (11-48) are minimum. MINIMUM VARIANCE • Gauss-Mtll'koff Theo~m. The best estimators of a and b are given by (11-45) or, equivalently, by (11-46), where a·=.!.n a.X I /JI X;- /3; = ~(x; - X (11-49) x>2 + bx1, (11-47) yields + bx;) =a ~/3;(a + bx;) = b • Proof. Since E{y1} = a ~a;(a Rearranging terms, we obtain (~a; - l)a + (Ia1x1)b = 0 (~/31 )a + (~/31x; - l)b = 0 This must be true for any a and b; hence, ~a;= 1 ~a1 x1 = 0 (11-50a) ~/31 = 0 ~/3;X; = 1 (11-50b) Thus our problem is to minimize the sums ~a1 and ~131 subject to the constraints (11-SOa) and (ll-50b), respectively. The first two constraints contain only a 1, and the last two only /31• We can therefore minimize each sum separately. Proceeding as in (7-37) we obtain (see Problem 11-8) nx;x- ~xl a;= n(nF- ~xf) X;- = ~xf - xf _ 1 /3; X n(i)2 and (11-49) follows. Note that ~ ~ _ /3;2 - L (x; [L (x; - 2- X)2] L(X; - x)-2 L a;/3; = ~ L /3;- x L 131 = L(X~= x) L a1 = L (.!.n - /3;x) 2 (11-51) 2 2 = .!.n + L(X;(i) X)2 11-3 SEC. STATISTICAl. INTERPRETATION 405 Variance and Interval Estimates From (I 1-51) and (I 1-48) it follows that , u- ul = l:(X;- -r (I 1-52) X- • • Cov (a, b) 2 "' = cr, ~a;{3; = ~( -u iX_), (11-53) ~X;- Furthermore, the sum TJk = a + bxk is an unbiased estimator of the sum a + bx, , and its variance equals , u2 u 2(x, - i)~ u~, = T l:(x; - .tV (11-54) n We shall determine the confidence intervals ofthe parameters a, b. and a, b, and iJ are sums of the independent avs y;, we conclude that they are nearly normal; hence, the y = 2u - 1 confidence intervals of the parameters a, b, and m. equal Tlk under the assumption that n is large. Since the avs ti ~ ZPa b~ ZuUh '11t =z,u.;, 2 (11-55) 2 respectively. These estimates can be used if u is known. If u is unknown, we replace it by its estimate 6' 2 • To find 6' 2 , we replace the parameters 11; in the sum l:(y; - Tl;) 2 by their estimates. This yields the sum l:£1 £; = Y; - '11; = }'; - (ti ..._ bX;) (11-56) Reasoning as in the proof of ( 11-49) we conclude that the sum ilx = a + bx is an unbiased estimator of Tlx and its variance equals (see Problem 11-12) 1 o- 2 = -n-- 2~ ~ <Y·' - .yj..,, >2 (It-57) is an unbiased estimate of u 2 • Replacing the unknown u 2 in (11-52) and (11-54) by its estimate 6' 2, we obtain the large sample estimates of a 6 , uj,, and u.,,. The corresponding confidence intervals are given by (11-55). Note, finally, that tests of various hypotheses about a, b, and Tlk are obtained as in Section 10-2 (see Problem 11-9). Regression Line Estimate Consider the sums Tlx = a + bx and Yx = a + bx + 'llx where a and bare the constants in (11-42), xis an arbitrary constant, and "·• is an RV with zero mean (Fig. 11.12). Thus Yx = Tlx + llx Tlx = a + bx (11-58) We shall estimate the ordinate Tlx of the straight line a + bx for x :/= x; in terms of the n data points (x;, y;). This problem has two interpretations depending on the nature of the RV Yx. First Interpretation The RV Yx is the result of the measurement of the sum Tlx =a+ bx, and 'llx is the measurement error. This sum might represent a physical quantity, for example, the temperature of the earth at depth x. In this case, the quantity of interest is the estimate '11x of Tlx • We shall use for '11x the sum (11-59) '11x = ti + bx 406 CHAP. II THE METHOD OF LEAST SQUARES 0 X Figure 11.11 where a and 6 are the estimates of a and b given by (11-46). Reasoning as before, we conclude that il~ is an unbiased estimator of 71x, and its variance equals cr2 CT~ = - "' n + cr2(x - ..f)2 (11-60) l:(x1 - x)2 Second Interprettltlon The RV Yx represents a physical quantity of interest, the value of a stock at timex, for example, and 71x is its mean. The sum 11~ + llx relates Yx to the controlled variable x, and the quantity of interest is the value y~ ofy~. We thus have a prediction problem: We wish to predict the value Yx ofy~ in terms of the values y 1 of then Rvs y1 in (11-43). As the predictor of the RV Yx = 11~ + "·• we shall use the sum Y~ = itt = A + 6x (11-61) Assuming that "xis independent of then RVS v1 in (11-43), we conclude from (11-60) that the MS value of the prediction error y, - y~ = "~ - (itt - 71x) equals • 2} E{-2} 2 E {(y~- Yx> = Jl".i + CT,;, 2 -)2 21 cr cr-,x- x = cr,,2 +n + l:(X;- X )2 (11-62) The two interpretations of the regression line estimate a + 6x can thus be phrased as follows: This line is the estimate of the line 71x = a + bx; the variance of this estimation is given by (11-60). It is also the predicted value of the RV Yx = a + bx + vx; the MS prediction error is given by (11-62). Example 11.7 We are given the following II points: X; I 2 Y; 1 5 3 II 4 10 5 15 6 12 1 16 8 19 9 22 10 20 II 25 i =6 y= 14.73 11-4 Sf:.C. 0 5 .\' PREDICTION 407 10 t'igure 11.13 Find the point estimates of a, b, and Using the formulas X·- X {3; we find h = 1.836, d <i- 2 I = I<;; - x)2 <r~. h = ~{3;Y1 ti =y - h.t =- 3.714, II = 9 L <Y; 1-1 3.714 - t.836x;>~ = 3.42 u= In Fig. I I. 13 we show the points (X;, Y;) and the estimate .,:;. a+ bx. • 1.85 = 3.714 - 1.836x of 11-4 Prediction In the third interpretation of regression, the points (X;. y 1) are the samples of two RVs x and y, and the problem is to estimate y in terms of x. Multiple regression is the estimation of y in terms of severed Rvs w,. This topic is an extension of the nonlinear and linear regression introduced in Sections 5-2 and 6-3. 408 CHAP. II THE METHOD OF LEAST SQUARES Linear Prediction We start with a review of the results of Section 5-2. We wish to find two constants a and b such that the sum y = a + bx is the least mean square (LMS) predictor of the RV y in the sense of minimizing the MS value Q = E:{(y - y)2 } - f . f . [y - (a + bx)J:f(x. y)dxdy = y - y. Clearly, Q is minimum if aQ = - 2E{y - (a + bx)} = 0 (11-63) of the prediction error " aa aQ ab = -2E{[y- This yields the system a+ b.,x = .,.,, Solving, we obtain (a a71x + bx))x} =0 + bE{x2 } = E{xy} (11-65) a= T'/)•- h.,x b = E{xy} E{x2} - T/xTI>· 71~ (11-64) (ll-66a) = #Lll = r ~ 0'~ (11-66b) O'x Note that " =y - (a + bx) = (y - T/y) - b(x - T/x) E{ll(a + bx)} = 0 Hence, the LMS error equals Q = E{v} = E{vy} = E{[(y = u~ - bu; = u~(l - r2) .,.,).) - b(x - T'/x>P} Thus the ratio Qlu~ is a normalized measure of the LMS prediction error Q, and it equals 0 iff lrl = 1. If the avs x and y are uncorrelated, then r =0 Q = u~ b =0 y = a = .,.,, In this case, the predicted value of y equals its mean; that is, the observed x does not improve the prediction. The solution of the prediction problem is thus based on knowledge of the parameters T/x, .,.,, , O'x, u,, and r. If these parameters are unknown, they can be estimated in terms of the data (x;, y; ). The resulting estimates are Tlx = :x,.,,=y, If these approximations are inserted into the two equations in (11-66), the two equations in (11-12) result. Tbis shows the connection between the prediction problem and the deterministic curve-fitting problem considered in Section 11-2. SEC. 11-4 PREDICTION 409 We are given m + 1 RVS w1 , • • • ,w,,y (11-67) We perform the underlying experiment once and observe the values wk of the m Rvs wk. Using this information, we wish to predict the value y ofthe RV y. In the linear prediction problem, the predicted value of y is the sum y = c,w, + ... + c,.w,,. (11-68) GF.N.:RAUZATIO~ and our problem is to find the m constants c:k so as to minimize the MS value Q = E{ly - (t·,w 1 + · · · + c,.w,.)l2} (11-69) of the prediction error v = y - (c 1w1 + · · · + c,.w,). Clearly, Q is minimum if a~k = -2E{[y- (c,w, + · · · + c,.w,.)}wd = 0 (11-70) This yields the system c 1 £{~} + · · · + c,£{w,w 1} = £{w 1y} (11-71) Solving, we obtain ck. The nonhomogeneous linear predictor of y is the sum 'Yo + 'YtWt + · · · + 'YmWm This can be considered as a special case of ( 11-68) if we replace the constant 'Yo by the product 'YoWo where w0 = I. Proceeding as in (11-71), we conclude that the constants 'Yk are the solutions of the system 'Yo+ 'YtE{w,} + · · · + 'YmE{w,} = E{y} 'YoE{w,} + 'YtE{wH + · · · + 'YmE{w,w,} = E{w,y} k =1: 'YoE{w,} + 'YtE{w,w,} + · · · + 'YmE{~} = E{w,y} From this it follows that if E{w•} = E{y} = 0. then 'Yo = 0 and 'Y• 0. ( 11-72) = ck for Orthogonality Principle Two RVS x andy are called orthogonal if E{xy} = 0. From (11-70) it follows that E{vwk} = 0 k = I, •.. , m (11-73) Multiplying the kth equation by an arbitrary constant dk and adding, we obtain (11-74) E{v(d1w 1 + · · · + d,w,)} = 0 Thus in the linear pr~diction problem, the prediction error v = y - y is orthogonal to the "data" wk and to any linear function of the data (orthogonality principle). This result can be used to obtain the system in (11-71) directly, thereby avoiding the need for minimizing Q. 410 CHAP. II THE METHOD OF LEAST SQUARES From (1 I-74) it follows that E{,Y} = E{(y - y)y} =0 Hence, the LMS prediction error Q equals Q = E{(y - y)2 } = E{(y - y)y} = E{y2} - E{Y2} (11-75) As we see from (11-71), the determination of the linear predictor yofy is based on knowledge ofthejoint moments of them+ 1 avs in (II-67). If these moments are unknown, they can be approximated by their empirical estimates where w11 andy; are the samples of the avs w1 and y, respectively. Inserting the approximations into (11-71), we obtain the system (11-26). Nonlinear Prediction The nonlinear predictor of an RV y in terms of the m avs x 1 , • function «/>(x., ... , Xm) = «/>(X) X= [x,, . . . , Xn] minimizing the MS prediction error e = E{[y - «/>(X)J2} • , Xm is a (11-76) As we have shown in Section 6-3, if m = 1, then «/>(x) = E{ylx}. For an arbitrary m, the function «/>(X) is given by the conditional mean of y assuming X: «/>(X) = E{yiX} = r. yf(y!X> dy (11-77) To prove this, we shaJI use the identity [see (7-21)] (11-78) E{z} = E{E{ziX}} 2 With z = [y - «/>(X) ] , this yields (11-79) e = E{E{[y - «/>(X)J2IX}} The right side is a multiple integral involving only positive quantities; hence, it is minimum if the conditional mean E{(y- «/>(X)J21X} = r. [y- «/>(X)Jr<y!X)dy (11-80) is minimum. In this integral, the function fP(X) is a constant (independent of the variable of integration y). Reasoning as in (6-52), we conclude that the integral is minimum if «/>(X) equals the conditional mean of y. PROBLEMS 4I I Note For normal RVs, nonlinear and linear predictors are identical because the conditional mean E{y!X} is a linear function of the components X; of X (see Problem 11-16). This is a simple extension of (6-41 ). OrthogonaUty Principle We have shown that in linear prediction, the error is orthogonal to the data and to any linear function of the data. We show next that in nonlinear prediction, the error y - «!><X) is orthogonal to any function q(X). linear or nonlinear, of the data X. Indeed, from (11-77) it follows, as in (6-49), that E{ly - «/>(X))q(X)} = E{q(X)E{y - «<><X>IX}} From (11-77) and the linearity of expected values it follows that E{y - «/>(X)IX} = Ey!X} - E{«J><X>IX} = E{yiX} - «/>(X) Hence [see (11-78)). (11-81) E{ly - «/>(X)]q(X)} = 0 for any q(X). This shows that the error y - <b(X> is orthogonal to q(X). Problems 11-1 11-l Fit the lines y = b1x. y ..::: u .,. bx. andy X; 0 .Vi I 2 3 4 3 5 5 5 8 6 7 8 7 8 9 9 II 10 13 II 15 and sketch the results. Find and compare the corresponding errors ~(.V; - b1x;)2 }:ly; - (tl + b.t1>F }:(y, - yxf)2 Fit the parabola y = A .,. Bx - Cx 2 to the following points .'fj Y; 11-3 I 3 = y:cu to the points 0 0 2 5 4 17 3 10 5 25 6 40 7 50 8 65 9 80 10 98 and sketch the results. Here are the average grades x;. y 1, and z.; of 15 students in their freshman, sophomore. and senior year, respectively: Xi Y; li 2.8 1.5 2.6 2.5 3.1 2.6 2.2 2.7 2.9 2.6 3.4 3.1 3.0 3.9 3.4 3.6 3.8 2.9 3.5 4.0 2.5 3.7 2.4 3.9 3.6 3.4 3.4 3.3 4.0 3.8 3.4 2.6 2.9 3.2 3.4 '!t.7 1.9 2.2 2.5 2.7 2.9 2.4 3.4 3.6 3.8 412 CHAP. 11 THE METHOD OF LEAST SQUARES (a) Find the LS fit of the plane~= c 1 • czx T CJY to the points (x1, Y~o l 1). (b) Use (a) to predict the senior grade~ of student if his grades at the freshman and sophmore year are x = 3.9 andy= 3.8. a 11-4 11-5 Fit the curve y = a - b sin ~ x to the following points Xt 0 Y; 4 X; 11-7 3 12.2 4 13.5 5 14 6 13.6 7 12 8 9.9 9 7 10 3.9 7 -9 8 -10 9 -6 10 -1 II 2 12 and sketch the results. Fit the curve y = a sin wx to the points y1 11-6 2 10 I 7.5 0 2 I 7 2 II 3 12 4 6 5 I 6 -5 4 using perturbation with initial guess a0 = 10, fiJo = .,/5. Sketch the results. We measure the angles a,~. 'Y ofa triangle and obtain then triplets (x1, y 1, ~;) where x1 - a, y1 - ~. ~~ - 'Y are the measurement errors. (a) Find the LS estimates ci, /J, j of a,~. 'Y deterministically. (b) Assuming that the errors are the samples of three independent NCO, u, ), NCO, uv ). N(O, 0':) RVS, find the ML estimates of a, ~. 'Y· · The avs y1 are N(a + bx1, u). Show that if = ~(Xt - x)~; ~(Xt- .i)· bi then E{b} = b, E{i} = a y 11-8 (a) Find n numbers a 1 such that the sum I = u1 is minimum subject to the constraints u 1 = I, ~1x1 = 0, where X; are n given constants. (b) Find ~. such that the sum J = I~f is minimum subject to the constraints ~~~ = 0, I~;X; =I. 11-9 The n avs y1 are independent and N(a • bx1, u). Test the hypothesis b = b0 against b :/< b0 in terms of the data y 1• Use as the test statistic: the sum 6 = ~~1y 1 in (11-46). (a) Assume that u is known. (b) Assume that u is unknown. 11-10 The avs y1 , • • • , y, are independent with the same variance and with mean E{y1} =A • Bx; + Cxf where x 1 are n known constants. (a) Find the best linear estimates b i = - - A = u;y; !J = I~;Y; C = Iy;y; of the parameters A. B. and C. (b) Show that if the avs y1 are normal, then the ML estimates of A, B. and C satisfy the system (11-21). 11-11 Suppose that the temperature of the earth distance x from the surface equals e. = a + bx. We measure e. at 10 points x1 and obtain the values Yt = a T bx1 + 111 where 111 are the measurement errors, which we assume i.i.d. and N(O, u). The results in meters for x and degrees C for e. are X; Yt 10 26.2 52 27.1 110 28.6 153 29.9 200 31.4 245 32.6 310 34.1 350 35.1 450 37.5 600 40.2 PROBI.EMS 413 (a) Find the best unbiased estimates of a and b and test the hypothesis a = 0 against a :1: 0 if cr = 1. (b) If cr is unknown, estimate it and find the 0.95 confidence interval of a and b. (c) Estimate 8, at x =- 800 m and find its 0.95 confidence interval if cr = I . 11-U (a) Show that the avs yand bin (11-45) are independent. (b) Show that the RV -!, I(l); CT" TJ;l 2 is x 2(2). (c) WithE; show that the RVs E; and iJ1 = i + I cr2 = y; - iJ; .: v; - (iJ; - TJ;), as in (11-56), bx; arc uncorrelatcd. (d) Show that the RV ~ E;2 IS • X"'( n - 2J . 11·13 (Weighted least squares) The avs y1 are independent and normal, with mean a ... bx; and variance cr~ = cr 21w;. Show that the ML estimates a and &of a and b are the solutions of the system aiw1 + biw;x; = Iw;y; aiw;x; - hiw;xl = Iw 1x 1y1 11-14 The avs x andy are such that TJ1 = 3 'lh = 4 CT1 = 2 CT~ = 8 r1 y = 0.5 (a) Find the homogeneous predictor y = ax of y and the MS prediction error Q = E{(y- ax) 2 }. (b) Find the nonhomogeneous predictory0 ='Yo+ y 1x and the error Q = E{[y - (yo + y,x)]2}. 11·15 The avs x andy are jointly normal with zero mean. Suppose that y =ax is the LMS predictor ofy in terms ofx and Q = E{(y - ax)2 } is the MS error. (a) Show that the avs y - ax and x are independent. (b) Show that the conditional density f<ylx) is a normal curve with mean ax and variance Q. 11-16 The avs y, x 1 , • • • , x,. are jointly normal with zero mean. Show that ify = a 1x 1 ... • • • + a,.x,. is the linear MS predictor ofy in terms ofx1 , then E{ylx~o . . . , x,.} = a,x 1 + · · · T a,.x,. This shows that for normal avs, nonlinear and linear predictors are identical. The proof is based on the fact that for normal avs with zero mean, uncorrelatedness is equivalent to independence. 12_ _ __ Entropy Entropy is rarely treated in books on statistics. It is viewed as an arcane subject related somehow to uncertainty and information and associated with thermodynamics, statistical mechanics, or coding. In this chapter, we argue that entropy is a basic concept precisely defined within a probabilistic model and that all its properties follow from the axioms of probability. We show that like probability, the empirical interpretation of entropy is based on the properties of long sequences generated by the repetition of a random experiment. This leads to the notion of typical sequences and offers, in our view, the best justification of the principle of maximum entropy and of the use of entropy in statistics. 12-1 Entropy of Partitions and Random Variables Entropy, as a scientific concept. was introduced first in thermodynamics (Clausius, 1850). Several years later, it was given a probabilistic interpretation in the context of statistical mechanics (Boltzmann, 1877). In 1948. Shannon established the connection between entropy and typical sequences. This led to the solution of a number of basic problems in coding and data trans- 414 SEC. 12-J l:.NTROPY OF PARTITIOSS ASD RASDOM VARIABLES Pl...-1;1 .~·•x ::1! A 415 = f/, v /II AI- ... ~. P; ~n P; I - • Fipre 12.1 mission. Jaynes (1957) used the method of maximum entropy to solve a number of problems in physics and, more recently, in a variety of other areas involving the solution of ill-posed problems. In this chapter. we examine the relationship between entropy and statistics. and we use the principle of maximum entropy to estimate unknown distributions in terms of known parameters (see also Section 8-2). • Definition. Given a probabilistic model~ and a partition A = [Slf 1• • Slf,,·l of~ consisting of the N events :4; (Fig. 12.1 ). we form the sum s H(A) = - ~ p; In p; p; = PC sl;) ..' (12-1) i-1 This sum is called the etrtropy of the partition A. Example 12.1 Consider a coin with P{h} = p and P{t} _,_ q. The events ai 1 =- {II} and .92 2 = {t} form a partition A = [ .s4 1 , .14 2] with entropy H<AI = -(pIn p ... q In ql If p = q = .5, then H(Al 0.562 . • Example 12.2 = In 2 -= 0.693: if p "" .25 and q = .75. then H(A) = In the fair-die experiment, the elementary events{/;} form a partition A with entropy H(AI = - (lin 6 l,. 6 · · · + !6 In !) 6 = In 6 = I .79 In the same experiment, the events tA 1 tion B with entropy = {even} and ~ 2 = {odd} form a parti- I I I I) HCB> =- - ( 2 In 2 ~· 2 In 2 - In 2 = 0.693 • In an arbitrary experiment ~. an event .5lf and its complement .5lf form a partition A = l ~. Slfl consisting of the two events d, = :A. and Slf2 = St. The 416 CHAP. 12 ENTROPY 2n2 -- 0.5 I e 0 I .s p e Figure 12.2 entropy of this partition equals q = P(sf) H(A) = -(p In p + q In q) where p = P(sf) In Fig. 12.2, we plot the functions -p In p, -q In q, and their sum fl>{p) = -(p lnp + q In q) q =1- p (12-2) The function f!>(p) tends to 0 as p approaches 0 or I because p In p ~ 0 for p- 0 and for p- I Furthermore, fi>(p) is symmetrical about the point p = .5, and it is maximum for p = .5. Thus the entropy of the partition [sf, sf] is maximum if the events sf and sf have the same probability. We show next that this is true in general. The proof will be based on the following. A Bale Inequality If c1 is a set of N numbers such that Ct + . . . + CN =I c, <?! 0 then N N i•l i•l -l: pdnp, s -l: pdn c, (12-3) • Proof. From the convexity of the function In z it follows (Fig. 12.3) that In z s z - I (12-4) SEC. 12-1 ENTROPY OF PARTITIONS AND RANDOM VARIABLES 417 2nz!Sz· z 0 Figure 12.3 With z = c;lp;, this yields N ('; /10 (('; }: p; In --: ~ }: p; - - I i-1 p, ;-t p; ) .\' = }: c·; ,\' - }: p; i·l i-1 =0 Hence, /\' (' 0 ~ }: p; In ....!. i-1 p; ,\ .\' = }: p; In c; - ;-t }: p; In p; i-1 and (12-3) results. • Theorem. The entropy of a partition consisting of N events is maximum if all events have the same probabilities, that is, if p; = I IN: N .\' I I H(A) = -}: p; In p; ~ -}:-In-= InN (12-5) ;·IN i-1 • Proof. Setting c·; = liN in (12-3), we obtain :'!, ·" I - ~ P; In p; ~ - }: P; In i-1 ;·.t N From the theorem it follows th~t 0 Furthermore, H(A) N = In N ~ H(A) ~InN = In N iff P1 = · · · = PN and I i =k H(A) = 0 iff p; = { 0 i :/= k (12-6) (12-7) The following two properties of entropy arc consequences of the convexity of the function -pIn p of Fig. 12.2. 418 CHAP. 12 ENTROPY B A I JI(AJ$.1/(8) \ Fipre 12.4 Property 1 The partitions A and B of Fig. 12.4 consist of Nand N + I events, respectively. The N - I events .s42, • • • , .s4.v are the same in both partitions. Furthermore, the events ~a and ~bare disjoint and .s4 1 = ~a u OOb; hence, = P(sfl) = P(~a) + Pt P(!ib) = Pa + Pb We shall show that H(A) < H(B) (12-8) • Proof. From the convexity of the function w(p) = -p In p it follows that W(Pa + Pb) < w(pa) + w(pb) (12-9) Hence, N -p, In P1 - L Pi In Pi s i-2 N -(pa In Pa + Pb In Pb) - L Pi In Pi i•2 and (12-8) results because the left side equals H(A) and the right side equals H(B). Example 12.3 The partition A consists of three events with probabilities P1 = .55 pz = .30 Pl = .IS and its entropy equals H(A) = 0.915. Replacing the event .!4 1 by the events :Ia and !jib, we obtain the partitionS wherepa = P<lla) = .38. Pb = P(98b) = .17. The entropy of the partition so formed equals H(B) = 1.3148 > H(A) in agreement with ( 12-8). • Property 2 The partitions A and C of Fig. 12.5 consist of N events each. The N - 2 events .s42 , • • • , .s4N are the same in both partitions. Furthermore, .s41 U .s42 = <€a U C€b and P1 = P( .s4t) P2 = P( .s42) Pa = P(Cf;a) Pb = P(C€b) We shall show that if P1 < Pa < Pb < P2, then H(A) < H(C) (12-10) SEC. 12-1 ENTROPY 01' PARTITIONS AND RANI>OM VARIABLES 419 A IliA) :S 1/(C) Figure 12.5 = Pu +Ph: hence, Pu = P1 + the convexity of w(p) it follows that if 8 > 0, then • Proof. Clearly, P1 + P2 w(pl) + w(p2) < w(pl + 8) + 8 < P2 -8 =Ph. From w(p2 - e) Hence, N -pi In P1 - P2 In P2 - L p; In p; s ;-) N -pu In p, -Ph In Ph - ~ p; In p; i-3 and (12-10) results because the left side equals H(A) and the right side equals H(B). Example 12.4 Suppose that A is the partition in Example 12-3 and Cis such that Pu = P(C(f,,.) = .52 In this case, H(C) Ph = P(<fl,) = .33 = 0.990 > H(A) = 0.915. in agreement with 02-10). • We can use property 2 to give another proof of the theorem (12-5). Indeed, if the events s4; of A do not have the same probability, we can construct another partition Cas in (12-10), with larger entropy. From this it follows that if H{A) is maximum, all the events of A must have the same probability. Note The concept of entropy is usually introduced as a measure of uncertainty about the occurrence of the events sf; ofa partition A, and the sum (12-1), used to quantify this measure, is derived from a number of postulates that are based on the heuristic notion of the properties of uncertainty. We follow a different approach. We view (12-1) as the definition of entropy, and we derive the relationship between H(A) and the number of typical sequences. As we show in Section 12-3, this relationship forms the conceptual justification of the method of maximum entropy, and it shows the connection between entropy and relative frequency. 420 CHAP. 12 ENTROPY Random Variables Consider a discrete type RV x taking the values x, with probability p1 = P{x = x1}. The events {x = x1} are mutually exclusive. and their union equals ~. Hence, they form the partition A ... = [.!If,, . . . , .!liN] .!lit= {x = Xt} The entropy of this partition is by definition the entropy H(x) of the RV x: N H(x) = H(AJl) = -~Pi In p; Pi= P{x = x;} (12-11) i•l Thus H(x) does not depend on the values x; of x; it depends only on the probabilities Pt. Conversely, given a partition A, we can construct an RV x,. such that x,.(,) = x; for .!lit (12-12) where x1 are distinct numbers but otherwise arbitrary. Clearly, {x,. = x1} = .lli1; hence, H(x,.) = H(A). 'e Entropy as Expected Value We denote by f(:c) the point density of the discrete type RV x. Thusf(x) is different from 0 at the points x; and/(x;) = P{x = x1} =PI· Using the functionf(x), we construct the RV lnf(x). This Rv is a function of the RV x, and its mean [see (4-94)] equals N E{ln f(x)} = L P1 In /(x;) i•l Comparing with (12-11), we conclude that the entropy of the RV x equals H(x) = -E{ln/(x)} (12-13) Example U.S In the die experiment, the elementary events {jj} form a partition A. We construct the Rvx,. such that x(jj) = i as in (12-12). Clearly,f(x) is different from Oat the points x = 1, . . . , 6, and /(xi) = p;: hence, H(x,.) = -E{lnf(xA)} = 6 - ~ p; In p; juJ = H(A) • CONTINUOUS TYPE avs The entropy H(x) of a continuous type RV cannot be defined directly as the entropy of a partition because the events {x = x1} are noncountable, and P{x = x1} = 0 for every x;. To avoid this difficulty, we shall define H(x) as expected value, extending (12-13) to continuous type Rvs. Note that unlike (12-11), the resulting expression (12-14) is not consistent with the interpretation of entropy as a measure of uncertainty about the values ofx; however, as we show in Section 12-2, it leads to useful applications. SEC. 12-1 ENTROPY OF PARTITIONS AND RANDOM VARIABLES 421 • Definition. Denoting by /(x) the density of x, we form the RV -In f(x). The expected value of this RV is by definition the entropy H(x) of x: H(x) = -E{ln/(x)} = - f ..J<x> lnf(x) dx (12-14) ~ote that/(x) ln/(x)- 0 asf(.·c)- 0; hence. the integral is 0 in any region of the x axis where f(x) = 0. Example 12.6 The RV x has an exponential distribution: /(x} = oce-QXU(.t) In this case. E{x} = 1/oc and lnf(x} =In oc - ocx for x > 0; hence, + I = In-ocI! ln/(x) = -In cr V27T - 1/(x) =- -E{In a - ocx} Example 12.7 If /(x) is N(TJ, u), = -In oc • then /(X)=~ u ('·IX·7J)2121T: V21J' (x - T'J)2 ., , ~u- Hence. H(x) And since E{(x - = -E{Inf(x)} = E{ln CT V27T} T'J)2} {(x ;~,.:') 2 } = u 2 • we obtain H(x) = In u Example 12.8 + E V27T - 2I : In u • v'21Te The RV x is uniform in the interval (0, c). Thus /(x) = lie for 0 < x < c and 0 elsewhere; hence, H(x) = - E {In -I} = - -I Lc In -I dx = In c c c 0 c • Joint Entropy Extending (12-13), we define the joint entropy H(x, y) oftwo RVS x andy as the expected value of the RV -lnf(x, y) wheref(x, y) is the joint density ofx andy. Suppose, first, that the RVs x andy are of the discrete type, taking the values x; and y1 with probabilities P{x = X;, y = YJ} = p;1 i = I, . . . , M j = I, . . . , N In this case, the function/(x, y) is different from 0 at the points (x~o )j}, and f(x;, YJ) = Pu· Hence, H(x. y) = -E{In/(x, y)} = - Lf(x; • .v1 >lnf(x;. YJ) (12-15) iJ If the Rvs x and y are of the continuous type. then H(x, y) = -E{ln/(x, y)} = - t . (,.f(x, y) ln/(x, y) dx dy (12-16) 422 CHAP. 12 Example 12.9 ENTROPY If the avs x andy are jointly normal as in (5-100) (see Problem 12-15), their joint density equals H(x, y) = In (211'E'uru2 '\l'f=?) • Note that if the avs x andy are independent, their joint entropy equals the sum of their "marginal entropies" H(x) and H(y): H(x, y) = H(x) + H(y) (12-17) Indeed, in this case, f(x, y) = fx(x)/,(y) ln/(x, y) = Jn.fx(x) + ln/,(y) and (12-17) results. 12-2 Maximum Entropy and Statistics We shall use the concept of entropy to determine the distribution F(x) of an RV x or of other unknown quantities of a probabilistic model. The known information, if it exists, is in the form of parameters providing only partial specification of the model. It is assumed that no observations are available. Suppose that x is a continuous type RV with unknown density f(x ). We know the second moment E{x2} = 8 of x, and we wish to determine /(.t ). Thus our problem is to find a positive function j( x) of unit area such that r. x 2/(x) dx =8 (12-18) Clearly, this problem does not have a unique solution because there are many densities satisfying (12-18). Nevertheless, invoking the following principle, we shall find a solution. Principle or Maximum Entropy maximize the entropy H(x) =- The unknown density must be such as to r. f(x) ln/(x) dx (12-19) of the Rv x. As we shall see, this condition leads to the unique function f(x) = "/t'-x t21 2 Thus the principle of maximum entropy (ME) leads to the conclusion that the RV x must be normal. This is a remarkable conclusion! However, it is based on a principle the validity of which we have not established. The justification of the ME principle is usually based on the relationship between entropy and uncertainty. Entropy is a measure of uncertainty; iff(x) is unknown, then our uncertainty about the values of x is maximum; hence, f(x) must be such as to maximize H(x). This reasoning is heuristic because uncertainty is not a precise concept. We shall give another justifica- SEC. 12-2 MAXIMUM ENTROPY ASD STATISTICS 423 tion based on the relationship between entropy and typical sequences. This justification is also heuristic; however, it shows the connection between entropy and relative frequency, a concept that is central in the applications of statistics. The ME principle has been used in a variety of physical problems, and in many cases, the results are in close agreement with the observations. In the last analysis, this is the best justification of the principle. Method of Maximum Entropy All ill-posed problems dealing with the specification of a probabilistic model can be solved with the method of maximum entropy. However, the usefulness of the solution varies greatly from problem to problem and is greatest in applications dealing with averages of very large samples (statistical mechanics, for example). We shall consider the problem of determining the distribution of one or more RVs under the assumption that certain statistical averages are known. As we shall see, this problem has a simple analytic solution. In other applications the ME method might involve very complex computations. Our development is based on the following version of (12-3). A Ba.ttic Inequality If c(x) is a function such that r. =I c(x) dx and f(x) is the density of an RV x, then - r2./(x) ln/(x) dx Equality holds iff/(x) ~- c(x) ==: 0 r. f<x> In c(x) dx = c(x). • Proof. From the inequality In z ~ z- I it follows with z -r. f(x) Hence, 0 ==: r. ( 12-20) In~~;~ dx ~ r2./(x) [;~;~- = c(x)lf(x) that I] dx = 0 f<x> In c(x) dx- f.J<x> ln/(x) dx and (12-20) results. We shall use this inequality to determine the ME solution of various illposed problems. The density of x so obtained will be denoted by fo(x) and will be called the ME density. Thus fo(x) maximizes the integral H(x) =- r. f<x> ln/(x) dx The corresponding value of H(x) will be denoted by H0(x). • Fundamental Theorem. (a) If the mean E{g(x)} = 8 of a function g(x) of the RV xis known, its ME density fo(x) is an exponential (12-21) fo(x) = ye-A.e<.tl 424 CHAP. 12 ENTROPY where 'Y and A are two constants such that 'Y f. e-Af(xl dx = 1 'Y f . g(x)e-AR~xl dx = 6 (12-22) • Proof. lf/o(x) is given by (12-21), then lnfo(x) =In 'Y- Ag(x) The corresponding entropy Ho(x) equals -t . .Jo<x> ln.fo(x) dx = - J"..Jo<x)[ln 'Y - Ag(x)) dx Hence, (12-23) Ho(x) = -In 'Y + M To show thatfo(x) is given by (12-21), it therefore suffices to show that if f(x) is any other function such that f ..JCx) dx = 1 f . g(x)f(x) dx = 6 the corresponding entropy is less than -In 'Y + M. To do so, we set c(x) = fo(x) in (12-20). This yields - t..f<x> lnf(x) dx s - f . f(x) ln/o(x) dx = - f ..J<x)[ln 'Y - Ag(x)) dx Thus H(x) s In 'Y + M, and the proof is complete. (b) Suppose now that in addition to the information that E{g(x)} = 6, we require thatf(x) = 0 for x ~ R where R is a specified region ofthe x-axis. Replacing in (a) the entire axis by the region R, we conclude that all results hold. Thus fo(x) = {~e-Af<xl ~ ~ ~ (12-24) and Ho(x) = -In 'Y + M. The constant A and 'Yare again determined from (12-22) provided that the region of integration is the set R. (c) If no information aboutf(x) is known, the ME density is a constant. This follows from (12-24) with g(x) = 0. In this case, the problem has a solution only if the region R in whichf(x) -:/= 0 has finite length. Thus if no information aboutf(x) is given,fo(x) does not exist; that is, we can find an f(x) with arbitrary large entropy. Suppose, however, that we require that f(x) =F 0 only in a region R consisting of a number of intervals of total length c. In this case, (12-24) yields fo(x) = {o1/c x e R xER (12-25) Note that the theorem does not establish the existence of a ME density. It states only that if/o(x) exists, it is the function in (12-21). Discrete Type Rvs Suppose, finally, tbat the RV x takes the values x; with probability p;. We shall determine the ME values po; of p; under the assumption that the mean E{g(x)} = 6 of the function g(x) is known. Thus our SEC. 12-2 MAXIMUM ENTROPY ASD STATISTICS 425 problem is to find p; such as to maximize the entropy H(x) = subject to the constraints L p; In P; - E{ln /(x)} = - L p;g(x;) = LP; = 1 (12-26) (12-27) 0 where g(x1) are known numbers. We maintain that po; = 'YE'·A,III.t,l (12-28) where 'Y and A arc two constants such that 'YLE'·AI/Ix,l = I 'Y L e·ANC.t,l,(.'(X;) = (12-29) 0 This follows from (12-24) if we use for R the set of points x1• However, we shall give a direct proof based on (12-3). • Pmoj. If po; is given by ( 12-28), then In Po; corresponding entropy H 0(x) equals = In 'Y - Ag(x;). Hence, the - L Po; In po; = -In 'Y L po; + ALpo;g(x;) = -In 'Y + AO It therefore suffices to show that if p 1 is another set of probabilities satisfying (12-27), the corresponding entropy is less than Ho(x). To do so, we set c1 = po; in (12-3). This yields - L p,ln p; s - L p; In Po; and the proof is complete. =- L p;[ln 'Y - Suppose. first that E{x2 } Ag(x;)) = Ho(x) = 0. In this case. (12-21) yields fo(x) = 'Ye·Ax: (12-30) Thus if the second moment 0 of an RV x is known. x is N(O, V7h. Hence, ILLUSTRATIONS I = -I Ho(x) = In v'21frlj 20 If the variance u 2 of x is specified, then x is N(TJ. u) where .,., is an arbitrary constant. This follows from (12-21) with g(x) = (x- TJ)2• 'Y = - - v21T9 Example 12.10 A Consider a collection of particles moving randomly in a certain region. If they are in statistical equilibrium, the x component Vx of their velocity can be considered as the sample of an RV Vx with distribution f(vx ). We maintain that if the average kinetic energy Nx = EHmv:} of the particles is specified, the ME density of Vx equals fo<vx) = ~ exp {- :~:} 426 CHAP. 12 ENTROPY Indeed, this follows from (12-30) with 8 = E{•!} = 2N,Im. Thus "• is N(O, u) where u 2 = 2N.,Im equals the average kinetic energy per unit mass. The same holds for Vv and vl. • · Suppose that xis an RV with mean E{x} = 8 and such that/(x) = 0 for x < 0. In this case,fo(x) is given by (12-24) where R is the region x > 0 and g(x) = x. This yields I (12-32) 8 with specified mean is exponential. y=A=- Thus the ME density of a positive Example 12.11 RV Using the ME method, we shall determine the atmospheric pressure P(l) as a function of the distance z from the ground knowing only the ratio N I m of the energy N over the mass m of a column of air. Assuming statistical equilibrium, we can interpret N as the energy and m as the mass of all particles in a vertical cylinder C of unit cross section (fig. 12.6). The location z of each particle can be considered as the sample of an RV z with density /(d. We shall show that fo(z) = mg e·mrz'NU(z) (12-33) N where g is the acceleration of gravity. The probability that a particle is in a cylindrical segment a between z and dz equals/(ddz; hence, the average mass in a equals m/C;.)dz. Since the energy of a unit mass, distance z from the ground, equals gz. we conclude that the energy of the mass in the region a equals gzmf(z)dz., and the total energy N equals N = J: mgzf(z) dz = E{mgz) With 8 = E{z} = Nlmg, (12-33) follows from (12-32). The atmospheric pressure P(:) equals the weight of the air in the cylinder C above z: P(z) = J: mgfo(z) dz = mge·mtt:JN • Flglll'e 12.6 ,_--..... X y SF.C. 12-2 427 MAXIMUM F.NTROPY AND STATISTICS fo<P> 0 p 0 p 0 (b) (a) Figure 12.7 Example 12.12 In a coin experiment. the probability p = P{h} that heads will show is the value of an p with unknown density f(p). We wish to find its ME formfo(p). (a) Clearly ,f(p) = 0 outside the interval (0, I); hence. if nothing else is known, then [see (12-25)]/o(p) =- I. as in Fig. 12.7a. (b) We assume that E{p} =- 8 :..: 0.236. In this case. ( 12-24) yields 0<p < I (12-34) .fo(p) = ye Af' where y and A are such that RV y Jof' e· Ap dp = I y (I pe·Ap dp Ju = 0.236 Solving for y and A. we find y = 1.1. A = 1.2 (Fig. 12.7b). • Example 12.13 (Brandeis Die+). In the die experiment, we are given the information that the average number of faces up equals 4.5. Using this information. we shall determine the ME values p 01 of the probabilities p; = P{/;}. To do so. we introduce the RV x such that x(/;) = i. Clearly. 6 E{x} = L ip1 = 4.5 i-1 With g(x) = x and x1 = i, it follows from (12-28) that Po; = ye·Ai i = I. . . . . 6 ( 12-35) where the constants y and A are such that I> fl YLeAI=l ;~I y Lie AI= 4.5 i'l To solve this system, we plot the ratio Tj(W) = w· 1 + 2w 2 + · · · + 6w " w·l + w·2 ...... + w-6 and we find the value ofw such that 1J(W) = 4.5. This yields w = 0.68 (see Fig. 12.8). Hence, y = 0.036, p 01 = .054 Po2 = .079 Po3 = .114 P04 = .165 Po~ = .240 P06 = .348 • + E. T. Jaymes, Brandeis lectures. 1962. 428 CHAP. 12 ENTROPY o.s 0 w Figure 12.8 GENERALIZATION Now let us consider the problem of determining the density of an 811 RV x under the assumption that the expected values = E{g11(x)} = f.. g"(x )f(x) dx k = I, . . . ,n ( 12-36) of n functions g 11(x) of x are known. Reasoning as in (12-21), we can show that the ME density of x is the function (12-37) The n + 1 constants y and A; are determined from the area condition 'Y J: exp{- ~ "-•g (x>} dx = I (12-38) 11 and then equations [see (12-36)] 'Y J-·. . g"(x) exp{- .R. ±A~tg11(x>} dx = 811 (12-39) The proof is identical to the proof of (12-21) if we replace the terms A.g(x) and U by the sums entropy equals II II ·-· ·~· L A.11g.(x) and L A11 8~r,, respectively. The resulting II Ho(x) =- In y + L "-•8• ·-· (12-40) 12-2 SEC. Example 12.14 MAXIMUM ESTROPY AND STATISTICS 429 Given the first two moments = E{x} 8, of the RV x, we wish to find /oCt). In this problem, g 1(x) = x. g 2(x) 82 = E{x2} = x 2, and 02-37) yields /ulx) = ye-Ao< A).,: This shows that Jo(.t) is a normal density with mean 11 =- t1 1 and variance u 2 8~ - 8i. • """ Partition Function The partition function is by definition the integral r. Z(},,, . . . , A,.) = = 11-y. As we see from (12-38). Z :~ = r. - exp{- ~ A4Rk(x)} dx (12-41) Furthermore, g;(X) exp{- ~ Atgk(X)} dt (12-42) t, . . . • n (12-43) Comparing with (12-39), we conclude that az - z1 ax. = fh k = This is a system of n equations equivalent to (12-39); however, it involves only then parameters Ak. Consider, for example. the coin experiment where E{p} = I~ pf(p) dp = (J is a given number. In this case [see (12-34)], Z = -I = "Y and with n J.' e-Ap dp = -I --e-A A o = I, (12-43) yields I az - z ax = I - e-'A - Ae-A xo - e-'A) =8 To findfo(p) for a given 8, it suffices to solve this equation for A. Discrete Type RVs Consider, finally. a discrete type RV x taking the values with probability p;. We shall determine the ME values po; of p; under the assumption that we know the n constants X; e. = E{g~c(x)} = L; p;g~c(X;) Reasoning as in (12-28), we obtain I Po; = exp {- A,g,(x;) - · z k = I. ... 'n (12-44) · - X,.g,.(x;)} where Z = ~ i exp {-A,g,(x;)- · · · - A,.g,.(x;)} (12-45) 430 CHAP. 12 ENTROPY is the discrete form of the partition function. From this and (12-44), it follows that Z satisfies the n equations - zI iJZaA. = Bk k = I •..• n (12-46) Thus to determine po;, it suffices to form the sum in (12-45) and solve the system (12-46) for the n parameters A.k. In Example 12.13, we assumed that E{x} = .,. In this case, n = I, 6 - z1 azax= ~ ie-Ai i•l 6 ~ e·Ai ; ... =., To determine the probabilities po;, it suffices to solve the last equation for A.. 12-3 Typical Sequences and Relative Frequency We commented earlier on the subjective interpretation of entropy as a measure of uncertainty about the occurrence of the events sf; of a partition A at a single trial. Next we give a different interpretation based on the relationship between the entropy H(A) of A and the number n, of typical sequences, that is, sequences that are most likely to occur in a large number of trials. This interpretation shows the equivalence between entropy and relative frequency, and it establishes the connection between the model concept H(A) and the real world. Typical sequences were introduced in Section 8-2 in the context of a partition consisting of an event sf and its complement $1.. Here we generalize to arbitrary partitions. As preparation, Jet us review the analysis of the twoevent partition A = [sf, sf]. In the experiment ~11 of repeated trials, the sequence Sj = sfJ4J4 . j . . J4 = I, . . . • 211 (12-47) is an event with probability P(si) = pkq" ·• p = P(sf) = I - q (12-48) where k is the number of successes of J4. The number ns of such sequences equals 211 • We know from the empirical interpretation of probability that if n is large, we expect with near certainty that k == np. We can thus divide the 211 sequences of the form (12-47) into two groups. The first group consists of all sequences such that k""' np. These sequences will be called typical and will be identified by the Jetter t. The second group, called atypical. consists of all sequences that are not typical. SEC. 12-3 TYPICAl. SEQl,;ENCES AND RELATIVE FREQUENCY 43) St;MBER OF TYPICAL SEQUENCES We show next that the number n, of typical sequences can be expressed in terms of the entropy H(A) = -(p In p + q In q) of the partition A. This number is empirical because it is based on the empirical formula k = np. As we shall show. it can be given a precise interpretation based on the law of large numbers. To determine we shall first find the probability of a typical sequence ti. Clearly. P(ti) is given by (12-48) where now k == np n - k = n - np = nq Inserting into (12-48). we obtain P(lj) == p"Pq"" = enplnp-nqlnq = (! tiHIAl (12-49) n,. The union '!! of all typical sequences is an event in the space ~". This event will almost certainly occur at a single trial of ;-ln because it consists of all sequences with k = np. Hence, P(?i) == I. And since all typical sequences have the same probability. we conclude that I = /'(3) = n,P(t1 ). This yields ,, = enlll.-\1 (12-50) We shall now compare 11, to the number 2" of all sequences of the form (12-47). If p = .5, then H(A) = In 2: hence. n, = e" 1" ~ = 2". If p :1= .5, then H(A) < In 2. and for large n. n, ~ e"IIIAI ~ e"'" ~ = 2" (12-51) This shows (Fig. 12.9) that if p :1= .5, then the number n, of typical sequences is much smaller than the number 2" of all sequences even though P(9") = I. Thus if the experiment ~" is repeated a large number of times. most sequences that will occur are typical (Fig. 12.10). Figure 12.9 ,II 0 p 0 fl 432 CHAP. 12 ENTROPY ...... ••••••} :J" :::::: •••••• •••••• sI 111 "" e'llfiAI 'J • 1 • • • • typical k ""np ::::: ••• I • •:} •••••• :::::: a, lis •••••• •••••• = 2" 1 1 • • • • atypical k :Fnp I I I I I I I I I I I I a •••••••••••••••••••••••••••••••••••••••••• Flpre 12.10 Note Since P(~) "" I, most-but not all-atypical sequences have a small probability of occurrence. If p < ..5, then plq < I. hence. the sequence p 4q"_, is decreasing ask increases. This shows that all atypical sequences with k < np are more likely than the typical sequences. However. the number of such sequences is small compared ton, (see also Fig. 12.11). Typical Sequences and Bernoulli Trials We shall reexamine the preceding concepts in the context of Bernoulli trials using the following definitions. We shall say that a sequence of the form (12-47) is typical if the number k of successes of JA. is in the interval (Fig. 12.11) k0 < k < kb ko = np - 3'\11ii)q kb = np + 3'\11ii)q ( 12-52) We shall use the De Moivre-Laplace approximation to determine the probability of the set ~ consisting of all typical sequences so defined. Since k is in the ± 3'\11ii)q interval centered at np, it follows from (3-30) that .I: P(~) = ... (~) p"q"-k ... 2G(3) Flpre 12.11 n E!! (·-) k n ., 2" I = .997 (12-53) fin- exp {2( -trn ")2)( k- -:; n . .. k SEC. 12-3 TYPICAL SEQUENCES AND RF.LATIVF. FREQUENCY 433 This shows that if the experiment fin is repeated a large number of times, in 99.7% of the cases, we will observe only typical sequences. The number of such sequences is n, = To find this sum, we set p =q = L•· (n) k (12-54) k-k. 112 in ( 12-53). This yields the approximation (n) 1 k k = .997 n Vn k1 =n- - 3 X Vn k.~ =- -4- 3 X - (12-55) 2 k=k, 2 2 2 2 and it shows that .997 x 2n of the 2n sequences of the form (12-47) are in the interval (k 1, k2 ) of Fig. 12.11. If p =I= .5, then the interval Cku, kb) in (12-52) is outside the interval (k 1, k2); hence n, is less than .003 x 2n. The preceding analysis can be used to give a precise interpretation of (12-52) in the form of a limit. With ku and kh as in ( 12-52), it can be shown that the ratio 2 -;j ~ n, _! ~ (n) n - n •-•. k tends to H(A) as n-+ oc. The proof, however, is not simple. (12-56) ARBITRARY PARTITIONS Consider an arbitrary partition A = [.s4 1, • • • , .s4Nl consisting of the N events .Sii;. In the experiment ~In of repeated trials, we observe sequences of the form Sj=l!Ja,· • ·?Ak· • ·~n j= 1, . . . ,Nn (12-57) where ~k is any one of the events .<A.; of A. The sequence si is an event in the space ~n• and its probability equals (12-58) p; = P(.Sif;) where k; is the number of successes of .Sii;. If k; = np;. for every i then si is a typical sequence ti, and its probability equals P(tj) = pfP• • •• PNnP.v = enp ..np,~···•np.\ lnp~ = e·niiiAI (12-59) where H(A) = - (p 1 In p, + · · · + PN In PN) is the entropy of the partition A. The union '!! of all typical sequences is an event in the space ~n, and for large n, its probability equals almost I because almost certainly k; = np;. From this and (12-59) it follows that the number n, of typical sequences equals n, = enHIAI (12-60) as in (12-50).1fthe events .Sii;are not equally likely. H(A) <InN, and (12-60) yields n, = enHIAI ~ en InN= Nn (12-61) This shows that the number of typical sequences is much smaller than the number Nn of all sequences of the form (12-57). 434 CHAP. 12 ENTROPY Maximum Entropy and Typical Sequentes We shall reexamine the concept of maximum entropy in the context of typical sequences. limiting the discussion to the determination of the probabilities Pi of a partition. As we see from (12-60), the entropy H(A) of the partition A is maximum iff the number of typical sequences generated by the events of A is maximum. Thus the ME principle can be stated as the principle of maximizing the number of typical sequences. Since typical sequences are observable quantities. this equivalence gives a physical interpretation to the concept of entropy. We comment finally on the relationship between the ME principle and the principle of insufficient reason. Suppose that nothing is known about the probabilities Pi· In this case. H(A) is maximum iff [see (12-5)) I N PI= . • . = PN = - as in (1-8). The resulting number of typical sequences is Nn. If the unknown numbers p; satisfy various constraints of the form I.p;g~r,(X;) = 81r. as in (12-44), no solution can be obtained with the classical method. The ME principle leads to a solution that muimizes the number of typical sequences subject to the given constraints. Concluding Remarks In the beginning of our development. we stated that the principal applications of probability involve averages of mass phenomena. This is based on the empirical interpretation p =kin of the theoretical concept p = P(s4.). We added, almost in passing, that this interpretation leads to useful results only if the following condition is satisfied: The ratio kin must approach a constant as n increases, and this constant must be the same for any subsequence of trials. The notion of typical sequences shows that this apparently mild condition imposes severe restrictions on the class of phenomena for which it holds. It shows that of the Nn sequences. that we can form with the N elements of a partition A, only 2nH!A> are likely to occur; most of the remaining sequences are nearly impossible. Four Interpretations of Entropy We conclude with a summary of the similarities between the various interpretations of probability and entropy. Probllbility In Chapter I, we introduced the following interpretations of probability. Axiomatic: P(s4.) is a number p assigned to an event s4. of an experiment fl. Empirical: Inn repetitions of the experiment~. k p =(12-62) n Subjective: P(s4.) is a measure of our uncertainty about the occurrence of s4. in a single performance of ~. Principle ofinsufficient reason: If~ is the union of N events s4.; of a partition A and nothing is known about the probabilities p; = P(s4.i). then p; = 1/N. PROBLEMS 435 Entropy The results of this chapter lead to the following interpretations of H(A ): Axiomatic: /1( A) is a number H(A) --' - ~P; In p; assigned to a partition A of~. Empirical: This interpretation involves the repeated performance not of the experiment~ but of the experiment ~n· In this experiment, a specific typical sequence ti is an event with probability e nHIAI. Applying (12-62) to this event, we conclude that if in m repetitions of~~n the event t1 occurs m1 times and m is large, then P(tj) = e-nHIAl = m/m; hence. H(A) m· = - -nI In ~ m (12-63) This approximation relates the model concept H(A) to the observation mi and can be used in principle to determine H(A) experimentally. It is, however, imprctctical. Subjective: The number H(A) equals our uncertainty about the occurrence of the events J4; of A in a single performance of ff. Principle of maximum entropy: The unknown probabilities p; must be such as to maximize H(A). or equivalently, to maximize the number n, of typical sequences. This yields p; = liN and n, = Nn if nothing is known about the probabilities p;. Problems ll·l In the die experiment. P{even} = .4. Find the ME values of the probabilities P; = P{Ji}. ll-2 12-3 ll-4 12-S ll-6 In the die experiment, the average number of faces up equals 2.21. Find the ME values of the probabilities p; = P{Ji}. Find the ME density of an RV x if x = 0 for lx > I and E{x} = 0.31. It is observed that the duration of the telephone calls is a number x between I and 5 minutes and its mean is 3 min 37 sec. Find its ME density. It is known that the range of an RV x is the interval (8, 10). Find its ME density if "'.r = 9 and u. = I. The density /(x) of an RV xis such that J",f(x)dx 12-7 ll-8 =I J",f(x) cosxd:c = 0.5 Find the ME form of/(x). The number x of daily car accidents in a city does not exceed 30, and its mean equals 3. Find the ME values of the probabilities P{x = lc} = P•. We are given a die with P{even} = .5 and are told that the mean of the number x of faces up equals 4. Find the ME values of p; = P{x = i}. 436 CHAP. 12 ENTROPY Suppose that x is an av with entropy H(x) and y = 3x. Express the entropy H(y) of yin terms of H(x). (a) If x is of discrete type, (b) if x is of continuous type. 12-10 Show that if c(x, y) is a positive function of unit volume and x, yare two avs with joint density /(x, y), then E{ln /(s, y)} s - E{ln c(x, y)} 12-9 12-11 Show that if the expected values 8tt = E{gtt(x, y)} of them functions gtt(x, y) of the avs x and y are known, then their ME density equals f(x, y) = 'Y exp{-A 1g 1(x, y) - • • • - A,.g,.(x, y)} 12-12 Find the ME density of the avs x andy if E{x2} = 4, E{y2} = 4, and E{xy} = 3. 12-13 Show that if the avs z and ware jointly normal as in (5-100), then H(z, w) =In (211'eup,..~). 12-14 (a) The avs x andy are N(O, 2) and N(O, 3), respectively. Find the maximum of their joint entropy H(x, y). (b) The joint entropy of the avs x and y is maximum subject to the constraints E{x2} = 4 and E{y2} = 9. Show that these avs are normal and independent. 12-15 Suppose that x1 = x + y, y 1 = x- y. Show that if the avs x andy are ofthe discrete type, then H(x 1, y,) = H(x, y), and if they are of the continuous type, then H(x1, y 1) = H(s, y) + In 2. 12-16 The joint entropy of n avs X; is by definition H(x., . ·· .• x,.) = -E{ln /(x., . . . • x,.)}. Show that if the avs X; are the samples of x, then H(x1 , . . . , x,.) = nH(x). 12-17 In the experiment of two fair dice, A is a partition consisting of the events .!1 1 = {seven}, .!12 = {eleven}, and .!13 = .!1 1 U .!12• (a) Find its entropy. (b) The dice were rolled 100 times. Find the number of typical and atypical sequences formed with the events .54., .542 , and .543• 12-18 (Coding and entropy). We wish to transmit pictures consisting of rectangular arrays of two-level spots through a binary channel. If we identify the black spots with 0 and the white spots with I, the required time is T seconds. We are told that 83% of the spots are black and 17% are white. Show that by proper coding of the spots, the time of transmission can be reduced to 0.65 T seconds. Tables In the following tables. we list the standard normal distribution G(z) and the u-percentiles z, XZ(n) t,(n) Fu<m. n) of the standard normal, the chi-square. the Student t, and the Snedecor F distributions. The u-percentile x,. of a distribution F(x) is the value x, of x such that (Fig. T .I) u = F(x,.) = f:J<x)dx Thusx, is the inverse ofthe function u = F(x).lf/(x) is even, F(-x) = 1F(x) and x 1-u = -x,. It suffices, therefore, to list F(x) for x;;:: 0 only and x, for u ::: .5 only. In Table Ia, we list the normal distribution G(z) for 0 s z s 3. For z > 3, we can use the approximation G(z) = 1 - ~ e·;:~ ZV21T In Table lb, we list the z,-percentiles of the N(O, 1) distribution G(z). The x,-percentile of the N(TJ, u) distribution G(x ~ 71 ) is x, = 71 + z,u. 437 438 TABLES u = F(x,) x, Xt-u = -x, f(x) F1pre T.l In Table 2, we list the x~(n) percentiles. This is a number depending on u and on the parameter n. For large n, the x 2(n) distribution Fx(x) ap- proaches a normal distribution with mean nand variance 2n. Hence, u = Flr(X~) = X~(n) = o(X&fin) n + z, V2fi The following is a better approximation x~(n) = 2I (z, + V2n- 1)2 In Table 3, we list the t,(n)-percentiles. For large n, the t(n) distribution F,(x) approaches a normal distribution with mean zero and variance nl (n - 2). Hence, u = F,(r,> = o(V 1 " nl(n - 2) ) t,(n) = z,.Jn ~2 The F,(m, n) percentiles depend on the two parameters m and nand are determined in terms of their values for u ~ .S because F,(m, n) = 11 F 1 _,(n, m). They are listed in Table 4 for u = .9S and u = .99. Note that 2 1 ~ F2u-tO, n) = t,(n) and F,(m, n) = - x;;(m) form~ I m TABI.ES Table la G(x) G(x) 439 I- J·'_,. e-·····-dv ·~ . =\1'2; - X G(x) X G(x) X G(x) X G(x) 0.05 0.10 0.15 .51944 .53983 .55962 .93943 .94520 .95053 .57926 .59871 .61791 .63683 .65542 .67364 .69146 .70884 .72575 .74215 .75804 .77337 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95 3.00 .98928 .99061 .99180 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 .78814 .80234 .81594 .82894 .84134 .85314 .86433 .87493 .88493 .89435 .90320 .91149 .91924 .92647 .93319 1.55 1.60 1.65 0.20 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.70 .95543 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 .95994 .96407 .96784 .97128 .97441 .97726 .97982 .98214 .98422 .98610 .9877R .99286 .99379 .99461 .99534 .99597 .99653 .99702 .99744 .99781 .99813 .99841 .99865 Figure T.Z I X v"fi I ·· ,· e· _,.2 1• -tly ..;5.,, 0.3 0.2 0.1 0 0.5 1.0 1.5 Table lb z, Lu I z, .90 t.282 2.0 :!.5 u I I .925 t.440 .95 1.645 .975 t.967 3.0 X I:. . , I = V27T .,. e __,... _dy I .99 I .995 I .999 I .9995 I I 2.326 I 2.576 I 3.090 I 3.291 I 440 TABLES Table 2 l'• n 1 1( ) !,;\ 1 2 3 4 ( 95) .1'4. .005 .01 .025 .05 .1 0.00 0.10 0.35 0.71 1.15 0.02 0.21 0.58 1.06 1.61 0.00 0.02 0.11 0.30 5 0.00 0.01 0.07 0.21 0.41 0.55 0.00 0.05 0.22 0.48 0.83 6 7 8 9 10 0.68 0.99 1.34 1.73 2.16 0.87 1.24 1.65 2.09 2.56 1.24 1.69 2.18 2.70 3.25 1.64 2.17 2.73 3.33 3.94 2.20 2.83 3.49 4.17 4.87 11 12 13 14 3.05 3.57 4.11 4.66 5.23 3.82 4.40 5.01 5.63 6.26 4.57 5.23 5.89 6.57 7.26 5.58 6.30 7.04 7.79 15 2.60 3.07 3.57 4.07 4.60 8.55 16 17 18 19 20 5.14 5.70 6.26 6.84 7.43 5.81 6.41 7.01 7.63 8.26 6.91 7.56 8.23 8.91 7.96 8.67 9.39 10.12 10.85 22 8.6 24 9.9 26 11.2 28 12.5 30 13.8 40 20.7 50 28.0 9.5 9.59 .9 .95 .995 6.63 9.21 11.34 13.28 15.09 7.88 10.60 12.84 14.86 16.75 10.64 12.02 13.36 14.68 15.99 12.59 14.45 14.07 16.01 15.51 17.53 16.92 19.02 18.31 20.48 16.81 18.48 20.09 21.67 23.21 18.55 20.28 21.96 23.59 25.19 17.28 18.55 19.81 21.06 22.31 19.68 21.03 22.36 23.68 25.00 21.92 23.34 24.74 26.12 27.49 24.73 26.22 27.69 29.14 30.58 26.76 28.30 29.82 31.32 32.80 9.31 10.09 10.86 11.65 12.44 23.54 26.30 28.85 24.77 27.59 30.19 25.99 28.87 31.53 27.20 30.14 32.85 28.41 31.41 34.17 32.00 33.41 34.81 36.19 37.57 34.27 35.72 37.16 38.58 40.00 47.0 40.3 43.0 45.6 48.3 50.9 42.8 45.6 48.3 51.0 53.7 59.3 71.4 63.7 76.2 19.5 11.0 12.4 13.8 15.3 16.8 12.3 13.8 15.4 16.9 18.5 14.0 15.7 17.3 18.9 20.6 30.8 33.2 35.6 37.9 40.3 22.2 29.7 24.4 32.4 26.5 29.1 37.7 51.8 55.8 34.8 63.2 67.5 ~50: .99 2.71 3.84 5.02 4.61 5.99 7.38 6.25 7.81 9.35 7.78 9.49 11.14 9.24 11.07 12.83 10.9 12.2 13.6 15.0 For n .915 =9.49 33.9 36.4 38.9 41.3 43.8 36.8 39.4 41.9 44.5 66.8 TABLES Table 3 ~ =2.72 t,99(11) t.(n) .9 .95 .975 .99 .995 2 3 4 5 1.89 1.64 1.53 1.48 6.31 2.92 2.35 2.13 2.02 12.7 4.30 3.18 2.78 2.57 31.8 6.97 4.54 3.75 3.37 63.7 9.93 5.84 4.60 4.03 6 7 8 9 10 1.44 1.42 1.40 1.38 1.37 1.94 1.90 1.86 1.83 1.81 2.45 2.37 2.31 2.26 2.23 3.14 3.00 2.90 2.82 2.76 3.71 3.50 3.36 3.25 3.17 II 12 13 14 15 1.36 1.36 1.35 1.35 1.34 1.80 1.78 1.77 1.76 1.75 2.20 2.18 2.16 2.15 2.13 2.72 2.68 2.65 2.62 2.60 3.11 3.06 3.01 2.98 2.95 16 17 18 19 20 1.34 1.33 1.33 1.33 1.33 1.75 1.74 1.73 1.73 1.73 2.12 2.11 2.10 2.09 2.09 2.58 2.57 2.55 2.54 2.53 2.92 2.90 2.88 2.86 2.85 22 24 26 28 30 1.32 1.32 1.32 1.31 1.31 1.72 1.71 1.71 1.70 1.70 2.07 2.06 2.06 2.05 2.05 2.51 2.49 2.48 2.47 2.46 2.82 2.80 2.78 2.76 2.75 I 3.08 For n ~ 30: t,.(n) rn - 2 = z,. \j n 441 442 TABLES Table 4a F 95(5, 8) = 3.69 F95(m, n) ,. m 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 30 40 50 60 70 I s 6 7 8 9 10 12 14 16 18 20 30 40 so 60 70 3 4 5 6 8 10 20 30 40 200 161 225 230 242 216 234 239 248 251 250 29.0 18.5 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.5 19.5 9.28 9.12 8.94 8.85 10.1 9.55 9.01 8.79 8.66 8.62 8.59 7.71 6.94 6.39 6.26 6.59 6.16 6.04 5.96 5.80 5.15 5.12 6.61 5.19 5.41 5.19 5.05 4.82 4.74 4.95 4.56 4.50 4.46 5.99 5.14 4.28 4.15 4.76 4.53 4.39 4.06 3.87 3.81 3.77 4.74 3.87 3.73 5.59 4.35 4.12 3.97 3.64 3.44 3.38 3.34 5.32 4.46 3.58 3.44 3.35 3.15 4.07 3.84 3.69 3.08 3.04 5.12 4.26 3.63 3.48 3.14 3.86 3.37 3.23 2.94 2.86 2.83 4.96 4.10 3.07 2.98 2.77 3.71 3.48 3.33 3.22 2.70 2.66 4.15 3.89 3.00 2.85 2.15 2.54 3.49 3.26 3.11 2.47 2.43 3.74 4.60 2.85 3.11 2.96 2.70 3.34 2.60 2.39 2.31 2.27 4.49 3.63 2.74 3.01 2.85 2.59 2.49 3.24 2.28 2.19 2.15 3.55 2.93 2.77 4.41 2.41 3.16 2.66 2.51 2.19 2.11 2.06 4.35 3.49 2.60 3.10 2.87 2.71 2.45 2.35 2.12 2.04 1.99 4.17 3.32 2.42 2.27 2.16 2.92 2.69 2.53 1.93 1.84 1.79 2.18 4.08 2.45 2.61 3.23 2.34 2.84 2.08 1.84 1.74 1.69 4.03 3.18 2.79 2.56 2.40 2.03 2.29 2.13 1.78 1.69 1.63 4.00 2.53 2.37 3.15 2.76 2.25 2.10 1.99 1.75 1.65 1.59 3.13 2.23 3.98 2.50 2.35 2.07 1.97 2.74 1.72 1.62 1.57 Table 4b F,(m, n) I 2 3 4 2 F.,(5, 8) = 6.63 4,052 4.999 5,403 5,625 5.764 5,859 5,982 6.056 6.209 6.261 6.287 99.0 99.3 98.5 99.2 99.3 99.3 99.4 99.4 99.4 99.5 99.5 30.8 34.1 28.7 28.2 27.9 27.5 27.2 29.5 26.S 26.4 26.7 18.0 21.2 16.0 IS.S 15.2 14.8 14.5 14.0 16.7 13.8 18.7 16.3 13.3 11.4 11.0 12.1 10.7 10.3 10.1 9.55 9.38 9.29 13.7 10.9 9.78 9.15 8.75 7.87 8.47 8.10 7.40 7.23 7.14 7.46 12.2 9.55 8.45 7.85 7.19 6.84 6.62 6.16 5.91 5.99 8.65 7.01 6.63 7.59 6.37 11.3 6.03 5.81 5.36 5.20 5.12 10.6 8.02 6.99 6.42 6.06 5.80 5.47 4.81 4.57 5.26 4.65 10.0 7.56 5.99 5.64 5.06 6.55 5.39 4.85 4.41 4.25 4.17 9.33 5.41 5.06 4.50 6.93 5.95 3.86 3.70 3.62 4.30 4.82 8.86 5.04 4.70 4.46 4.14 6.51 3.94 S.S6 3.51 3.35 3.27 4.77 8.53 4.44 6.23 3.02 4.20 3.89 3.69 3.26 5.29 3.10 8.29 4.58 4.25 6.01 5.09 3.71 3.51 3.08 4.01 2.92 2.84 8.10 5.85 4.10 4.94 4.43 3.87 3.56 3.37 2.69 2.94 2.78 5.39 4.02 3.70 3.47 7.56 4.51 3.17 2.98 2.55 2.30 2.39 7.31 5.18 2.11 4.31 3.83 3.51 3.29 2.99 2.80 2.37 2.20 7.17 5.06 3.72 3.41 3.19 2.89 2.70 2.27 2.01 4.20 2.10 7.08 4.98 4.13 3.65 3.34 3.12 2.82 2.63 2.20 2.03 1.94 7.01 4.92 4.08 3.60 3.29 2.78 2.59 2.15 3.07 1.98 1.89 Answers and Hints for Selected Problems CHAPTER3 CHAPTER2 2-2 {4 S I S {I < 10}, {7 S < I} I S 8}. {1 < 4}, 4} U {8 2-3 (1°) =- 120 3 2-S (a) :4 U ~ u 't: (f) (s4 n 3-5 .6, .8, .4, .3 2-, (~~)1(~~) 2-11 (9018) (10)/('oo· 2 20) == ·318 3-6 3·7 3-10 3-12 3-13 3-14 3-15 3-16 =- .0036 2-12 P(f!/l) = .4 P1 = 20/30; p~ = 811 I P(:4) "" I == 2.994 X 10 ·1 ; 2" - (n ... I) p = .607 PI 3-2 7 p -= 12M n '{ 2-6 2-16 2-18 2-20 2-22 2-24 PI 3-3 00) p~ =2 X 10 = .031, 3-1 6 3-l,t' 3-19 p~ - .016 .5, p~ - .322 46. 656. 7,776: PC.7lJ-= .167. PCM =- .842 PI == .HIS. p~ -= .086. p~ == .86 PI =- .201 p ... . 164 n = 9604 .924 (a) .919: (b) .931 p = .06 (a) PI = .29757, P2 -= .26932: (b) PI -= .29754. p~ = .26938 (a) p -= .011869; (b) p :. .011857 p = .744 PI -'- 443 444 ANSWERS AND HINTS FOR SELECTED PROBLEMS 711 CHAPTER4 4-2 4-3 4-5 4-8 4-9 .003, .838,.683, .683 54.3% x.9~ = 5.624 (a) 6. 7%; (b) 2.2 months Differentiate the identity,. f~ e-~· dy = 1 e-•·" with respect to c. 4-10 (a) .04; (b) I.S x 10" kw 4-11 (a) 4.6%; (b) 20% 4-11 (a) geometric; (b) L" o.s X .95* = .9Sl0-k k-10 (b) .186 4-15 (a) .OS; 4-17 Uniform in the interval -9 < 4-19 (a) 4-23 4-24 4-26 4-19 3 x<3 ~/,(V,) ~ [.t;({ry) + .t;<- {ry)) (b) 4 (~) 1/.(y) .,. /,( -y)]U(y); (d) /.(y)U(y); P{y = 0} = f,(O) Uniform in the interval (3, 5) a = ::0.5, b = + 2.5 Elx} = 6 E{P} = 12.1 x 1.004 watts 4-30 (a) .t;.(y) = { I 3~ 729 < y < 1.331 0 otherwise 71.• = 1,000, <T~ = 1/3; (~) £{y} = 1,010 Use the approximation g(x)"" R(71) + (x - 71)g'C71). 5·9 u~ = u 2; = 2u4 !n L (X;+ Y;) =! L Xt +! ~ Yt; • n n~ ~ L X;Y; (~ ~ Xt)(~ L )';) Show that a 1 = ru,lu", a2 = ru,fuy. 11r = o, u). = 8 vTI; a = o, b = 12 :/: 5·13 5-14 u,.. = 11.7 watts 5-16 71 •. = 220 watts; 2/(c: 2 - s 2), <l»(s) = c 71" = 0, 5-18 u, = V1.1c 5-19 Apply (5-73) to the moment function (5-103). 5-ll /(x): isosceles triangle; base 0 ~ z ~ 2: /,(s): isosceles triangle: base -I ~ w ~ 1: /,(s): right triangle; base 0 ~ s ~ I. S-23 (a) isosceles triangle with base 1.800 ~ R ~ 2.200; (b) p=.l25 S-14 (a) isosceles triangle with base -20 ~ z ~ 20: (b) Pz - P{- 5 s z s 4} 5-28 (a) Use (5-92): (b) /.(z) = c:2 5·19 J.<z> J,: !.<z - y)f,(y) dy J~ e·.-oertz-al da Z >0 = { ;. c2 fo ('·co~t:· ol da z<0 S-30 Set z = x/y: show that m2 - m4 =- F:CO): use (5-107). (b) 4-31 CHAPTERS 5-1 5-4 NCO. 112), N(O, 112): (~) .163 'Y=41Tr; Use the inequalities F"_,.(x, .v> (a) (b) F,y(x. 5-5 (a) y) ~ F_.<.v>. Use the identities y) = /,Cx ).t;.(y) = f(y. x); Note that £{yl} = £{xlfz3}. Show that E{(x - y)2} -= 0. /Cx. (b) S-6 ~ f~,Cx). CHAPTER6 6-1 6-1 6-3 6-4 6-5 6-6 6-7 6-9 6-10 (a) PI = .55: (b) P2 = .82 (a) 32.6%; (b) 32.8 years. /.(ZIW = 5): /\'(2.5, 3 V2l: £{zlw = 5} = 2.5 PI = .2863, P2 = .lOSS cp(x) = x - x2f! (a) ip(.r) = ..t·; (b) E{ly - cp(x)x} = 0 Show that E{xy} = E{u}. It follows from 16-21) and (6-54): 'Y = 513 (a) P{x > t} =- .2St> .,., - .1St> "': (b) .355 6-11 R(x) = ( I - c·.t)ll•t>·' ANSWERS AND HINTS FOR SELECTED PROBLEMS 445 7-26 Use (7-89) and Problem 7-25. 7-28 Note that the RV 2x is F(4, 6). 7-30 (a) Use (7-38) and (7-97); (b) Show that the RV wis normal with Tlw = Tlx - 'l'h. = u 2(1/n + 1/m) and the !lum n-1 m-1 --::r ~ + -::r ~ is )(2(n + m - 2) ui. CHAPTER7 7-1 CT Note that E{(c 1x 1 • • • • - c,x.)2} ~ 0 for = A, then [see (7A-3)] 1J2 = TAT*TAT* = TA 2T" = TAT* = D; hence, 7-31 If A2 any c;. 7-5 Note that y 7-7 _ ( =X;- X= X; >..1 1 ". I - - +- ~ x•. 1) n CT = A.; 7-32 Show that n ;••·• E{e"ln} = E{es<a,·····a·1} = «<>;(s) E{e"} = E{E{en:n}} = E{CI>:(s)} X = 7-8 (a) (b) 2 P•«<>~(s) k•l CHAPTER9 Use (5-63) and Problem 7-7 with «<>t(s) = cl(s + c); E{z} = E{E{z,n}} = E{nlc} = Nplc = 200 7-10 (a) Use Problem 5-8: (b) Show that E{(x; 7-13 Note that y 4 < 7-14 7-15 7-16 7-17 7-19 7-20 7-22 7-23 X;- 1) 2} = 2u 2• x.~ iff x; s x~ for i ~ k, that is, iff the event {x :$ x.~} occurs at least k times. With n = 5 and k = 3, (7-47) yieldsf,(t) = 3 X 10-3(1 - 55)~(65 - t)2 p = 2G(VJ) - I = .916 Note that E{xn = u 2 • E{x1} = 3u4 • (a) Note that x, = me iff heads shows k times; (b) Use {3-27); (c) P{xso > 6c} == G(6!7) == .82 Use the CLT and Problem 4-32. Apply the CLT to the RVs In X; and use Problem 4-32. Note that J;.(y) = 2yx2<n. y2), Use (7-97) and Problem 7-22. = 0.115 em; (b) n = 16 202 < ., < 204: (b) 2.235 em 9-3 21.400 < ., < 28,600 9-4 n=l 9-S (a) a = 25161, b = 36/61; (b) 12.47 ± 0.26 9-7 c = 413 g 9-8 29.77 < 8 < 30.23 9-9 a = 0.8. b = 4 9-10 If w = x - i. then u~ = (I .,. l/20)u2; c == 20.5 9-12 0.076 < Tl < 0.146 9-13 12.255 < A < 13.245 9-1 (a) 9-2 (a) c 9-14 .50< p < .54 9-15 (a) 3.2%; (b) y = .78 9-17 n = 2,500 9-18 p = .567 9-21 -5.17 < .,, - .,, < -2.83 9-22 87.7 < f1 < 92.3. 3.44 < CT < 9.13 9-23 0.44 < CT < 0.276 9-25 .308 < r < .512 9-26 Use the identity I[(x; - y;) - (i + y)J2 = I(x; - i)2 ... I< Y; - yl2 + 2(n - Olin . 9-27 (a) Note that I(a - zb;) 2 ~ 0 for all z; (b) Set x, - i = a; • .V;- y = b;. 9-28 P{y9 s x.~ < Yu = .5598 == .5581 446 ANSWERS AND HINTS FOR SELECTED PROBLEMS 9-30 Note that 0.5 ) G ( \7,;i4 - G {2 (. -0.5) \Tni4 = \j;n 9-31 3 = .I, c = .02; n = 3745 9-32 c = 1.29 9-33 c = 1/(x- xo> 9-36 Use the identities rx ;~ rx ;~ u;. - 83 e.,.,. ei and 10-3 10-5 10-6 x = ~e. )(!14n. x) dx {3{8) - P{q > dHt} = P{28q = I "• L i4, nk 1~1 10-7 10-9 k, = 24, k2 = J: -Br > 28c·lf1} x214n, x) dx 10-28 Use the lognormal approximation t.m(63) = -1.67, 1. 00 ~63) = -2.62. q = -1.6; accept H0 1.975(40) == 2. u;;;: = 0.336. q 1.19 < 2; accept Ho Test 'I'Jr = 0 against 'I'Jr :/: 0 where i =- x2{4) x2(4n) respectively. Show that a = P{q < c11/o} = P{28oq < 28od = {3 = .32; > 8.41 ., .. = r 10-27 Note that the avs 28x and 28q are CHAPTER 10 n = 129, J 82 J(y, 8) x > 8.58, I. }': :· = 0. ·" . lit (b) /,( 8 ''· ) - r(8, - 8o) e /y(y, 8) = /y(y. 8o) J(y. ) 80 9-44 Show that/( y. z) = c· 2e ,.,,. for 0 < z < y. 9-45 Use Problem 9-38. (a) (b) L P; = 10-25 Reject H 0 if x > 1.384: {3 = .40 10-26 (a) f,(r, 80 ) dr = P{r < r s r + dr H0} = f<X. 80) d'l/.f,{r, 81) dr = P{r < r s r + dr•H,} = f<X. 81) dV: 0= dx = dx 9-37 (a) I= 118 2 ; (b) Use Problem 7-24. 9-38 Note that the statistic w = y - az is 2a~·: + a2u~. unbiased with u~ = 9-43 The function J(y. 8) in (9-95) is known, and 10-1 (a) UCL = 93 U; (b) p = .79 q = 9.76 < x:~c5) = II: accept H0 q = 2.36 < x:9~C2) = 5.99: yes q = 17.76 < )(~9~(9) = 16.92: reject /10 10-ZO Note that a2j a2L (a1.)2 = aBI f + a8 f aff1 10-15 10-16 10-17 10-18 (Problem 7-201 for the density of the product x1 • • • x,. 10-32 Show that 81110 = 80 , 8,. = w = 2nl8o - .'i ·- .{In (x/80 )]. 10-33 Show that the N avs x) are i.i.d. with joint x. density /(X) - exp {- 2 u; = u L' ....!cnk 2 2~ Q} where Q = Q, ... Qz as in Problem 10-22. 4-t 40; reject H0 nAo = 110, k 1 = 110 - 1.645 v'TIO = 93 > q = 90; reject the hypothesis that A = 5 10-10 Under hypothesis H 1 , the av qu~u 2 is x2(n) and P{q s ciHt} = P {;~ q < =~ ciH1} 10-U c = .136 > q = .I; accept H 0 10-13 k 1 = 31 < k = 36 < k2 = 41; accept H 0; no decision about H 0 10-14 (a) G(2- 0.5Vn) - G(-2 - 0.5Vn) = {3(30.1)- 0.1; find n. Accept H 0 iff 30 - 0.4/Vn < X < 30 + 0.4/Vn; (b) I - {3(30.1) = 0.9 CHAPTER 11 11·3 11-6 = Use (11-26) with w 1 I. w 2 y = z. (a) Maximize the sum = x, w3 = ~(X; - a) 2 - ~(YI - {3)2 + ~(li - 'Y)2 subject to the constraint a + {3 + 'Y = 1r and show that ti-x=/3-.V=.Y-z _ 1r- - c.x- 1- u 3 y. ANSWF.RS AND HINTS FOR SELECTED PROBLEMS (b) Maximize the sum 1 ,... , u; ~ a)- (X; - I + 1 ,... u; ~ (y; - /W 2 Cz;- y)2 + u~ subject to the same constraint and show that a-x = -,~-y -y-i --,= --,u:; u~ u: _ 1T - <i -r y- zJ - 11-8 2 ur- (2 a1 -;--= 2a, ua; 11·9 1) - IJ. ~ CHAPTF.R 12 U;.t; - A - p.x; ..:. 0 for any A and p.. Solve the 11 ~· 2 equations 2a; = A + IJ.X;, ~a; = I. ~a;.\"; = 0 for the 11 - 2 unknowns a;. A, and p.: (b) Proceeding similarly, solve the system 2{:J; = A • JJ-'C;. I.{J; = 0. I.{J;x; """ I. (a) Accept 1/0 iff [see (11-52)) lhj < CTiJZJ-n;~ where 2 . (Th - (b) Maximize the sum ~w;l,\'; - (a -r bx;))~. Using the independence of the RVs y - ax and x, show that E{y - ax·x} = 0, E{(y ax)~} = Q. and E{y - ax:x} = E{y x} aE{x:x} = E{y x} - ax. 11·16 Note that the RVS y - y and X; are orthogonal; hence, they are independent, and E{y - y x; • . . . , x,.} = E{y - y} = 0. 11-13 11-15 3 Use Lagrange multipliers. Note that . 1A U;- (a) ~( ., '(· I - .'C)2 0' ~ h < O';,lt-o.2(n 12·1 12·2 12-3 12-4 Pt = PJ "- P• = .2. P2 = P• = Po "' .4/3 Pt "' .42. P2 = .252, p, = .151. P• = .09. p~ = .054. p,. .:: .034 /! x) = 0.425e ·' for - I < x < I /!x) = 0.212e··t•-•' for I < .t < 5 12·5 f(x) = 12-6 12·7 12-8 12-9 Accept //0 iff tsce (11-57)1 ll-10 - 2) ~[Y;- (a + 1 - , ~(X;- X)· II- 2 hx;)J2 12-12 f(x. y) = - • rr2 ~ n (a) Cov(y, b)= (b) .!. ~ (,..' - .... )~ rr2 ~ ., = ly - ~.,; = ~Ef (a ..- ~ {3; =0 /(x. t} = 0 I 12-13 Show that u~ b.t)J2 ... (' - ~( lj; - 71;)2 y) 17'\/7 w2 } z2 E { --; - 2r -zw- - ~ u/Yn - y) 2 3 xy + y-')} 2 (. x·, - 2 exp { - 7 Note that (c) V27T fl.\") = 7.e A ~·" ' for x < 1T and /Ct) "' 0 for lx > 1T P4 "' .25 X • 7~ 4 • 0 :5 k :5 30 Pt =- P2 .:.. .064. p, = P• = .138, p~ = p,- .298 (a) H!y) -= //(x): (b) /ICy) "' 1/{x) + In 3 Show that /(x. , 11-12 1.~ e" -~,: 2 E {tn dx. y)} :s E {c(x, y)- where 0'7, ::. ,.. 447 0'/, b)2 12·14 12-17 0':0'., rr; = 2( I H(x, y) = 3.734 (a) H(A) = 0.655: (b) n1 = 2.79 X J02K. nu == 5.36 - r 2) X 1047 Index A Alternative hypothesis, 243 Analysis of variance, 360-69 ANOVA principle, 361 one-factor tests, 362 two-factor tests, 365 additivity, 366 Asymptotic theorems: central limit, 214-17 lognormal distribution, 231 DeMoivre-Laplace, 70, 76, 216 entropy, 433 law of large numbers, 74, 219 Poisson, 78 Auxiliary variables, 158, 198 Axioms, 10, 32 empirical interpretation, 10 infinite additivity, 33 B Bayes' formulas, 50, 171, 174 empirical interpretation, 175 Bayesian theory, 246 controversy, 247 estimation, 171, 287-90 law of succession, 173 448 Bernoulli trials, 64, 70 DeMoivre-Laplace, 70 law of large numbers, 74, 219 rare events, 77 Bertrand paradox, 16 Best estimators, 274, 307 Ra~ram~rbound,309 Beta distribution, 173 Bias, 2 Binomial distribution, 108, 212 large n, 108 mean, 154, 213 moment function, 154 Boote's inequality, 57 ButTon's needle, 141 c Cartesian produce, 24, 60, 64 Cauchy, density, 107, 164, 167 Cauchy-Schwarz inequality, 319 Centered random variables, 146 Central limit theorem, 214-27 lattice type, 216 products. 231 lognormal distribution, 231 proof, 231 sufficient conditions, 215 INDEX Certain event, 7 Chain rule, densities, 201 probabilities, 18 Chapman-KolmogorofT, 201 Charctcteristic functions, 154 (See also Moment generating functions) Chi distribution. 232 Chi-square distribution, I06, 219-23 degree of freedom, 219 fundamental property, 221 moment function, 220 noncentral. 227 eccentricity, 227 quadratic forms, 221, 227 Chi-square tests, 349 contingency tables, 354 distributions, 357 incomplete null hypothesis, 352 independent events, 159 modified, 352 Circular symmetry, 142 Combinations, 26 Complete statistics, 314 Conditional distribution, 168-77, 200 chain rule, 201 empirical interpretation. 175 mean, 178, 201 Conditional failure rate, 188 Conditional probability, 45 chain rule, 48 empirical interpretation. 45 fundamental property, 48 Confidence, coefficient, 241, 274 interval, 241, 274 level, 274 limits, 274 Consistent estimators. 274 Contingency tables, 354 Convergence, 218 Convolution, 160, 211 theorem, 161, 211 Correlation coefficient, 145 empirical interpretation, 148 sample, 295 Countable, 22 Covariance, 145 empirical interpretation, 149 matrix, 199 449 nonnegative. 229 sample. 295, 318 Critical region, 244, 322 Cumulant, 166 generating function, 166 Cumulative distribution (St•e Distributions) Curve fitting (See Least squares) D DcMoivre-Laplace theorem. 70. 72. 76. 216 correction. 74 DcMorgun haw. 57 tmnsformations. 112-17. 155 Density. 9H. 136. 19M circular symmetry. 142 conditional. 169. 177. 201 empirical interpretation. 100 histugmm. toO marginal. IJ7. 201 point. 99 mass. JOt tmnsformations. I 17-2 I. 156. 198 auxiliary variable, 158 (See abw Distribution) Dispersion (St•t• Variance) Distribution. MS. 136. 198 computer simulation. 269 conditional. 168 Baye's formulas. 174 empirical interpretation. 96. 175 marginal. 137 model formation. 101 fundamental note. 102 properties. 92 Distributions: beta. 173 binomial. tOM Cauchy. 107. 164. 167 chi. 232 Erlang. 106 exponential. 106 gamma. 105 geometric. Ill hypogeomctric. I I I Laplace. 166 lognormal. 133. 23 I 450 IND~X Distributions (cont.) Maxwell. 232 multinomial. 217 normal. 103. 163. 200 Pascal. 132 Poisson. 109 Rayleigh. 156. 167 Snedecor F. 224 Student t. 223 uniform. 105 Weibull. 190 zero-one. 94 E Eccentricity, 227 Efficient estimator, 310 Elementary event, 7, 30 Elements, 19 Empirical interpretation, 18 axioms, 10 conditional probability, 45 density, I00 distribution, 96 events, 31 failure rate, 189 mean, 122 percentiles, 97 Empty set, 22, 29 Entropy, 248, 414-35 as expected value, 420 four interpretations, 434 maximum, 248, 423-30 properties, 418 of random variables, 420-21 in statistics, 422-30 · Equally likely condition, 7, 14 Equally likely events, 38 Erlang distribution, 106 Estimation, 239, 273 Bayesian, 287 correlation coefficient, 295 covariance,295 difference of means, 290 distribution, 298 KolmogorotT estimate, 299 maximum likelihood, 302-6 mean, 275 moments, method of, 301 percentiles, 297 probabilities. 283-90 variance, 293 Estimation-prediction, 317 Estimators: best, 274 consistent, 274 most efficient, 310 Events. 7. 30 certain. 29 elementary, 30 equally likely, 38 impossible. 29 independent, 52, 56 mutually exclusive. 30 Expected value, 122 (See also Mean) linearity, 125, 145 Exponential, distribution, 106 mean, 127 type, 310 F Failure rate, 188 expected, 189 empirical interpretation, 189 Fisher's statistic, 296 Fourier tr.msform, 154 Fractile, 95 Franklin, J. M., 252 G Galton's law, 183 Gamma distribution, 105 mean, 153 moment function, 153 moments, 153 variance, 153 Gamma function, 105 mean, 153 Gap test, 257 Gauss-Markoff theorem, 404 Gaussian (See Normal) Geometric distribution, Ill in gap test, 257 mean, 127 INOl::X Goodness of fit tests, 348-60 (See also Chi-square tests) Pearson's test statistics. 349 computer simulation, 372 H Hazard rate, 188 Histogram, 100 Hypergeometric. distribution, Ill series, 67 Hypothesis testing: computer simulation. 270 correlation coefficient. 338 distributions, 339 chi-square, 339 Kolmogoroff-Smirnov, 339 sign test, 340 mean, 327 equality of two means. 329 Neyman-Pearson test. 370 Poisson, mean, 335 equality of two means, 337 probability, 332 equality of two probabilities, 333 variance, 337 equality of two variances I Iff, 21 Impossible event, 29 Independent, events, 52, 56 empirical interpretation, 52 experiments, 140 random variables, 139. 198 trials, 60 Independent identically distributed (i.i.d.), 202 Infant mortality, 178 Information, 306, 308 Insufficient reason, principle of, 17 1 Jacobian, 158, 198 451 K Kolmogoroff. II Kolmogoroff-Smirnov test. 339 L Laplace distribution, 166 Laplace transform, 154 Lattice, type. 216 centr.tllimit theorem. 216 Law of large numbers, 74, 219 Law of succession, 173 Least squares. 388-411 curve fitting. 391-402 linear, 391 nonlinear. 396 perturbation. 400 prediction, 407-11 linear, 408 nonlinear. 410 orthogonality principle. 409, 411 statistical. 402-7 Gauss-markoff theorem, 404 maximum likelihood, 403 minimum variance. 404 regression line estimate. 405 weighted, 413 Lehmer, D. H .• 252, 253, 254n Likelihood function, 247, 302 Jog-likelihood, 303 Likelihood ratio test, 378-82 asymptotic form. 380 Line masses, 138 Linear regression (See Regression) Linearity, 125, 145 Lognormal distribution, 133 centred limit theorem, 231 Loss function, 186 M Marginal, density, 137, 201 Marginal. distribution, 137 Markoff's inequality, 131 452 INDEX Masses, density, 101 normal RVs, 167 point, 101 probability, 33, 138 Maximum entropy, method of, 248, 423-30 known mean, 423, 428 illustrations, 42S-28 atmospheric pressure, 426 Brandeis Die, 427 partition function, 429 Maximum likelihood, 302-6 asymptotic properties, 306 information, 306 Pearson's test statistic, 3S2 Maxwell distribution, 232 Mean, 122 approximate evaluation, 129, lSI conditional, 178 empirical interpretation, 122 linearity, 12S, 14S transformations, 124, 144 sample, 203, 222, 238 Measurements, minimum variance, 204 Median. 96, 186 Memoryless systems, 191 Mendel's theory. 350 Minimum variance estimates, 30716 complete statistics, 314 measurements, 204 Rao-Cram~r bound, 309 sufficiency, 312 Model, S formation, 6 specification, 36 from distributions, 101 Moment generating function, 1S2, IS4, 199 convolution theorem, 161, 211 independent RVs, ISS, 199 moment theorem, IS3, ISS Moments, lSI, 1S4 method of, 301 Monte Carlo method, 2S1, 267 ButTon's needle, 142 distributions, 269 multinomial, 272 Pearson's test statistic, 271 Most powerful tests, 323 Neyman-Pearson criterion, 370 Multinomial distribution. 217 computer generated, 272 Mutually exclusive, 30 N Neyman-Pearson, criterion. 370 sufficient statistics. 373 test statistic, 371 exponential type distributions. 373 Noncentral distribution, 227-29 eccentricity, 227 Normal curves, 70 area, 81 Normal distribution, 103, 163, 200, 439 (tub/e) conditional, 179 moment function, 1S2, 200 moments, lSI, 16S quadrant masses, 167 regression line, 179 Null hypothesis, 243 Null set, 22 0 Operating characteristic (OC) function, 322 Order statistics, 207-11 extremes, 209 range. 209 Orthogonality, 146 Orthogonality principle, 409 least square, 392 nonlinear, 18S, 411 Rao-Blackwell theorem, 185 Outcomes, 7 empirical interpretation, 31 equally likely, 36 p Paired samples, 290, 329 Parameter estimation (See Estimation) INDEX Partition, 24 Partition function, 429 Pascal, distribution, 132 Pearson's test statistic, 349 incomplete null hypothesis, 352 Percentile curve, 94 empirical interpretation, 97 Percentiles, 238, 439-442 (tables) Permutations, 25 Point density, 99 Point estimate. 274 Poisson distribution, 109 mean. 128 moment function. 152 Poisson. points, 79, 110 theorem, 78 Posterior. density. 173. 246. 287 probability, 51 Power of a test, 323 Prediction, 149, 181-86, 407-11 Primitive root, 254 Principle of maximum entropy, 248, 422 Prior. density. 173. 246. 288 probability. 51 Probability. the four interpretations. 9-17 Q_ ________ Quadrcltic forms, chi-square, 106, 219-23 Quality control, 342-48 Quantile, 95 Quetelet curve, 97 R Random interval, 274 Random numbers, 251-67 computer generation, 258-67 Random points, 79 Random process, 217 Random sums, 206 Random variables (Rvs): definition, 93 functions of, 112, 144, 198 453 Random walk. 231 Randomness, 9 tests of. 255 Range. 209 Rao-Blackwell theorem, 185 Rao-Cramcr bound, 309 Rare events, 77 Rutherford experiment, 359 Rayleigh distribution, 156, 167 Regression curve. 179, 202 (See also Prediction) Galton's law. 183 Regularity. 9 Rejection method, 261 Relative frequency. 4 (See also Empirical interpretation) Reliability. 186-94 conditional failure rclte, 188 state variable, 193 structure functions. 193 Repeated trials. 59-64 (See also Bernoulli trials) dual meaning, 59 Risk, 186 s________ Sample, r.mdom variable correlation coefficient, 295 mean, 203. 222, 238 observed, 239 variance, 222 Sampling, 202 paired, 290, 329 Schwarz's inequality, 147. 319 Sequences of r.mdom variables. 217 Sequential hypothesis testing. 37478 Sign test, 340 Significance level, 322 Snedecor F distribution, 224 noncentral, 229 percentiles, 225, 442 (tables) Spectral test, 257 Standard deviation, 126 Statistic. 274 complete. 314 sufficient, 312 test, 323 454 INDEX Step function U(t), 102 Structure function, 193 Student t distribution, 223 noncentral, 229 percentiles, 225, 441 (tab/e) Sufficient statistic, 312 System reliability, 186-94 T Tables, 437-42 TchebychetT's inequality, 130 in estimation, 203, 278 Markoff's inequality, 131 Test statistic, 323 Time-to-failure, 187 Total probability, 49, 170 Transformations of random variables, 112, 144, 198 Tree, probability, 63 Trials, 7 repeated, 59-64 Typical sequences, 249, 430-34 u Uncorrelatedness, 146 Uniform distribution, 105 variance, 127 v Variance, 125 approximate evaluation, 165 empirical interpretation, 149 sample, 222 Venn diagram, 20 Von Mises, 12, 253 w Weibul1 distribution, 190 z Zero-one random variable, 94 mean, 127