STOCHASTIC PROCESSES

advertisement
STOCHASTIC PROCESSES:
Theory for Applications
Draft
R. G. Gallager
February2, 2012
i
ii
Preface
Thistexthasevolvedoversome20years,startingaslecturenotesfortwofirst-yeargraduate
subjects at M.I.T., namely, Discrete Stochastic Processes(6.262) and Random Processes,
Detection, and Estimation (6.432). The two sets of notes are closely related and have been
integratedintoonetext. Instructorsandstudents canpickandchoosethe topicsthatmeet
their needs, and a table of prerequisite topics is included to help in this choice.
Thesesubjectsoriginallyhadanapplicationemphasis,thefirstonqueueingandcongestion
in data networks and the second on modulation and detection of signals in the presence of
noise. As the notes have evolved, it has become increasingly clear that the mathematical
development(withminorenhancements)isapplicabletoamuchbroadersetofapplications
in engineering, operations research, physics, biology, economics, finance, statistics, etc.
The fieldof stochasticprocesses is essentiallya branch of probability theory, treatingprobabilistic models that evolve in time. It is best viewed as a branch of mathematics, starting
with the axiomsof probability and containing a rich and fascinatingset of results following
fromthose axioms. Althoughthe results are applicableto many applications,theyare best
understood in terms of their mathematical structure and interrelationships.
Applying axiomatic probability results to a real-world area requires creating a probabiity model for the given area. Mathematically precise results can then be derived within
the model and translated back to the real world. If the model fits the area su!ciently
well, real problems can be solved by analysis within the model. Models are almost always
simplified approximations of reality, however, so precise results within the model become
approximations in the real world.
Choosing an appropriate probability model is an essential part of this process. Sometimes
an application area will have customary choices of models, or at least structured ways of
selecting them. For example, there is a well developed taxonomy of queueing models. A
soundknowledgeoftheapplicationarea,combinedwithasoundknowledgeofthebehavior
of these queueing models, often lets one choose a suitable model for a given issue within
the application area. In other cases, one can start with a particularly simple model and
use the behavior of that model to gain insight about the application issue, and use this to
iteratively guide the selection of more general models.
Animportantaspectofchoosingaprobabilitymodelforareal-worldareaisthataprospective choice depends heavily on prior understanding, at both an intuitive and mathematical
level, ofresultsfromtherangeofmathematicalmodelsthatmightbe involved. Thispartly
explains the title of the text — Theory for applications. The aim is to guide the reader
in both the mathematical and intuitive understanding necessary in developing and using
stochastic process models in studying application areas.
Application-orientedstudentsoftenaskwhyitisimportanttounderstandaxioms,theorems,
and proofs in mathematicalmodels when the appliedresults are approximate anyway. One
answer is that we cannot reason with approximations if we don’t know how to reason with
the quantities being approximated.
Given the need for precision in the theory, however, why is an axiomatic approach needed?
iii
I triedto avoid axiomaticsin these notesfor many years, simplystatingand explainingthe
major results. Engineering and science students learn to use calculus, linear algebra and
undergraduate probability e"ectively without axioms or rigor. Why doesn’t this work for
more advanced probability and stochastic processes?
Probability theory has more than its share of apparent paradoxes, and these show up in
veryelementaryarguments. Undergraduatesarecontentwiththis,sincetheycanpostpone
thesequestionsto laterstudy. Graduatestudents, however, usuallywant a foundationthat
provides understanding without paradoxes, and the axioms achieve this.
I have tried to avoid the concise and formal proofs of pure mathematics, and instead use
explanations that are longer but more intuitive while still being precise. This is partly to
helpstudentswithlimitedexposuretopuremath,andpartlybecauseintuitionisvitalwhen
going back and forth between a mathematical model and a real-world problem. In doing
research, we grope toward results, and successful groping requires both a strong intuition
and precise reasoning.
The text neither uses nor develops measure theory. Measure theory is undoubtedly important in understanding probability at a deep level, but most of the topics useful in many
applicationscanbeunderstoodwithoutmeasuretheory. Ibelievethatthelevelofprecision
here provides a good background for a later study of measure theory.
The text does require some background in probability at an undergraduate level. Chapter
1 presents this background material as review, but it is too concentrated and deep for
most students without prior background. Some exposure to linear algebra and analysis
(especiallyconcretetopicslikevectors,matrices,andlimits)ishelpful,butthetextdevelops
the necessary results. The most important prerequisite is the mathematical maturity and
patience to couple precise reasoning with intuition.
The organization of the text, after the review in Chapter 1 is as follows: Chapters 2, 3,
and 4 treat three of the simplest and most important classes of stochastic processes, first
Poisson processes, next Gaussian processes, and finally finite-state Markov chains. These
are beautiful processes where almost everything is known, and they contribute insights,
examples, and initial approaches for almost all other processes. Chapter 5 then treats
renewal processes, which generalize Poisson processes and provide the foundation for the
rest of the text.
Chapters 6 and 7 use renewal theory to generalize Markov chains to countable state spaces
andcontinuoustime. Chapters8 and10thenstudydecisionmakingandestimation,which
in a sense gets us out of the world of theory and back to using the theory. Finally Chapter
9 treats random walks, large deviations, and martingales and illustrates many of their
applications.
Most results here are quite old and well established, so I have not made any e"ort to
attributeresultstoinvestigators. MytreatmentofthematerialisindebtedtoBertsekasand
Tsitsiklis’s book, Elementary probability, [2], Sheldon Ross’s book, Stochastic Processes,
[18] and to William Feller’s classic books, Probability Theory and its Applications, [7] and
[8].
Contents
1 INTRODUCTION AND REVIEW OF PROBABILITY
1.1 Probability models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
1.1.1 The sample space of a probability model . . . . . . . . . . . . . . . .
3
1.1.2 Assigning probabilities for finite sample spaces . . . . . . . . . . . .
4
1.2 The axioms of probability theory . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1 Axioms for events . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.2 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.3 Probability review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Conditional probabilities and statistical independence . . . . . . . .
9
9
1.3.2 Repeated idealized experiments . . . . . . . . . . . . . . . . . . . . .
10
1.3.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3.4 Multiple random variables and conditional probabilities . . . . . . .
13
1.3.5 Stochastic processes and the Bernoulli process . . . . . . . . . . . .
16
1.3.6 Expectation
20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.7 Random variables as functions of other random variables . . . . . .
23
1.3.8 Conditional expectations . . . . . . . . . . . . . . . . . . . . . . . .
25
1.3.9 Typical values of rv’s; mean and median . . . . . . . . . . . . . . . .
28
1.3.10 Indicator random variables . . . . . . . . . . . . . . . . . . . . . . .
29
1.3.11 Moment generating functions and other transforms . . . . . . . . . .
30
1.4 Basic inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
1.4.1 The Markov inequality . . . . . . . . . . . . . . . . . . . . . . . . . .
32
1.4.2 The Chebyshev inequality . . . . . . . . . . . . . . . . . . . . . . . .
33
1.4.3 Cherno" bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
iv
v
CONTENTS
1.5 The laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
1.5.1 Weak law of large numbers with a finite variance . . . . . . . . . . .
37
1.5.2 Relative frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
1.5.3 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . .
40
1.5.4 Weak law with an infinite variance . . . . . . . . . . . . . . . . . . .
47
1.5.5 Convergence of random variables . . . . . . . . . . . . . . . . . . . .
48
1.5.6 Convergence with probability 1 . . . . . . . . . . . . . . . . . . . . .
51
1.6 Relation of probability models to the real world . . . . . . . . . . . . . . . .
54
1.6.1 Relative frequencies in a probability model . . . . . . . . . . . . . .
54
1.6.2 Relative frequencies in the real world . . . . . . . . . . . . . . . . . .
55
1.6.3 Statistical independence of real-world experiments . . . . . . . . . .
57
1.6.4 Limitations of relative frequencies . . . . . . . . . . . . . . . . . . .
58
1.6.5 Subjective probability . . . . . . . . . . . . . . . . . . . . . . . . . .
59
1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
2 POISSON PROCESSES
75
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
2.1.1 Arrival processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
2.2 Definition and properties of a Poisson process . . . . . . . . . . . . . . . . .
77
2.2.1 Memoryless property . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
2.2.2 Probability density of Sn and joint density of S1,... ,S
. . . . . . .
81
2.2.3 The PMF for N ( t ) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
2.2.4 Alternate definitions of Poisson processes . . . . . . . . . . . . . . .
84
2.2.5 The Poisson process as a limit of shrinking Bernoulli processes . . .
85
n
2.3 Combining and splitting Poisson processes . . . . . . . . . . . . . . . . . . .
88
2.3.1 Subdividing a Poisson process . . . . . . . . . . . . . . . . . . . . . .
89
2.3.2 Examples using independent Poisson processes . . . . . . . . . . . .
91
2.4 Non-homogeneous Poisson processes . . . . . . . . . . . . . . . . . . . . . .
92
2.5 Conditional arrival densities and order statistics. . . . . . . . . . . . . . . .
95
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
vi
CONTENTS
3 GAUSSIAN RANDOM VECTORS AND PROCESSES
110
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
3.2 Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
3.3 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . .
112
3.3.1 Generating functions of Gaussian random vectors . . . . . . . . . . .
113
3.3.2 IID normalized Gaussian random vectors . . . . . . . . . . . . . . .
114
3.3.3 Jointly Gaussian random vectors . . . . . . . . . . . . . . . . . . . .
115
3.3.4 Joint Probability Density for Gaussian n -r v ’s (special case) . . . . .
117
3.4 Properties of Covariance Matrices. . . . . . . . . . . . . . . . . . . . . . . .
119
3.4.1 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
3.4.2 Positive definite matrices and covariance matrices . . . . . . . . . .
121
3.4.3 Joint Probability Density for Gaussian n -r v ’s (general case) . . . . .
123
3.4.4 Geometry and Principal Axes for Gaussian Densities . . . . . . . . .
124
3.5 Conditional Probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
126
3.6 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
130
3.6.1
Stationarity and related concepts: . . . . . . . . . . . . . . . . . . .
132
3.6.2 Orthonormal expansions . . . . . . . . . . . . . . . . . . . . . . . . .
133
3.6.3 Continuous-time Gaussian processes . . . . . . . . . . . . . . . . . .
136
3.6.4 The sinc function expansion . . . . . . . . . . . . . . . . . . . . . . .
137
3.6.5 Filtered continuous-time stochastic processes . . . . . . . . . . . . .
140
3.6.6 Properties of covariance functions . . . . . . . . . . . . . . . . . . .
143
3.6.7 White Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . . .
144
3.6.8 The Wiener process / Brownian motion . . . . . . . . . . . . . . . .
146
3.7 Circularly-symmetric complex random vectors . . . . . . . . . . . . . . . . .
149
3.7.1 Circular symmetry and complex Gaussian rv’s . . . . . . . . . . . .
149
3.7.2 Pseudo-covariance and two examples . . . . . . . . . . . . . . . . . .
150
3.7.3 Covariance matrices of complexn -r v . . . . . . . . . . . . . . . . . .
152
3.7.4 Linear transformations of W ⇠ CN(0, [I ` ]) . . . . . . . . . . . . . .
153
3.7.5 Linear transformations of Z ⇠ CN(0, [K ])
154
. . . . . . . . . . . . . .
3.7.6 The density of circularly-symmetric Gaussian n -r v ’s . . . . . . . . .
155
3.7.7 Circularly-symmetric Gaussian processes. . . . . . . . . . . . . . . .
157
CONTENTS
vii
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
160
4 FINITE-STATE MARKOV CHAINS
164
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
164
4.2 Classification of states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
4.3 The matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171
4.3.1 Steady state and [P n ] for large n . . . . . . . . . . . . . . . . . . . .
172
4.3.2 Steady state assuming P
[ ]> 0 . . . . . . . . . . . . . . . . . . . . .
174
4.3.3 Ergodic Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . .
175
4.3.4 Ergodic Unichains . . . . . . . . . . . . . . . . . . . . . . . . . . . .
176
4.3.5 Arbitrary finite-state Markov chains . . . . . . . . . . . . . . . . . .
178
4.4 The eigenvalues and eigenvectors of stochastic matrices. . . . . . . . . . .
179
4.4.1 Eigenvalues and eigenvectors forM =2 states . . . . . . . . . . . . .
180
4.4.2 Eigenvalues and eigenvectors forM > 2 states . . . . . . . . . . . . .
181
4.5 Markov chains with rewards . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
4.5.1 Examples of Markov chains with rewards . . . . . . . . . . . . . . .
184
4.5.2 The expected aggregate reward over multiple transitions . . . . . . .
186
4.5.3 The expected aggregate reward with an additional final reward . . .
189
4.6 Markov decision theory and dynamic programming . . . . . . . . . . . . . .
190
4.6.1 Dynamic programming algorithm . . . . . . . . . . . . . . . . . . . .
192
4.6.2 Optimal stationary policies . . . . . . . . . . . . . . . . . . . . . . .
196
4.6.3 Policy improvement and the seach for optimal stationary policies . .
198
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
202
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
204
5 RENEWAL PROCESSES
218
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
218
5.2 The strong law of large numbers and convergence WP1 . . . . . . . . . . .
221
5.2.1 Convergence with probability 1 (WP1) . . . . . . . . . . . . . . . . .
221
5.2.2 Strong law of large numbers (SLLN) . . . . . . . . . . . . . . . . . .
223
5.3 Strong law for renewal processes . . . . . . . . . . . . . . . . . . . . . . . .
224
viii
CONTENTS
5.4 Renewal-reward processes; time-averages . . . . . . . . . . . . . . . . . . . .
229
5.4.1 General renewal-reward processes . . . . . . . . . . . . . . . . . . . .
232
5.5 Random stopping trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
236
5.5.1 Wald’s equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238
5.5.2 Applying Wald’s equality to m ( t ) = E [N ( t )] . . . . . . . . . . . . . .
241
5.5.3 Stopping trials, embedded renewals, and G/G/1 queues . . . . . . .
242
5.5.4 Little’s theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
245
5.5.5 Expected queueing time for an M/G/1 queue . . . . . . . . . . . . .
248
5.6 Expected number of renewals . . . . . . . . . . . . . . . . . . . . . . . . . .
250
5.6.1 Laplace transform approach . . . . . . . . . . . . . . . . . . . . . . .
252
5.6.2 The elementary renewal theorem . . . . . . . . . . . . . . . . . . . .
253
5.7 Renewal-reward processes; ensemble-averages . . . . . . . . . . . . . . . . .
255
5.7.1 Age and duration for arithmetic processes . . . . . . . . . . . . . . .
256
5.7.2 Joint age and duration: non-arithmetic case . . . . . . . . . . . . . .
260
5.7.3 Age Z ( t ) for finite t : non-arithmetic case . . . . . . . . . . . . . . .
261
5.7.4 Age Z ( t ) as t → ∞ : non-arithmetic case
264
. . . . . . . . . . . . . . .
5.7.5 Arbitrary renewal-reward functions: non-arithmetic case . . . . . . .
5.8 Delayed renewal processes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
266
268
5.8.1 Delayed renewal-reward processes. . . . . . . . . . . . . . . . . . . .
271
5.8.2 Transient behavior of delayed renewal processes . . . . . . . . . . . .
271
5.8.3 The equilibrium process . . . . . . . . . . . . . . . . . . . . . . . . .
272
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
273
5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
274
6 COUNTABLE-STATE MARKOV CHAINS
6.1 Introduction and classification of states . . . . . . . . . . . . . . . . . . . .
6.1.1 Using renewal theory to classify and analyze Markov chains . . . . .
6.2 Birth-death Markov chains
. . . . . . . . . . . . . . . . . . . . . . . . . . .
290
290
293
302
6.3 Reversible Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . .
303
6.4 The M/M/1 sample-time Markov chain . . . . . . . . . . . . . . . . . . . .
307
6.5 Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
310
CONTENTS
ix
6.6 Round-robin and processor sharing . . . . . . . . . . . . . . . . . . . . . . .
312
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
318
6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
319
7 MARKOV PROCESSES WITH COUNTABLE STATE SPACES
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 The sampled-time approximation to a Markov process . . . . . . . .
7.2 Steady-state behavior of irreducible Markov processes . . . . . . . . . . . .
324
324
328
329
7.2.1 Renewals on successive entries to a given state . . . . . . . . . . . .
331
7.2.2 The limiting fraction of time in each state . . . . . . . . . . . . . . .
331
7.2.3 Finding { pj ( i ); j ≥ 0} in terms of { πj ; j ≥ 0} . . . . . . . . . . . . .
333
7.2.4 Solving for the steady-state process probabilities directly . . . . . .
335
7.2.5 The sampled-time approximation again . . . . . . . . . . . . . . . .
336
7.2.6 Pathological cases . . . . . . . . . . . . . . . . . . . . . . . . . . . .
336
7.3 The Kolmogorov di"erential equations . . . . . . . . . . . . . . . . . . . . .
337
7.4 Uniformization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
341
7.5 Birth-death processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
342
7.6 Reversibility for Markov processes . . . . . . . . . . . . . . . . . . . . . . .
344
7.7 Jackson networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
350
7.7.1 Closed Jackson networks . . . . . . . . . . . . . . . . . . . . . . . . .
7.8 Semi-Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8.1 Example — the M/G/1 queue . . . . . . . . . . . . . . . . . . . . .
356
357
360
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
361
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
364
8 Detection, decisions, and hypothesis testing
377
8.1 Decision criteria and the MAP criterion . . . . . . . . . . . . . . . . . . . .
378
8.2 Binary MAP detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
381
8.2.1 Su!cient statistics I . . . . . . . . . . . . . . . . . . . . . . . . . . .
383
8.2.2 Binary detection with a one-dimensional observation . . . . . . . . .
384
8.2.3 Binary MAP detection with vector observations. . . . . . . . . . . .
388
8.2.4 Su!cient statistics II. . . . . . . . . . . . . . . . . . . . . . . . . . .
393
x
CONTENTS
8.3 Binary detection with a minimum cost criterion . . . . . . . . . . . . . . . .
398
8.4 The error curve and the Neyman-Pearson rule . . . . . . . . . . . . . . . . .
399
8.4.1 The Neyman-Pearson detection rule . . . . . . . . . . . . . . . . . .
404
8.4.2 The min-max detection rule . . . . . . . . . . . . . . . . . . . . . . .
406
8.5 Finitely many hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
406
8.5.1 Su!cient statistics with m ≥ 2 hypotheses. . . . . . . . . . . . . . .
409
8.5.2 More general min-cost tests . . . . . . . . . . . . . . . . . . . . . . .
411
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
412
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
414
RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
421
421
9.1.1 Simple random walks . . . . . . . . . . . . . . . . . . . . . . . . . .
422
9.1.2 Integer-valued random walks . . . . . . . . . . . . . . . . . . . . . .
423
9.1.3 Renewal processes as special cases of random walks. . . . . . . . . .
423
9.2 The queueing delay in a G/G/1 queue: . . . . . . . . . . . . . . . . . . . . .
423
9.3 Threshold crossing probabilities in random walks . . . . . . . . . . . . . . .
427
9.3.1 The Cherno" bound . . . . . . . . . . . . . . . . . . . . . . . . . . .
427
9.3.2 Tilted probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . .
428
9.3.3 Back to threshold crossings . . . . . . . . . . . . . . . . . . . . . . .
430
9.4 Thresholds, stopping rules, and Wald’s identity . . . . . . . . . . . . . . . .
432
9.4.1 Wald’s identity for two thresholds . . . . . . . . . . . . . . . . . . .
433
9.4.2 The relationship of Wald’s identity to Wald’s equality . . . . . . . .
434
9.4.3 Zero-mean simple random walks . . . . . . . . . . . . . . . . . . . .
435
9.4.4 Exponential bounds on the probability of threshold crossing . . . . .
435
9.4.5 Binary hypothesis testing with IID observations
437
. . . . . . . . . . .
9.4.6 Sequential decisions for binary hypotheses . . . . . . . . . . . . . . .
439
9.4.7 Joint distribution of crossing time and barrier . . . . . . . . . . . . .
440
9.5 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
441
9.5.1 Simple examples of martingales . . . . . . . . . . . . . . . . . . . . .
442
9.5.2 Scaled branching processes . . . . . . . . . . . . . . . . . . . . . . .
444
xi
CONTENTS
9.5.3 Partial isolation of past and future in martingales
. . . . . . . . . .
444
9.6 Submartingales and supermartingales . . . . . . . . . . . . . . . . . . . . .
445
9.7 Stopped processes and stopping trials . . . . . . . . . . . . . . . . . . . . .
447
9.8 The Kolmogorov inequalities . . . . . . . . . . . . . . . . . . . . . . . . . .
450
9.8.1 The strong law of large numbers (SLLN). . . . . . . . . . . . . . . .
452
9.8.2 The martingale convergence theorem . . . . . . . . . . . . . . . . . .
454
9.9 Markov modulated random walks . . . . . . . . . . . . . . . . . . . . . . . .
455
9.9.1 Generating functions for Markov random walks . . . . . . . . . . . .
457
9.9.2 stopping trials for martingales relative to a process . . . . . . . . . .
458
9.9.3 Markov modulated random walks with thresholds. . . . . . . . . . .
458
9.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
460
9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
462
10 Estimation
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
470
470
10.1.1 The squared cost function . . . . . . . . . . . . . . . . . . . . . . . .
471
10.1.2 Other cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
472
10.2 MMSE Estimation for Gaussian Random vectors . . . . . . . . . . . . . . .
473
10.2.1 Scalar recursive estimation . . . . . . . . . . . . . . . . . . . . . . .
476
10.2.2 Scalar Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . .
478
10.3 Linear least squares error estimation . . . . . . . . . . . . . . . . . . . . . .
480
10.4 Filtered vector signal plus noise . . . . . . . . . . . . . . . . . . . . . . . . .
482
10.4.1 Estimate of a single rv in IID vector noise . . . . . . . . . . . . . . .
483
10.4.2 Estimate of a single rv in arbitrary vector noise . . . . . . . . . . . .
484
10.4.3 Vector recursive estimation . . . . . . . . . . . . . . . . . . . . . . .
485
10.4.4 Vector Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . .
486
10.5 The vector space of rv’s and the orthogonality principle . . . . . . . . . . .
487
10.5.1 Complex rv’s and inner products . . . . . . . . . . . . . . . . . . . .
492
10.6 MAP estimation and su!cient statistics . . . . . . . . . . . . . . . . . . . .
495
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
498
Chapter 1
INTRODUCTION AND REVIEW
OF PROBABILITY
1.1
Probability models
Probability theory is a central field of mathematics, widely applicable to scientific, technological, and human situations involving uncertainty. The most obvious applications are to
situations, such as games of chance, in which repeated trials of essentially the same procedure lead to di"ering outcomes. For example, when we flip a coin, roll a die, pick a card
from a shu#ed deck, or spin a ball onto a roulette wheel, the procedure is the same from
one trial to the next, but the outcome (heads ( H ) or tails ( T ) in the case of a coin, one to
sixinthecaseofadie,etc.) variesfromonetrialtoanotherinaseeminglyrandomfashion.
For the case of flipping a coin, the outcome of the flip could be predicted from the initial
position, velocity, and angular momentum of the coin and from the nature of the surface
on which it lands. Thus, in one sense, a coin flip is deterministic rather than random
and the same can be said for the other examples above. When these initial conditions are
unspecified, however, as when playing these games, the outcome can again be viewed as
random in some intuitive sense.
Many scientific experiments are similar to games of chance in the sense that multiple trials
of apparently the same procedure lead to results that vary from one trial to another. In
some cases, this variationis due to slight variationsin the experimental procedure, in some
it is due to noise, and in some, such as in quantum mechanics, the randomness is generally
believed to be fundamental. Similar situations occur in many types of systems, especially
those in which noise and randomdelays are important. Some of these systems, rather than
beingrepetitionsofacommonbasicprocedure,aresystemsthatevolveovertimewhilestill
containing a sequence of underlying similar random occurrences.
Thisintuitivenotionofrandomness,asdescribedabove,isaveryspecialkindofuncertainty.
Rather than involving a lack of understanding, it involves a type of uncertainty that can
leadtoprobabilisticmodelswithpreciseresults. Asinanyscientificfield,themodelsmight
or might not correspond to reality very well, but when they do correspond to reality, there
1
2
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
is the sense that the situation is completely understood, while still being random.
For example, we all feel that we understand flipping a coin or rolling a die, but still accept
randomnessineachoutcome. Thetheoryofprobabilitywasinitiallydevelopedparticularly
to give precise and quantitative understanding to these types of situations. The remainder
of this section introduces this relationship between the precise view of probability theory
and the intuitive view as used in applications and everyday language.
Afterthisintroduction,thefollowingsectionsofthischapterreviewprobability theoryasa
mathematical discipline, with a special emphasis on the laws of large numbers. In the final
section, we use the theory and the laws of large numbers to obtain a fuller understanding
of the relationship between theory and the real world.1
Probability theory, as a mathematical discipline, started to evolve in the 17th century
and was initially focused on games of chance. The importance of the theory grew rapidly,
particularlyinthe20thcentury,anditnowplaysacentralroleinriskassessment,statistics,
datanetworks,operationsresearch,informationtheory,controltheory,theoreticalcomputer
science, quantum theory, game theory, neurophysiology, and many other fields.
The core concept in probability theory is that of a probability model. Given the extent of
the theory, both in mathematics and in applications, the simplicity of probability models
is surprising. The first component of a probability model is a sample space
, which is a set
whose elements are calledsample points or outcomes. Probability models are particularly
2 and we consider only this case
simple in the special case where the sample space is finite,
in the remainder of this section. The second component of a probability model is a class
of events, which can be considered for now simply as the class of all subsets of the sample
space. The third component is a probability measure, which can be regarded for now as
the assignment of a nonnegative number to each outcome, with the restriction that these
numbers must sum to one over the sample space. The probability of an event is the sum of
the probabilities of the outcomes comprising that event.
Theseprobabilitymodelsplayadualrole. Inthefirst,themanyknownresultsaboutvarious
classesofmodels,andthemanyknownrelationshipsbetweenmodels,constitutetheessence
ofprobabilitytheory. Thusoneoftenstudiesamodelnotbecauseofanyrelationshiptothe
real world, but simply because the model provides a building block or example useful for
the theory and thus ultimately useful for other models. In the other role, when probability
theoryis appliedto some game, experiment, or some othersituationinvolvingrandomness,
a probability model is used to represent the experiment (in what follows, we refer to all of
these random situations as experiments).
For example, the standard probability model for rolling a die uses { 1, 2, 3, 4, 5, 6} as the
sample space, with each possible outcome having probability 1/6. An odd result, i.e., the
subset{ 1, 3, 5} ,isanexampleofaneventinthissamplespace,andthiseventhasprobability
1
It would be appealing to show how probability theory evolved from real-world random situations, but
probability theory, like most mathematical theories, has evolved from complex interactions between theoretical developments and initially over-simplified models of real situations. The successes and flaws of
such models lead to refinements of the models and the theory, which in turn suggest applications to totally
di!erent fields.
2
Anumberofmathematicalissuesarisewithinfinitesamplespaces,asdiscussedinthefollowingsection.
1.1. PROBABILITYMODELS
3
1/2. Thecorrespondencebetweenmodelandactualexperimentseemsstraightforwardhere.
Both have the same set of outcomes and, given the symmetry between faces of the die, the
choiceofequalprobabilitiesseemsnatural. Closerinspection,however,revealsanimportant
di"erence between the model and the actual rolling of a die.
The model above corresponds to a single roll of a die, with a probability defined for each
possible outcome. In a real-world experiment where a single die is rolled, one of the six
faces, say facek comes up, but there is no observableprobability for k .
Our intuitive notion of rolling dice, however, involves an experiment with repeated rolls of
a die (or single rolls of n di"erent dice). With a total of n rolls, there are are 6npossible
outcomes, one for each possiblen -tuple of individual die outcomes. As reviewed in subsequentsections,thestandardprobabilitymodelforthisrepeated-rollexperimentistoassign
probability 6 − n to each possiblen -tuple, which leads to a probability mn (1/ 6)m (5/ 6)n − m
that the face k comes up on m of the n rolls, i.e. , that the relative frequency of face k is
m/n . The distribution of these relative frequencies is increasingly clustered around 1/ 6 as
n isincreasesd. Thusif areal-worldexperiment fortossing n diceisreasonablymodeledby
thisprobabilitymodel,therelativefrequencyof
k shouldbe 3closeto1/6. Thisrelationship
through relative frequencies in a repeated experiment helps overcome the non-observable
nature of probabilities in the real world.
1.1.1
The sample space of a probability model
An outcome or sample point in a probability model corresponds to a complete result (with
alldetailspecified)oftheexperiment beingmodeled. Forexample,a gameofcardsis often
appropriately modeled by the arrangement of cards within a shu#ed 52 card deck, thus
giving rise to a set of 52! outcomes (incredibly detailed, but trivially simple in structure),
eventhoughtheentiredeckmightnotbeplayedinonetrialofthegame. Apokerhandwith
4 aces is an event rather than an outcome in this model, since many arrangements of the
cards can give rise to 4 aces in a given hand. The possible outcomes in a probability model
(and in the experiment being modeled) are mutually exclusive and collectively constitute
the entire sample space (space of possible outcomes). An outcome is often called afinest
grain result of the model in the sense that an outcome ω contains no subsets other than
the empty set φ and the singleton subset { ω} . Thus events typically give only partial
information about the result of the experiment, whereas an outcome fully specifies the
result.
Inchoosingthesamplespaceforaprobabilitymodelofanexperiment,weoftenomitdetails
that appear irrelevant for the purpose at hand. Thus in modeling the set of outcomes for a
coin toss as{ H,T } , we ignore the type of coin, the initial velocity and angular momentum
of the toss, etc. We also omit the rare possibility that the coin comes to rest on its edge.
Sometimes,conversely, the sample space is enlargedbeyond what is relevant in the interest
of structural simplicity. An example is the above use of a shu#ed deck of 52 cards.
The choice of the sample space in a probability model is similar to the choice of a mathematical model in any branch of science. That is, one simplifies the physical situation by
3
Thisconnectionisnotassimpleasitmightsound,andisfurtherdevelopedinanumberofplaceslater.
4
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
eliminatingdetailoflittleapparentrelevance. Oneoftendoesthisinaniterativeway,using
a very simple model to acquire initial understanding, and then successively choosing more
detailed models based on the understanding from earlier models.
Themathematicaltheoryofprobabilityviewsthesamplespacesimplyasanabstractsetof
elements, and from a strictly mathematical point of view, the idea of doing an experiment
and getting an outcome is a distraction. For visualizing the correspondence between the
theoryandapplications,however, it is bettertoviewthe abstractsetofelements as theset
of possible outcomes of an idealized experiment in which, when the idealized experiment is
performed, one and only one of those outcomes occurs. The two views are mathematically
identical, but it will be helpful to refer to the first view as a probability model and the
secondasan idealizedexperiment. Inappliedprobabilitytextsandtechnicalarticles,these
idealized experiments, rather than real-world situations, are often the primary topic of
discussion.4
1.1.2
Assigning probabilities for finite sample spaces
The word probability is widely used in everyday language, and most of us attach various
intuitive meanings5to the word. For example, everyone would agree that something virtually impossible should be assigned a probability close to 0 and something virtually certain
should be assigned a probability close to 1. For these special cases, this provides a good
rationale for choosing probabilities. The meaning of virtually and close to are slightly unclear at the moment, but if there is some implied limiting process, we would all agree that,
in the limit, certainty and impossibility correspond to probabilities 1 and 0 respectively.
Betweenvirtualimpossibilityandcertainty,ifoneoutcomeappearstobeclosertocertainty
thananother,itsprobabilityshouldbecorrespondinglygreater. Thisintuitivenotionisimpreciseandhighlysubjective;itprovideslittlerationaleforchoosingnumericalprobabilities
for di"erent outcomes, and, even worse, little rationale justifying that probability models
bear any precise relation to real-world situations.
Symmetry can often provide a better rationale for choosing probabilities. For example, the
symmetry between H and T for a coin, or the symmetry between the the six faces of a die,
motivatesassigningequalprobabilities,1/2eachfor H and T and1/6eachforthesixfaces
of a die. This is reasonable and extremely useful, but there is no completely convincing
reason for choosing probabilities based on symmetry.
Another approach is to perform the experiment many times and choose the probability of
each outcome as the relative frequency of that outcome i.e.
( , the number of occurrences of
that outcome divided by the total number of trials). Experience shows that the relative
frequency of an outcome often approaches a limiting value with an increasing number of
trials. Associating the probability of an outcome with that limiting relative frequency is
4
This is not intended as criticism, since we will see that there are good reasons to concentrate initially
onsuchidealizedexperiments. However,readersshouldalwaysbeawarethatmodelingerrorsarethemajor
cause of misleading results in applications of probability, and thus modeling must be seriously considered
before using the results.
5
Itispopulartotrytodefineprobabilitybylikelihood,butthisisunhelpfulsincethewordsareessentially
synonyms.
1.2. THE AXIOMSOF PROBABILITYTHEORY
5
certainly close to our intuition and also appears to provide a testable criterion between
model and real world. This criterion is discussed in Sections 1.6.1 and 1.6.2 and provides
a very concrete way to use probabilities, since it suggests that the randomness in a single
trial tends to disappear in the aggregate of many trials. Other approaches to choosing
probability models will be discussed later.
1.2
The axioms of probability theory
As the applications of probability theory became increasingly varied and complex during
the 20th century, the need arose to put the theory on a firm mathematical footing. This
was accomplished by an axiomatization of the theory, successfully carried out by the great
RussianmathematicianA.N.Kolmogorov[15]in1932. Beforestatingandexplainingthese
axioms of probability theory, the following two examples explain why the simple approach
of the last section, assigning a probability to each sample point, often fails with infinite
sample spaces.
Example 1.2.1. Suppose we want to model the phase of a sine wave, where the phase is
viewedasbeing“uniformlydistributed”between0and2
π. Ifthisphaseistheonlyquantity
of interest, it is reasonable to choose a sample space consisting of the set of real numbers
between 0 and 2π. There are uncountably6many possible phases between 0 and 2π, and
with any reasonable interpretation of uniform distribution, one must conclude that each
samplepoint hasprobabilityzero. Thus, the simpleapproach ofthelastsectionleadsus to
concludethatanyeventinthisspacewithafiniteorcountablyinfinitesetofsamplepoints
shouldhaveprobabilityzero. Thatsimpleapproachdoesnothelpinfindingtheprobability,
say, of the interval (0,! ).
For this example, the appropriateview is the one taken in all elementary probability texts,
namely to assign a probability density 21π to the phase. The probability of an event can
then usually be found by integrating the density over that event. Useful as densities are,
7
however, they do not lead to a general approach over arbitrary sample spaces.
Example 1.2.2. Consideraninfinitesequenceofcointosses. Theusualprobabilitymodel
is to assign probability 2 − n to each possible initial n -tuple of individual outcomes. Then
in the limit n → ∞ , the probability of any given sequence is 0. Again, expressing the
probability of an event involving infinitelymany tosses as a sum of individualsample-point
probabilities does not work. The obvious approach (which we often adopt for this and
similar situations) is to evaluate the probability of any given event as an appropriatelimit,
as n → ∞ , of the outcome from the first n tosses.
6
A set is uncountably infinite if it is infinite and its members cannot be put into one-to-one correspondence with the positive integers. For example the set of real numbers over some interval such as (0
, 2π)
is uncountably infinite. The Wikipedia article on countable sets provides a friendly introduction to the
concepts of countability and uncountability.
7
Itispossibletoavoidtheconsiderationofinfinitesamplespacesherebyquantizingthepossiblephases.
This is analogous to avoiding calculus by working only with discrete functions. Both usually result in both
artificiality and added complexity.
6
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Wewilllaterfindanumberofsituations,evenforthisalmosttrivialexample,whereworking
withafinitenumberofelementaryexperimentsandthengoingtothelimitisveryawkward.
Oneexample,tobediscussedindetaillater,isthestronglawoflargenumbers(SLLN).This
law looks directly at events consisting of infinite length sequences and is best considered in
the context of the axioms to follow.
Althoughappropriateprobabilitymodelscanbegeneratedforsimpleexamplessuchasthose
above, there is a need for a consistent and general approach. In such an approach, rather
thanassigningprobabilitiestosamplepoints, whicharethenusedtoassignprobabilitiesto
events, probabilities must be associated directly with events.The axioms to follow establish
consistency requirements between the probabilities of di"erent events. The axioms, and
the corollariesderived from them, are consistent with one’sintuition, and, for finitesample
spaces,areconsistentwithourearlierapproach. Dealingwiththecountableunionsofevents
in the axioms will be unfamiliar to some students, but will soon become both familiar and
consistent with intuition.
The strange part of the axioms comes from the fact that defining the class of events as the
set of all subsets of the sample space is usually inappropriate when the sample space is
uncountably infinite. What is needed is a class of events that is large enough that we can
almost forget that some very strange subsets are excluded. This is accomplishedby having
8
two simple sets of axioms, one defining the class of events, and
the other defining the
relations between the probabilities assigned to these events. In this theory, all events have
probabilities, but those truly weird subsets that are not events do not have probabilities.
This will be discussed more after giving the axioms for events.
The axioms for events use the standard notation of set theory. Let $ be the sample space,
i.e., the set of all sample points for a given experiment. It is assumed throughout that $
is nonempty. The events are subsets of the sample space. The union ofn subsets (events)
Sn
S
S
or A 1 ··· A n ,andconsistsofallpointsinat
A 1,A ,2 ··· ,A n isdenotedbyeither
i =1 A i
least one of A 1,A ...
2 ,A
n . Similarly, the intersection of these subsets is denoted by either
Tn
9
or A 1A 2 ··· A n and consists of all points in all of A 1,A ...
2 ,A
n.
i =1 A i
A sequenceofeventsisacollectionofeventsinone-to-onecorrespondencewiththepositive
S∞
integers, i.e. , A 1,A ,...
ad infinitum. A countable union,
is the set of points in
2 ,
i =1
T ∞A i
one or more of A 1,A ,...
. Similarly, a countable intersection i =1 A i is the set of points
2
in all of A 1,A ,...
.
Finally,
the complement A c of a subset (event) A is the set of points
2
in $ but not A .
1.2.1
Axioms for events
Given a sample space $, the class of subsets of $ that constitute the set of events satisfies
the following axioms:
1. Ω is an event.
8
9
A class of elements satisfying these axioms isTcalled
T a σ-algebra or, less commonly, a σ-field.
Intersection is also sometimes denoted as A 1 ··· A n , but is usually abbreviated as A 1A 2 ··· A n .
7
1.2. THE AXIOMSOF PROBABILITYTHEORY
2. For every sequence of events
A 1,A ,...
2
, the union
S∞
n =1
A n is an event.
3. For every event A , the complement A c is an event.
There are a number of important corollaries of these axioms. First, the empty set φ is an
event. ThisfollowsfromAxioms1and3, since
φ =$ . cTheemptysetdoesnotcorrespond
toourintuitionaboutevents,butthetheorywouldbeextremelyawkwardifitwereomitted.
S
S
Second,everyfiniteunionofeventsisanevent. Thisfollowsbyexpressing
A 1 ··· A n as
S∞
where A i = φ for all i>n . Third, every finite or countable intersection of events
i =1 A i
is an event. This follows from deMorgan’s law,
h[
ic \
An =
A cn.
n
n
Although we will not make a big fuss about these axioms in the rest of the text, we will
be careful to use only complements and countable unions and intersections in our analysis.
Thus subsets that are not events will not arise.
Note that the axioms do not say that all subsets of $ are events. In fact, there are many
rather silly ways to define classes of events that obey the axioms. For example, the axioms
are satisfied by choosing only the universal set $ and the empty set φ to be events. We
shall avoid such trivialities by assuming that for each sample point ω, the singleton subset
{ ω} is an event. For finite sample spaces, this assumption, plus the axioms above, imply
that all subsets are events.
Foruncountablyinfinitesamplespaces,suchasthesinusoidalphaseabove,thisassumption,
plustheaxiomsabove, stillleavesconsiderablefreedominchoosingaclassofevents. Asan
example, the class of all subsets of $ satisfies the axioms but surprisingly does not allow
the probability axioms to be satisfied in any sensible way. How to choose an appropriate
class of events requires an understanding of measure theory which would take us too far
10
afield for our purposes. Thus we neither assume nor develop measure theory here.
From a pragmatic standpoint, we start with the class of events of interest, such as those
requiredtodefinetherandomvariablesneededintheproblem. Thatclassisthenextended
so as to be closed under complementation and countable unions. Measure theory shows
that this extension is possible.
1.2.2
Axioms of probability
Given any sample space $ and any class of events E satisfying the axioms of events, a
probability rule is a function Pr {·} mapping eachA 2 E to a (finite 11
) real number in such
a way that the following three probability axioms 12 hold:
10
Thereisnodoubtthatmeasuretheoryisusefulinprobabilitytheory,andseriousstudentsofprobability
should certainly learn measure theory at some point. For application-oriented people, however, it seems
advisabletoacquiremoreinsightandunderstandingofprobability,atagraduatelevel,beforeconcentrating
on the abstractions and subtleties of measure theory.
11
±!
The word finite is redundant here, since the set of real numbers, by definition, does not include
The set of real numbers with ±! appended, is called the extended set of real numbers
12
Sometimes finite additivity, (1.3), is added as an additional axiom. This addition is quite intuitive and
avoids the technical and somewhat peculiar proofs given for (1.2) and (1.3).
.
8
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
1. Pr { Ω} =1.
2. For every event A , Pr { A } $ 0.
3. The probability of the union of any sequence A 1,A ,...
of disjoint 13 events is given
2
by
n[ ∞
o X ∞
Pr
Pr { A n } ,
(1.1)
An =
n =1
where
P
∞
n =1
n =1
Pr { A n } is shorthand for lim m →∞
P
m
n =1
Pr { A n } .
The axioms imply the following useful corollaries:
Pr
n[
m
Pr { φ}
o
An
=0
=
(1.2)
X
n =1
m
n =1
Pr { A n }
for A 1,... ,A
Pr { A c} = 1 − Pr { A }
Pr { A } '
Pr { B }
Pr { A } '
1
X
Pr { A n } '
n[ ∞
o
Pr
=
An
n =1
n[ ∞
o
Pr
=
An
n =1
n\ ∞
o
Pr
=
An
n =1
lim Pr
n[
m →∞
disjoint
for all A
for all A ✓ B
for all A
for A 1,... , disjoint
1
n
m
m
n =1
o
(1.3)
(1.4)
(1.5)
(1.6)
(1.7)
(1.8)
An
lim Pr { A n }
for A 1 ✓ A 2 ✓ ááá
(1.9)
lim Pr { A n }
for A 1 ◆ A 2 ◆ ··· .
(1.10)
n →∞
n →∞
To verify (1.2), consider a sequence of events, A 1,A ,...
for which A n = φ for each
2 ,
.
These
events
are
disjoint
since
contains
no
outcomes,
and
thus has no outcomes in
n
φ
S
common with itself or any other event. Also,
= φ since this union contains no
n An
outcomes. Axiom 3 then says that
Pr { φ} = lim
m →∞
Xm
Pr { A n } = lim
n =1
m →∞
mPr { φ} .
Since Pr{ φ} is a real number, this implies that Pr { φ} =0.
To verify (1.3), apply Axiom 3 to the disjoint sequence A 1,... ,A ,#,#,...
.
m
S c
To verify (1.4), note that $= A A . Then apply (1.3) to the disjoint sets A and A c.
S
To verify(1.5), notethatif A ✓ B , then B = A ( B − A ) where B − A is an alternateway
T c
to write B A . We see then that A and B − A are disjoint, so from (1.3),
n [
o
Pr { B } = Pr A ( B − A ) = Pr { A } + Pr { B − A } $ Pr { A } ,
where we have used Axiom 2 in the last step.
13
Twosets or events A 1,A 2 aredisjoint if theycontainno commonevents,
of sets or events are disjoint if all pairs are disjoint.
i.e. , if A 1A
2
= φ. Acollection
9
1.3. PROBABILITYREVIEW
To verify (1.6) and (1.7), first substitute $ for B in (1.5) and then substitute
S
n
A n for A .
Finally, (1.8) is established in Exercise 1.3, part (e), and (1.9) and (1.10) are simple consequences of (1.8).
Theaxiomsspecifytheprobabilityofany
disjoint unionofeventsintermsoftheindividual
eventprobabilities,butwhataboutafiniteorcountableunionofarbitraryevents? Exercise
1.3 (c) shows that in this case, (1.3) can be generalized to
Pr
n[
m
n =1
o
An
=
X
m
n =1
Pr { B n } ,
(1.11)
where B 1 = A 1 andforeach n> 1, B n = A n − B n − 1 isthesetofpointsin A n butnotinany
of the sets A 1,... ,A n − 1. That is the sets B n are disjoint. The probability of a countable
union of disjoint sets is then given by (1.8). In order to use this, one must know not
only the event probabilities for A 1,A ...
, but also the probabilities of their intersections.
2
The union bound, which is derived in Exercise 1.3 (d), depends only on the individual
event probabilities, and gives the following frequently useful upper bound on the union
probability.
n[
o
X
Pr
Pr { A n }
(Union bound) .
(1.12)
An ≤
n
1.3
1.3.1
n
Probability review
Conditional probabilities and statistical independence
Foranytwoevents A and B with Pr { B } > 0, theconditionalprobability
Definition 1.3.1.
of A , conditional on B , is defined by
Pr { A |B } =Pr { AB } / Pr { B } .
(1.13)
One visualizes an experiment that has been partly carried out with B as the result. Then,
assuming Pr{ B } > 0, Pr { A |B } can be viewed as the probability of A normalized to a
sample space restricted to eventB . Within this restricted sample space, we can view B as
the sample space i.e.
( , as the set of outcomes that remain possible upon the occurrence of
)
and
as
an
event within this sample space. For a fixed event B , we can visualize
B
AB
mapping each eventA in the original space to event AB in the restricted space. It is easy
toseethattheeventaxiomsarestillsatisfiedinthisrestrictedspace. Assigningprobability
Pr { A |B } to each event AB in the restricted space, it is easy to see that the axioms of
probability are satisfied when B is regarded as the entire sample space. In other words,
everything we know about probability can also be applied to such a restricted probability
space.
Definition 1.3.2.
independent) if
Two events, A and B , are statistically independent (or, more briefly,
Pr { AB } =Pr { A } Pr { B } .
10
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
For Pr { B } > 0, this is equivalent to Pr { A |B } = Pr { A } . This latter form corresponds to
our intuitive view of independence, since it says that the observation ofB does not change
the probability of A . Such intuitive statements about “observation” and “occurrence” are
helpful in reasoning probabilistically, but sometimes cause confusion. For example, Bayes
law, in the form Pr { A |B } Pr { B } = Pr { B |A } Pr { A } , is an immediate consequence of the
definition of conditional probability in (1.13). However, if we can only interpret Pr { A |B }
when B is ‘observed’ oroccurs‘before’ A , thenwe cannotinterpretPr { B |A } andPr { A |B }
together. This caused immense confusion in probabilistic arguments before the axiomatic
theory was developed.
The notion of independence is of vital importance in defining, and reasoning about, probabilitymodels. Wewillseemanyexampleswhereverycomplexsystemsbecomeverysimple,
both in terms of intuition and analysis, when appropriate quantities are modeled as statistically independent. An example will be given in the next subsection where repeated
independent experiments are used to understand arguments about relative frequencies.
Often,whentheassumptionofindependenceturnsouttobeoversimplified,itisreasonable
toassumeconditionalindependence,where A and B aresaidtobe conditionallyindependent
given C if Pr { AB |C } =Pr { A |C } Pr { B |C } . Most of the stochastic processes to be studied
here are characterized by particular forms of independence or conditional independence.
For more than two events, the definition of statistical independence is a little more complicated.
The events A 1,... ,A n , n > 2 are statistically independent if for each
Definition 1.3.3.
collection S of two or more of the integers 1 to n .
Pr
n\
o
i2S
Ai
=
Y
i2S
Pr { A i } .
(1.14)
This includes the entire collection { 1,... ,n } , so one necessary condition for independence
is that
n\ n
o Y n
Pr
Pr { A i } .
(1.15)
Ai =
i =1
i =1
It might be surprising that (1.15) does not imply (1.14), but the example in Exercise 1.5
will help clarify this. This definition will become clearer (and simpler) when we see how to
view independence of events as a special case of independence of random variables.
1.3.2
Repeated idealized experiments
Much of our intuitive understanding of probability comes from the notion of repeating
the same idealized experiment many times ( i.e. , performing multiple trials of the same
experiment). However, the axioms of probability contain no explicit recognition of such
repetitions. The appropriate way to handle
n repetitions of an idealized experiment is
through an extended experiment whose sample points are n -tuples of sample points from
the original experiment. Such an extended experiment is viewed asn trials of the original
11
1.3. PROBABILITYREVIEW
experiment. The notion of multiple trials of a given experiment is so common that one
sometimesfailstodistinguishbetweentheoriginalexperimentandanextendedexperiment
with multiple trials of the original experiment.
To be more specific, given an original sample space $, the sample space of an
n -repetition
model is the Cartesian product
Ω× n = { ( ω1 βω2 β... βωn ) : ωi 2 Ω for each i, 1 ≤ i ≤ n } ,
(1.16)
i.e., the set of all n -tuples for which each of the n components of the n -tuple is an element
of the original sample space $. Since each sample point in the n -repetition model is an
n -tuple of points from the original $, it follows that an event in the
n -repetition model is
a subset of $× n , i.e., a collection of n -tuples (ω1 β... βωn ), where eachωi is a sample point
from $. This class of events in $ × n should include each event of the form{ ( A 1A 2 ··· A n ) } ,
where { ( A 1A 2 ··· A n ) } denotes the collection of n -tuples ( ω1 β... βωn ) where ωi 2 A i for
1 ≤ i ≤ n . The set of events (for n -repetitions) must also be extended to be closed under
complementation and countable unions and intersections.
Thesimplestandmostnaturalwayofcreatingaprobabilitymodelforthisextendedsample
space and class of events is through the assumption that then -trials are statistically independent. Moreprecisely,weassumethatforeachextendedevent { ( A 1A 2 ··· A n ) } contained
in $ × n , we have
Y n
Pr { ( A 1A 2 ··· A n ) } =
Pr { A i } ,
(1.17)
i =1
where Pr { A i } is the probability of event A i in the original model. Note that since $
can be substituted for any collection of A i in this formula, the subset condition of (1.14)
is automatically satisfied. In other words, for any probability model, there is an extended
independentn -repetitionmodelforwhichtheeventsineachtrialareindependentofthosein
theothertrials. Inwhatfollows,werefertothisastheprobabilitymodelfor
n independent
identically distributed (IID) trials of a given experiment.
Thenicetiesofhowtocreatethismodelfor
n IIDarbitraryexperimentsdependonmeasure
theory, butwe simplyrelyon the existenceof such a model andthe independenceof events
in di"erent repetitions. What we have done here is very important conceptually. A probability model for an experiment does not say anything directly about repeated experiments.
However,questionsaboutindependentrepeatedexperimentscanbehandleddirectlywithin
thisextendedmodelof n IID repetitions. Thiscanalso be extendedtoa countablenumber
of IID trials.
1.3.3
Random variables
The outcome of a probabilistic experiment often specifies a collection of numerical values
such as temperatures, voltages, numbers of arrivals or departures in various time intervals,
etc. Each such numerical value varies, depending on the particular outcome of the experiment, and thus can be viewed as a mappingfrom the set $ of samplepoints to the set R of
real numbers (note that R does not include ±# ). These mappings from sample points to
real numbers are called random variables.
12
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
A random variable (rv) is essentially a function
Definition 1.3.4.
X from the sample
spaceΩ of a probability model to the set of real numbersR. Three modifications are needed
to make this precise. First, X might be undefined or infinite for a subset of Ω that has 0
probability.14 Second, the mappingX ( ω) must have the property that { ω 2 Ω : X ( ω) ≤ x }
is an event15 for each x 2 R. Third, every finite set of rv’s X 1,... ,X n has the property
that for each x 1 2 R,... ,x n 2 R, the set { ω : X 1( ω) ≤ x 1,... ,X n ≤ x n } is an event .
As with any function, there is often confusion between the function itself, which is called
X in the definition above, and the value X ( ω) taken on for a sample point ω. This is
particularly prevalent with random variables (rv’s) since we intuitively associate a rv with
its sample value when an experiment is performed. We try to control that confusion here
byusing X , X ( ω), and x , respectively, torefertotherv, thesamplevaluetakenforagiven
sample point ω, and a generic sample value.
The distribution function 16 FX ( x ) of a random variable (rv) X is a
Definition 1.3.5.
function, R → R, defined by FX ( x ) = Pr { ω 2 Ω : X ( ω) ≤ x } . The argument ω is usually
omitted for brevity, so FX ( x ) =Pr { X ≤ x } .
Note that x is the argument of FX ( x ) and the subscript X denotes the particular rv under
consideration. AsillustratedinFigure1.1,thedistributionfunction
FX ( x ) isnondecreasing
with x and must satisfy the limits lim x →−∞ FX ( x ) = 0 and lim x →∞ FX ( x ) = 1. Exercise
1.6 proves that FX ( x ) is continuous from the right ( i.e., that for every x 2 R and ✏ >0,
lim ✏→ 0 FX ( x + ✏
) = FX ( x )).
1
q
q
FX ( x )
0
Figure 1.1: Example of a distributionfunction for a rv that is neither continuous nor
discrete. If FX ( x ) has a discontinuity at some x o , it means that there is a discrete
probability at x o equal to the magnitude of the discontinuity. In this case FX ( x o ) is
given by the height of the upper point at the discontinuity.
Becauseofthedefinitionofarv,theset
{ X ≤ x } foranyrv X andanyrealnumber x must
be an event, and thus Pr{ X ≤ x } must be defined for all real x .
14
For example, consider a probability model in which " is the closed interval [0
, 1] and the probability
∞ but X isstillregardedas
distributionisuniformover". If
X ( ω) =1 /$ ,thenthesamplepoint0mapsto
a rv. These subsets of 0 probability are usually ignored, both by engineers and mathematicians. Thus, for
example,theset { ω 2 Ω : X ( ω) ≤ x } meansthesetforwhich X ( ω) is bothdefinedandsatisfies X ( ω) ≤ x .
15
Theselasttwomodificationsaretechnicallimitationsconnectedwithmeasuretheory. Theycanusually
be ignored, since they are satisfied in all but the most bizarre conditions. However, just as it is important
to know that not all subsets in a probability space are events, one should know that not all functions from
Ω to R are rv’s.
16
The distribution function is sometimes referred to as the cumulative distribution function.
1.3. PROBABILITYREVIEW
13
The concept of a rv is often extended to complex random variables (rv’s) and vector rv’s.
A complex random variableis amappingfromthesamplespacetothe setoffinitecomplex
numbers, and a vector random variable (rv) is a mapping from the sample space to the
finitevectorsinsomefinite-dimensionalvectorspace. Anotherextensionisthatofdefective
rv’s. A defective rv X is a mappinp, from the sample space to the extended real numbers,
which satisfies the conditionsof a rv except that the set of sample points mapped into ±#
has positive probability.
When rv’s are referred to (without any modifier such as complex, vector, or defective), the
original definition, i.e., a function from $ to R, is intended.
If X has only a finite or countable number of possible sample values, say x 1,x ,...
, the
2
probability Pr { X = x i } of each sample value x i is called the probability mass function
(PMF) at x i and denoted by pX ( x i ); such a random variable is called discrete. The distribution function of a discrete rv is a ‘staircase function,’ staying constant between the
possible sample values and having a jump of magnitude pX ( x i ) at each sample value x i .
Thus the PMF and the distribution function each specify the other for discrete rv’s.
If the distribution function FX ( x ) of a rv X has a (finite) derivative at x , the derivative is
calledthe probabilitydensity (orthedensity)of X at x anddenotedby fX ( x ); forsu!ciently
small δ, δ · fX ( x ) then approximates the probability that X is mapped to a value between
x and x + δ. If the density exists for all x , the rv is said to be continuous. More generally,
if there is a function fX ( x ) such that, for each x 2 R, the distribution function satisfies
Rx
( ) ,thenthervissaidtobecontinuousand
fX ( x ) istheprobabilitydensity. This
−∞ f X y dy
generalizationallowsthedensitytobediscontinuous. Inotherwords,beingacontinuousrv
islessrestrictivethathavingacontinuousdensity, butslightlymorerestrictivethanhaving
a continuous distribution function.
ElementaryprobabilitycoursesworkprimarilywiththePMFandthedensity,sincetheyare
convenient for computational exercises. We will often work with the distribution function
here. Thisispartlybecauseitisalwaysdefined,partlytoavoidsayingeverythingthrice,for
discrete, continuous, and other rv’s, and partly because the distribution function is often
most important in limiting arguments such as steady-state time-average arguments. For
distributionfunctions, density functions, and PMF’s, the subscript denotingthe rv is often
omitted if the rv is clear from the context. The same convention is used for complex or
vector rv’s.
The following tables list some widely used rv’s. If the density or PMF is given only in a
limited region, it is zero outside of that region. The moment generating function, MGF, of
⇥ ⇤
a rv X is E erX and will be discussed in Section 1.3.11.
1.3.4
Multiple random variables and conditional probabilities
Oftenwemustdealwithmultiplerandomvariables(rv’s)inasingleprobabilityexperiment.
If X 1,X ,...,X
2
n arerv’sorthecomponentsofavectorrv,theirjointdistributionfunction
14
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Density fX ( x )
Name
λ exp(− λx );
Exponential:
Erlang:
λ nx
exp( − λx ) ;
( n − 1)!
√1
σ 2π
Gaussian:
exp
1;
a
Uniform:
⇣
Variance
1
λ
1
λ2
n
λ
n
λ2
a
σ2
exp(ra + r 2' / 2 2)
a
2
a2
12
exp( ra ) − 1
ra
x≥ 0
n− 1
x≥ 0
− ( x − a) 2
2σ2
MGF gX ( r )
Mean
⌘
0≤ x ≤ a
λ
λ− r
⇣
λ
λ− r
; for r<&
⌘
n
; for r<&
Table 1.1: The density, mean, varianceand MGF forsome commoncontinuous rvÕs.
Name
PMF pM ( m )
Mean
Variance
MGF gM ( r )
Binary:
pM (1) = p; pM (0) =1 − p
p
p(1 − p)
1 − p + per
pm (1− p) n − m ; 0 ≤ m≤ n
np
np(1 − p)
[1 − p + per ]n
1
p
1− p
p2
λ
λ
Binomial:
n
m
p(1− p) m − 1; m ≥ 1
Geometric:
λ n exp( − λ ) ;
n!
Poisson:
n≥ 0
per
1− (1 − p) er
; for r< ln
1
1− p
exp[λ ( er − 1)]
Table 1.2: The PMF,mean, varianceand MGFfor some commondiscrete rvÕs.
is defined by
FX 1 ···X n ( x 1,... ,x
n)
=Pr { ω 2 Ω: X 1( ω) ≤ x 1, X 2( ω) ≤ x 2,..., X
n ( ω)
≤ xn } .
(1.18)
Thisdefinitiongoesalongwaytowardexplainingwhyweneedthenotionofasamplespace
Ω when all we want to talk about is a set of rvÕs. The distribution function of a rv fully
describes the individual behavior of that rv, but $ and the above mappings are needed to
describe how the rv’s interact.
Foravectorrv X withcomponents X 1,...,X n ,oracomplexrv X withrealandimaginary
parts X 1,X 2, the distributionfunctionis alsodefinedby (1.18). Notethat
{ X 1 ≤ x 1, X 2 ≤
is
an
event
and
the
corresponding
probability
is
nondecreasing
in each
x 2,..., X n ≤ x n }
argument x i . Also the distribution function of any subset of random variables is obtained
by setting the other arguments to + ∞ . For example, the distribution of a single rv (called
a marginal distribution) is given by
FX i ( x i ) = FX 1 ···X i − 1 X iX
i +1
···X n ( ∞
,..., ∞ ,x,i ∞ ,..., ∞ ) .
If the rv’s are all discrete, there is a joint PMF which specifies and is specified by the joint
15
1.3. PROBABILITYREVIEW
distribution function. It is given by
pX 1...X n ( x 1,... ,x
n)
=Pr { X 1 = x 1,...,X
n
= xn } .
Similarly, if the joint distribution function is di"erentiable everywhere, it specifies and is
specified by the joint probability density,
fX 1...X
(
n x 1,...,x
n)
n (
)
= ∂ F x 1,...,x n .
∂x 1∂x 2 ··· ∂x n
Two rv’s, say X and Y , are statistically independent (or, more briefly, independent) if
FXY ( x,y ) = FX ( x ) FY ( y)
for each x 2 R,y 2 R.
(1.19)
If X and Y are discrete rv’s then the definition of independence in (1.19) is equivalent to
the corresponding statement for PMF’s,
pXY ( x,y
i j ) = pX ( x i ) pY ( yj )
for each value x i of X and yj of Y.
Since{ X = x i } and { Y = yj } areevents,theconditionalprobabilityof
on { Y = yj } (assuming pY ( yj ) > 0) is given by (1.13) to be
{ X = x i } conditional
pXY ( x,y
i j)
pX |Y ( x i | yj ) =
.
pY ( yj )
If pX |Y ( x i | yj ) = pX ( x i ) for all i,j , then it is seen that X and Y are independent. This
captures the intuitive notion of independence better than (1.19) for discrete rv’s , since it
can be viewed as saying that the PMF of X is not a"ected by the sample value of Y .
If X and Y have a joint density, then (1.19) is equivalent to
fXY ( x,y ) = fX ( x ) fY ( y)
for each x 2 R,y 2 R.
If fY ( y) > 0,theconditionaldensitycanbedefinedas
independence can be expressed as
f X | Y ( x |y ) = f X ( x )
f X | Y ( x | y) =
f XY ( x,y ) .
fY ( y )
where fY ( y) > 0.
Thenstatistical
(1.20)
Thiscapturestheintuitivenotionofstatisticalindependenceforcontinuousrv’sbetterthan
(1.19), but it does not quite say that the density of X , conditional on Y = y is the same
as the marginal density of X . The event { Y = y} has zero probability for a continuous rv,
andwecannotconditiononeventsofzeroprobability. Ifwelookatthederivativesdefining
thesedensities,the conditionaldensity looks atthe probability that { x ≤ X ≤ x + δ} given
that { y ≤ Y ≤ y + ✏
} in the limit δβ✏
→ 0. At some level, this is a very technical point and
the intuition of conditioning on { Y = y} works very well. Furthermore, problems are often
directlymodeled in terms of conditionalprobability densities, and in these cases, viewinga
conditional density as a limit is unnecessary.
More generally, the probability of an arbitrary event A , conditional on a given value of a
continuous rv Y , is given by
Pr { A,Y 2 [y,y + δ]}
.
δ→ 0 Pr { Y 2 [y,y + δ]}
Pr { A | Y = y} = lim
16
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
We next generalize the above results about two rv’s to the case ofn rv’s X = X 1,...,X
Statistical independence is then defined by the equation
FX ( x 1,...,x
n)
=
Y
n
i =1
Pr { X i ≤ x i } =
Y
n
i =1
FX i ( x i )
for all x 1,... ,x
n
2 R.
n.
(1.21)
In other words, X 1,...,X n are independent if the events X i ≤ x i for 1 ≤ i ≤ n are
independent for all choices ofx 1,...,x n . If the density or PMF exists, (1.21) is equivalent
to a product form for the density or mass function. A set of rv’s is said to be pairwise
independent if each pair of rv’s in the set is independent. As shown in Exercise 1.22,
pairwise independence does not imply that the entire set is independent.
Independent rv’s are very often also identically distributed, i.e., they all have the same
distributionfunction. Thesecasesarisesooftenthatwe abbreviateindependentidentically
distributed by IID. For the IID case (1.21) becomes
FX ( x 1,...,x
1.3.5
n)
=
Y
n
i =1
FX ( x i ) .
(1.22)
Stochastic processes and the Bernoulli process
17 ) is an infinite collection of rv’s, usually indexed
A stochastic process (or random process
by an integer or a real number often interpreted as time. 18 Thus each sample point of the
probability model maps to an infinite collection of sample values of rv’s. If the index is
regarded as time, then each sample point maps to a function of time called a sample path
or sample function. These sample paths might vary continuously with time or might vary
onlyatdiscretetimes,andiftheyvaryatdiscretetimes,thosetimesmightbedeterministic
or random.
In many cases, this collection of rv’s comprising the stochastic process is the only thing of
interest. In this case, the sample points of the probability model can be taken to be the
samplepaths of the process. Conceptually, then, each event is a collectionof samplepaths.
Often the most important of these events can be defined in terms of a finite set of rv’s.
As an example of sample paths that change at only discrete times, we might be concerned
with the times at which customers arrive at some facility. These ‘customers’ might be
customers entering a store, incoming jobs for a computer system, arriving packets to a
communication system, or orders for a merchandising warehouse.
TheBernoulliprocessisanexampleofhowsuchcustomerscouldbemodeledandisperhaps
the simplest non-trivial stochastic process. We define this process here and develop a few
of its many properties. We will frequently return to it, both to use it as an example and to
develop additional properties.
17
Stochastic and random are synonyms, but random has become more popular for random variables and
stochastic for stochastic processes. The reason for the author’s choice is that the common-sense intuition
associated with randomness appears more important than mathematical precision in reasoning about rv’s,
whereas for stochastic processes, common-sense intuition causes confusion much more frequently than with
rv’s. The less familiar word stochastic warns the reader to be more careful.
18
This definition is deliberately vague, and the choice of whether to call a sequence of rv’s a process or a
sequence is a matter of custom and choice.
17
1.3. PROBABILITYREVIEW
ofIIDbinaryrandomvariExample 1.3.1. A Bernoulli processisasequence, Z 1,Z ,...
2 ,
ables.19 Let p =Pr { Z i =1 } and1 − p =Pr { Z i =0 } . WeoftenvisualizeaBernoulliprocess
as evolving in discrete time with the event { Z i = 1 } representing an arriving customer at
time i and { Z i =0 } representingnoarrival. Thusatmostonearrivaloccursateachinteger
time. Wevisualizetheprocessasstartingattime0,withthefirstopportunityforanarrival
at time 1.
Whenviewedasarrivalsintime,itisinterestingtounderstandsomethingabouttheintervals
betweensuccessivearrivalsandabouttheaggregatenumberofarrivalsuptoanygiventime
(see Figure 1.2). These interarrival times and aggregate numbers of arrivals are rv’s that
are functions of the underlying sequence Z 1,Z ,...
. The topic of rv’s that are defined as
2
functions of other rv’s ( i.e., whose sample values are functions of the sample values of the
other rv’s) is taken up in more generality in Section 1.3.7, but the interarrival times and
aggregate arrivals for Bernoulli processes are so specialized and simple that it is better to
treat them from first principles.
First,considerthefirstinterarrivaltime,
X 1,whichisdefinedasthetimeofthefirstarrival.
If Z 1 =1, then(and onlythen) X 1 =1. Thus pX 1 (1) = p. Next, X 1 =2 if and only Z 1 =0
and Z 2 =1, so pX 1 (2) = p(1− p). Continuing, we see that X 1 has the geometric PMF,
pX 1 ( j ) = p(1 − p) j − 1
r
where j ≥ 1.
r
-
X3
r X2
S2
X1
i 0
Zi
Si
1
0
0
2
1
1
3
1
2
4
0
2
5
0
2
6
1
3
7
0
3
8
0
3
Figure 1.2: IllustrationofasamplepathforaBernoulliprocess: Thesamplevaluesof
thebinaryrv’s Z i areshownbelowthetimeinstants. Thesamplevalueoftheaggregate
Pn
number of arrivals, Sn =
, is the illustrated step function, and the interarrival
i =1 Z i
intervals are the intervals between steps.
Each subsequent interarrival time X k can be found in this same way. 20 It has the same
geometric PMF and is statistically independent of X 1,... ,X k − 1. Thus the sequence of
interarrival times is an IID sequence of geometric rv’s.
It can be seen from Figure 1.2 that a sample path of interarrival times also determines a
sample path of the binary arrival rv’s, { Z i ; i ≥ 1} . Thus the Bernoulli process can also be
characterized in terms of a sequence of IID geometric rv’s.
19
We say that a sequence Z 1,Z ,...
of rv’s are IID if for each integer
n , the rv’s Z 1,... ,Z n are IID.
2 ,
There are some subtleties in going to the limit n → ∞ , but we can avoid most such subtleties by working
with finite n -tuples and going to the limit at the end.
20
Thisisoneofthosemaddeningargumentsthat,whileintuitivelyobvious,requiressomecarefulreasoning
to be completely convincing. We go through several similar arguments with great care in Chapter 2, and
suggest that skeptical readers wait until then to prove this rigorously.
18
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
For our present purposes, the most important rv’s in a Bernoulli process are the partial
Pn
sums Sn =
. Each rv Sn is the number of arrivals up to and including time n , i.e. ,
i =1 Z i
Sn is simply the sum of n binary IID rv’s and thus has the binomial distribution. That is,
pSn ( k ) istheprobabilitythat k outof n ofthe Z i ’shavethevalue1. Thereare nk = k !( nn−! k )!
arrangements of a binary n -tuple with k 1’s, and each has probability pkq n − k . Thus
✓ ◆
n k n− k
(1.23)
pSn ( k ) =
pq
.
k
We will use the binomial PMF extensively as an example in explaining the laws of large
numbers laterinthischapter,andwilloftenuseit inlaterchaptersasan exampleofasum
of IID rv’s. For these examples, we need to know how pSn ( k ) behaves asymptotically as
n → ∞ and k → ∞ . The relative frequency k/n will be denoted as ˜p. We make a short
digression here to state and develop an approximation to the binomial PMF that makes
this asymptotic behavior clear.
Lemma 1.3.1. Let pSn (˜pn) be the PMF of the binomial distribution for an underlying
binary PMF pZ (1) = p> 0, pZ (0) = q> 0. Then for each integer pn,
˜ 1 ≤ pn
˜ ≤ n − 1,
s
1
⇥
⇤
exp n# ( p, p˜ )
where
(1.24)
pSn (˜pn) <
2πnp˜ (1− p˜ )
1− p
p
φ( p, p˜ ) = ˜p ln( ˜ ) + (1 − p˜ ) ln( 1 ˜ ) ≤ 0.
− p
p
(1.25)
Also, φ( p, p˜ ) < 0 for all p˜ =
, 0 <$ ≤ 1/ 2, there is an n ( ✏
) such that
6 p . Finally, for any ✏
for n>n ( ✏
),
s
✓
1 ◆
1
⇥
⇤
exp n# ( p, p˜ )
for ✏≤ p˜ ≤ 1 − ✏
(1.26)
pSn (˜pn) > 1 − √
2πnp˜ (1− p˜ )
n
Discussion: Theparameter˜p = k/n istherelativefrequencyof1’sinthe n -tuple Z 1,... ,Z n .
Foreach n , p
˜ ontheleftof(1.24)isrestrictedsothat ˜
pn isaninteger. Thelemmathensays
that pSn (˜pn) is upper boundedby an exponentiallydecreasingfunctionof n for each p˜ 6
= p.
If ˜p is bounded away from 0 and 1, the ratio of the upper and lower bounds on pSn (˜pn)
approaches 1 asn → ∞ . A bound that is asymptotically tight in this way is denoted as
s
1
⇥
⇤
for ✏< p<
˜ 1− ✏
(1.27)
pSn (˜pn)) ⇠ 2 ˜ (1 ˜ ) exp n# ( p, p˜ )
πnp − p
wherethesymbol ⇠ meansthattheratioofthelefttotherightsideapproaches1as
Proof*:
21
21
n→∞
The factorial of any positive integer n is bounded by the Stirling bounds,22
⇣ ⌘n
⇣ n ⌘n
√
√
2πn n
(1.28)
<n ! < 2πn
e1/ 12n .
e
e
Proofs with an asterisk can be omitted without an essential loss of continuity.
See Feller [7] for a derivation of these
√ results about the Stirling bounds. Feller also shows that an
improved lower bound to n ! is given by 2πn( n/e ) n exp[ 121n − 3601n 3 ].
22
19
1.3. PROBABILITYREVIEW
√
The ratio 2πn( n/e ) n /n ! is monotonically increasing with n toward the limit 1, and the
√
ratio 2πn( n/e ) n exp(1/ 12n ) /n ! is monotonically decreasing toward 1. The upper bound
is more accurate, but the lower bound is simpler and known as the Stirling approximation.
p
√
Since 2πn( n/e ) n /n ! is increasing in n , we see that n !/k ! < n/kn kn − k e− n + k for k <n .
Combining this with (1.28) applied to n − k ,
✓ ◆ r
n
n
nn
(1.29)
<
2πk( n − k ) k k ( n − k ) n − k .
k
Using (1.29) in (1.23) to upper bound pSn ( k ),
r
pSn ( k ) <
n
pkq n − k n n
2πk( n − k ) k k ( n − k ) n − k .
Replacing k by ˜pn, we get (1.24) where φ( p, p˜ ) is given by (1.25). Applying the same
argument to the right hand inequality in (1.28),
✓ ◆
n
>
k
>
◆
✓ 1
1
n
nn
exp −
−
2πk( n − k ) k k ( n − k ) n − k
12k 12(n − k )
r
n
1
n
n
1−
−
k
n
k
2πk( n − k ) k ( n − k )
12n p˜ (1 − p˜ ) .
r
(1.30)
For ✏< p<
˜ 1− ✏
, the term in brackets in (1.30) is lower bounded by 1 − 1/ (12n$(1 − ✏
)),
√
which is further lower bounded by 1 − 1/ n for all su!ciently large n , establishing (1.26).
Finally, to show that φ( p, p˜ ) ≤ 0, with strict inequality for ˜ p =
6 p, we take the first two
derivatives of φ( p, p˜ ) with respect to ˜p.
✓ (1 ˜ ) ◆
∂φ( p, p˜ ) =ln
p − p
˜
∂p
p˜ (1 − p)
−1
∂f 2( p, p˜ ) =
.
∂ p˜ 2
p˜ (1 − p˜ )
Since the second derivative is negative for 0< p<
˜ 1, the maximum of φ( p, p˜ ) with respect
to ˜p is 0, achieved at p˜ = p. Thus φ( p, p˜ ) < 0 for ˜p =
6 p. Furthermore, φ( p, p˜ ) decreases as
˜p moves in either direction away from p.
Various aspects of this lemma will be discussed later with respect to each of the laws of
large numbers.
We saw earlier that the Bernoulli process can also be characterized as a sequence of IID
geometric interarrival intervals. An interesting generalization of this arises by allowing the
interarrivalintervalstobearbitrarydiscreteorcontinuousnonnegativeIIDrv’sratherthan
geometricrv’s. Theseprocessesareknownas renewalprocessesandarethetopicofChapter
5. Poissonprocessesarespecialcasesofrenewalprocessesinwhichtheinterarrivalintervals
have an exponential PDF. These are treated in Chapter 2 and have many connections to
Bernoulli processes.
Renewal processes are examplesof
discrete stochastic processes
. The distinguishingcharacteristic of such processes is that interesting things (arrivals, departures, changes of state)
20
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
occur at discrete instants of time separated by deterministic or random intervals. Discrete
stochastic processes are to be distinguished from noise-like stochastic processes in which
changes are continuously occurring and the sample paths are continuously varying functions of time. The description of discrete stochastic processes above is not intended to be
precise,butChapters2,4,and5arerestrictedtodiscretestochasticprocessesinthissense,
whereas Chapter 3 is restricted to continuous processes.
1.3.6
Expectation
The expected valueE [X ] of a random variable X is also called the expectation or the mean
andisfrequentlydenotedas X . Beforegivingageneraldefinition,wediscussseveralspecial
cases. First consider nonnegative discrete rv’s. The expected valueE [X ] is then given by
X
(1.31)
E [X ] =
x pX ( x ) .
x
If X has a finite number of possible sample values, the above sum must be finite since each
samplevaluemustbefinite. Ontheotherhand,if
X hasacountablenumberofnonnegative
sample values, the sum in (1.31) might be either finite or infinite. Example1.3.2 illustrates
a case in which the sum is infinite. The expectation is said to exist only if the sum is finite
( i.e. , ifthesumconvergestoarealnumber), andinthiscase
E [X ]isgivenby(1.31). Ifthe
sum is infinite, we say that E [X ] does not exist, but also say 23 that E [X ] = ∞ . In other
words, (1.31) can be used in both cases, butE [X ] is said to exist only if the sum is finite.
Example 1.3.2. This example will be useful frequently in illustrating rv’s that have an
infinite expectation. Let N be a positive integer-valued rv with the distribution function
FN ( n ) = n/ ( n + 1) for each integer n ≥ 1. Then N is clearly a positive rv since FN (0) =0
and lim n →∞ FN ( n ) =1. For each n ≥ 1, the PMF is given by
pN ( n ) = FN ( n ) − FN ( n − 1) =
1
n
n− 1=
−
.
n+ 1
n
n ( n + 1)
(1.32)
P∞
Since pN ( n ) is a PMF,we seethat
1 [ ( + 1)] =1, which is a frequentlyusefulfact.
n =1 / n n
The following equation, however, shows thatE [N ] does not exist and has infinite value.
E [N ] =
X∞
n =1
n pN ( n ) =
X∞
n =1
X∞
1
n
=
= ∞,
+ 1
n ( n + 1)
n
n =1
where we have used the fact that the harmonic series diverges.
Wenextderiveanalternativeexpressionfortheexpectedvalueofanonnegativediscreterv.
Thisnewexpressionisgivendirectlyintermsofthedistributionfunction. Wethenusethis
new expression as a general definition of expectation which applies to all nonnegative rv’s,
whether discrete, continuous, or arbitrary. It contains none of the convergence questions
that could cause confusion for arbitrary rv’s or for continuous rv’s with wild densities.
23
Itseemsmetaphysicaltosaythatsomethingdoesn’texistbuthasthevalueinfinity. However,theword
exist hereisshorthandfor exist as a real number, whichmakesitquitereasonabletoalsoconsiderthevalue
in the extended real number system, which includes ±! .
21
1.3. PROBABILITYREVIEW
1 s
pX ( a0 )
pX ( a1 )
pX ( a2 )
a1
a1 pX ( a1 )
s
a2
s
a2 pX ( a2 )
a3
c
sFX ( x )
a3 pX ( a3 )
pX ( a3 )
pX ( a4 )
a4s
a4 pX ( a4 )
Figure1.3: TheÞgureshowsthecomplementarydistributionfunction
FcX ofanonnegative discrete rv X . For this example, X takes on five possible values, 0 = a0< a <1
c
a2< a <3a
4 . Thus FX ( x ) = Pr { X >x } = 1 − pX ( a0 ) for x < a 1 . For a1 ≤ x < a 2 ,
Pr { X >x } = 1 − pX ( a0 ) − pX ( a1 ), and Pr { X >x } has similar drops as x reachesa2 ,
P
therectanglesinthe
a3 , and a4 . E [X ], from(1.31),is
X ( ai ), which is thesumof
i ai p
R
figure. This is also the area under the curve FcX ( x ), i.e. , ∞ FcX ( x ) dx . It can be seen
0
that this argument applies to any nonnegative rv, thus verifying (1.33).
For a nonnegative discrete rv X , Figure 1.3 illustrates that (1.31) is simply the integral of
the complementary distribution function, where the complementary distribution function
Fc of a rv is defined as FcX ( x ) =Pr { X >x } =1 − FX ( x ).
Z∞
Z∞
Pr { X >x } dx.
(1.33)
E [X ] =
FcX dx =
0
0
AlthoughFigure1.3onlyillustratestheequalityof(1.31)and(1.33)foronespecialcase,one
easily sees that the argument applies to any nonnegative discrete rv, including those with
countably many values, by equating the sum of the indicated rectangles with the integral.
For a nonnegative integer valued rv X , (1.33) reduces to a simpler form that is often
convenient whenX has a countable set of sample values.
E [X ] =
X∞
Pr { X >n } =
n =0
X∞
Pr { X ≥ n }
(1.34)
n =1
For a continuous nonnegative rv X , the conventional definition of expectation is given by
Z
E [X ]= lim
b→∞
b
x fX ( x ) dx.
(1.35)
0
SupposetheintegralisviewedasalimitofRiemannsums. EachRiemannsumcanbeviewed
as the expectation of a discrete approximation to the continuous rv. The corresponding
expectationoftheapproximationisgivenby(1.33)usingtheapproximate
FX . Thus(1.33),
using the true FX , yields the expected value of X . This can also be seen using integration
by parts. There are no mathematical subtleties in integrating an arbitrary nonnegative
non-increasing function, and this integral must have either a finite or infinite limit. This
leads us to the following fundamental definition of expectation for nonnegative rv’s:
22
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
The expectation E [X ] of a nonnegative rv X is defined by (1.33). The
Definition 1.3.6.
expectation is said to exist if and only if the integral is finite. Otherwise the expectation is
said to not exist and is also said to be infinite.
Next consider rv’s with both positive and negative sample values. If X has a finite number
of positive and negative sample values, saya1,a 2,... ,a n the expectation E [X ] is given by
X
E [X ] =
ai pX ( ai )
i
=
X
ai pX ( ai ) +
ai ≤ 0
X
ai pX ( ai ) .
(1.36)
ai > 0
If X has a countably infinite set of sample values, then (1.36) can still be used if each of
the sums in (1.36) converges to a finite value, and otherwise the expectation does not exist
(as a real number). It can be seen that each sum in (1.36) converges to a finite value if and
only if E [|X |] exists (i.e. , converges to a finite value) for the nonnegative rv |X |.
If E [X ] does not exist (as a real number), it still might have the value ∞ if the first sum
converges and the second does not, or the value−∞ if the second sum converges and the
first does not. If both sums diverge, then E [X ] is undefined, even as ±# . In this latter
case, the partialsumscan be arbitrarilysmallor largedependingon the orderin which the
terms of (1.36) are summed (see Exercise 1.8).
As illustrated for a finite number of sample values in Figure 1.4, the expression in (1.36)
can also be expressed directly in terms of the distribution function and complementary
distribution function as
Z0
Z∞
[
]=
(
)
+
(1.37)
−
EX
FX x dx
FcX ( x ) dx.
−∞
0
Since FcX ( x ) =1 − FX ( x ), this can also be expressed as
Z∞
⇥
⇤
[
]=
EX
u( x ) − FX ( x ) dx,
−∞
where u( x ) is the unit step, u( x ) =1 for x ≥ 0 and u( x ) =0 otherwise.
0
s
s
a1
s
a3
a2
FX ( x )
c
− a2 pX ( a2 )
− a1 pX ( a1 )
a3 pX ( a3 )
sFX ( x )
a4 pX ( a4 )
a4
s
Figure 1.4: For this example, X takes on four possible sample values,a1<a < 2
0<
c ( ) for
.
The
figure
plots
(
)
for
0
and
0.
As
in
Figure
1.3,
FX x
a
x ≤
x>
R3< a c 4
R FX x
(
)
=
(
)
+
(
)
Similarly,
(
)
= − a1 fX ( a1 ) − a2 f X ( a2 ) .
F
f
f
F
x
dx
a
a
a
a
.
x
dx
3
X
3
4
X
4
X
x≥ 0 X
x< 0
The first integral in (1.37)corresponds to the negative sample valuesand the secondto the
positivesamplevalues,and E [X ]existsifandonlyifbothintegralsarefinite(
i.e., if E [|X |]
is finite).
23
1.3. PROBABILITYREVIEW
For continuous-valued rv’s with positive and negative sample values, the conventional definition of expectation (assuming that E [|X |] exists) is given by
Z∞
[
]=
(1.38)
EX
x fX ( x ) dx.
−∞
This is equal to (1.37) by the same argument as with nonnegative rv’s. Also, as with nonnegative rv’s, (1.37) also applies to arbitrary rv’s. We thus have the following fundamental
definition of expectation:
The expectation E [X ] of a rv X exists, with the value given in (1.37),
Definition 1.3.7.
if each of the two terms in (1.37) is finite. The expectation does not exist, but has value∞
( −∞ ), if the first term is finite (infinite) and the second infinite (finite). The expectation
does not exist and is undefined if both terms are infinite.
We should not view the general expression in (1.37) for expectation as replacing the need
for the conventional expressions in (1.38) and (1.36). We will use all of these expressions
frequently, using whichever is most convenient. The main advantages of (1.37) are that
it applies equally to all rv’s, it poses no questions about convergence, and it is frequently
useful, especially in limiting arguments.
Example 1.3.3. The Cauchy rv X is the classic example of a rv whose expectation does
not exist and is undefined. The probability density is
fX ( x ) = π(1+1 x 2 ) . Thus x fX ( x ) is
R∞
proportional to 1 /x both as x → ∞ and as x → −∞ . It follows that
( )
and
0 x fX x dx
R0
( )
are both infinite. On the other hand, we see from symmetry that the
−∞ − x f X x dx
Cauchy principal value of the integral in (1.38) is given by
Z
A
lim
A →∞
−A
x
dx =0 .
(1+
π
x 2)
Thereisusuallylittlemotivationforconsideringtheupperandlowerlimitsoftheintegration
to have the same magnitude, and the Cauchy principal value usually has little significance
for expectations.
1.3.7
Random variables as functions of other random variables
Random variables (rv’s) are often defined in terms of each other. For example, if h is a
functionfrom R to R and X is a rv, then Y = h( X ) is the randomvariable that maps each
sample point ω to the composite function h( X ( ω)). The distribution function of Y can be
found from this, and the expected value of Y can then be evaluated by (1.37).
It is often more convenient to find E [Y ] directly using the distribution function of
X.
R
Exercise 1.18 indicates that E [Y ] is given by
h( x ) fX ( x ) dx for continuous rv’s and by
P
(
)
(
)
for
discreterv’s.
In
ordertoavoidcontinuingtouse
separateexpressionsfor
x h x pX x
continuous and discrete rv’s, we express both of these relations by
Z∞
[
]=
(1.39)
EY
h( x ) dFX ( x ) ,
−∞
24
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
This is known as a Stieltjes integral, which can be used as a generalization of both the
continuousanddiscretecases. Formostpurposes,weuseStieltjesintegrals 24 asanotational
R
P
shorthand for either h( x ) fX ( x ) dx or x h( x ) pX ( x ).
The existence of E [X ] does not guarantee the existence of E [Y ], but we will treat the
question of existence as it arises rather than attempting to establish any general rules.
Particularly important examples of such expected values are the moments E [X n ] of a rv
⇥
⇤
X and the central moments E ( X − X ) n of X , where X is the mean E [X ]. The second
central moment is called the variance, denoted by σX2 or VAR[X ]. It is given by
⇥
⇤
⇥⇤
2
σX2 = E ( X − X ) 2 = E X 2 − X .
(1.40)
The standarddeviation σX of X isthesquarerootofthevarianceandprovidesameasureof
dispersionofthervaroundthemean. Thusthemeanisoftenviewedasa‘typicalvalue’for
the outcome of the rv (see Section 1.3.9) andσX is similarly viewed as a typical di"erence
between X and X . An important connection between the mean and standard deviation is
⇥
⇤
that E ( X − x ) 2 is minimized over x by choosing x to be E [X ] (see Exercise 1.23).
Next suppose X and Y are rv’s and consider the rv 25 Z = X + Y . If we assume that X
and Y are independent, then the distribution function of Z = X + Y is given by26
Z∞
Z∞
(
)
=
(
)
(
)
=
(1.41)
FZ z
FX z − y dFY y
FY ( z − x ) dFX ( x ) .
−∞
−∞
If X and Y both have densities, this can be rewritten as
Z∞
Z∞
(
)
=
(
)
(
)
=
fZ z
fX z − y fY y dy
fY ( z − x ) fX ( x ) dx.
−∞
(1.42)
−∞
Eq. (1.42) is the familiar convolution equation from linear systems, and we similarly refer
to (1.41) as the convolution of distributionfunctions (although it has a di"erent functional
form from (1.42)). If X and Y are nonnegative random variables, then the integrands in
(1.41) and (1.42) are non-zero only between 0 andz, so we often use 0 andz as the limits
in (1.41) and (1.42).
If X 1,X ,...,X
Sn = X 1 + X 2 + ··· +
2
n areindependentrv’s,thenthedistributionoftherv
can
be
found
by
first
convolving
the
distributions
of
and
to
Xn
X1
X 2 get the distribution
24
Morespecifically,theRiemann-Stieltjesintegral,abbreviatedhereastheStieltjesintegral,isdenotedas
P
h
(
x ) dFX ( x ). ThisintegralisdefinedasthelimitofageneralizedRiemannsum,lim
δ→ 0
n h ( x n )[ F( yn ) −
a
F( yn − 1 )] where { yn ; n ≥ 1} is a sequence of increasing numbers from a to b satisfying yn − yn − 1 ≤ δ and
yn − 1 < x n ≤ yn for all n . The Stieltjes integral is defined to exist over finite limits if the limit exists and
is independent of the choices of { yn } and { x n } as δ → 0. It exists over infinite limits if it exists over finite
lengths and a limit over the integration limits can be taken. See Rudin [20] for an excellent elementary
treatment of Stieltjes integration, and see Exercise 1.14 for some examples.
25
The question whether a real-valued function of a rv is itself a rv is usually addressed by the use of
measuretheory, andsincewe neitherusenordevelopmeasurethearyinthistext,weusuallysimplyassume
(within the limits of common sense) that any such function is itself a rv. However, the sum
X + Y of rv’s
is so important that Exercise 1.12 provides a guided derivation of this result for
X + Y . In the same way,
the sum Sn = X 1 + ··· + X n of any finite collection of rv’s is also a rv.
26
See Exercise 1.14 for some peculiarities about this definition.
Rb
25
1.3. PROBABILITYREVIEW
of S2 and then, for each n ≥ 2, convolving the distribution of Sn and X n +1 to get the
distribution of Sn +1 . The distributions can be convolved in any order to get the same
resulting distribution.
Whether or not X 1,X ,...,X
2
+
satisfies
··· X n
n
are independent, the expected value of Sn = X 1 + X 2 +
E [Sn ]= E [X 1 + X 2 + ··· + X n ]= E [X 1]+ E [X 2]+ ··· + E [X n ] .
(1.43)
This says that the expected value of a sum is equal to the sum of the expected values,
whether or not the rv’s are independent (see exercise 1.13). The following example shows
how this can be a valuable problem solving aid with an appropriate choice of rv’s.
Example 1.3.4. Considera switch with n input nodes and n outputnodes. Suppose each
input is randomly connected to a single output in such a way that each output is also
connected to a single input. That is, each output is connected to input 1 with probability
1/n . Given this connection, each of the remaining outputs are connected to input 2 with
probability 1 / ( n − 1), and so forth.
An input node is said to be matched if it is connected to the output of the same number.
We want to show that the expected number of matches (for any given n ) is 1. Note that
the first node is matched with probability 1 /n , and therefore the expectation of a match
for node 1 is 1/n . Whether or not the second input node is matched depends on the choice
of output for the first input node, but it can be seen from symmetry that the
marginal
distribution for the output node connected to input 2 is 1 /n for each output. Thus the
expectation of a match for node 2 is also 1/n . In the same way, the expectation of a match
for each input node is 1/n . From (1.43), the expected total number of matches is the sum
over the expected number for each input, and is thus equal to 1. This exercise would be
much more di!cult without the use of (1.43).
If the rv’s X 1,... ,X n are independent, then, as shown in exercises 1.13 and 1.20, the
variance of Sn = X 1 + ··· + X n is given by
σS2n =
X
n
i =1
σX2 i .
(1.44)
If X 1,... ,X n are also identically distributed ( i.e. , X 1,... ,X n are IID) with variance σX2 ,
√
then σS2n = n' X2 . Thus the standard deviation of Sn is σSn = n' X . Sums of IID rv’s
appear everywhere in probability theory and play an especially central role in the laws of
large numbers. It is important to remember that the mean of
Sn is linear in n but the
standard deviation increases only with the square root of n . Figure 1.5 illustrates this
behavior.
1.3.8
Conditional expectations
Just as the conditional distribution of one rv conditioned on a sample value of another rv
is important, the conditional expectation of one rv based on the sample value of another is
equally important. Initially let X be a positive discrete rv and let y be a sample value of
26
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
1
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.8
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.6
·
·
·
·
·
·
·
0.4
·
·
·
FS50 ( s)
·
·
S50 =12 .5
σS50 =3 .06
·
·
·
·
·
·
0.2
·
·
·
·
·
·
·
FS4 ( s)
·
·
·
S4 =1
σS4 = .87
·
·
·
·
·
FS20 ( s)
·
·
·
S20 =5
σS20 =1 .94
·
·
·
·
·
·
·
·
·
0
5
10
s
15
20
Figure1.5: Thedistributionfunction
FSn ( s) of Sn = X 1 + ··· + X n where X 1,... ,X n
are typical IID rv’s and n takes the values 4, 20, and 50. The particular rv
X in
the figure is binary with pX (1) = 1 / 4, pX (0) = 3 / 4. Note that the mean of Sn is
√
proportional to n and the standard deviation to n .
another discrete rv Y such that pY ( y) > 0. Then the conditional expectation of X given
Y = y is defined to be
X
(1.45)
E [X | Y = y]=
x pX |Y ( x | y) .
x
This is simply the ordinary expected value of X using the conditional probabilities in the
reducedsamplespacecorrespondingto Y = y. Thisvaluecanbefiniteorinfiniteasbefore.
Moregenerally,if X cantakeonpositiveornegativevalues,thenthereisthepossibilitythat
the conditional expectation is undefined. In other words, for discrete rv’s, the conditional
expectation is exactly the same as the ordinary expectation, except that it is taken using
conditional probabilities over the reduced sample space.
More generally yet, let X be an arbitrary rv and let y be a sample value of a discrete rvY
with pY ( y) > 0. The conditionaldistributionfunctionof X conditionalon Y = y is defined
as
FX |Y ( x | y) =
Pr { X ≤ x,Y = y}
.
Pr { Y = y}
Since this is an ordinary distribution function in the reduced sample space where Y = y,
(1.37) expresses the expectation ofX conditional on Y = y as
Z
E [X | Y = y]= −
Z
0
−∞
FX |Y ( x | y) dx +
0
∞
FcX |Y ( x | y) dx.
(1.46)
The forms of conditional expectation in (1.45) and (1.46) are given for individual sample
values of Y for which pY ( y) > 0.
27
1.3. PROBABILITYREVIEW
Wenextshowthattheconditionalexpectationof
X conditionalonadiscreterv Y canalso
beviewedasarv. Withthepossibleexceptionofasetofzeroprobability,each
ω 2 Ωmaps
to { Y = y} for some y with pY ( y) > 0 and E [X | Y = y] is definedfor that y. Thus we can
define E [X | Y ] as27 a rv that is a function of Y , mapping ω to a sample value, sayy of Y ,
and mapping that y to E [X | Y = y]. Regarding a conditionalexpectation as a rv that is a
function of the conditioning rv is a powerful tool both in problem solving and in advanced
work. For now, we use this to express the unconditional mean ofX as
⇥
⇤
(1.47)
E [X ]= E E [X | Y ] ,
where the inner expectationis over X for each value ofY and the outerexpectationis over
the rv E [X | Y ], which is a function of Y .
Example 1.3.5. Consider rolling two dice, say a red die and a black die. Let X 1 be the
number on the top face of the red die, and X 2 that for the black die. Let S = X 1 + X 2.
Thus X 1 and X 2 are IID integer rv’s, each uniformly distributed from 1 to 6. Conditional
on S = j , X 1 is uniformly distributed between 1 and j − 1 for j ≤ 7 and between j − 6
and 6 for j ≥ 7. For each j ≤ 7, it follows that E [X 1 | S = j ] = j/ 2. Similarly, for j ≥ 7,
E [X 1 | S = j ]= j/ 2. This can also be seen by the symmetry betweenX 1 and X 2.
The rv E [X 1 | S] is thus a discrete rv taking on values from 1 to 6 in steps of 1/2 as the
samplevalueof S goesfrom2to12. ThePMFof E [X 1 | S]isgivenby pE[X 1 |S] ( j/ 2) = pS ( j ).
Using (1.47), we can then calculateE [X 1] as
⇥
⇤ X12 j
E [S] = 7
[
]=
[
]
E X1
E E X1 | S =
p
S(j ) =
2
2
2.
j =2
This example is not intended to show the value of (1.47) in calculating expectation, since
E [X 1] = 7 / 2 is initially obvious from the uniform integer distribution of X 1. The purpose
is simply to illustrate what the rv E [X 1 | S] means.
To illustrate (1.47) in a more general way, while still assuming X to be discrete, we can
write out this expectation by using (1.45) for E [X | Y = y].
⇥
⇤ X
E E [X | Y ] =
pY ( y) E [X | Y = y]
E [X ] =
=
X
y
pY ( y)
X
y
x pX |Y ( x |y) .
(1.48)
x
Operationally, there is nothing very fancy in the example or in (1.47). Combining the
P
sums,(1.48)simplysaysthat E [X ]=
(
). Asaconcept,however,viewingthe
y,x x pYX y,x
conditional expectation E [X | Y ] as a rv based on the conditioning rv Y is often a useful
theoretical tool. This approach is equally useful as a tool in problem solving, since there
are many problems where it is easy to find conditional expectations, and then to find the
27
This assumes that E [X | Y = y ] is finite for each y , which is one of the reasons that expectations are
said to exist only if they are finite.
28
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
total expectationby averagingover the conditioningvariable. For this reason, this result is
sometimes called either the total expectation theorem or the iterated expectation theorem.
Exercise 1.19 illustrates the advantages of this approach, particularly where it is initially
clear that the expectationis finite. The following cautionaryexample, however, shows that
this approach can sometimes hide convergence questions and give the wrong answer.
Example 1.3.6. Let Y be a geometric rv with the PMF pY ( y) = 2 − y for integer y ≥ 1.
Let X be an integer rv that, conditional on Y , is binary with equiprobable values ± 2y
given Y = y. We then see that E [X | Y = y] = 0 for all y, and thus, (1.48) indicates that
E [X ] = 0. On the other hand, it is easy to see that pX (2k) = pX ( − 2k ) = 2 − k − 1 for each
integer k ≥ 1. Thus the expectation over positive values of X is ∞ and that over negative
values is −∞ . In other words, the expected value ofX is undefined and (1.48) is incorrect.
The di!culty in the above example cannot occur if X is a nonnegative rv. Then (1.48) is
simply a sum of a countable number of nonnegative terms, and thus it either converges to
afinitesumindependentoftheorderofsummation,or itdivergesto
∞ , againindependent
of the order of summation.
If X hasbothpositiveandnegativecomponents,wecanseparateitinto X = X + + X − where
X + = max(0 ,X ) and X − = min( X, 0). Then (1.48) applies to X + and − X − separately.
If at most one is infinite, then (1.48) applies to X , and otherwise X is undefined. This is
summarized in the following theorem:
Let X and Y bediscreterv’s. If X isnonnegative,
Theorem 1.3.1 (Total expectation).
⇥
⇤ P
then E [X ] = E E [X | Y ] =
( ) [
= y]. If X has both positive and negative
y pY y E X | Y
⇥
⇤
−
+
values, and if at most one of E [X ] and E [− X ] is infinite, then E [X ] = E E [X | Y ] =
P
( ) [
= y].
y pY y E X | Y
We have seen above that ifY is a discreterv, then the conditionalexpectation E [X |Y = y]
is little more complicated than the unconditional expectation, and this is true whether X
is discrete, continuous, or arbitrary. If X and Y are continuous, we can essentially extend
these results to probability densities. In particular, defining E [X | Y = y] as
Z∞
[
=
]=
(1.49)
EX |Y y
x fX |Y ( x | y) dx,
−∞
we have
Z
E [X ]=
Z
∞
−∞
fY ( y) E [X | Y = y] dy =
Z
∞
−∞
fY ( y )
∞
−∞
x fX |Y ( x | y) dxdy.
(1.50)
We do not state this as a theorem because the details about the integration do not seem
necessary for the places where it is useful.
1.3.9
Typical values of rv’s; mean and median
The distribution function of a rv often has more detail than we are interested in, and the
mean is often taken as a ‘typical value.’ Similarly, in statistics, the average of a set of
29
1.3. PROBABILITYREVIEW
numerical data values is often taken to be representative of the entire set. For example,
students always want to know the average of the scores on an exam, and investors always
want to know the Dow-Jones average. Economists are also interested, for example, in such
averages as the average annual house-holdincome over various geographicalregions. These
averages often take on an importance and a life of their own, particlarly in terms of how
they vary in time.
The median of a rv (or set of data values) is often an alternate choice of a single number
to serve as a typical value. We say that α is the median of X if Pr { X ≤ α } $ 1/ 2 and
Pr { X ≥ α } $ 1/ 2. Exercise1.9illustrateswhatthisdefinitionmeans. Inaddition,Exercise
1.10 shows that if the mean exists, then the median α satisfies α = min x E [|X − x |]. It
is possible for the median to be non-unique, with all values in an interval satisfying the
definition, and in this case, min x E [|X − x |] is similarly non-unique.
Another interesting property of the median, suggested in Exercise 1.34 is that in essence
the median of a large number of IID sample values of a rv is close to the median of the
distributionwith highprobability. Anotherproperty, relatingthe median
α to the mean X
of a rv with standard deviation σ, is (see Exercise 1.33)
|X − α|'
σ
(1.51)
The question now arises whether the mean or the median is preferable as a single number
describinga rv. The questionis toovagueto be answered inany generality, butthe answer
depends heavily on what the single number is to be used for. To illustrate this, consider
a rv whose sample values are the yearly household incomes of a large society (or, almost
equivalently, consider a large data set consisting of these yearly household incomes).
For the mean, the probability of each sample value is weighted by the household income,
so that a household income of $109 is weighted the same as 100,000 household incomes of
$104 each. For the median, this weighting disappears, and if our billionaire has a truly
6
awfulyearwithonly$10 income,themedianisunchanged.
Ifoneisinterestedinthetotal
purchasing power of the society, then the mean might be the more appropriate value. On
the other hand, if one is interested in the well-being of the society, the median is the more
appropriate value.28
1.3.10
Indicator random variables
For any event A , the indicator random variable of A , denoted I A , is a binary rv that has
the value 1 for all ω 2 A and the value 0 otherwise. It then has the PMF pI (1) = Pr { A }
A
and pI (0) =1 − Pr { A } . The corresponding distributionfunction FI A is then illustrated in
A
Figure 1.6. It is easily seen that E [I A ]=Pr { A } .
Indicator rv’s are useful because they allow us to apply the many known results about rv’s
andparticularlybinaryrv’stoevents. Forexample,thelawsoflargenumbersareexpressed
intermsofsumsofrv’s,andthoseresultsalltranslateintoresultsaboutrelativefrequencies
through the use of indicator functions.
28
Unfortunately, the choice between median and mean (and many similar choices) is often made for
commericial or political expediency rather than scientific or common-sense appropriateness.
30
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
1
FI A
1 − Pr { A }
0
1
0
Figure 1.6: The distributionfunction
1.3.11
FI A
of an indicator random variable I A .
Moment generating functions and other transforms
The moment generating function (MGF) for a rv X is given by
Z∞
⇥ ⇤
gX ( r ) = E erX =
erx dFX ( x ) .
(1.52)
−∞
where r isarealvariable. Theintegrandisnonnegative,andwecanstudywheretheintegral
exists (i.e. , where it is finite) by separating it as follows:
Z∞
Z0
rx
(
)
=
(
)
+
(1.53)
gX r
e dFX x
erx dFX ( x ) .
−∞
0
Both integrals exist for r = 0, since the first is Pr { X > 0} and the second is Pr{ X ≤ 0} .
The first integral is increasing in r , and thus if it exists for one value of r , it also exists
for all smaller values. For example, if X is a nonnegative exponential rv with the density
fX ( x ) = e− x , then the first integral exists if and only if r < 1, and it then has the value
1 . As another example, if
X satisfies Pr{ X >A } = 0 for some finite A , then the first
1− r
integral is at most erA , which is finite for all real r .
Let r + ( X ) be the supremum of values of r for which the first integral exists. Then 0 ≤
r + ( X ) ≤ ∞ and the first integral exists for all r <r + ( X ). In the same way, let r − ( X ) be
the infimum of values of r for which the the second integral exists. Then 0≥ r − ( X ) ≥ −∞
and the second integral exists for all r>r − ( X ).
Combining the two integrals, the region of r over which gX ( r ) exists is an interval I ( X )
from r − ( X ) ≤ 0 to r + ( X ) ≥ 0. Either or both of the end points, r − ( X ) and r + ( X ), might
be included in I ( X ), and either or both might be either 0 or infinite. We denote these
quantities as I , r − , and r + when the rv X is clear from the context. Tables 1.3.3 and 1.3.3
give the interval I for a number of standardrv’s and Exercise 1.24 illustrates I ( X ) further.
If gX ( r ) exists in an open region of r around 0 ( i.e. , if r − < 0 <r
all orders exist in that region. They are given by
dk gX ( r ) =
dr k
29
Z
∞
−∞
k rx
x e d FX ( x )
;
dk gX ( r )
dr k
+
), then derivatives29 of
hi
= E Xk .
(1.54)
r =0
This result depends on interchanging the order of di!erentiation (with respect to
r ) and integration
gX ( r ) existsfor r bothgreaterandsmaller
(withrespectto x ). Thiscanbeshowntobepermissiblebecause
− FX ( x ) must approach 0 at least exponentially as x → ∞ and,
than 0, which in turn implies, first, that 1
second, that FX ( x ) must approach 0 at least exponentially as x → −∞ .
31
1.3. PROBABILITYREVIEW
This shows that finding the moment generating function often provides a convenient way
to calculate the moments of a random variable (see Exercise 3.2 for an example). If any
moment of a rv fails to exist, however, then the MGF must also fail to exist over any open
interval containing 0 (see Exercise 1.38).
Another important feature of moment generating functions is their usefulness in treating
sums of independent rv’s. For example, let Sn = X 1 + X 2 + ··· + X n . Then
h⇣
⌘i
X n
⇥ ⇤
gSn ( r ) = E erS n = E exp
rX i
i =1
hY n
i
Y n
= E
exp(rX i ) =
(1.55)
gX i ( r ) .
i =1
i =1
In the last step, we have used a result of Exercise 1.13, which shows that for independent
rv’s, the mean of the product is equal to the product of the means. If X 1,... ,X n are also
IID, then
gSn ( r ) =[ gX ( r )] n .
(1.56)
WewillusethispropertyfrequentlyintreatingsumsofIIDrv’s. Notethatthisalsoimplies
that the region over which the MGF’s of Sn and X exist are the same,i.e., I ( Sn ) = I ( X ).
The real variable r in the MGF can also be viewed as a complex variable, giving rise to a
numberofothertransforms. Aparticularlyimportantcaseistoview
r asapureimaginary
√
⇥ ⇤
variable, say i* where i =
− 1 and θ is real. Then 30 gX ( i* ) = E ei'x is called the
characteristic function of X . Since |ei'x | is 1 for all x , gX ( i* ) exists for all rv’s X and all
real θ, and its magnitude is at most one.
Aminorbutimportantvariationonthecharacteristicfunctionof
X istheFouriertransform
of the probability density of X . If X has a density fX ( x ), then the Fourier transform of
fX ( x ) is given by
Z∞
(
2
)
=
(1.57)
gX − i πθ
fX ( x ) exp(− i 2πθ) dx
−∞
The major advantage of the Fourier transform (aside from its familiarity) is that fX ( x ) can
usually be found from gX ( − i 2πθ) as the inverse Fourier transform,31
Z∞
(
)
=
(1.58)
fX x
gX ( − i 2πθ) exp(i 2πθx) dx,
−∞
The Z-transform is the result of replacing er with z in gX ( r ). This is useful primarily
for integer valued rv’s, but if one transform can be evaluated, the other can be found
immediately. Finally, if we use − s, viewed as a complex variable, in place ofr , we get the
two sided Laplace transform of the density of the random variable. Note that for all of
these transforms,multiplicationin the transformdomaincorresponds to convolutionof the
distribution functions or densities, and summation of independent rv’s. The simplicity of
takingproductsoftransformsis amajorreasonthattransformsaresousefulinprobability
theory.
30
The notation here can be slightly dangerous, since one cannot necessarily take an expression for gX ( r ),
valid for real r , and replace r by i& with real θ to get the characteristic function.
31
This integral does not necessarily converge, particularly if
X does not have a PDF. However, it can
be shown (see [22] Chap. 2.12, or [8], Chap. 15) that the characteristic function/ Fourier transform of an
arbitrary rv does uniquely specify the distribution function.)
32
1.4
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Basic inequalities
Inequalities play a particularly fundamental role in probability, partly because many important models are too complex to find exact answers, and partly because many of the
most useful theorems establish limiting rather than exact results. In this section, we study
three related inequalities, the Markov, Chebyshev, and Cherno" bounds. These are used
repeatedly both in the next section and in the remainder of the text.
1.4.1
The Markov inequality
This is the simplest and most basic of these inequalities. It states that if a nonnegative
random variable Y has a meanE [Y ], then, for every y> 0, Pr { Y ≥ y} satisfies32
E [Y ]
for every y> 0 (Markov Inequality).
(1.59)
y
Figure1.7 derives this result usingthe fact(see Figure1.3) thatthe meanof a nonnegative
rvistheintegralofitscomplementarydistributionfunction,i.e.,oftheareaunderthecurve
Pr { Y >z } . Exercise 1.30 gives another simple proof using an indicator random variable.
Pr { Y ≥ y} '
Area under curve = E [Y ]
Pr { Y ≥ y }
Area = y Pr { Y ≥ y }
@
@
? @
@R
y
Figure 1.7: Demonstration that yPr { Y ≥ y} '
E [Y ]. By letting y → ∞ , it can also
be seen that the shaded area becomes a negligible portion of the area E [Y ], so that
lim y →∞ yPr { Y >y } =0 if E [Y ] ≤ ∞ .
As an example of this inequality, assume that the average height of a population of people
is1.6meters. ThentheMarkovinequalitystatesthatatmosthalfofthepopulationhavea
height exceeding 3.2 meters. We see from this example that the Markov inequality is often
very weak. However, for any y > 0, we can consider a rv that takes on the value y with
probability ✏and the value 0 with probability 1 − ✏
; this rv satisfies the Markov inequality
at the point y with equality. Figure 1.7 (as elaborated in Exercise 1.46) also shows that,
for any nonnegative rv Y with a finite mean,
lim y Pr { Y ≥ y} =0 .
y →∞
(1.60)
32
Thedistributionfunctionofanygivenrv
Y isknown(atleastinprinciple),andthusonemightquestion
whyanupperboundiseverpreferabletotheexactvalue. Oneansweristhat
Y mightbegivenasafunction
ofmanyotherrv’sandthattheparameters(suchasthemean)usedinaboundareoftenmucheasiertofind
than the distribution function. Another answer is that such inequalities are often used in theorems which
stateresultsintermsofsimplestatisticssuchasthemeanratherthantheentiredistributionfunction. This
will be evident as we use these bounds.
33
1.4. BASIC INEQUALITIES
This will be useful shortly in the proof of Theorem 1.5.4.
1.4.2
The Chebyshev inequality
We now use the Markov inequality to establish the well-known Chebyshev inequality. Let
Z be an arbitrary rv with finite mean E [Z ] and finite variance σZ2 , and define Y as the
nonnegative rv Y =( Z − E [Z ]) 2. Thus E [Y ]= σZ2 . Applying (1.59),
2
Pr ( Z − E [Z ]) 2 ≥ y ≤ σZ
y
2 andnotingthattheevent
Replacingywith ✏
, this becomes
✏
for every y> 0.
2 isthesameas
{ ( Z − E [Z ]) 2 ≥ ✏
}
|Z − E [Z ] | $
σZ2
for every ✏> 0 (Chebyshev inequality).
(1.61)
2
✏
NotethattheMarkovinequalityboundsjusttheuppertailofthedistributionfunctionand
appliesonlytononnegativerv’s,whereastheChebyshevinequalityboundsbothtailsofthe
distribution function. The more important di"erences, however, are that the Chebyshev
bound requires a variance and goes to zero inversely with the square of the distance from
themean,whereastheMarkovbounddoesnotrequireavarianceandgoestozeroinversely
with the distance from 0 (and thus asymptotically with distance from the mean).
Pr {| Z − E [Z ] | $ ✏
}'
The Chebyshev inequality is particularly useful when Z is the sample average, X
( 1 + X2 +
··· + X n ) /n , of a set of IID rv’s. This will be used shortly in proving the weak law of large
numbers.
1.4.3
Cherno! bounds
Cherno" (or exponential) bounds are another variation of the Markov inequality in which
theboundoneachtailofthedistributionfunctiongoesto0exponentiallywithdistancefrom
⇥ ⇤
themean. Foranygivenrv Z ,let I ( Z ) betheintervaloverwhichtheMGF
gZ ( r ) = E eZr
exists. Letting Y = eZr for any r 2 I ( Z ), the Markov inequality (1.59) applied to Y is
Pr { exp(rZ ) ≥ y} '
gZ ( r )
y
for every y> 0 (Cherno" bound) .
This takes on a more meaningful form if y is replaced by erb . Note that exp( rZ ) ≥ exp(rb)
is equivalent to Z ≥ b for r > 0 and to Z ≤ b for r < 0. Thus, for any real b, we get the
following two bounds, one for r> 0 and the other for r< 0:
Pr { Z ≥ b} '
Pr { Z ≤ b} '
gZ ( r ) exp(− rb)
gZ ( r ) exp(− rb)
;
;
(Cherno" bound for r> 0, r 2 I ( Z )) (1.62)
(Cherno" bound for r< 0, r 2 I ( Z )) . (1.63)
This provides us with a family of upper bounds on the tails of the distribution function,
using values ofr> 0 for the upper tail and r< 0 for the lower tail. For fixed 0 <r 2 I ( Z ),
34
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
this bound on Pr { Z ≥ b} decreases exponentially33 in b at rate r . Similarly, for each
0 > r 2 I ( Z ), the bound on Pr { Z ≤ b} decreases exponentially at rate |r | as b → −∞ .
We will see shortly that (1.62) is useful only when b> E [X ] and (1.63) is useful only when
b< E [X ].
The most important application of Cherno" bounds is to sums of IID rv’s. Let
Sn =
+
+
where
are
IID
with
the
MGF
(
).
Then
(
)
=[
(
)]
gX r
gSn r
gX r n , so
X 1 ··· X n
X 1,... ,X n
(1.62) and (1.63) (with b replaced by na) become
Pr { Sn ≥ na} '
Pr { Sn ≤ na} '
[gX ( r )] n exp(− rna )
[gX ( r )] n exp(− rna )
;
;
( for 0 <r 2 I ( Z ))
( for 0 >r 2 I ( Z )) .
(1.64)
(1.65)
These equations are easier to understand if we define thesemi-invariant MGF , γX ( r ), as
γX ( r ) =ln gX ( r ) .
(1.66)
The semi-invariant MGF for a typical rv X is sketched in Figure 1.8. The major features
to observe are, first, that γX0 (0) = E [X ] and, second, that γX00( r ) ≥ 0 for r in the interior of
I ( X ).
@0
@
@
@
@
@
r
γ(r )
slope = E [X ]
γ (0) =0
γ 0(0) = E [X ]
γ 00(0) = σX2
γ 00( r ) > 0
Figure1.8: Semi-invariantmoment-generatingfunction γ ( r ) foratypicalrv
X assuming r − < 0 < r + . Since γ ( r ) = ln g( r ), we see that that d γ ( r ) = 1 d g( r ). Thus
dr
g ( r ) dr
γ 0(0) = E [X ]. Also, for r in the interior of I ( X ), Exercise 1.26 shows that γ 00( r ) ≥ 0,
and in fact, γ 00( r ) is strictly positive except in the uninteresting case where X is deterministic (takes on a single value with probability 1). As indicated in the figure, the
straight line of slope E [X ] through the origin is tangent to γ ( r ).
In terms of γX ( r ), (1.64) and (1.65) become
Pr { Sn ≥ na} '
Pr { Sn ≤ na} '
exp(n [γX ( r ) − ra ])
exp(n [γX ( r ) − ra ])
;
;
( for0 <r 2 I ( X ))
( for0 >r 2 I ( X )) .
(1.67)
(1.68)
Theseboundsaregeometricin n forfixed a and r ,soweshouldaskwhatvalueof r provides
the tightest bound for any given a. Since γX00( r ) > 0, the tightest bound arises either at
that r for which γ 0( r ) = a or at one of the end points, r − or r + , of I ( X ). This minimum
value is denoted by34
µX ( a) =inf[ γX ( r ) − ra ].
r
33
This seems paradoxical, since Z seems to be almost arbitrary.
However, since r 2 I ( Z ), we have
b
FZ ( b) < ∞ .
er d
34
Theinfimum, denotedinf, ofasetofnumbersisthelargestnumberlessthanorequaltoallnumbersin
the set. For example, inf { (0, 1)} =0, whereas min { (0 , 1)} does not exist.
R
35
1.4. BASIC INEQUALITIES
Note that ( γX ( r ) − ra ) |r =0 = 0 and drd ( γX ( r ) − ra ) |r =0 = E [X ] − a. Thus if a > E [X ],
then γX ( r ) − ra must be negative for su!ciently small positive r . Similarly, if a < E [X ],
then γX ( r ) − ra is negative for negative r su!ciently close 35 to 0. In other words,
Pr { Sn ≥ na} '
Pr { Sn ≤ na} '
exp(nµ X ( a)) ;
exp(nµ X ( a)) ;
where µX ( a) < 0 for a> E [X ]
where µX ( a) < 0 for a< E [X ] .
(1.69)
(1.70)
This is summarized in the following lemma:
Lemma 1.4.1. Assume that 0 is in the interior of I ( X ) and let Sn be the sum of n IID
rv’s each with the distribution of X . Then µX ( a) =inf [ r γX ( r ) − ra ] < 0 for all a 6
= E [X ].
Also, Pr { Sn ≥ na} ' enµ X ( a) for a> E [X ] and Pr { Sn ≤ na} ' enµ X ( a) for a< E [X ].
Figure 1.9 illustrates the lemma and gives a graphical construction to find
inf r[ γX ( r ) − ra ].
0
r0
r
γ ( r ) − ra X X
X
XX XXXX
γ ( r o ) − r oa
XX
X X X X X X X γ (r )
X
X
γ (r 0) X X X X
X
slope = γ 0( r o ) = a
0
0
γ ( r 0 ) − r 0a
36
µX ( a) =
γ (r )
⌘
⌘
⌘ slope = a
⌘
⌘
⌘
⌘
⌘
r0
⌘
Figure 1.9: Graphical minimization of γ ( r ) − ar : For any r 2 I ( X ), γ ( r ) − ar is the
vertical axis intercept of a line of slope a through the point ( r,+ ( r )). The minimum
occurs when the line of slope a is tangent to the curve. The two examples show one
case whereE [X ] < 0 and another where E [X ] > 0.
These Cherno" bounds will be used in the next section to help understand several laws of
largenumbers. TheywillalsobeusedextensivelyinChapter9andareusefulfordetection,
random walks, and information theory.
The following example evaluates these bounds for the case where the IID rv’s are binary.
We will see that in this case the bounds are exponentially tight in a sense to be described.
Example 1.4.1. Let X be binary with pX (1) = p and pX (0) = q = 1 − p. Then gX ( r ) =
q + per for −∞ <r< ∞ . Also, γX ( r ) =ln( q + per ). To be consistent with the expression
for the binomial PMF in (1.24), we will find bounds to Pr { Sn ≥ pn
˜ } and Pr { Sn ≤ pn
˜ } for
˜p>p and p<p
˜
respectively. Thus, according to Lemma 1.4.1, we first evaluate
˜ ].
µX (˜p) =inf[ γX ( r ) − pr
r
35
In fact, for r su#ciently small, γ ( r ) can be approximated by a second order power series,
γ(r ) ≈
0
00
γ (0) + r' (0) + ( r 2/ 2)γ (0) = rX + ( r 2/ 2)σX2 . It follows that µ X ( a) ≈ − ( a − X ) 2 / 2σX2 for very small r .
36
As a special case, the infimum might occur at the edge of the interval of convergence, i.e. , at r − or r + .
As shown in Exercise 1.25, the infimum can be at r + ( r − ) only if gX ( r + ) ( gX ( r − )) exists, and in this case,
the graphical technique in Figure 1.9 still works.
36
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
The minimum occurs at that r for which γX0 ( r ) = ˜p, i.e. , at
per = ˜
p.
q + per
Rearranging terms,
˜
pq
er = ˜
pq
where q˜ =1 − p.
˜
(1.71)
Substituting this minimizing value of r into ln( q + per ) − r p˜ and rearranging terms,
p
µX (˜p) = ˜p ln ˜ + q˜ ln
p
q˜
.
q
(1.72)
Substituting this into (1.69), and (1.70), we get the following Cherno" bounds for binary
IID rv’s. As shown above, they are exponentially decreasing in n .
⇢
Pr { Sn ≥ n p˜ } '
exp n p˜ ln p + q˜ ln q
;
for ˜p>p
(1.73)
p˜
q˜
⇢
Pr { Sn ≤ n p˜ } '
exp n p˜ ln p + q˜ ln q
;
for ˜p<p.
(1.74)
p˜
q˜
So far, it seems that we have simply developed another upper bound on the tails of the
distribution function for the binomial. It will then perhaps be surprising to compare this
bound with the asymptotically correct value (repeated below) for the binomial PMF in
(1.27).
r
⇢
1
p
q
for ˜p = k .
(1.75)
pSn ( k ) ⇠ 2 ˜ ˜ exp n p˜ ln ˜ + q˜ ln ˜
πnpq
p
q
n
For any integer value of n p˜ with ˜p>p , we can lower bound Pr{ Sn ≥ n p˜ } by the single term
pSn ( n p˜ ). Thus Pr { Sn ≥ n p˜ } is both upper and lower bounded by quantities that decrease
exponentially with n at the same rate. The di"erence between the upper bound and the
asymptoticlowerboundisessentiallynegligibleforlarge n . Wecanexpressthisanalytically
by considering the log of the upper bound in (1.73) and the lower bound in (1.75).
lim
n →∞
lnPr { Sn ≥ n p˜ }
= ˜ p ln p + q˜ ln q
n
p˜
q˜
where p>p.
˜
(1.76)
where p<p.
˜
(1.77)
In the same way, for p<p
˜
,
lim
n →∞
lnPr { Sn ≤ n p˜ }
= ˜ p ln p + q˜ ln q
n
p˜
q˜
Inotherwords,theseCherno"boundsarenotonlyupperbounds,butarealsoexponentially
correct in the sense of (1.76) and (1.77). In Chapter 9 we will show that this property is
typical for sums of IID rv’s. Thus we see that the Cherno" bounds are not ‘just bounds,’
butratherareboundsthatwhenoptimizedprovidethecorrectasymptoticexponentforthe
tails of the distribution of sums of IID rv’s. In this sense these bounds are quite di"erent
from the Markov and Chebyshev bounds.
37
1.5. THE LAWS OF LARGE NUMBERS
1.5
The laws of large numbers
The laws of large numbers are a collection of results in probability theory that describe
the behavior of the arithmetic average of n rv’s for large n . For any n rv’s, X 1,... ,X n ,
P
the arithmetic average is the rv (1 /n ) ni =1 X i . Since in any outcome of the experiment,
the sample value of this rv is the arithmetic average of the sample values of X 1,... ,X n ,
this random variable is usually called the sample average
. If X 1,... ,X n are viewed as
successive variables in time, this sample average is called the time-average. Under fairly
generalassumptions,thestandarddeviationofthesampleaveragegoesto0withincreasing
n , and, in various ways depending on the assumptions, the sample average approaches the
mean.
Theseresultsarecentraltothestudyofstochasticprocessesbecausetheyallowustorelate
time-averages (i.e., the average over time of individual sample paths) to ensemble-averages
(i.e., the mean of the value of the process at a given time). In this section, we develop
and discuss one of these results, the weak law of large numbers for IID rv’s. We also
briefly discuss another of these results, the strong law of large numbers. The strong law
requires considerable patience to understand, and its derivation and fuller discussion are
postponed to Chapter 5 where it is first needed. We also discuss the central limit theorem,
partly because it enhances our understanding of the weak law, and partly because of its
importance in its own right.
1.5.1
Weak law of large numbers with a finite variance
Let X 1,X ,...
X and finite variance σX2 . Let Sn =
2 ,X
n be IID rv’s with a finite mean
X 1 + ··· + X n , and consider the sample averageSn/n . We saw in (1.44) that σS2n = n' X2 .
Thus the variance of Sn/n is
Sn
VAR
n
"✓
= E
Sn − nX
n
◆2#
=
1
n
2E
h
Sn − nX
2
i
2
= σX .
n
(1.78)
√
Thissaysthatthestandarddeviationofthesampleaverage Sn/n is σ/ n ,whichapproaches
0 as n increases. Figure1.10illustratesthisdecreaseinthestandarddeviationof
Sn/n with
increasing n . In contrast, recall that Figure 1.5 illustrated how the standard deviation of
Sn increases with n . From (1.78), we see that
"✓
◆2#
S
n
lim E
=0 .
(1.79)
− X
n →∞
n
As a result, we say that Sn/n converges in mean squareto X .
Thisconvergenceinmeansquaresaysthatthesampleaverage, Sn/n ,di"ersfromthemean,
X , by a random variable whose standard deviation approaches 0 with increasing n . This
convergence in mean square is one sense in whichSn/n approachesX , but the idea of a
sequence of rv’s i.e.
( , a sequence of functions)approaching a constant is clearly much more
involved than a sequence of numbers approaching a constant. The laws of large numbers
38
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
1
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0.8
·
·
·
·
·
·
·
·
·
·
·
·
·
·
FZ n ( z)
0.6
·
·
·
·
·
·
·
·
·
Sn
=
Zn
n
·
·
0.4
·
·
·
·
·
·
·
·
·
·
·
0.2
·
·
·
·
·
·
·
·
·
·
·
n =4
n =20
n =50
·
·
·
0
0
0.25
0.5
0.75
1
Figure 1.10: The same distributionas Figure 1.5, scaled di↵erently to give the distribution function of the sample average Z n . It can be visualized that as n increases, the
distributionfunction of Z n becomes increasinglyclose to a unit step at the mean, 0.25,
of the variables X being summed.
bring out this central idea in a more fundamental, and usually more useful, way. We start
the development by applying the Chebyshev inequality (1.61) to the sample average,
⇢
Pr
Sn
− X >$
n
≤
σ2
.
n$2
(1.80)
Thisisanupperboundontheprobabilitythat
Sn/n di"ersbymorethan ✏fromitsmean, X .
This is illustrated in Figure 1.10 which shows the distribution function of Sn/n for various
n . The figure suggests that lim n →∞ FSn/n ( z) = 0 for all z < X and lim n →∞ FSn/n ( z) = 1
for all z > X . This is stated more cleanly in the following weak law of large numbers,
abbreviated WLLN
For each integer n ≥ 1, let Sn = X 1 +
Theorem 1.5.1 (WLLN with finite variance).
+
be
the
sum
of
IID
rv’s
with
a
finite
variance.
Then
··· X n
n
⇢
Sn
lim Pr
for every ✏> 0.
(1.81)
− X >$ =0
n →∞
n
Proof: For every ✏> 0, Pr |Sn/n − X | >$ is bounded between 0 andσ2/n✏2. Since the
upper bound goes to 0 with increasing n , the theorem is proved.
Discussion: Thealgebraicproofaboveisbothsimpleandrigorous. However,thegraphical
descriptioninFigure1.11probablyprovidesmoreintuitionabouthowthelimittakesplace.
It is important to understand both.
We refer to (1.81) as saying that Sn/n converges toX in probability. To make sense out of
this, we shouldview X asadeterministicrandomvariable, i.e. , arvthattakesthevalue
X
39
1.5. THE LAWS OF LARGE NUMBERS
δ2
1
?
6
6
1− δ
?
0
⇣⇣ ⇣ FSn/n
)
2✏
δ1 + δ2 = δ ≤
σ2
n#2
?
δ1 6
X
Figure 1.11: Approximation of the distribution function
FSn/n of a sample average
by a step function at the mean: From (1.80), the probability δ that Sn/n di"ers from
X by more than ✏( i.e. , Pr |Sn/n − X | $ ✏) is at most σ2/n✏2 . The complementary
event, where |Sn/n − X | < $ , has probability 1 − δ ≥ 1 − σ2/n✏2 . This means that
we can construct a rectangle of width 2 ✏centered on X and of height 1 − δ such that
FSn/n enters the rectangle at the lower left (say at ( X − ✏βδ
1 )) and exits at the upper
right, say at ( X + ✏β
1 − δ2 )). Now visualize increasing n while holding ✏fixed. In the
limit, 1 − δ → 1 so Pr |Sn/n − X | $ ✏ → 0. Since this is true for every ✏> 0 (usually
with slower convergence as✏gets smaller), FSn/n ( z) approaches 0 for everyz <X and
approaches 1 for every z > X , i.e. , FSn/n approaches a unit step at X . Note that
there are two ‘fudge factors’ here, ✏and δ and, since we are approximating an entire
distribution function, neither can be omitted, except by directly going to a limit as
n→∞.
for each sample point of the space. Then (1.81) says that the probability that the absolute
di"erence, |Sn/n − X |, exceeds any given✏> 0 goes to 0 asn → ∞ .37
One should ask at this point what (1.81) adds to the more specific bound in (1.80). In
particular(1.80)providesanupperboundontherateofconvergenceforthelimitin(1.81).
The answer is that (1.81) remains valid when the theorem is generalized. For variables
that are not IID or have an infinite variance, (1.80) is no longer necessarily valid. In some
situations, as we see later, it is valuable to know that (1.81) holds, even if the rate of
convergence is extremely slow or unknown.
One di!culty with the bound in (1.80) is that it is extremely loose in most cases. If
Sn/n
actually approached X this slowly, the weak law of large numbers would often be more a
mathematicalcuriosity than a highly useful result. If we assume that the MGF of X exists
in an open interval around 0, then (1.80) can be strengthened considerably. Recall from
(1.69) and (1.70) that for any ✏> 0,
Pr Sn/n − X ≥ ✏ ≤
Pr Sn/n − X ≤ − ✏ ≤
where from Lemma 1.4.1,µX ( a) =inf
r { γX
exp(nµ X ( X + ✏
))
exp(nµ X ( X − ✏
)) ,
(1.82)
(1.83)
( r ) − ra } < 0 for a 6
= X . Thus, for any ✏> 0,
Pr |Sn/n − X | $ ✏ ≤ exp[nµ X ( X + ✏
)] + exp[nµ X ( X − ✏
)] .
(1.84)
37
Saying this in words gives one added respect for mathematical notation, and perhaps in this case, it is
preferable to simply understand the mathematical statement (1.81).
40
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Theboundhere, foranyfixed ✏> 0, decreasesgeometricallyin n ratherthanharmonically.
In terms of Figure 1.11, the height of the rectangle must approach 1 at least geometrically
in n .
1.5.2
Relative frequency
We next show that (1.81) (and similarly (1.84)) can be applied to the relative frequency of
an event as well as to the sample average of a random variable. Suppose that A is some
event in a single experiment, and that the experiment is independently repeated n times.
Then, in the probability model for the n repetitions, let A i be the event that A occurs at
the i th trial, 1 ≤ i ≤ n . The events A 1,A ,...
2 ,A
n are then IID.
Ifwelet I A i betheindicatorrvfor
A onthe i thtrial,thentherv
Sn = I A 1 + I A 2 + ··· + I A n
is the number of occurrences ofA over the n trials. It follows that
Pn
I
i =1 A i
relative frequency ofA = Sn =
(1.85)
.
n
n
Thustherelativefrequencyof A isthesampleaverageofthebinaryrv’s
I A i ,andeverything
we know about the sum of IID rv’s applies equallyto the relative frequency of an event. In
fact, everything we know about sums of IID binary rv’s applies to relative frequency.
1.5.3
The central limit theorem
The weak law of large numbers says that with high probability,
Sn/n is close to X for
large n , but it establishes this via an upper bound on the tail probabilities rather than an
estimateofwhat FSn/n lookslike. Ifwe lookattheshapeof
FSn/n forvariousvaluesof n in
theexampleofFigure1.10,weseethatthefunction
becomesincreasinglycompressed
FSn/n
around X as n increases (in fact, this is the essence of what the weak law is saying). If we
normalize the random variable Sn/n to 0 mean and unit variance, we get a normalized rv,
√
Z n = ( Sn/n − X ) n/σ . The distribution function of Z n is illustrated in Figure 1.12 for
the same underlying X as used for Sn/n in Figure 1.10. The curves in the two figures are
√
the same except that each curve has been horizontally scaled by n in Figure 1.12.
InspectionofFigure1.12showsthatthenormalizeddistributionfunctionsthereseemtobe
approaching a limiting distribution. The critically important
central limit theorem states
that there is indeed such a limit, and it is the normalized Gaussian distribution function.
Let X 1,X ,...
Theorem 1.5.2 (Central limit theorem (CLT)).
2
mean X and finite variance σ2. Then for every real number z,
⇢
lim Pr Sn −√ nX ≤ z = %( z) ,
n →∞
σ n
beIIDrv’swithfinite
(1.86)
where Φ(z) is the normal distribution function, i.e., the Gaussian distribution with mean 0
and variance 1,
Zz
2
1
Φ(z) =
exp(− y ) dy.
√
2
2π
−∞
41
1.5. THE LAWS OF LARGE NUMBERS
1
·
·
·
·
·
·
·
·
·
·
·
0.8
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Sn −√nX
=
Zn
σX n
·
·
·
·
·
·
·
·
·
·
FZ n ( z)
0.6
·
·
0.4
·
·
·
·
·
·
·
·
·
·
·
·
·
0.2
·
·
·
·
·
·
·
·
·
·
·
·
·
n =4
n =20
n =50
·
·
0
-2
-1
0
1
2
Figure 1.12: Thesamedistributionfunctionsas Figure1.5normalizedto0 meanand
√
unit standard deviation, i.e. , the distribution functions of Z n = ( Sn/n − X ) n for
σX
n =4 , 20, 50. Note that as n increases, the distribution function of Z n slowly starts to
resemble the normal distribution function.
√
Discussion: The rv Z n = ( Sn − nX ) / ( σ n ), for each n ≥ 1 on the left side of (1.86),
has mean 0 and variance 1. The central limit theorem (CLT), as expressed in (1.86), says
that the sequence of distribution functions, FZ 1 ( z) , FZ 2 ( z) ,... converges at each value ofz
to %(z) as n → ∞ . In other words, lim n →∞ FZ n ( z) = %( z) for each z 2 R. This is called
convergence in distribution, since it is the sequence of distribution functions, rather than
the sequence of rv’s that is converging. The theorem is illustrated by Figure 1.12.
The CLT tells us quite a bit about how
this, rewrite (1.86) in the form
lim Pr
n →∞
FSn/n converges to a step function at X . To see
⇢
Sn
σz
− X ≤ √
n
n
= %( z) .
(1.87)
This is illustrated in Figure 1.13 where we have used %( z) as an approximation for the
probability on the left.
Thereasonwhy theword central appearsintheCLT canalsobe seenfrom(1.87). Asymptotically, we are looking at a limit (as n → ∞ ) of the probability that the sample average
√
di"ers fromthe mean by at most a quantity goingto 0 as 1 / n . Thisshouldbe contrasted
with the corresponding optimized Cherno" bound in (1.84) which looks at the limit of the
probability that the sample average di"ers from the mean by at most a constant amount.
Those latter results are exponentially decreasing in n and are known as large deviation
results.
Theorem 1.5.2 says nothing about the rate of convergence to the normal distribution. The
Berry-Esseen theorem (see, for example, Feller, [8]) provides some guidance about this for
42
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
1
Φ(z )
X +
X +
FSn/n
√σz
n
√σz
4n
FS4 n / 4n
0
X
Figure 1.13: Approximationofthedistributionfunction
FSn/n of a sample average by
a Gaussian distribution of the same mean and variance. Whenever n is increased by
a factor of 4, the curve is horizontally scaled inward toward X by a factor of 2. The
√
CLT saysboththatthesecurvesarescaledhorizontallyas1
/ n andalsothattheyare
better approximated by the Gaussian of the given mean and variance asn increases.
⇥
⇤
cases in which the third central moment E |X − X |3 exists. This theorem states that
⇥
⇤
⇢
( Sn − nX )
C E |X − X |3
Pr
(1.88)
√
√
≤ z − Φ(z) ≤
.
σn
σ3 n
where C can be upper bounded by 0.766 (later improved to 0.4784). We will come back
shortly to discuss convergence in greater detail.
The CLT helps explain why Gaussian rv’s play such a central role in probability theory.
In fact, many of the cookbook formulas of elementary statistics are based on the tacit
assumption that the underlying variables are Gaussian, and the CLT helps explain why
these formulas often give reasonable results.
One should be careful to avoid reading more into the CLT than it says. For example, the
√
normalized sum, (Sn − nX ) /' n
need not have a density that is approximately Gaussian.
In fact, if the underlying variables are discrete, the normalized sum is discrete and has no
density. The PMF of the normalized sum might have very detailed and wild fine structure;
this does not disappear as n increases, but becomes “integrated out” in the distribution
function.
A proof of the CLT requires mathematical tools that will not be needed subsequently. 38
Thus we give a proof only for the binomial case. Before doing this, however, we will show
that the PMF for Sn in the binomial approaches a sampled form of the Gaussian density.
This detailed form of the PMF does not follow from the CLT and is often valuable in its
own right.
Theorem 1.5.3. Let { X i ; i ≥ 1} be a sequence of IID binary rv’s with p = pX (1) > 0 and
q =1 − p = pX (0) > 0. Let Sn = X 1 + ··· + X n for each n ≥ 1 and let α be a fixed constant
38
Many elementary texts provide ‘simple proofs,’ using transform techniques, but, among other issues,
thesetechniquesoftenindicatethatthenormalizedsumhasadensitythatapproachestheGaussiandensity;
this is incorrect for all discrete rv’s. The simplest correct proof known by the auther is given by Feller ([7]
and [8]).
43
1.5. THE LAWS OF LARGE NUMBERS
satisfying 1/ 2 < ) <
that |k − np| ' n α ,
2/ 3. Then constants C and n o exist such that for all integer k such
✓ (
1
)2 ◆
exp − k − np
1 ± Cn 3α − 2
for n ≥ n o,
(1.89)
pSn ( k ) = √ 2
2npq
πnpq
where this ‘equation’ is to be interpreted as an upper bound when the
± sign is replaced with
+ and a lower bound with − .
Conventional proof for this type of theorem outlined in Exercise 1.35 use the Stirling
approximation and are might be simpler computationally than the one here. The proof
below has the merits of being self contained and also of showing exactly why the quadratic
term in the exponent arises.
Proof:
39
Recall from (1.23) that
✓ ◆
n k n− k
pq
.
k
To understand how this varies with k , consider the ratio of adjacent terms,
pSn ( k ) =
n!
k !( n − k )! pk + 1qn − k − 1
( k +1)!( n − k − 1)!
n!
pkq n − k
= n− kp .
(1.90)
k +1 q
This is strictly decreasing in k . Note that if k ≥ pn, then also n − k ≤ qn so the ratio in
(1.90) is less than 1. Similarly, if k + 1 ≤ pn, then the ratio is greater than 1. Thus
⇢
for k ≥ pn.
pSn ( k + 1) < 1
(1.91)
for k + 1 ≤ pn.
> 1
pSn ( k )
pSn ( k +1)
pSn ( k )
=
This implies that pSn ( k ) is maximized over k either at k = bpnc or at k = dpne
We now analyze (1.90) for large n where |k − np| ' n α . To simplify the algebra, let
δ = pn −b pnc and replace k with bpnc + i = pn − δ + i . Thus (1.90) becomes
pSn ( bpnc + i + 1) = n − pn+ δ− i p =
nq + δ − i
p
pSn ( bpnc + i )
pn− δ+ i +1 q
pn − δ + i + 1 q
−δ
1 − i nq
=
δ+1 .
1 + i −np
(
+ + 1)
ln pSn bpnc i
pSn ( bpnc + i )
= ln 1
−
i− δ
i − δ+ 1
− ln 1+
.
nq
np
Recall that ln(1+ x ) = x − x 2/ 2 + ··· for |x | < 1,
(
+ + 1)
1
ln pSn bpnc i
= − i − δ− i − δ−
+ ···
(
+
)
pSn bpnc i
nq
np
np
+
= − i δ − q + ··· ,
npq
39
This proof can be omitted (or read without a concern for details), without loss of continuity. However,
the idea of the proof is important.
44
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
pn
−2
−1
0
1
2
3
k −b pnc
Figure 1.14: The PMF pSn ( k ) of the binomial distribution is increasing with
k for
and
decreasing
with
for
.
Also,
as
shown
later,
the
ratio
of
adjacent
k < pn
k
k > pn
terms is approximately proportional to − k + pn
where we have used 1/p + 1/q = 1 /pq . The neglected terms are of order i 2/n
zero as n 2α − 2 for |k − np| ' n α .
2
and go to
Thissaysthatthelogoftheadjacent-termratiosareessentiallylinearin
i . Wenowhaveto
combine these ratios to evaluate an arbitraryterm pSn ( k ) of the PMF in terms of a central
term pSn ( bpnc). Assuming j > 0, we have
(
+ )
ln pSn bpnc j
pSn ( bpnc)
=
=
=
X
(
+ + 1)
ln pSn bpnc i
i =0
pSn ( bpnc + i )
X j −1 −i + δ− q
+ ···
i =0
npq
− j ( j − 1) + j ( δ − q) +
··· ,
2npq
npq
j−1
(1.92)
where we have used the fact40 that 1+2+ ··· + ( j − 1) = j (( j − 1)/ 2. The term j 2 here is
the important term. It arises because the log-ratio of adjacent terms is linear in i , so the
log-ratio in (1.92) is quadratic in j . When this is exponentiated, the PMF will contain the
characteristic term exp( − j 2/ 2npq) of the discrete Gaussian distribution. We now turn to
the somewhat tedious convergence details to complete the proof.
Each term ignored in (1.92) is a sum of j numbers each of orderj 2/n 2 and thus the total
ignored quantity has order j 3/n 2. For |k − np| ' n α and α < 2/ 3, the magnitude of the
ignored terms can be upper bounded, for all large n ≥ n 1 by C1n 3α − 2, where C1 depends
on p and α and n 1 depends on p,) , and the particular choice of C1. The parameters C1
and n 1 are of no particular interest other than being finite and independent of j .
By rearranging the terms in (1.92) and including the bound on neglected terms,
(
+ )
(
)2
(1 2 ) + δ2
ln pSn bpnc j
= − j − δ + j − q
± C1n 3α − 2
2npq
2npq
pSn ( bpnc)
for n ≥ n 1,
(1.93)
The second term has order j/n , and its magnitude can be bounded by C2n α − 1 for some
constant C2 overtherangewhere |k − np| ' n α . Since α − 1 < 3α − 2 for α > 1/ 2, theterm
40
Toseethis,write(
and divide by 2.
j − 1) + ( j − 2) ··· + 1 underneath1+2+
··· + ( j − 1). Thenaddallthesetermspairwise
45
1.5. THE LAWS OF LARGE NUMBERS
C2n α − 1 can be absorbed into C1n 3α − 2 by increasing C1. Thus (1.93) can be modified to
(
+ )
(
)2
ln pSn bpnc j
= − j − δ ± C1n 3α − 2
2npq
pSn ( bpnc)
for n ≥ n 1.
(1.94)
The term C1n 3α − 2 used for the upper and lower bound might be larger than the quantity
of interest for small values of j , but we see later that this causes no problem. This same
argument can be repeated for j < 0 with the result that (1.94) applies to all j such that
|k − pn| ' n α .
Now recalling that j = k −b pnc, we see that j − δ = k − pn, so (1.94) becomes
ln
(
)2
pSn ( k )
= − k − pn ± C1n 3α − 2
2npq
pSn ( bpnc)
for n ≥ n 1.
(1.95)
We now exponentiate both sides of (1.95). Using the same convention for upper and lower
bounds,
◆
✓ (
− k − pn) 2 exp
for n ≥ n 1.
(1.96)
± C1n 3α − 2
pSn ( k ) = pSn ( bpnc) exp
2npq
Note that for 0 ≤ x ≤ 1, we have ex< 1 + 2x and e− x ≥ 1 − 2x . Now choosen 2 ≥ n 1 so
that C1n 32α − 2 ≤ 1. Then for n ≥ n 2,
exp C1n 3α − 2 ≤ 1 + 2C1n 3α − 2;
exp − C1n 3α − 2 ≥ 1 − 2C1n 3α − 2,
◆
✓ (
⌘
− k − pn) 2 ⇣ 1 2
3α − 2
(
)
=
(
)
exp
±
pSn k
pSn bpnc
C
1n
2npq
for |k − pn| ' n α.
(1.97)
Tocompletetheproof,wemustshowthat pSn ( bpnc) ⇠ √ 1 . Inordertodothis,firstnote
2πnpq
P
that
(
)
=
1
(since
is
a
PMF).
First
we
show
that the tails of this distribution
p
k
p
S
S
n
n
k
α
arenegligible. The termswith k − np< − n areincreasingin k andupper boundedby the
first term with k − np ≥ − n α . Thus for n ≥ n 2,
✓ 2α ◆
X
−n
1 + 2C1n 3α − 2 .
(1.98)
pSn ( k ) ≤ n pSn ( bpnc) exp 2
npq
α
k − pn< − n
Since pSn ( bpnc) ≤ 1, this sumgoesto zerowith n asexp( − n 2α − 1) (where2 α − 1 > 0). The
sum of terms with k − np>n α goes to 0 in the same way. Representing the sum of these
tail terms by ✏
, using (1.97) for all other terms, and assuming n ≥ n 2, we have
0
1
✓ (
2◆
X
)
− k − pn
@
A 1 ± 2C1n 3α − 2 = 1 − ✏.
(1.99)
pSn ( bpnc) exp
2
npq
α
| − |'
k pn
n
Since✏goesto0withincreasing n fasterthan C1n 3α − 2,wecanomitit(perhapsbyincreasing
√
of a normalized Gaussian
C1 and n 2). Next let φ( x ) =(1 / 2π) exp(− x 2/ 2) be the density
R∞
rv. Consider the following Rieman sum approximation to −∞
(
φ x ) dx =1.
X∞
k = −∞
√
✓ (
)2 ◆
exp − k − pn
⇠ 1,
2npq
2πnpq
1
(1.100)
46
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
√
The spacing between samples is 1
/ npq . The di"erence between the two sides of (1.100)
goes to 0 with n as n − 1/ 2. Also, in the same way as in (1.99), the portion of this sum over
|k − pn| >n α goes to zero with n as exp(− n 2α − 1). Thus, for su!ciently large n 3 and C3,
X
| k − pn |'
◆
✓ (
⌘
− k − pn) 2 ⇣ 1
− 1/ 2 =1
exp
√
±
C
3n
2npq
2πnpq
1
nα
for n ≥ n 3.
(1.101)
Comparing (1.99) and (1.101), we see that if n ≥ n 2 and n ≥ n 3, then
⇣
⌘
1
1 ± C3n − 1/ 2 .
pSn ( bpnc) 1 ± 2C1n − 3α − 2n = √ 2
πnpq
Since n − 1/ 2 converges faster thann 3α − 2n , we see that there is a C4 and n 4 such that
⇣
⌘
1
1 ± C4n 3α − 2
for n ≥ n 4.
pSn ( bpnc) = √ 2
πnpq
Using this in (1.97) and combining all these error terms, for a su!ciently large
we have (1.89).
C and n o,
The central limit theorem (for the binary
Proof of Theorem 1.5.2 (binomial case):
case) in the form of Theorem 1.5.2 simply converts the PMF of (1.89) into a distribution
function. To do this carefully, we first choose arbitrary real numbers z0<z and show that
⇢
Zz 1
2
lim Pr z0 ≤ Sn −√ nX ≤ z =
exp − y dy.
(1.102)
√
2
n →∞
2π
σ n
z0
Todothis,choose α 2 ( 12 , 23 ) andforthat α ,choose n o and C tosatisfy(1.89). Thenchoose
√
√
m to satisfy m ≥ n o and − mα ≤ z0 mpq <z mpq ≤ m α . Since α > 1/ 2, this guarantees
that
√
√
− n α ≤ z0 npq<z npq ≤ n α
for n ≥ m.
(1.103)
For n ≥ m , we then have
⇢
Pr z0 ≤ S√n − np ≤ z
npq
√
bnp + z npqc
=
X
k = dnp +
=
√
z0
pSn ( k )
npq e
0
(1.104)
√
bnp + z npqc
1 ± Cn 3α − 2 @
X
√
k = dnp + z 0 npq
1
✓ (
2◆
)
A .
exp − k − np
√
2npq
2πnpq
e
1
Asseenfrom(1.103),eachtermin(1.104)satisfies |k − np| ' n α ,whichjustifiesthebounds
in the following sum. That following sum can be viewed as a Riemann sum for the integral
in (1.86). Thus the sum approaches the integral as n − 1/ 2. Taking the limit as n → ∞ in
(1.104), the term Cn 3α − 2 approaches 0, justifying (1.102). The theorem follows by taking
the limit z0 → −∞ .
Since the CLT provides such explicit information about the convergence of Sn/n to X , it
is reasonable to ask why the weak law of large numbers (WLLN) is so important. The
47
1.5. THE LAWS OF LARGE NUMBERS
first reason is that the WLLN is so simple that it can be used to give clear insights into
situationswheretheCLTcouldconfusetheissue. AsecondreasonisthattheCLTrequires
a variance, where as we see next, the WLLN does not. A third reason is that the WLLN
can be extended to many situations in which the variables are not independent and/or not
identically distributed. 41 A final reason is that the WLLN provides an upper bound on the
tails of FSn/n , whereas the CLT provides only an approximation.
1.5.4
Weak law with an infinite variance
We now establish the WLLN without assuming a finite variance.
Foreachinteger n ≥ 1,let Sn = X 1+ ··· + X n whereX 1,X ,...
Theorem 1.5.4 (WLLN).
2
are IID rv’s satisfying E [|X |] < ∞ . Then for any ✏> 0,
⇢
Sn
lim Pr
(1.105)
− E [X ] >$ =0 .
→∞
n
n
Proof: 42 We use a truncation argument; such arguments are used frequently in dealing
with rv’s that have infinite variance. The underlying idea in these arguments is important,
but some less important details are treated in Exercise 1.42. Let b be a positive number
(which we later take to be increasing with n ), and for each variableX i , define a new rv X˘ i
(see Figure 1.15) by
8
for E [X ] − b ≤ X i ≤ E [X ]+ b
< Xi
˘
=
[
]+
for X i> E [X ]+ b
(1.106)
EX
Xi
b
:
for X i< E [X ] − b.
E [X ] − b
X + b
FX
X
FX˘
X − b
X
Figure 1.15: Thetruncatedrv X˘ for a given rv X has a distributionfunctionwhich is
truncated at X ± b.
The truncated variables X˘ i are IID and, because of the truncation, must have a finite
˘ = X˘ + ··· X˘ . More
second moment. Thus the WLLN applies to the sample average S
n
1
n
˘ /n , we get
particularly, using the Chebshev inequality in the form of (1.80) on S
n
(
)
2
˘
hi
4σ ˘
8bE [|X |]
X
Pr Sn − E X˘ > ✏ ≤
≤
,
2
n
n$2
n$2
41
Central limit theorems also hold in many of these more general situations, but they do not hold as
widely as the WLLN.
42
The details of this proof can be omitted without loss of continuity. However, truncation arguments are
important in many places and should be understood at some point.
48
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
⇥⇤
where Exercise 1.42 demonstrates the final inequality. Exercise 1.42 also shows that
E X˘
approachesE [X ] as b → ∞ and thus that
(
)
˘
8bE [|X |]
S
n
Pr
(1.107)
− E [X ] >$
≤
,
n
n$2
˘ , so
for all su!ciently large b. This bound also applies to Sn/n in the case where Sn = S
n
we have the following bound (see Exercise 1.42 for further details):
⇢
Pr
(
Sn
− E [X ] >$
n
≤ Pr
˘
S
n
− E [X ] >$
n
)
n
o
˘ .
+ Pr Sn 6
= S
n
(1.108)
˘ unlessoneofthe X hasanoutage, i.e., |X − X | >b .
Theoriginalsum Sn isthesameas S
i
i
n n
o
˘
Thus, using the union bound, Pr Sn 6
= Sn ≤ n Pr |X i − X | >b . Substituting this and
(1.107) into (1.108),
⇢
Pr
Sn
− E [X ] >$
n
≤
8bE [|X |] n
+ [b Pr {| X − E [X ] | >b } ] .
n$2
b
(1.109)
We now show that for any ✏ > 0 and δ > 0, Pr |Sn/n − X | $ ✏ ≤ δ for all su!ciently
large n . We do this, for given ✏βδ
, by choosing b( n ) for each n so that the first term in
2 16 [
(1.109) is equal to δ/ 2. Thus b( n ) = n%$
/ E |X |]. This means that n/b ( n ) in the second
term is independent of n . Now from (1.60), lim b→∞ bPr |X − X | >b =0, so by choosing
b( n ) su!ciently large (and thus n su!ciently large), the second term in (1.109) is also at
most δ/ 2.
1.5.5
Convergence of random variables
Thissectionhasdeveloped anumber ofresultsabouthow the sequenceofsampleaverages,
{ Sn/n ; n ≥ 1} , for a sequence of IID rv’s { X i ; i ≥ 1} approaches the meanX . In the case
of the CLT, the limiting distribution around the mean is also specified to be Gaussian. At
the outermost intuitive level, i.e. , at the level most useful when first looking at some very
complicatedsetofissues,viewingthelimitofthesampleaveragesasbeingessentiallyequal
to the mean is highly appropriate.
At the next intuitive level down, the meaning of the word essentially becomes important
and thus involves the detailsof the above laws. All of the resultsinvolve how the rv’s Sn/n
change with n and become better and better approximated by X . When we talk about a
sequence of rv’s (namely a sequence of functions on the sample space) being approximated
byarvornumericalconstant, wearetalkingaboutsomekindof
convergence
,butitclearly
isnotassimpleasasequenceofrealnumbers(suchas1
/n forexample)convergingtosome
given number (0 for example).
The purpose of this section, is to give names and definitions to these various forms of
convergence. Thiswillgiveusincreasedunderstandingofthelawsoflargenumbersalready
49
1.5. THE LAWS OF LARGE NUMBERS
developed, but, equally important, it will allow us to develop another law of large numbers
called the strong law of large numbers (SLLN). Finally, it will put us in a position to use
these convergence results later for sequences of rv’s other than the sample averages of IID
rv’s.
We discuss four types of convergence in what follows, convergence in distribution, in probability, in mean square, and with probability 1. For the first three, we first recall the type
of large-number result with that type of convergence and then give the general definition.
Forconvergencewithprobability1(WP1),wewilldefinethistypeofconvergenceandthen
providesome understandingof what it means. This will then be used in Chapter5 to state
and prove the SLLN.
We start with the central limit theorem, which, from (1.86) says
⇢
Zz
✓ 2◆
1
lim Pr Sn√− nX ≤ z =
exp − x
√
dx
2
n →∞
2π
nσ
−∞
for every z 2 R.
This is illustratedin Figure1.12 andsays thatthe sequence(in n ) of distributionfunctions
n
o
Pr Sn√− nX ≤ z converges at everyz to the normal distribution function at z. This is an
n%
example of convergence in distribution.
A sequence of random variables, Z 1,Z ,...
converges in distribution
Definition 1.5.1.
2 ,
to a random variable Z if lim n →∞ FZ n ( z) = FZ ( z) at each z for which FZ ( z) is continuous.
For the CLT example, the rv’s that converge in distributionare { Sn√− nX ; n ≥ 1} , and they
n%
converge in distribution to the normal Gaussian rv.
Convergenceindistributiondoesnotsaythattherv’sthemselvesconvergeinanyreasonable
sense, but only that their distribution functions converge. For example, let Y1,Y 2,... , be
IID rv’s with the distribution function
FY . For each n ≥ 1, if we let let Z n = Yn + 1/n ,
then it is easy to see that { Z n ; n ≥ 1} converges in distribution to Y . However (assuming
Y has variance σY2 and is independent of eachZ n ), we see that Z n − Y has variance 2σY2 .
Thus Z n does not get close to Y as n → ∞ in any reasonable sense, andZ n − Z m does
not get small as n and m both get large.43 As an even more trivial example, the sequence
{ Yn ; n ≥ 1} converges in distribution to Y .
For the CLT, it is the rv’s
Sn√− nX that converge
n%
the rv Sn√− nX − S2√n − 2nX
n%
2n%
in distributionto the normal. As shown in
Exercise 1.45, however,
is not close to 0 in any reasonable sense,
even though the two terms have distribution functions that are very close for large n .
For the next type of convergence of rv’s, the WLLN, in the form of (1.105), says that
⇢
Sn
lim Pr
for every ✏> 0.
− X >$ =0
n →∞
n
This is an example of convergence in probability,as defined below:
43
Infact,sayingthatasequenceofrv’sconvergesindistributionisunfortunatebutstandardterminology.
It would be just as concise, and far less confusing, to say that a sequence of distributionfunctions converge
rather than saying that a sequence of rv’s converge in distribution.
50
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
A sequence of random variablesZ 1,Z ,...
Definition 1.5.2.
2 ,
a rv Z if lim n →∞ Pr {| Z n − Z | >$ } =0 for every ✏> 0.
converges in probability to
For the WLLN example, Z n in the definition is the sample average Sn/n and Z is the
constant rv X . It is probably simpler and more intuitive in thinking about convergence
of rv’s to think of the sequence of rv’s { Yn = Z n − Z ; n ≥ 1} as converging to 0 in some
sense.44 As illustrated in Figure 1.10, convergence in probability means that { Yn ; n ≥ 1}
converges in distribution to a unit step function at 0.
An equivalent statement, as illustrated in Figure 1.11, is that
{ Yn ; n ≥ 1} converges in
probability to 0 if lim n →∞ FYn ( y) = 0 for all y < 0 and lim n →∞ FYn ( y) = 1 for all y > 0.
This shows that convergence in probability is a special case of convergence in distribution,
since with convergence in probability, the sequenceFYn of distribution functions converges
to a unit step at 0. Note that lim n →∞ FYn ( y) is not specified at y = 0. However, the step
function is not continuous at 0, so the limit there need not be specified for convergence in
distribution.
Convergenceinprobabilitysaysquiteabitmorethanconvergenceindistribution. Asanimportantexampleofthis,considerthedi"erence
Yn − Ym for n and m bothlarge. If { Yn ; n ≥
1} convergesinprobabilityto0,then Yn and Ym arebothcloseto0withhighprobabilityfor
large n and m ,andthusclosetoeachother. Moreprecisely,lim m →∞ ,n →∞ Pr {| Yn − Ym | >$ } =
0 for every ✏ > 0. If the sequence { Yn ; n ≥ 1} merely converges in distribution to some
arbitrary distribution, then, as we saw, Yn − Ym can be large with high probability, even
when n and m are large. Another example of this is given in Exercise 1.45.
It appears paradoxical that the CLT is more explicit about the convergence of Sn/n to X
than the weak law, but it corresponds to a weaker type of convergence. The resolution of
this paradox is that the sequence of rv’s in the CLT is { Sn√− nX ; n ≥ 1} . The presence of
n%
√
n in the denominator of this sequence provides much more detailed information about
how Sn/n approachesX with increasing n than the limiting unit step of FSn/n itself. For
example, it is easy to see from the CLT that lim
n →∞ FSn/n ( X ) = 1 / 2, which can’t be
derived directly from the weak law.
Yet another kind of convergence isconvergence in mean square(MS). An example of this,
for the sample averageSn/n of IID rv’s with a variance, is given in (1.79), repeated below:
"✓
◆2#
S
n
lim E
=0 .
− X
n →∞
n
The general definition is as follows:
A sequence of rv’s Z 1,Z ,...
Definition 1.5.3.
2 ,
⇥
⇤
Z if lim n →∞ E ( Z n − Z ) 2 =0 .
converges in mean square (MS) to a rv
Our derivation of the weak law of large numbers (Theorem 1.5.1) was essentially based on
theMSconvergenceof(1.79). Usingthesameapproach,Exercise1.44showsingeneralthat
44
Definition 1.5.2 gives the impression that convergence to a rv
Z is more general than convergence to
a constant or convergence to 0, but converting the rv’s to
Yn = Z n − Z makes it clear that this added
generality is quite superficial.
1.5. THE LAWS OF LARGE NUMBERS
51
convergence in MS implies convergence in probability. Convergence in probability does not
imply MS convergence, since as shown in Theorem 1.5.4, the weak law of large numbers
holds without the need for a variance.
Figure1.16illustratestherelationshipbetweentheseformsofconvergence, i.e.,meansquare
convergence implies convergence in probability, which in turn implies convergence in distribution. The figure also shows convergence with probability 1 (WP1), which is the next
form of convergence to be discussed.
#
#
MS
"!WP1 "!
In probability
Distribution
Figure 1.16: Relationshipbetweendi↵erentkindsofconvergence: Convergenceindistributionisthemostgeneralandisimpliedbyalltheothers. Convergenceinprobability
is the next most general and is implied both by convergence with probability 1 (WP1)
and by mean square (MS) convergence, neither of which implies the other.
1.5.6
Convergence with probability 1
Convergence with probability 1, abbreviated as convergence WP1, is often referred to as
convergence a.s. (almost surely) and convergence a.e. (almost everywhere). The strong
law of large numbers, which is discussed briefly in this section and further discussed and
proven in various forms in Chapters 5 and 9, provides an extremely important example of
convergence WP1. The general definition is as follows:
Let Z 1,Z ,...
be a sequence of rv’s in a sample spaceΩ and let Z be
Definition 1.5.4.
2 ,
another rv in Ω. Then { Z n ; n ≥ 1} is defined to converge toZ with probability 1 (WP1) if
n
o
Pr ω 2 Ω : lim Z n ( ω) = Z ( ω) =1 .
(1.110)
n →∞
The condition Pr { ω 2 Ω : lim n →∞ Z n ( ω) = Z ( ω) } = 1 is often stated more compactly as
Pr { lim n Z n = Z } = 1, and even more compactly as lim n Z n = Z WP1, but the form here
is the simplest for initial understanding. As discussed in Chapter 5. the SLLN says that if
are IID with E [|X |] < ∞ , then the sequence of sample averages,
{ Sn/n ; n ≥ 1}
X 1,X ,...
2
converges WP1 toX .
In trying to understand (1.110), note that each sample point ω of the underlying sample
space $ maps to a sample value Z n ( ω) of each rv Z n , and thus maps to a sample path
{ Z n ( ω); n ≥ 1} . Foranygiven ω, suchasamplepathissimplyasequenceofrealnumbers.
That sequence of real numbers might converge to Z ( ω) (which is a real number for the
52
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
given ω), it might converge to somethingelse, or it might not converge at all. Thus a set of
ω exists for which the corresponding samplepath { Z n ( ω); n ≥ 1} converges toZ ( ω), and a
second set for which the sample path converges to something else or does not converge at
all. Convergence WP1 of the sequence of rv’s is thus defined to occur when the first set of
sample paths above is an event that has probability 1.
Foreach ω,thesequence { Z n ( ω); n ≥ 1} issimplyasequenceofrealnumbers,sowebriefly
reviewwhatthe limit of such a sequenceis. A sequenceof real numbers b1,b 2,... is said to
have a limit b if, for every ✏> 0, thereis aninteger m ✏such that |bn − b| ' ✏forall n ≥ m ✏.
An equivalent statement is that b1,b 2,... , has a limit b if, for every integer k ≥ 1, there is
an integer m ( k ) such that |bn − b| ' 1/k for all n ≥ m ( k ).
Figure 1.17 illustrates this definition for those, like the author, whose eyes blur on the
second or third ‘there exists’, ‘such that’, etc. in a statement. As illustrated, an important
aspect of convergenceof a sequence
{ bn ; n ≥ 1} of real numbers is that bn becomesclose to
for
large
and
stays
close
for
all
su!ciently
large values of n .
b
n
b1
b+ 1
b + 1/ 2
b + 1/ 3
b2
m (1)
m (4)
m (2)
b5
m (3)
b7
b4
b6
b
b − 1/ 3
b − 1/ 2
b− 1
b3
Figure 1.17: Illustration of a sequence of real numbers b1,b 2,...
that converge to a
number b. The figure illustrates an integer m (1) such that for all n ≥ m (1), bn lies in
the interval b ± 1. Similarly, for each k ≥ 1, there is an integer m ( k ) such that bn lies
in b ± 1/k for all n ≥ m ( k ). Thus lim n →∞ bn = b means that for a sequence of ever
tighter constraints, the k th constraint can be met for all su!ciently large
n , ( i.e. , all
(
)).
Intuitively,
convergence
means
that
the
elements
get
close to b
n ≥ m k
b1,b 2,...
and stay close. The sequence of positive integersm (1) ,m (2) ,... is nondecreasing, but
otherwise arbitrary, depending only on the sequence { bn ; n ≥ 1} . For sequences that
converge very slowly, the integersm (1) ,m (2) ,... are simply correspondingly larger.
Figure 1.18 gives an example of a sequence of real numbers that does not converge. Intuitively, this sequence is close to 0 (and in fact identically equal to 0) for most large n , but
it doesn’t stay close, because of ever more rare outages.
The following example illustrates how a sequence of rv’s can converge in probability but
not converge WP1. The example also provides some clues as to why convergence WP1 is
important.
Example 1.5.1. Consider a sequence { Yn ; n ≥ 1} of rv’s for which the sample paths
constitute the following slight variation of the sequence of real numbers in Figure 1.18.
53
1.5. THE LAWS OF LARGE NUMBERS
3/ 4
b1
b5
b25
qqqqqqqqqqqqqqqqqqqqqqqqqq
0
Figure 1.18: Illustrationofanon-convergentsequenceofrealnumbers
b1,b 2,... . The
sequenceisdefinedby bn =3 / 4 for n =1 , 5, 25,... , 5j ,... forallinteger j ≥ 0. Forall
other n , bn =0. The terms for which bn 6
=0 become increasingly rare as n → ∞ . Note
that bn 2 [− 1, 1] for all n , but there is no m (2) such that bn 2 [− 1 , 1 ] for all n ≥ m (2).
2 2
Thus the sequence does not converge.
In particular, as illustrated in Figure 1.19, the non-zero term at n = 5 jin Figure 1.18 is
replaced by a non-zero term at a randomly chosenn in the interval 45 [5j , 5j + 1).
3/ 4
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q
0
1
5
25
Figure 1.19: Illustrationof a samplepath of a sequenceof rvÕs { Yn ; n ≥ 0} where, for
each j ≥ 0, Yn =1 for an equiprobable choice of n 2 [5j , 5j + 1 ) and Yn =0 otherwise.
Sinceeachsamplepathcontainsasingleoneineachsegment[5 j , 5j + 1), andcontainszero’s
elsewhere, none of the sample paths converge. In other words, Pr
{ ω : lim Yn ( ω) =0 } =0.
−
j
j
j
+1
On the other hand Pr { Yn =0 } =1 − 5 for 5 ≤ n< 5 , so limn →∞ Pr { Yn =0 } =1.
Thus this sequence of rv’s converges to 0 in probability, but does not converge to 0 WP1.
This sequence also converges in mean square and (since it converges in probability) in
distribution. Thus we have shown (by example) that convergence WP1 is not implied by
any of the other types of convergence we have discussed. We will show in Section 5.2 that
convergenceWP1doesimplyconvergenceinprobabilityandindistributionbutnotinmean
square (as illustrated in Figure 1.16).
Theinterestingpointinthisexampleisthatthissequenceofrv’sisnotbizarre(althoughit
is somewhatspecializedto make the analysissimple). Anotherimportant point is that this
definition of convergence has a long history of being accepted as the ‘useful,’ ‘natural,’ and
‘correct’waytodefineconvergenceforasequenceofrealnumbers. Thusitisnotsurprising
that convergence WP1 will turn out to be similarly useful for sequences of rv’s.
There is a price to be paid in using the concept of convergence WP1. We must then look
at the entire sequence of rv’s and can no longer analyze finiten -tuples and then go to the
limit as n → ∞ . This requires a significant additional layer of abstraction, which involves
additional mathematical precision and initial loss of intuition. For this reason we put o"
further discussion of convergence WP1 and the SLLN until Chapter 5 where it is needed.
45
Thereisnospecialsignificancetothenumber5hereotherthanmakingthefigureeasytovisualize. We
could replace 5 by 2 or 3 etc.
54
1.6
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Relation of probability models to the real world
Whenever experienced and competent engineers or scientists construct a probability model
torepresentaspectsofsomesystemthateitherexistsorisbeingdesignedforsomeapplication, they must acquirea deepknowledgeof the system and its surroundingcircumstances,
andconcurrentlyconsidervarioustypesofprobabilitymodelsusedinprobabilisticanalyses
ofthesameorsimilarsystems. Usuallyverysimpleprobabilitymodelshelpinunderstanding
the real-world system, and knowledge about the real-world system helps in understanding
what aspects of the system are well-modeled by a given probability model. For a text such
as this, there is insu!cient space to understand the real-world aspects of each system that
might be of interest. We must use the language of various canonical real-world systems for
motivationandinsightwhenstudyingprobabilitymodelsforvariousclassesofsystems,but
such models must necessarily be chosen more for their tutorial than practical value.
There is a danger, then, that readers will come away with the impression that analysis is
more challenging and important than modeling. To the contrary, for work on real-world
systems, modeling is almost always more di!cult, more challenging, and more important
than analysis. The objective here is to provide the necessary knowledge and insight about
probabilistic models so that the reader can later combine this with a deep understanding
of particular real application areas. This will result in a useful interactive use of models,
analysis, and experimentation.
In this section, our purpose is not to learn how to model real-world problems, since, as
said above, this requires deep and specialized knowledge of whatever application area is of
interest. Rather it is to understand the following conceptual problem that was posed in
Section 1.1. Suppose we have a probability model of some real-world experiment involving
randomnessinthesenseexpressedthere. Whenthereal-worldexperimentbeingmodeledis
performed,thereisanoutcome,whichpresumablyisoneoftheoutcomesoftheprobability
model, but there is no observable probability.
It appears to be intuitively natural, for experiments that can be carried out repeatedly
under essentially the same conditions, to associate the probability of a given event with
the relative frequency of that event over many repetitions. We now have the background
to understand this approach. We first look at relative frequencies within the probability
model, and then within the real world.
1.6.1
Relative frequencies in a probability model
We have seen that for any probability model, an extended probability model exists for n
IID idealized experiments of the original model. For any event A in the original model,
the indicator function I A is a random variable, and the relative frequency of A over n IID
experiments is the sample average of n IID rv’s each with the distribution of
I A . From
the weak law of large numbers, this relative frequency converges in probability to E [I A ] =
Pr { A } . By taking the limit n → ∞ , the strong law of large numbers says that the relative
frequency of A converges with probability 1 to Pr { A } .
In plain English, this says that for large
n , the relative frequency of an event (in the
n-
1.6. RELATION OF PROBABILITYMODELSTO THE REAL WORLD
55
repetition IID model) is essentially the same as the probability of that event. The word
essentially is carrying a great deal of hidden baggage. For the weak law, for any ✏βδ >0,
the relative frequency is within some ✏of Pr { A } with a confidence level 1 − δ whenever
n is su!ciently large. For the strong law, the
✏and δ are avoided, but only by looking
directly at the limit n → ∞ . Despite the hidden baggage, though, relative frequency and
probability are related as indicated.
1.6.2
Relative frequencies in the real world
Intryingtosortoutifandwhenthelawsoflargenumbershavemuchtodowithreal-world
experiments, we should ignore the mathematical details for the moment and agree that for
large n , the relative frequency of an event A over n IID trials of an idealized experiment
is essentially Pr{ A } . We can certainly visualize a real-world experiment that has the same
set of possible outcomes as the idealized experiment and we can visualize evaluating the
relative frequencyof A over n repetitionswithlarge n . If thatreal-worldrelative frequency
is essentially equal to Pr{ A } , and this is true for the various events A of greatest interest,
then it is reasonableto hypothesize that the idealized experiment is a reasonablemodel for
the real-world experiment, at least so far as those given events of interest are concerned.
Oneproblemwiththiscomparisonofrelative frequenciesis thatwe have carefullyspecified
a model for n IID repetitionsof the idealizedexperiment, but have said nothingabout how
the real-world experiments are repeated. The IID idealized experiments specify that the
conditionalprobabilityof A at onetrialis the sameno matterwhattheresultsoftheother
trialsare. Intuitively,wewouldthentrytoisolatethe
n real-worldtrialssotheydon’ta"ect
each other, butthis is a little vague. Thefollowing exampleshelpexplainthis problemand
several others in comparing idealized and real-world relative frequencies.
Tossing coins is widely used as a way to choose the first
Example 1.6.1. Coin tossing:
player in various games, and is also sometimes used as a primitive form of gambling. Its
importance, however, and the reason for its frequent use, is its simplicity. When tossing
a coin, we would argue from the symmetry between the two sides of the coin that each
should be equally probable (since any procedure for evaluating the probability of one side
shouldapplyequallyto the other). Thus since H and T are the only outcomes(the remote
possibility of the coin balancingon its edge is omitted from the model), the reasonableand
universally accepted model for coin tossing is that H and T each have probability 1/2.
On the other hand, the two sides of a coin are embossed in di"erent ways, so that the mass
is not uniformly distributed. Also the two sides do not behave in quite the same way when
bouncing o" a surface. Each denomination of each currency behaves slightly di"erently in
this respect. Thus, not only do coins violate symmetry in small ways, but di"erent coins
violate it in di"erent ways.
How do we testwhetherthise"ectis significant? If we assumefor themoment thatsuccessive tosses of the coin are well-modeled by the idealized experiment of
n IID trials, we can
essentiallyfindtheprobabilityof H for aparticularcoinastherelative frequencyof
H ina
su!ciently large number of independent tosses of that coin. This gives us slightly di"erent
relative frequencies for di"erent coins, and thus slightly di"erent probability models for
56
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
di"erent coins.
Theassumptionofindependenttossesisalsoquestionable. Considerbuildingacarefullyengineeredmachinefortossingcoinsandusingitinavibration-freeenvironment. Astandard
coin is inserted into the machine in the same way for each toss and we count the number
of heads and tails. Since the machine has essentially eliminated the randomness, we would
expect all the coins, or almost all the coins, to come up the same way — the more precise
the machine, the less independent the results. By inserting the original coin in a random
way, a single trial might have equiprobable results, but successive tosses are certainly not
independent. The successive trials would be closer to independent if the tosses were done
by a slightly inebriated individual who tossed the coins high in the air.
The point of this example is that there are many di"erent coins and many ways of tossing
them, and the idea that one model fits all is reasonable under some conditions and not
under others. Rather than retreating into the comfortable world of theory, however, note
that we can now find the relative frequency of heads for any given coin and essentially for
any given way of tossing that coin.46
Consider the binary data transmitted over a communicaExample 1.6.2. Binary data:
tion link or stored in a data facility. The data is often a mixture of encoded voice, video,
graphics, text, etc., with relatively long runs of each, interspersed with various protocols
for retrieving the original non-binary data.
The simplest (and most common) model for this is to assume that each binary digit is 0 or
1 withequalprobabilityandthatsuccessivedigitsarestatisticallyindependent. Thisisthe
same as the model for coin tossing after the trivial modification of converting { H,T } into
{ 0, 1} . This is also a rather appropriate model for designing a communication or storage
facility, since all n -tuples are then equiprobable (in the model) for each n , and thus the
facilities need not rely on any special characteristics of the data. On the other hand, if one
wantstocompressthedata,reducingtherequirednumberoftransmittedorstoredbitsper
incoming bit, then a more elaborate model is needed.
Developing such an improved model would require finding out more about where the data
is coming from — a naive application of calculating relative frequencies of n -tuples would
probablynotbethebestchoice. Ontheotherhand,therearewell-knowndatacompression
schemes that in essence track dependencies in the data and use them for compression in a
coordinated way. These schemes are called
universal data-compressionschemes since they
don’t rely on a probability model. At the same time, they are best analyzed by looking at
how they perform for various idealized probability models.
Thepointofthisexampleisthatchoosingprobabilitymodelsoftendependsheavilyonhow
the model is to be used. Models more complex than IID binary digits are usually based on
what is known about the input processes. Measuring relative frequencies and associating
them with probabilities is the basic underlying conceptual connection between real-world
46
We are not suggesting that distinguishing di!erent coins for the sake of coin tossing is an important
problem. Rather, we are illustrating that even in such a simple situation, the assumption of identically
prepared experiments is questionable and the assumption of independent experiments is questionable. The
extension to n repetitions of IID experiments is not necessarily a good model for coin tossing. In other
words, one has to question not only the original model but also the n -repetition model.
1.6. RELATION OF PROBABILITYMODELSTO THE REAL WORLD
57
and models, but in practice this is essentially the relationship of last resort. For most of
the applications we will consider, there is a long history of modeling to build on, with
experiments as needed.
In the year 2008, the financial structure of the USA failed and
Example 1.6.3. Fable:
theworldeconomywasbroughttoitsknees. Muchhasbeenwrittenabouttheroleofgreed
on Wall Street and incompetence in Washington. Another aspect of the collapse, however,
was a widespread faith in stochastic models for limiting risk. These models encouraged
peopletoengageininvestmentsthatturnedouttobefarriskierthanthemodelspredicted.
These models were created by some of the brightest PhD’s from the best universities, but
theyfailedmiserablybecausetheymodeledeverydayeventsverywell,butmodeledtherare
events and the interconnection of events poorly. They failed badly by not understanding
their application, and in particular, by trying to extrapolate typical behavior when their
primary goal was to protect against highly atypical situations. The moral of the fable is
that brilliant analysis is not helpful when the modeling is poor; as computer engineers say,
“garbage in, garbage out.”
The examplesabove show that the problemsof modeling a real-world experiment are often
connected with the question of creating a model for a set of experiments that are not
exactlythesameanddonotnecessarilycorrespondtothenotionofindependentrepetitions
withinthemodel. Inotherwords,thequestionisnotonlywhethertheprobabilitymodelis
reasonableforasingleexperiment,butalsowhethertheIIDrepetitionmodelisappropriate
for multiple copies of the real-world experiment.
Atleastwehaveseen,however,thatifareal-worldexperimentcanbeperformedmanytimes
with a physical isolation between performances that is well modeled by the IID repetition
model, then the relative frequencies of events in the real-world experiment correspond to
relative frequenciesin theidealizedIID repetitionmodel, which correspond toprobabilities
in the original model. In other words, under appropriate circumstances, the probabilities
in a model become essentially observable over many repetitions.
We will see later that our emphasis on IID repetitions was done for simplicity. There are
other models for repetitions of a basic model, such as Markov models, that we study later.
These will also lead to relative frequencies approaching probabilities within the repetition
model. Thus,forrepeatedreal-worldexperimentsthatarewellmodeledbytheserepetition
models, the real world relative frequencies approximate the probabilities in the model.
1.6.3
Statistical independence of real-world experiments
We have been discussing the use of relative frequencies of an eventA in a repeated realworld experiment to test Pr { A } in a probability model of that experiment. This can be
done essentially successfully if the repeated trials correpond to IID trials in the idealized
experiment. However, thestatement aboutIIDtrialsinthe idealizedexperiment is astatement about probabilities in the extended n -trial model. Thus, just as we tested Pr { A } by
repeated real-world trials of a single experiment, we should be able to test Pr{ A 1,... ,A n }
in the n -repetition model by a much larger number of real-world repetitions of n -tuples
rather than single trials.
58
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
To be more specific, choose two large integers,m and n , and perform the underlying realworld experiment mn times. Partition the mn trials into m runs of n trials each. For any
given n -tuple A 1,... ,A n of successive events, find the relative frequency (overm trials of
n tuples) of the n -tuple event A 1,... ,A n . This can then be used essentially to test the
probabilityPr { A 1,... ,A n } inthemodelfor n IIDtrials. Theindividualeventprobabilities
can also be tested, so the condition for independence can be tested.
The observant reader will note that there is a tacit assumption above that successive n
tuples can be modeled as independent, so it seems that we are simply replacing a big
problem with a bigger problem. This is not quite true, since if the trials are dependent
withsomegivenprobabilitymodelfordependenttrials,thenthistestforindependencewill
essentially reject the independence hypothesis for large enoughn . In other words, we can
not completely verify the correctness of an independence hypothesis for then -trial model,
although in principle we could eventually falsify it if it is false.
Choosingmodelsforreal-worldexperimentsisprimarilyasubjectforstatistics,andwewill
notpursueitfurtherexceptforbriefdiscussionswhentreatingparticularapplicationareas.
The purpose here has been to treat a fundamental issue in probability theory. As stated
before, probabilities are non-observables — they exist in the theory but are not directly
measurableinreal-worldexperiments. Wehaveshownthatprobabilitiesessentiallybecome
observable in the real-world via relative frequencies over repeated trials.
1.6.4
Limitations of relative frequencies
Most real-world applications that are modeled by probability models have such a large
sample space that it is impractical to conduct enough trials to choose probabilities from
relative frequencies. Even a shu#ed deck of 52 cards would require many more than 52!≈
8 × 1067 trials for most of the outcomes to appear even once. Thus relative frequencies can
be used to test the probability of given individual events of importance, but are usually
impractical for choosing the entire model and even more impractical for choosing a model
for repeated trials.
Sincerelative frequenciesgive us a concreteinterpretationof whatprobability means, however, we can now rely on other approaches, such as symmetry, for modeling. From symmetry, for example, it is clear that all 52! possible arrangements of a card deck should be
equiprobableaftershu#ing. Thisleads,forexample,totheabilitytocalculateprobabilities
of di"erent poker hands, etc., which are such popular exercises in elementary probability
classes.
Anothervaluablemodelingprocedureis thatofconstructingaprobabilitymodelwherethe
possibleoutcomesareindependentlychosen n -tuplesofoutcomesinasimplermodel. More
generally, most of the random processes to be studied in this text are defined as various
ways of combining simpler idealized experiments.
What is really happening as we look at modeling increasingly sophisticated systems and
studying increasingly sophisticated models is that we are developing mathematical results
for simple idealized models and relating those results to real-world results (such as relating
idealized statistically independent trials to real-world independent trials). The association
1.7. SUMMARY
59
of relative frequencies to probabilitiesforms the basis for this, but is usually exercised only
in the simplest cases.
The way one selects probability models of real-world experiments in practice is to use
scientificknowledgeandexperience,plussimpleexperiments,tochooseareasonablemodel.
The results from the model (such as the law of large numbers) are then used both to
hypothesize results about the real-world experiment and to provisionally reject the model
when further experiments show it to be highly questionable. Although the results about
the model are mathematically precise, the corresponding results about the real-world are
at best insightful hypotheses whose most important aspects must be validated in practice.
1.6.5
Subjective probability
There are many useful applications of probability theory to situations other than repeated
trialsofagivenexperiment. Whendesigninganewsysteminwhichrandomness(ofthetype
used in probability models) is hypothesized, one would like to analyze the system before
actually building it. In such cases, the real-world system does not exist, so indirect means
must be used to construct a probability model. Often some sources of randomness, such as
noise, can be modeled in the absence of the system. Often similar systems or simulation
can be used to help understandthe systemand helpin formulatingappropriateprobability
models. However, the choice of probabilities is to a certain extent subjective.
Another type of situation (such as risk analysis for nuclear reactors) deals with a large
number of very unlikely outcomes, each catastrophic in nature. Experimentation clearly
cannot be used to establish probabilities, and it is not clear that probabilities have any
real meaning here. It can be helpful, however, to choose a probability model on the basis
of subjective beliefs which can be used as a basis for reasoning about the problem. When
handled well, this can at least make the subjective biases clear, leading to a more rational
approach. When handled poorly (as for example in some risk analyses of large financial
systems)it can hide both the real risks and the arbitrarynatureof possibly poor decisions.
We will not discuss the various, often ingenious, methods for choosing subjective probabilities. The reason is that subjective beliefs should be based on intensive and long term
exposure to the particular problem involved; discussing these problems in abstract probability terms weakens this link. We will focus instead on the analysis of idealized models.
These can be used to provide insights for subjective models, and more refined and precise
results for objective models.
1.7
Summary
Thischapterstartedwithanintroductionintothecorrespondencebetweenprobabilitytheoryandreal-worldexperimentsinvolvingrandomness. Whilealmostallworkinprobability
theory works with established probability models, it is important to think through what
these probabilities mean in the real world, and elementary subjects rarely address these
questions seriously.
60
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Thenextsectiondiscussedtheaxiomsofprobabilitytheory,alongwithsomeinsightsabout
why these particular axioms were chosen. This was followed by a review of conditional
probabilities, statistical independence, random variables, stochastic processes, and expectations. The emphasis was on understanding the underlying structure of the field rather
than reviewing details and problem solving techniques.
This was followed by discussing and developing the laws of large numbers at a somewhat
deeper level than most elementary courses. This involved a fair amount of abstraction,
combinedwithmathematicalanalysis. Thecentralideais thatthe sampleaverageof
n IID
rv’s approaches the mean with increasing n . As a special case, the relative frequency of
an event A approaches Pr{ A } . What the word approachesmeans here is both tricky and
vitalinunderstandingprobabilitytheory. Thestronglawoflargenumbersandconvergence
WP1 requires mathematicalmaturity, and is postponed to Chapter 5 where it is first used.
The final section came back to the fundamental problem of understanding the relation
between probability theory and randomness in the real-world. It was shown, via the laws
of large numbers, that probabilities become essentially observable via relative frequencies
calculated over repeated experiments.
There are too many texts on elementary probability to mention here, and most of them
serve to give added understanding and background to the material in this chapter. We
recommend Bertsekas and Tsitsiklis [2], both for a careful statement of the fundamentals
and for a wealth of well-chosen and carefully explained examples.
Textsthatcoversimilarmaterialtothathereare[18]and[12]. Kolmogorov[15]isreadable
forthemathematicallymatureandisalsoofhistoricalinterestasthetranslationofthe1933
bookthatfirstputprobabilityonafirmmathematicalbasis. Feller[7]istheclassicextended
and elegant treatment of elementary material from a mature point of view. Rudin [19] is
an excellent text on measure theory for those with advanced mathematical preparation.
61
1.8. EXERCISES
1.8
Exercises
of events each of which have probability
Exercise 1.1. Consider a sequenceA 1,A ,...
2
zero.
P
P
and find lim m →∞ Pr { m
. What you have done is to show
a) Find Pr { m
n =1 A n }
n =1 A n }
that the sum of a countably infinite set of numbers each equal to 0 is perfectly well defined
as 0.
b) Forasequenceofpossiblephases, a1,a 2,... between0and2 π, andasequenceofsingleS
ton events, A n = { an } , find Pr { n A n } assuming that the phase is uniformly distributed.
c) Now let each A n be the empty event φ. Use (1.1) and part a) to show that Pr { φ} =0.
S
Exercise 1.2. Let A 1 and A 2 bearbitraryeventsandshowthatPr { A 1 A 2} + Pr { A 1A 2} =
Pr { A 1} + Pr { A 2} . Explain which parts of the sample space are being double counted on
both sides of this equation and which parts are being counted once.
Exercise 1.3. This exercise derives the probability of an arbitrary (non-disjoint) union of
events, derives the union bound, and derives some useful limit expressions.
a) For 2 arbitrary events A 1 and A 2, show that
[
[
where A 2− A 1 = A 2A .c1
A 1 A 2 = A 1 ( A 2− A 1)
Show that A 1 and A 2 − A 1 are disjoint Hint: This is what Venn diagrams were invented
for.
b) For an arbitrarysequence of events,{ A n ; n ≥ 1} , let B 1 = A 1 and for each n ≥ 2 define
S n− 1
. Show that B 1,B ,...
are disjoint events and show that for each
B n = AS
n −
2 ,
m =1 A m
Sn
n
2,
=
.
Hint:
Use
induction.
n≥
m =1 A m
m =1 B m
c) Show that
Pr
n[
∞
n =1
o
An
= Pr
n[
∞
n =1
o
Bn
=
X
∞
n =1
Pr { B n } .
Hint: Use the axioms of probability for the second equality.
d) Show that for each n , Pr { B n } ' Pr { A n } . Use this to show that
n[ ∞
o
X ∞
Pr
Pr { A n } .
An ≤
n =1
n =1
S
S
=lim m →∞ Pr { m
. Hint: Combinepartsc)andb). Note
e) ShowthatPr { ∞
n =1 A n }
n =1 A n }
thatthissaysthattheprobabilityofalimitofunionsisequaltothelimitoftheprobabilities.
Thismightwellappeartobeobviouswithoutaproof,butyouwillseesituationslaterwhere
similar appearing interchanges cannot be made.
T
T
= lim n →∞ Pr { ni =1 A i } . Hint: Remember deMorgan’s equalif) Show that Pr { ∞
n =1 A n }
ties.
62
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Exercise 1.4. Find the probability that a five card poker hand, chosen randomly from a
52 card deck, contains 4 aces. That is, if all 52! arrangements of a deck of cardsare equally
likely, what is the probability that all 4 aces are in the first 5 cards of the deck.
Exercise 1.5. Considera samplespace of 8 equiprobablesamplepoints and let A 1,A ,A
2
3
be three events each of probability 1/ 2 such that Pr { A 1A A
=Pr
Pr
Pr
.
}
{
}
{
}
{
}
A
A
A
2 3
1
2
3
a) Create an example where Pr{ A 1A 2} =Pr { A 1A 3} = 14 but Pr { A 2A 3} = .18Hint: Make
atablewitharowforeachsamplepointandacolumnforeacheventandtrydi"erentways
of assigning sample points to events (the answer is not unique).
b) Show that, for your example, A 2 and A 3 are not independent. Note that the definition
ofstatisticalindependencewouldbe verystrangeif itallowed
A 1,A ,A
2
3 tobe independent
while A 2 and A 3 aredependent. Thisillustrateswhythedefinitionofindependencerequires
(1.14) rather than just (1.15).
Exercise 1.6.
This exercise shows that for all rv’s X , FX ( x ) is continuous from the right.
a) For any given rv X , any real number x , and each integer n ≥ 1, let A n = { ω : X >
x + 1/n } , and show that An1 ✓ A 2 ✓oááá. Use this and the corollaries to the axioms of
S
probability to show that Pr
=lim n →∞ Pr { A n } .
n≥ 1 A n
b) ShowthatPr
nS
o
n≥ 1 An
=Pr { X >x } andshowthatPr { X >x } =lim
n →∞
Pr { X >x + 1/n } .
c) Show that for ✏> 0, lim ✏→ 0 Pr { X ≤ x + ✏
} =Pr { X ≤ x } .
eX ( x ) =Pr { X <x } . Showthat e
d) Define F
FX ( x ) iscontinuousfromtheleft. Inotherwords,
the continuity from the right for the distribution function arises from the almost arbitrary
(butuniversallyaccepted)choicein definingthe distributionfunctionas Pr
{ X ≤ x } rather
than Pr { X <x } .
Exercise 1.7. Suppose X and Y are discrete rv’s with the PMF pXY ( x,y
i j ). Show (a
picture will help) that this is related to the joint distribution function by
pXY ( x,y
i j) =
lim
δ> 0,&→ 0
[F( x,y
i j ) − F( x i − δβyj ) − F( x,y
i
j
− δ) + F( x i − δβyj − δ)] .
Exercise 1.8. A variation of Example 1.3.2 is to let M be a random variable that takes
on both positive and negative values with the PMF
1
pM ( m ) = 2
.
(
|m| |m | + 1)
In other words, M is symmetric around 0 and |M | has the same PMF as the nonnegative
rv N of Example 1.3.2.
P
P
( ) = ∞ and m< 0 mpM ( m) = −∞ . (Thus show that the
a) Show that
m ≥ 0 m pM m
expectationof M notonlydoesnotexistbutisundefinedevenintheextendedrealnumber
system.)
63
1.8. EXERCISES
P∞
( ) aresummedintheorderof2positiveterms
b) Supposethatthetermsin
m = −∞ m pM m
for each negative term (i.e. , in the order 1 , 2, − 1, 3, 4, − 2, 5, ··· ). Find the limiting value of
the partial sums in this series. Hint: You may find it helpful to know that
X n 1 Zn 1
lim
−
dx = γ ,
n →∞
i =1 i
1 x
where γ is the Euler-Mascheroni constant, γ =0 .57721··· .
c) Repeat part b) where, for any given integer k> 0, the order of summation is k positive
terms for each negative term.
Exercise 1.9. Let X be a ternary rv taking on the 3 values 0, 1, 2 with probabilities
p0,p 1,p 2 respectively. Find the median of X for each of the cases below.
a) p0 =0 .2, p 1 =0 .4, p 2 =0 .4.
b) p0 =0 .2, p 1 =0 .2, p 2 =0 .6.
c) p0 =0 .2, p 1 =0 .3, p 2 =0 .5.
Note 1: The median is not unique in part c). find the interval of values that are medians.
Note 2: Some people force the median to be distinct by defining it as the midpoint of the
interval satisfying the definition given here.
d) Now suppose that X is non-negative and continuous with the density fX ( x ) = 1 for
0 ≤ x ≤ 0.5 and fX ( x ) = 0 for 0 .5 < x ≤ 1. We know that fX ( x ) is positive for all x > 1,
but it is otherwise unknown. Find the median or interval of medians.
Themedianissometimes(incorrectly)definedasthat α forwhichPr { X >) } =Pr { X <) } .
Show that it is possible for no such α to exist. Hint: Look at the examples above.
Exercise 1.10. a) For any given rv Y , express E [|Y |] in terms of
R
c( )
. Hint: Review the argument in Figure 1.4.
≥ F y dy
y 0
R
y< 0 FY
( y) dy and
Y
b) For some given rv X with E [|X |] < ∞ , let Y = X − α . Using part a), show that
Z∞
Zα
[
]=
(
)
+
E |X − α |
FX x dx
FcX ( x ) dx.
α
−∞
c) Show that E [|X − α |] is minimized over α by choosing α to be a median of X . Hint:
Both the easy way and the most instructive way to do this is to use a graphical argument
involving shifting Figure 1.4. Be careful to show that when the median is an interval, all
points in this interval achieve the minimum.
Exercise 1.11. Let X be a rv with distribution function
function of the following rv’s.
FX ( x ). Find the distribution
a) The maximum of n IID rv’s, each with distribution function
FX ( x ).
b) The minimum of n IID rv’s, each with distribution FX ( x ).
c) The di"erence of the rv’s defined in a) and b); assume X has a density fX ( x ).
64
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Exercise 1.12. Let X and Y be rv’s in some sample space $ and letZ = X + Y , i.e., for
each ω 2 Ω, Z ( ω) = X ( ω) + Y ( ω).
a) Show that the set of ω for which Z ( ω) = ±# has probability 0.
b) Toshowthat Z = X + Y isarv,wemustshowthatforeachrealnumber
α ,theset { ω 2
Ω: X ( ω) + Y ( ω) ≤ α } is an event. We proceed indirectly. For an arbitrary positive integer
T
integer k , let B ( n,k ) = { ω : X ( ω) ≤ k)/n } { Y ( ω) ≤ ( n + 1 − k ) α/n } .
n and an arbitrary
S
Let D ( n ) = k B ( n,k ) and show that D ( n ) is an event.
c) On a 2 dimensional sketch for a given α , show the values of X ( ω) and Y ( ω) for which
ω 2 D ( n ). Hint: This set of values should be bounded by a staircase function.
d) Show that
{ ω : X ( ω) + Y ( ω) ≤ α } =
\
n
D (n).
Explain why this shows that Z = X + Y is a rv.
Exercise 1.13. a) Let X 1,X ,...
2 ,X
n be rv’s with expected values X 1,... ,X
n . Show
that E [X 1 + ··· + X n ]= X 1 + ··· + X n . You may assumethattherv’shave ajoint density
function, but do not assume that the rv’s are independent.
b) Now assume that X 1,... ,X n are statistically independent and show that the expected
value of the product is equal to the product of the expected values.
c) Again assuming that X 1,... ,X n are statistically independent, show that the variance
of the sum is equal to the sum of the variances.
Exercise 1.14. (Stieltjes integration) a) Let h( x ) = u( x ) and FX ( x ) = u( x ) where u( x ) is
theunitstep, i.e., u( x ) =0for −∞ <x< 0 and u( x ) =1for x ≥ 0. Usingthedefinitionof
R
the Stieltjes integral in Footnote 24, show that −11 h( x ) dFX ( x ) does not exist. Hint: Look
atthetermintheRiemannsumincluding
x =0andlookattherangeofchoicesfor
h( x ) in
that interval. Intuitively, it might help initially to view
dFX ( x ) as a unit impulse at x =0.
b) Let h( x ) = u( x − a) and FX ( x ) = u( x − b) where a and b are in ( − 1, +1). Show that
R1
( )
( ) exists if and only if a =
6 b. Show that the integral has the value 1 for a<b
− 1 h x dFX x
andthevalue0 for a>b . Arguethatthisresultis stillvalidinthelimitofintegrationover
( −∞ , ∞ ).
c) Let X and Y beindependentdiscreterv’s,eachwithafinitesetofpossiblevalues. Show
R∞
that −∞
FX ( z − y) dFY ( y), defined as a Stieltjes integral, is equal to the distribution of
=
+
Z X Y at each z other than the possible sample values ofZ , and is undefined at each
sample value of Z . Hint: Express FX and FY as sums of unit steps. Note: This failure of
Stieltjes integration is not a serious problem; FZ ( z) is a step function, and the integral is
undefinedat its points of discontinuity. We automaticallydefine FZ ( z) at those step values
so that FZ is a distributionfunction ( i.e., is continuous from the right). This problem does
not arise if either X or Y is continuous.
65
1.8. EXERCISES
be a sequence of IID continuous rv’s with the
Exercise 1.15. Let X 1,X ,...
2 ,X ,... n
common probability density function fX ( x ); note that Pr { X = α } = 0 for all α and that
Pr { X i = X j } = 0 for all i 6
= j . For n ≥ 2, define X n as a record-to-date of the sequence if
for
all
.
X n>X i
i<n
a) Find the probability that X 2 is a record-to-date. Use symmetry to obtain a numerical
answer without computation. A one or two line explanation should be adequate.
b) Find the probability that
symmetry.
X n is a record-to-date, as a function of n ≥ 1. Again use
c) Find a simple expression for the expected number of records-to-date that occur over
the first m trials for any given integer m. Hint: Use indicator functions. Show that this
expected number is infinite in the limit m → ∞ .
Exercise 1.16.
(Continuation of Exercise 1.15)
a) Let N 1 be the index of the first record-to-date in the sequence. Find Pr { N 1>n } for
each n ≥ 2. Hint: There is a far simpler way to do this than working from part b) in
Exercise 1.15.
b) Show that N 1 is a rv.
c) Show that E [N 1]= ∞ .
d) Let N 2 be the index of the secondrecord-to-datein the sequence. Show that N 2 is a rv.
Hint: You need not find the distribution function of N 2 here.
e) Contrastyourresultinpartc) totheresultfrompartc) ofExercise1.15sayingthatthe
expected number of records-to-date is infinite over an an infinite number of trials. Note:
this should be a shock to your intuition — there is an infinite expected wait for the first of
an infinite sequence of occurrences, each of which must eventually occur.
Exercise 1.17. (Another direction from Exercise 1.15) a) For any given n ≥ 2, find the
probability that X n and X n +1 are both records-to-date. Hint: The idea in part b) of 1.15
is helpful here, but the result is not.
b) Is theevent that X n is a record-to-datestatisticallyindependent oftheevent that
is a record-to-date?
c) Findtheexpectednumberofadjacentpairsofrecords-to-dateoverthesequence
1
1 .
Hint: A helpful fact here is that
= n1 − n +1
X n+ 1
X 1,X ,...
2
n ( n +1)
Exercise 1.18. a) Assumethat X isanonnegativediscretervtakingonvalues
a1,a 2,... ,
and let Y = h( X ) for some nonnegative function h. Let bi = h( ai ) , i ≥ 1 be the i th value
P
P
taken on by Y . Show that E [Y ] =
( ) ( ). Find an example where
Y ( bi ) =
i bi p
i h ai pX ai
[
]
exists
but
[
]=
.
∞
EX
EY
b) Let X be a nonnegative continuous rv with density fX ( x ) and let h( x ) be di"erentiable,
P
nonnegative,andstrictlyincreasingin x . Let A ( δ) =
( )[ ( )
(
)], i.e., A ( δ)
n h n% F n% − F n%− δ
.
66
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
R
is a δth order approximation to the Stieltjes integral
h( x ) dF( x ). Show that if A (1) < ∞ ,
R
then A (2− k ) ≤ A (2k − 1) < ∞ . Show from this that h( x ) dF( x ) converges to a finite value.
Note: this is a very special case, but it can be extended to many cases of interest. It seems
better to consider these convergence questions as required rather than consider them in
general.
Exercise 1.19. a) Consider a positive, integer-valued rv whose distribution function is
given at integer values by
2
FY ( y) =1 − ( + 1)( + 2)
y
y
for integer y ≥ 0.
Use (1.33) to show that E [Y ]=2. Hint: Note the PMF given in (1.32).
b) Find the PMF of Y and use it to check the value of E [Y ].
c) Let X be another positive, integer-valued rv. Assume its conditional PMF is given by
pX |Y ( x |y) =
1
y
for 1 ≤ x ≤ y.
Find E [X | Y = y] and show that E [X ] = 3 / 2. Explore finding pX ( x ) until you are convinced that using the conditional expectation to calculate E [X ] is considerably easier than
using pX ( x ).
d ) Let Z be another integer-valued rv with the conditional PMF
pZ |Y ( z|y) =
1
y2
for 1 ≤ z ≤ y2.
Find E [Z | Y = y] for each integer y ≥ 1 and find E [Z ].
Exercise 1.20. a) Show that, for uncorrelated rv’s, the expected value of the product is
equal to the product of the expected values (by definition, X and Y are uncorrelated if
⇥
⇤
E ( X − X )( Y − Y ) =0).
b) Show that if X and Y are uncorrelated, then the variance of X + Y is equal to the
variance of X plus the variance of Y .
c) Show that if X 1,... ,X
sum of the variances.
n
are uncorrelated, the the variance of the sum is equal to the
d) Show that independent rv’s are uncorrelated.
e) Let X,Y be identically distributed ternary valued random variables with the PMF
pX ( − 1) = pX (1) = 1 / 4; pX (0) = 1 / 2. Find a simple joint probability assignment such
that X and Y are uncorrelated but dependent.
f) Youhaveseenthatthemomentgeneratingfunctionofasumofindependentrv’sisequal
to the product of the individualmoment generating functions. Give an example where this
is false if the variables are uncorrelated but dependent.
67
1.8. EXERCISES
Exercise 1.21. SupposeX has the Poisson PMF, pX ( n ) = λ n exp(− λ ) /n ! for n ≥ 0 and
Y has the Poisson PMF, pY ( m) = µn exp(− µ) /n ! for n ≥ 0. Assume that X and Y are
independent. Find the distribution of Z = X + Y and find the conditional distribution of
Y conditional on Z = n .
Exercise 1.22. a) SupposeX , Y and Z are binary rv’s, each taking on the value 0 with
probability 1 / 2 and the value 1 with probability 1 / 2. Find a simple example in which
X , Y , Z are statistically dependentbut are pairwise statistically independent (i.e., X , Y
are statistically independent, X , Z are statistically independent, and Y , Z are statistically
independent). Give pXYZ ( x,y,z ) for your example. Hint: In the simplest example, there
are four joint values for x,y,z that have probability 1/4 each.
b) Is pairwise statistical independence enough to ensure that
hY n
i
Y n
E
E [X i ]
Xi =
i =1
for a set of rv’s X 1,... ,X
Exercise 1.23.
i =1
n?
⇥
⇤
Show that E [X ] is the value of α that minimizes E ( X − α ) 2 .
Exercise 1.24. For each of the following random variables, find the interval (r −,r + ) over
which the moment generating function g( r ) exists. Determine in each case whethergX ( r )
exists at the end points r − and r + . For parts a) and b) you should also find and sketch
g( r ). For part c), g( r ) has no closed form.
a) Let λβθ, be positive numbers and let X have the density
1
fX ( x ) = 2 λ exp(− λx ); x ≥ 0;
1
fX ( x ) = 2 θ exp(θx ); x< 0.
b) Let Y be a Gaussian random variable with meanm and variance σ2.
c) Let Z be a nonnegative random variable with density
fZ ( z) = k (1+ z) − 2 exp(− λz );
z ≥ 0.
R
where λ > 0 and k = [ z≥ 0(1+ z) 2 exp(− az) dz]− 1. Hint: Do not try to evaluate
Instead, investigate values ofr for which the integral is finite and infinite.
gZ ( r ).
Exercise 1.25. Recall that the MGF of the nonnegative exponential rv with density e− x
is (1 − r ) − 1 for r < r + = 1. In other words, g( r + ) does not exist and lim r → r + g( r ) = ∞ ,
where the limit is over r < r + . In this exercise, you are to assume that X is an arbitrary
rv for which g( r + ) does not exist and show that lim r → r + g( r ) = ∞ where the limit is over
r<r + .
a) Explain why
Z
lim
A →∞
0
A
exr + dF( x ) = ∞ .
68
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
b) Show that for any ✏> 0 and any A> 0,
Z
− ✏A
(
)
g r + − ✏≥ e
A
exr + dF( x ) .
0
c) ChooseA =1 /$ and show that
lim g( r + − ✏
)= ∞.
→0
✏
Exercise 1.26. a) Assume that the MGF of the random variable X exists (i.e., is finite)
in the interval ( r −,r + ) , r − < 0 <r + , and assumer − <r <r + throughout. For any finite
constant c, express the moment generating function of X − c, i.e., g( X − c) ( r ), in terms of
( ) 0.
gX ( r ) and show that g( X − c) ( r ) exists for all r in ( r −,r + ). Explain why g(00
X − c) r ≥
( ) =[ gX00( r ) − 2cgX0 ( r ) + c2gX ( r )] e− rc .
b) Show that g(00
X − c) r
c) Use a) and b) to show that gX00( r ) gX ( r ) − [gX0 ( r )] 2 ≥ 0, Let γX ( r ) =ln gX ( r ) and show
that γX00( r ) ≥ 0. Hint: Choose c = gX0 ( r ) / gX ( r ).
d) Assumethat X isnon-deterministic, i.e.,thatthereisnovalueof α suchthatPr { X = α } =
1. Show that the inequality sign “ ≥ ” may be replaced by “ > ” everywhere in a), b) and c).
Exercise 1.27. A computer system has n users, each with a unique name and password.
Due to a software error, the n passwords are randomly permuted internally (i.e. each of
the n ! possible permutations are equally likely. Only those users lucky enough to have had
their passwords unchanged in the permutation are able to continue using the system.
a) What is the probability that a particular user, say user 1, is able to continue using the
system?
b) What is the expected number of users able to continue using the system? Hint: Let X i
be a rv with the value 1 if user i can use the system and 0 otherwise.
Exercise 1.28. Suppose the rv X is continuous and has the distribution function FX ( x ).
Consider another rv Y = FX ( X ). That is, for each sample point ω such that X ( ω) = x , we
have Y ( ω) = FX ( x ). Show that Y is uniformly distributed in the interval 0 to 1.
Exercise 1.29. Let Z beaninteger-valuedrvwiththePMF
pZ ( n ) =1 /k for0 ≤ n ≤ k − 1.
Find the mean, variance, and moment generating function of Z . Hint: An elegant way to
do this is to let U be a uniformly distributed continuous rv over (0 , 1] that is independent
of Z . Then U + Z is uniform over (0,k ]. Use the known results about U and U + Z to find
the mean, variance, and MGF for Z .
Exercise 1.30. (Alternateapproach1totheMarkovinequality)
a) Let Y beanonnegative
rv and y > 0 be some fixed number. Let A be the event that Y ≥ y. Show that y I A ≤ Y
( i.e. , that this inequality is satisfied for every ω 2 Ω).
b) Use your result in part a) to prove the Markov inequality.
69
1.8. EXERCISES
Exercise 1.31. (Alternate approach 2 to the Markov inequality) a) Minimize E [Y ] over
all non-negative rv’s such that Pr { Y ≥ b} $ β for some given b> 0 and 0 <, < 1. Hint:
Use a graphical argument similar to that in Figure 1.7. What is the rv that achieves the
minimum. Hint: It is binary.
b) Use part a) to prove the Markov inequality and also point out the distribution that
meets the inequality with equality.
Exercise 1.32. (Theone-sidedChebyshevinequality)Thisinequalitystatesthatifazeromean rv X has a varianceσ2, then it satisfies the inequality
σ2
σ2 + b2
Pr { X ≥ b} '
for every b> 0,
(1.111)
with equality for some b only if X is binary and Pr { X = b} = σ2/ ( σ2 + b2). We prove this
here using the same approach as in Exercise 1.31. LetX be a zero-mean rv that satisfies
Pr { X ≥ b} = β for some b> 0 and 0 <, < 1. The variance σ2 of X can be expressed as
Z
2
σ =
b−
−∞
Z
2
∞
x fX ( x ) dx +
x 2 fX ( x ) dx
(1.112)
b
We will first minimize σ2 over all zero-meanX satisfying Pr { X ≥ b} = β .
R
a) Show that the second integral in (1.112) satisfies ∞ x 2fX ( x ) dx ≥ b2, .
b
b) Show that the first integral in (1.112) is constrained by
Z
Z
b−
−∞
fX ( x ) dx =1 − β
b−
and
−∞
x fX ( x ) dx ≤ − b,
.
c) Minimize the first integral in (1.112) subject to the constraints in part b). Hint: If
you scale f X ( x ) up by 1/ (1 − β ), it integrates to 1 over ( −∞ ,b ) and the second constraint
becomes an expectation. You can then minimize the first integral in (1.112) by inspection.
d) Combine the results in a) and c) to show that σ2 ≥ b2,/ (1 − β ). Find the minimizing
distribution. Hint: It is binary.
e) Use part d) to establish (1.111). Also show (trivially) that if
variance σ2, then Pr Y − Y ≥ b ≤ σ2/ ( σ2 + b2)
Exercise 1.33. (Proofof(1.51))Herewe showthatif
σ2, then the median α satisfies |α | ' σ.
Y has a mean Y and
X isazero-meanrvwithavariance
a) First show that |α | ' σ for the special case whereX is binary with equiprobable values
at ± σ.
b) For all zero-mean rv’s X with variance σ2 other than the special case in a), show that
Pr { X ≥ σ} < 0.5
70
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
Hint: Use the one-sided Chebyshev inequality of Exercise 1.32.
c) Show that Pr { X ≥ α } $ 0.5. Other than the special case in a), show that this implies
that α <' .
d) Other than the special case in a), show that |α | <' . Hint: repeat b) and c) for the rv
− X . You have then shown that |α | ' σ with equality only for the binary case with values
± σ. For rv’s Y with a non-zero mean, this shows that |α − Y | ' σ.
Exercise 1.34. Westressedtheimportanceofthemeanofarv X intermsofitsassociation
withthesampleaverageviatheWLLN.HereweshowthatinessencetheWLLNallowsusto
evaluate the entire distributionfunction, say FX ( x ) of X via su!ciently many independent
sample values ofX .
a) For any given y, let I j ( y) be the indicator function of the event
{ X j ≤ y} where
are IID rv’s with the distribution function
FX ( x ). State the WLLN
X 1,X ,...
2 ,X ,... j
for the IID rv’s { I 1( y) , I 2( y) ,... } .
b) Does the answer to part a) require X to have a mean or variance?
c) Suggestaprocedureforevaluatingthemedianof X fromthesamplevaluesof X 1,X ,...
2
Assume that X is a continuous rv. You need not be precise, but try to think the issue
through carefully.
.
What you have seen here, without stating it precisely or proving it is that the median has
a law of large numbers associated with it, saying that the sample median ofn IID samples
of a rv is close to the true median with high probability.
Exercise 1.35. (Alternative proof of Theorem 1.5.3) a) Let ˜p = k/n , ˜q = 1 − p˜ , and
2
✏= ˜p − p. Start with (1.24) and use the expansion ln(1+ x ) = x − x / 2 + ··· to express
pSn ( k ) as
◆
✓
1
− n$2 +
··· ,
pSn ( k ) = √ 2 ˜ ˜ exp 2
pq
πnpq
where the omitted terms are of order n$3.
b) For α < 2/ 3, show that
◆
✓ (
1
− k − np) 2
pSn ( k ) ⇠ √ 2 ˜ ˜ exp
2npq
πnpq
for |k − np| ' n α,
where a( n ) ⇠ b( n ) means that lim n →∞ a( n ) /b ( n ) =1.
c) Show that for |k − np| ' n α ,
✓ (
1
)2 ◆
exp − k − np
pSn ( k ) ⇠ √ 2
2npq
πnpq
for |k − np| ' n α.
(1.113)
71
1.8. EXERCISES
Exercise 1.36. a)
Show that for any 0 <k<n
✓
◆ ✓ ◆
n
nn − k
≤
.
+
1
k
k
k
b) Extend part a) to show that, for all ` ≤ n − k ,
✓
n
+
k `
◆ ✓ ◆
n
n− k
≤
k
k
`
.
˜ =1 − p˜ . Let Sn be the sum of n binary IID rv’s with pX (0) = q and
c) Let ˜p = k/n and q
pX (1) = p. Show that for all ` ≤ n − k ,
pSn ( k + ` ) ≤ pSn ( k )
d) For k/n>p , show that Pr { Sn ≥ kn } '
pq
˜
p˜ − p
✓ ˜ ◆`
qp
.
˜
pq
pSn ( k ).
˜
. Argue that as n → ∞ ,
e) Now let ` be fixed and k = dn p˜ e for fixed ˜p such that 1 > p>p
✓ ˜ ◆`
qp
pSn ( k + ` ) ⇠ pSn ( k ) ˜
pq
and Pr { Sn ≥ kn } !
˜
pq
( )
˜p − p pSn k ,
where a( n ) ⇠ b( n ) means that lim n →∞ a( n ) /b ( n ) =1.
Exercise 1.37. A sequence{ an ; n ≥ 1} of real numbers has the limit 0 if for all ✏ > 0,
there is an m( ✏
) such that |an | ' ✏for all n ≥ m ( ✏
). Show that the sequences in parts a)
and b) below satisfy lim n →∞ an =0 but the sequence in part c) does not have a limit.
a) an =
1
ln(ln( n +1))
b) an = n 10 exp(− n )
`
each positive integer ` and an =0 otherwise.
c) an =1 for n =10 for
d) Show that the definition can be changed (with no change in meaning) by replacing
with either 1 /k or 2− k for every positive integer k .
✏
Exercise 1.38. Consider the moment generating function of a rv X as consisting of the
following two integrals:
Z
gX ( r ) =
0
−∞
Z
rx
e d F( x ) +
∞
er xd F( x ) .
0
In each of the following parts, you are welcome to restrict X to be either discrete or continuous.
a) Show that the first integral always exists ( i.e. , is finite) for r ≥ 0 and that the second
integral always exists for r ≤ 0.
72
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
b) Show that if the second integral exists for a given r 1>
the range 0 ≤ r ≤ r 1.
0, then it also exists for all r in
c) Show that if the first integral exists for a given r 2< 0, then it also exists for all r in the
range r 2 ≤ r ≤ 0.
d) Show that the range of r over which gX ( r ) exists is an interval from some r 2 ≤ 0 to
somer 1 ≥ 0 (theinterval might or might not includeeach endpoint, andeitheror both end
point might be 0 or ∞ ).
e) Find an example where r 1 = 1 and the MGF does not exist for r = 1. Find another
examplewhere r 1 =1andtheMGFdoesexistfor
r R=1. Hint: Consider fX (R
x ) = e− x for x ≥
∞ y
0 and figureout how to modify it to fY ( y) so that
( )
but ∞ ey + ✏yfY ( y) =
0 e f Y y dy< ∞
0
∞ for all ✏> 0.
Exercise 1.39. Let { X n ; n ≥ 1} be a sequence of independent but not identically distributed rv’s. We say that the weak law of large numbers (WLLN) holds for this sequence
if for all ✏> 0
⇢
E [Sn ]
Sn
lim Pr
(WL) .
−
≥ ✏ =0 where Sn = X 1 + X 2 + ··· + X n.
n →∞
n
n
a) Show that the WLLN holds if there is some constant A such that σX2 n ≤ A for all n .
b) Suppose that σX2 n ≤ An 1− α for some α < 1 and for all n . Show that the WLLN holds
in this case.
Exercise 1.40. Let { X i ; i ≥ 1} be IID binary rv’s. Let Pr { X i =1 } = δ, Pr { X i =0 } =
1 − δ. Let Sn = X 1 + ··· + X n . Let m be an arbitrary but fixed positive integer. Think!
then evaluate the following and explain your answers:
P
Pr { Sn = i }
a) lim n →∞
i :n&− m ≤ i ≤ n&+ m
P
Pr { Sn = i }
b) lim n →∞
i :0≤ i ≤ n&+ m
P
Pr { Sn = i } .
c) lim n →∞
i :n ( δ− 1/m ) ≤ i ≤ n ( δ+1 /m )
Exercise 1.41. UsetheBerry-Esseenresult, (1.88),toprove theWLLNundertherestric⇥ ⇤
tion that E |X |3 exists. Note: This is not intended as a reasonable way to prove the
WLLN. Rather, it is to better understandwhat the convergence result of (1.88) implies. It
appears that the CLT, without something extra about convergence, does not establish the
WLLN.
Exercise 1.42. (Details in the proof of Theorem 1.5.4
)
are IID, then the truncated versions X˘ 1,X ˘,...
are also IID.
2 ,
hi
b) Show that each X˘ i has a finite mean E X˘ and finite variance σ2˘ . Show that the
X
variance is upper bounded by the second moment around the original mean X , i.e. , show
h
i
that σ2˘ ≤ E | X˘ − E [X ] |2 .
a) Show that if X 1,X ,...
2 ,
X
73
1.8. EXERCISES
c) Assume that X˘ i is X i truncated to X ± b. Show that |X˘ − X | ' b and that |X˘ − X | '
h
i
|X − X |. Use this to show that σ2˘ ≤ bE |X˘ − X | ' 2bE [|X |].
X
˘ = X˘ + ··· + X˘ and show that for any
d) Let S
n
1
n
✏> 0,
(
)
˘
hi
8bE [|X |]
S
n
✏
˘
Pr
− E X ≥ 2 ≤
.
n
n$2
e) Sketch the form of FX˘ − X ( x ) and use this, along with (1.37), to show that for all su!h
i
ciently large b, E X˘ − X ≤ ✏/2. Use this to show that
(
)
˘
8bE [|X |]
S
n
Pr
for all large enough b.
− E [X ] ≥ ✏ ≤
n
n$2
f) Use the following equation to justify (1.109).
⇢
⇢
\
S
n
Pr
=
Pr Sn − E [X ] >$
− E [X ] >$
Sn =
n
n
⇢
\
+ Pr Sn − E [X ] >$
Sn
n
˘
S
n
˘
=
6 S
n
.
Exercise 1.43. Let { X i ; i ≥ 1} be IID rv’s with mean 0 and infinite variance. Assume
⇥
⇤
that E |X i |1+ h = β forsomegiven h,0 <h< 1 andsomefinite β . Let Sn = X 1 + ··· + X n .
a) Show that Pr {| X i | $ y} '
βy − 1− h
8
<
b : Xi ≥ b
Xi : − b≤ Xi ≤ b
:
−b : Xi ≤ − b
h i
⇥⇤R
1− h
Show that E X˘ 2 ≤ 2β1b− h Hint: For a nonnegative rv Z , E X 2 = 0∞ 2 z Pr { Z ≥ z} dz
(you can establish this, if you wish, by integration by parts).
n
o
˘
− 1− h
= S
c) Let S˘ n = X˘ 1 + . . . + X˘ n . Show that Pr Sn 6
n ≤ n,b
h 1− h
i
d) Show that Pr Snn ≥ ✏ ≤ β (12−bh ) n#2 + 1+n h .
b
b) Let { X˘ i ; i ≥ 1} be truncated variables X˘ i =
e) Optimize your bound with respect to b. How fast does this optimized bound approach
0 with increasing n ?
=)
Exercise 1.44. ( MS convergence
convergence in probability)
{ Z n ; n ≥ 1} isasequenceofrv’sand α isanumberwiththepropertythatlim
0.
a) Let ✏> 0 be arbitrary and show that for each n ≥ 0,
⇥
⇤
E (Z n − α)2
Pr {| Z n − α | $ ✏
}'
.
2
✏
Assume that
⇥
⇤
2 =
n →∞ E ( Z n − α )
74
CHAPTER1. INTRODUCTIONAND REVIEWOF PROBABILITY
b) For the ✏above, let δ > 0 be arbitrary. Show that there is an integer
⇥
⇤ 2
E (Z n − α)2 ≤ ✏
δfor all n ≥ m .
c) Show that this implies convergence in probability.
m such that
be a sequence of IID rv’s each with mean 0 and variance
Exercise 1.45. Let X 1,X ...2 ,
√
√
2 . Let
=
+
+
forall
σ
Sn X 1 ··· X n
n andconsidertherandomvariable Sn/' n − S2n /' 2n .
Find the limiting distribution function for this sequence of rvs0 as n → ∞ . The point of
√
this exercise is to see clearly that the distribution function of Sn/' n
is converging but
that the sequence of rv’s is not converging.
Exercise 1.46. UseFigure1.7toverify(1.60). Hint: Showthat
R
and show that lim y →∞ ≥ zdFY ( z) =0 if E [Y ] is finite.
z y
yPr { Y ≥ y} '
R
z≥ y
zdFY ( z)
Q
(1 − 1/m ) =0. Hint: Show that
✓
✓ ✓
✓ 1◆
1◆
1 ◆◆
1−
=exp ln 1
exp
−
≤
−
.
m
m
m
Exercise 1.47.
Show that
Exercise 1.48.
Consider a discrete rv X with the PMF
m≥ n
pX ( − 1) = (1
pX (1) = (1
pX (1012) = 10
− 10− 10) / 2,
− 10− 10) / 2,
− 10
.
a) Findthe meanand varianceof X . Assumingthat { X m ; m ≥ 1} is an IID sequencewith
the distribution of X and that Sn = X 1 + ··· + X n for each n , find the mean and variance
of Sn . (no explanations needed.)
b) Let n = 10 6and describe the event { Sn ≤ 106} in words. Find an exact expression for
Pr Sn ≤ 106 = FSn (106).
c) Find a way to use the union bound to get a simple upper bound and approximation of
1 − FSn (106).
d) Sketch the distribution function of Sn for n = 10 6. You can choose the horizontal axis
for your sketch to go from − 1 to +1 or from − 3 × 103 to 3 × 103 or from − 106 to 106or
from 0 to 10 12
, whichever you think will best describe this distribution function.
. Give an exact expression for Pr
e) Now let n = 10 10
Sn ≤ 1010 and show that this
can be approximated by e− 1. Sketch the distribution function of Sn for n = 10 10
, using a
12
horizontal axis going from slightly below 0 to slightly more than 2 × 10 . Hint: First view
Sn as conditioned on an appropriate rv.
f) Can you make a qualitative statement about how the distribution function of a rv
X
a"ects the required size of n before the WLLN and the CLT provide much of an indication
about Sn .
Download