Uploaded by zilinliu-hf

Measure, Integral, and Probability

advertisement
Springer Undergraduate Mathematics Series
Advisory Board
M.A.J. Chaplain University of Dundee
K. Erdmann Oxford University
L.c.G. Rogers Cambridge University
E. Stili Oxford University
J.F. Toland University of Bath
Other books in this series
A First Course in Discrete Mathematics 1. Anderson
Analytic Methods for Partial Differential Equations G. Evans, j. Blackledge, P. Yardley
Applied Geometry for Computer Graphics and CAD, Second Edition D. Marsh
Basic Linear Algebra, Second Edition T.S. Blyth and E.F. Robertson
Basic Stochastic Processes Z. Brze::niak and T. Zastawniak
Calculus of One Variable K.E. Hirst
Complex Analysis j.M. Howie
Elementary Differential Geometry A. Pressley
Elementary Number Theory G.A. jones and J.M. jones
Elements of Abstract Analysis M. 6 Searc6id
Elements of Logic via Numbers and Sets D.L. johnson
Essential Mathematical Biology N.F. Britton
Essential Topology M.D. Crossley
Fields, Flows and Waves: An Introduction to Continuum Models D.F. Parker
Further Linear Algebra T.S. Blyth and E.F. Robertson
Geometry R. Fenn
Groups, Rings and Fields D.A.R. Wallace
Hyperbolic Geometry, Second Edition J. W. Anderson
Information and Coding Theory G.A. jones and J.M. jones
Introduction to Laplace Transforms and Fourier Series P.P.G. Dyke
Introduction to Ring Theory P.M. Cohn
Introductory Mathematics: Algebra and Analysis G. Smith
Linear Functional Analysis B.P. Rynne and M.A. Youngson
Mathematics for Finance: An Introduction to Financial Engineering M. Capinksi and
T. Zastawniak
Matrix Groups: An Introduction to Lie Group Theory A. Baker
Multivariate Calculus and Geometry, Second Edition S. Dineen
Numerical Methods for Partial Differential Equations G. Evans, J. Blackledge, P. Yardley
Probability Models j.Haigh
Real Analysis J.M. Howie
Sets, Logic and Categories P. Cameron
Special Relativity N.M.]. Woodhouse
Symmetries D.L. johnson
Topics in Group Theory G. Smith and O. Tabachnikova
Vector Calculus P.e. Matthews
Marek Capinski and Ekkehard Kopp
Nteasure, Integral
and Probability
Second Edition
With 23 Figures
~ Springer
Marek Capinski, PhD
Institute of Mathematics, Jagiellonian University, Reymonta 4, 30-059 Krakow, Poland
Peter Ekkehard Kopp, DPhil
Department of Mathematics, University of Hull, Cottingham Road, Hull, HU6 7RX, UK
Cover illustration elements reproduced by kind permission of:
Aptech Systems, Inc., Publishers of the GAUSS Mathematical and Statistical System, 23804 S.E. Kent-Kangiey Road, Maple Valley, WA
98038, USA. Tel: (206) 432 - 7855 Fax: (206) 432 - 7832 email: info@laptech.com URL: www.aptech.com.
American Statistical Association: Chance Vol 8 No 1 1995 article by KS and KW Heiner "Tree Rings of the Northern Shawangunks"
page 32 fig. 2.
Springer-Verlag: Mathematica in Education and Research Vol 4 Issue 3 1995 article by Roman E. Maeder, Beatrice Amrhein and
Oliver Gloor "Illustrated Mathematics: Visualization of Mathematical Objects" page 9 fig II, originally published as a CD ROM
"Illustrated Mathematics" by TELOS: ISBN 0-387-14222-3, German edition by Birkhauser: ISBN-13:978-1-85233-781-0
Mathematica in Education and Research Vol 4 Issue 3 1995 article by Richard J Gaylord and Kazume Nishidate "Traffic Engineering
with Cellular Automata" page 35 fig. 2. Mathematica in Education and Research Vol 5 Issue 2 1996 article by Michael Trott "The
Implicitization of a Trefoil Knot" page 14.
Mathematica in Education and Research Vol 5 Issue 2 1996 article by Lee de Cola "Coins, Trees, Bars and Bens: Simulation of the
Binomial Process" page 19 fig. 3. Mathematica in Education and Research Vol 5 Issue 2 1996 article by Richard Gaylord and
Kazume Nishidate 'Contagious Spreading" page 33 fig 1. Mathematica in Education and Research Vol 5 Issue 2 1996 article by Joe
Buhler and Stan Wagon "Secrets of the Madelung Constant" page 50 fig 1.
British Library Cataloguing in Publication Data
Capi\\ski, Marek, 1951Measure, integral and probability. - 2nd ed - (Springer
undergraduate mathematics series)
1. Lebesgue integral 2. Measure theory 3. Probabilities
I. Title II. Kopp, P. E., 1944515.4'2
Library of Congress Cataloging-in-Publication Data
Capinski, Marek, 1951Measure, integral and probability 1 Marek Capinski and Ekkehard Kopp.-2nd ed.
p. cm. - (Springer undergraduate mathematics series, ISSN 1615-2085)
Includes bibliographical references and index.
ISBN-13:978-1-8S233-781-0
1. Measure theory. 2. Integrals, Generalized. 3. Probabilities. I. Kopp. P. E.• 1944- II.
Title. III. Series.
QA312.C36 2004
SIS'.42-dc22
2004040204
Apart from any fair dealing for the purposes of research or private study. or criticism or review. as permitted under
the Copyright. Designs and Patents Act 1988. this publication may only be reproduced. stored or transmitted, in
any form or by any means. with the prior permission in writing of the publishers, or in the case of reprographic
reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers.
Springer Undergraduate Mathematics Series ISSN 1615-2085
ISBN-13:978-1-85233-781-0
e-ISBN-13:978-1-4471-064S-6
DOl: 10.10071978-1-4471-0645-6
Springer Science+Business Media
springeronline.com
@Springer-Verlag London Limited 2004
2nd printing 2005
Reprint of the original edition 2005
The use of registered names. trademarks etc. in this publication does not imply. even in the absence of a specific
statement. that such names are exempt from the relevant laws and regulations and therefore free for general use.
The publisher makes no representation. express or implied. with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be
made.
Typesetting: Electronic files provided by the authors
12/3830-54321
Printed on acid-free paper
SPIN 11494904
To our children; grandchildren:
Piotr, Maciej, Jan, Anna; Lukasz,
Anna, Emily
Preface to the second edition
After five years and six printings it seems only fair to our readers that we
should respond to their comments and also correct errors and imperfections to
which we have been alerted in addition to those we have discovered ourselves
in reviewing the text. This second edition also introduces additional material
which earlier constraints of time and space had precluded, and which has, in
our view, become more essential as the make-up of our potential readership has
become clearer. We hope that we manage to do this in a spirit which preserves
the essential features of the text, namely providing the material rigorously and
in a form suitable for directed self-study. Thus the focus remains on accessibility, explicitness and emphasis on concrete examples, in a style that seeks to
encourage readers to become directly involved with the material and challenges
them to prove many of the results themselves (knowing that solutions are also
given in the text!).
Apart from further examples and exercises, the new material presented here
is of two contrasting kinds. The new Chapter 7 adds a discussion of the comparison of general measures, with the Radon-Nikodym theorem as its focus.
The proof given here, while not new, is in our view more constructive and
elementary than the usual ones, and we utilise the result consistently to examine the structure of Lebesgue-Stieltjes measures on the line and to deduce
the Hahn-Jordan decomposition of signed measures. The common origin of
the concepts of variation and absolute continuity of functions and measures is
clarified. The main probabilistic application is to conditional expectations, for
which an alternative construction via orthogonal projections is also provided in
Chapter 5. This is applied in turn in Chapter 7 to derive elementary properties
of martingales in discrete time.
The other addition occurs at the end of each chapter (with the exception of
Chapters 1 and 5). Since it is clear that a significant proportion of our current
vii
viii
Preface to the second edition
readership is among students of the burgeoning field of mathematical finance,
each relevant chapter ends with a brief discussion of ideas from that subject. In
these sections we depart from our aim of keeping the book self-contained, since
we can hardly develop this whole discipline afresh. Thus we neither define
nor explain the origin of the finance concepts we address, but instead seek
to locate them mathematically within the conceptual framework of measure
and probability. This leads to conclusions with a mathematical precision that
sometimes eludes authors writing from a finance perspective.
To avoid misunderstanding we repeat that the purpose of this book remains the development of the ideas of measure and integral, especially with a
view to their applications in probability and (briefly) in finance. This is therefore neither a textbook in probability theory nor one in mathematical finance.
Both of these disciplines have a large specialist literature of their own, and our
comments on these areas of application are intended to assist the student in
understanding the mathematical framework which underpins them.
We are grateful to those of our readers and to colleagues who have pointed
out many of the errors, both typographical and conceptual, of the first edition.
The errors that inevitably remain are our sole responsibility. To facilitate their
speedy correction a webpage has been created for the notification of errors,
inaccuracies and queries, at http://www.springeronline.com/1-85233-781-8 and
we encourage our readers to use it mercilessly. Our thanks also go to Stephanie
Harding and Karen Borthwick at Springer-Verlag, London, for their continuing
care and helpfulness in producing this edition.
Krakow, Poland
Hull, UK
October 2003
Marek Capinski
Ekkehard Kopp
Preface to the first edition
The central concepts in this book are Lebesgue measure and the Lebesgue
integral. Their role as standard fare in UK undergraduate mathematics courses
is not wholly secure; yet they provide the principal model for the development of
the abstract measure spaces which underpin modern probability theory, while
the Lebesgue function spaces remain the main source of examples on which
to test the methods of functional analysis and its many applications, such as
Fourier analysis and the theory of partial differential equations.
It follows that not only budding analysts have need of a clear understanding
of the construction and properties of measures and integrals, but also that those
who wish to contribute seriously to the applications of analytical methods in
a wide variety of areas of mathematics, physics, electronics, engineering and,
most recently, finance, need to study the underlying theory with some care.
We have found remarkably few texts in the current literature which aim
explicitly to provide for these needs, at a level accessible to current undergraduates. There are many good books on modern probability theory, and increasingly they recognise the need for a strong grounding in the tools we develop in
this book, but all too often the treatment is either too advanced for an undergraduate audience or else somewhat perfunctory. We hope therefore that the
current text will not be regarded as one which fills a much-needed gap in the
literature!
One fundamental decision in developing a treatment of integration is
whether to begin with measures or integrals, i.e. whether to start with sets or
with functions. Functional analysts have tended to favour the latter approach,
while the former is clearly necessary for the development of probability. We
have decided to side with the probabilists in this argument, and to use the
(reasonably) systematic development of basic concepts and results in probability theory as the principal field of application - the order of topics and the
IX
x
Preface to the first edition
terminology we use reflect this choice, and each chapter concludes with further
development of the relevant probabilistic concepts. At times this approach may
seem less 'efficient' than the alternative, but we have opted for direct proofs
and explicit constructions, sometimes at the cost of elegance. We hope that it
will increase understanding.
The treatment of measure and integration is as self-contained as we could
make it within the space and time constraints: some sections may seem too
pedestrian for final-year undergraduates, but experience in testing much of the
material over a number of years at Hull University teaches us that familiarity and confidence with basic concepts in analysis can frequently seem somewhat shaky among these audiences. Hence the preliminaries include a review
of Riemann integration, as well as a reminder of some fundamental concepts of
elementary real analysis.
While probability theory is chosen here as the principal area of application
of measure and integral, this is not a text on elementary probability, of which
many can be found in the literature.
Though this is not an advanced text, it is intended to be studied (not
skimmed lightly) and it has been designed to be useful for directed self-study
as well as for a lecture course. Thus a significant proportion of results, labelled
'Proposition', are not proved immediately, but left for the reader to attempt
before proceeding further (often with a hint on how to begin), and there is a
generous helping of exercises. To aid self-study, proofs of the propositions are
given at the end of each chapter, and outline solutions of the exercises are given
at the end of the book. Thus few mysteries should remain for the diligent.
After an introductory chapter, motivating and preparing for the principal
definitions of measure and integral, Chapter 2 provides a detailed construction
of Lebesgue measure and its properties, and proceeds to abstract the axioms appropriate for probability spaces. This sets a pattern for the remaining chapters,
where the concept of independence is pursued in ever more general contexts,
as a distinguishing feature of probability theory.
Chapter 3 develops the integral for non-negative measurable functions, and
introduces random variables and their induced probability distributions, while
Chapter 4 develops the main limit theorems for the Lebesgue integral and compares this with Riemann integration. The applications in probability lead to a
discussion of expectations, with a focus on densities and the role of characteristic functions.
In Chapter 5 the motivation is more functional-analytic: the focus is on the
Lebesgue function spaces, including a discussion of the special role of the space
L2 of square-integrable functions. Chapter 6 sees a return to measure theory,
with the detailed development of product measure and Fubini's theorem, now
leading to the role of joint distributions and conditioning in probability. Finally,
Preface to the first edition
xi
following a discussion of the principal modes of convergence for sequences of
integrable functions, Chapter 7 adopts an unashamedly probabilistic bias, with
a treatment of the principal limit theorems, culminating in the Lindeberg-Feller
version of the central limit theorem.
The treatment is by no means exhaustive, as this is a textbook, not a
treatise. Nonetheless the range of topics is probably slightly too extensive for
a one-semester course at third-year level: the first five chapters might provide
a useful course for such students, with the last two left for self-study or as
part of a reading course for students wishing to continue in probability theory.
Alternatively, students with a stronger preparation in analysis might use the
first two chapters as background material and complete the remainder of the
book in a one-semester course.
Krakow, Poland
Hull, UK
May 1998
Marek Capinski
Ekkehard Kopp
Contents
1.
Motivation and preliminaries ............................... 1
1.1 Notation and basic set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sets and functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Countable and uncountable sets in lR ................. 4
1.1.3 Topological properties of sets in lR . . . . . . . . . . . . . . . . . . . . 5
1.2 The Riemann integral: scope and limitations . . . . . . . . . . . . . . . . . 7
1.3 Choosing numbers at random . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12
2.
Measure....................................................
2.1 Null sets ................................................
2.2 Outer measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.3 Lebesgue-measurable sets and Lebesgue measure. . . . . . . . . . . ..
2.4 Basic properties of Lebesgue measure ................ . . . . . ..
2.5 Borel sets ...............................................
2.6 Probability..............................................
2.6.1 Probability space ................................. "
2.6.2 Events: conditioning and independence ................
2.6.3 Applications to mathematical finance ............... "
2.7 Proofs of propositions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
15
15
20
26
35
40
45
46
46
49
51
3.
Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3.1 The extended real line ....................................
3.2 Lebesgue-measurable functions. . . . . . . . . . . . . . . . . . . . . . . . . . . ..
3.3 Examples................................................
3.4 Properties...............................................
3.5 Probability..............................................
55
55
55
59
60
66
xiii
Contents
XIV
3.5.1 Random variables ..................................
3.5.2 a-fields generated by random variables. . . . . . . . . . . . . . ..
3.5.3 Probability distributions ............................
3.5.4 Independence of random variables. . . . . . . . . . . . . . . . . . ..
3.5.5 Applications to mathematical finance . . . . . . . . . . . . . . . ..
Proofs of propositions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
66
67
68
70
70
73
Integral.....................................................
4.1 Definition of the integral ..................................
4.2 Monotone convergence theorems ............................
4.3 Integrable functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
4.4 The dominated convergence theorem. . . . . . . . . . . . . . . . . . . . . . ..
4.5 Relation to the Riemann integral. . . . . . . . . . . . . . . . . . . . . . . . . ..
4.6 Approximation of measurable functions .....................
4.7 Probability ..............................................
4.7.1 Integration with respect to probability distributions ....
4.7.2 Absolutely continuous measures:
examples of densities ................................
4.7.3 Expectation of a random variable .....................
4.7.4 Characteristic function ..............................
4.7.5 Applications to mathematical finance .................
4.8 Proofs of propositions .....................................
75
75
82
86
92
97
102
105
105
5.
Spaces of integrable functions ...............................
5.1 The space £1 ............................................
5.2 The Hilbert space L2 .....................................
5.2.1 Properties of the L 2 -norm ...........................
5.2.2 Inner product spaces ................................
5.2.3 Orthogonality and projections .......................
5.3 The LP spaces: completeness ...............................
5.4 Probability ..............................................
5.4.1 Moments ..........................................
5.4.2 Independence ......................................
5.4.3 Conditional expectation (first construction) ............
5.5 Proofs of propositions .....................................
125
126
131
132
135
137
140
146
146
150
153
155
6.
Product measures ..........................................
6.1 Multi-dimensional Lebesgue measure ........................
6.2 Product a-fields ..........................................
6.3 Construction of the product measure ........................
6.4 Fubini's theorem .........................................
6.5 Probability ..............................................
159
159
160
162
169
173
3.6
4.
107
114
115
117
119
Contents
6.6
7.
The
7.1
7.2
7.3
7.4
7.5
8.
xv
6.5.1 Joint distributions ..................................
6.5.2 Independence again .................................
6.5.3 Conditional probability .............................
6.5.4 Characteristic functions determine distributions ........
6.5.5 Application to mathematical finance ..................
Proofs of propositions .....................................
173
175
177
180
182
185
Radon-Nikodym theorem .............................. 187
Densities and conditioning ................................. 187
The Radon-Nikodym theorem ............................. 188
Lebesgue-Stieltjes measures ............................... 199
7.3.1 Construction of Lebesgue-Stieltjes measures ........... 199
7.3.2 Absolute continuity of functions ...................... 204
7.3.3 'Functions of bounded variation ....................... 206
7.3.4 Signed measures .................................... 210
7.3.5 Hahn-Jordan decomposition ......................... 216
Probability .............................................. 218
7.4.1 Conditional expectation relative to a O'-field ........... 218
7.4.2 Martingales ........................................ 222
7.4.3 Doob decomposition ................................ 226
7.4.4 Applications to mathematical finance ................. 232
Proofs of propositions ..................................... 235
Limit theorems .............................................
8.1 Modes of convergence .....................................
8.2 Probability ..............................................
8.2.1 Convergence in probability ..........................
8.2.2 Weak law of large numbers ..........................
8.2.3 The Borel-Cantelli lemmas ..........................
8.2.4 Strong law of large numbers .........................
8.2.5 Weak convergence ..................................
8.2.6 Central limit theorem ...............................
8.2.7 Applications to mathematical finance .................
8.3 Proofs of propositions .....................................
241
241
243
245
249
255
260
268
273
280
283
Solutions ....................................................... 287
Appendix . ............................................ '" ....... 301
References . ..................................................... 305
Index ........................................................... 307
1
Motivation and preliminaries
Life is an uncertain business. We can seldom be sure that our plans will work
out as we intend, and are thus conditioned from an early age to think in terms
of the likelihood that certain events will occur, and which are 'more likely'
than others. Thrning this vague description into a probability model amounts
to the construction of a rational framework for thinking about uncertainty.
The framework ought to be a general one, which enables us to handle equally
situations where we have to sift a great deal of prior information, and those
where we have little to go on. Some degree of judgement is needed in all cases;
but we seek an orderly theoretical framework and methodology which enables
us to formulate general laws in quantitative terms.
This leads us to mathematical models for probability, that is to say, idealised
abstractions of empirical practice, which nonetheless have to satisfy the criteria
of wide applicability, accuracy and simplicity. In this book our concern will be
with the construction and use of generally applicable probability models in
which we can also consider infinite sample spaces and infinite sequences of
trials: that such are needed is easily seen when one tries to make sense of
apparently simple concepts such as 'drawing a number at random from the
interval [0,1]' and in trying to understand the limit behaviour of a sequence
of identical trials. Just as elementary probabilities are computed by finding
the comparative sizes of sets of outcomes, we will find that the fundamental
problem to be solved is that of measuring the 'size' of a set with infinitely many
elements. At least for sets on the real line, the ideas of basic real analysis provide
us with a convincing answer, and this contains all the ideas needed for the
abstract axiomatic framework on which to base the theory of probability. For
1
2
Measure, Integral and Probability
this reason the development of the concept of measure, and Lebesgue measure
on lR in particular, has pride of place in this book.
1.1 Notation and basic set theory
In measure theory we deal typically with families of subsets of some arbitrary
given set and consider functions which assign real numbers to sets belonging to
these families. Thus we need to review some basic set notation and operations
on sets, as well as discussing the distinction between count ably and uncountably
infinite sets, with particular reference to subsets of the real line R We shall also
need notions from analysis such as limits of sequences, series, and open sets.
Readers are assumed to be largely familiar with this material and may thus skip
lightly over this section, which is included to introduce notation and make the
text reasonably self-contained and hence useful for self-study. The discussion
remains quite informal, without reference to foundational issues, and the reader
is referred to basic texts on analysis for most of the proofs. Here we mention
just two introductory textbooks: [8] and [11].
1.1.1 Sets and functions
In our operations with sets we shall always deal with collections of subsets
of some universal set n; the nature of this set will be clear from the context
- frequently n will be the set lR of real numbers or a subset of it. We leave
the concept of 'set' as undefined and given, and concern ourselves only with
set membership and operations. The empty set is denoted by 0; it has no
members. Sets are generally denoted by capital letters.
Set membership is denoted by E, so x E A means that the element x is a
member of the set A. Set inclusion, A c B, means that every member of A is a
member of B. This includes the case when A and B are equal; if the inclusion is
strict, i.e. A c Band B contains elements which are not in A (written x ~ A),
this will be stated separately. The notation {x E A : P (x)} is used to denote
the set of elements of A with property P. The set of all subsets of A (its power
set) is denoted by P(A).
We define the intersection A n B = {x : x E A and x E B} and union
Au B = {x : x E A or x E B}. The complement AC of A consists of the
elements of n which are not members of A; we also write AC = n \ A, and,
more generally, we have the difference B \ A = {x E B : x ~ A} = B n A C and
the symmetric difference ALlB = (A \ B) U (B \ A). Note that ALlB = 0 if
1. Motivation and preliminaries
3
and only if A = B.
Intersection (resp. union) gives expression to the logical connective 'and'
(resp. 'or') and, via the logical symbols :3 (there exists) and V (for all), they
have extensions to arbitrary collections. Indexed by some set A these are given
by
An = {x : x E An for all a E A} = {x : Va E A, x E An};
n
nEA
U An = {x : x E An for some a E A} = {x : :3a E A, x E An}.
nEA
These are linked by de Morgan's laws:
If A n B = 0 then A and B are disjoint. A family of sets (An)nEA is pairwise
disjoint if An n A,e = 0 whenever a -=1= f3 (a, f3 E A).
The Cartesian product A x B of sets A and B is the set of ordered pairs
A x B = {(a,b) : a E A, b E B}. As already indicated, we use N,Z,Q,~
for the basic number systems of natural numbers, integers, rationals and reals
respectively. Intervals in ~ are denoted via each endpoint, with a square bracket
indicating its inclusion, an open bracket exclusion, e.g. [a, b) = {x E ~ : a ::;
x < b}. We use 00 and -00 to describe unbounded intervals, e.g. (-00, b) =
{x E ~ : x < b}, [0, (0) = {x E ~ : x ?: O} = ~ +. The set ~2 = ~ X ~ denotes
the plane, more generally, ~n is the n-fold Cartesian product of ~ with itself,
i.e. the set of all n-tuples (Xl"'" x n ) composed of real numbers. Products of
intervals, called rectangles, are denoted similarly.
Formally, a function f : A --+ B is a subset of Ax B in which each first
coordinate determines the second: if (a, b), (a, c) E f then b = c. Its domain
Vf = {a E A : :3b E B, (a, b) E J} and range Rf = {b E B : :3a E A, (a, b) E J}
describe its scope. Informally, f associates elements of B with those of A, such
that each a E A has at most one image b E B. We write this as b = f(a).
The set X c A has image f(X) = {b E B : b = f(a) for some a E X} and
the inverse image of a set Y c B is f-l(y) = {a E A : f(a) E Y}. The
composition 12 0 II of II : A --+ Band 12 : B --+ C is the function h : A --+ C
defined by h(a) = h(lI(a)). When A = B = C, x r--+ (II 0 h)(x) = lI(h(x))
and x r--+ (12 0 II) (x) = 12 (II (x)) both define functions from A to A. In general,
these will not be the same: for example, let lI(x) = sinx, h(x) = x 2 , then
x r--+ sin(x 2 ) and x r--+ (sinx)2 are not equal.
The function g extends f if V f c V g and 9 = f on V f; alternatively we say
that f restricts g to Vf' These concepts will be used frequently for real-valued
set functions, where the domains are collections of sets and the range is a subset
ofR
4
Measure, Integral and Probability
The algebra of real functions is defined pointwise, Le. the sum f + 9 and
product fg are given by (f + g)(x) = f(x) + g(x), (fg)(x) = f(x)g(x).
The indicator function IA of the set A is the function
IA(X)
= {I for x
o
for x
E A
tt A.
Note that IAnB = IA . IB, IAUB = IA + IB - IAIB, and lAc = 1 - IA.
We need one more concept from basic set theory, which should be familiar:
for any set E, an equivalence relation on E is a relation (Le. a subset R of
E x E, where we write x '" y to indicate that (x, y) E R) with the following
properties:
1. reflexive: for all x E E, x'" x,
2. symmetric: x '" y implies y '" x,
3. transitive: x '" y and y '" z implies x '" z.
An equivalence relation '" on E partitions E into disjoint equivalence
classes: given x E E, write [x] = {z : z '" x} for the equivalence class of
x, Le. the set of all elements of E that are equivalent to x. Thus x E [x], hence
E = UXEE[x]. This is a disjoint union: if [x] n [y] =I- 0, then there is z E E with
x '" z and z '" y, hence x '" y, so that [x] = [y]. We shall denote the set of all
equivalence classes so obtained by E / "'.
1.1.2 Countable and uncountable sets in lR
We say that a set A is countable if there is a one-one correspondence between
A and a subset of N, i.e. a function f : A -t N that takes distinct points to
distinct points. Informally, A is finite if this correspondence can be set up using
only an initial segment {I, 2, ... , N} of N (for some N EN), while we call A
countably infinite or denumerable if all of N is used. It is not difficult to see
that countable unions of countable sets are countable; in particular, the set Q
of rationals is countable.
Cantor showed that the set lR. cannot be placed in one-one correspondence
with (a subset of) N; thus it is an example of an uncountable set. Cantor's proof
assumes that we can write each real number uniquely as a decimal (always
choosing the non-terminating version). We can also restrict ourselves (why?)
to showing that the interval [0,1] is uncountable.
If this set were countable, then we could write its elements as a sequence
(Xn)n;:::l, and since each Xn has a unique decimal expansion of the form
1. Motivation and preliminaries
for digits
the array
aij
5
chosen from the set {O, 1,2 ... , 9}, we could therefore write down
Xl = O.au al2al3 •..
X2
= 0.a21a22a23 .. .
X3
= 0.a31a32a33 .. .
Now let y = 0.b l b2 b3 .•. , where the digits bn are chosen to differ from ann- Such
a decimal expansion defines a number y E [0, 1] that differs from each of the
Xn (since its expansion differs from that of Xn in the nth place). Hence our
sequence does not exhaust [0,1], and the contradiction shows that [0,1] cannot
be countable.
Since the union of two countable sets must be countable, and since Q is
countable, it follows that JR\Q is uncountable, i.e. there are far 'more' irrationals
than rationals! One way of making this seem more digestible is to consider the
problem of choosing numbers at random from an interval in JR.
Recall that rational numbers are precisely those real numbers whose decimal expansion recurs (we include 'terminates' under 'recurs'). Now imagine
choosing a real number from [0,1] at random: think of the set JR as a pond
containing all real numbers, and imagine you are 'fishing' in this pond, pulling
out one number at a time.
How likely is it that the first number will be rational, i.e. how likely are we
to find a number whose expansion recurs? It would be like rolling a ten-sided
die infinitely many times and expecting, after a finite number of throws, to say
with certainty that all subsequent throws will give the same digit. This does
not seem at all likely, and we should therefore not be too surprised to find
that countable sets (including Q) will be among those we can 'neglect' when
measuring sets on the real line in the 'unbiased' or uniform way in which we
have used the term 'random' so far. Possibly more surprising, however, will
be the discovery that even some uncountable sets can be 'negligible' from the
point of view adopted here.
1.1.3 Topological properties of sets in
Recall the definition of an open set 0 C JR :
~
6
Measure, Integral and Probability
Definition 1.1
A subset 0 of the real line JR is open if it is a union of open intervals, i.e. for
intervals (IoJaEA' where A is some index set (countable or not)
0=
U Ia.
aEA
A set is closed if its complement is open. Open sets in JRn (n > 1) can be
defined as unions of n-fold products of open intervals.
This definition seems more general than it actually is, since, on JR, countable
unions will always suffice - though the freedom to work with general unions
will be convenient later on. If A is an index set and Ia is an open interval for
each Q E A, then there exists a countable collection (Ia k)k>1 of these intervals
whose union equals UaEAla . What is more, the sequence of intervals can be
chosen to be pairwise disjoint.
It is easy to see that a finite intersection of open sets is open; however,
a countable intersection of open sets need not be open: let On = (- ~, 1) for
n 2': 1, then E =
On = [0,1) is not open.
Note that JR, unlike JRn or more general spaces, has a linear order, i.e. given
x, y E JR we can decide whether x ::; y or y ::; x. Thus u is an upper bound for
a set A c JR if a ::; u for all a E A, and a lower bound is defined similarly. The
supremum (or least upper bound) is then the minimum of all upper bounds and
written supA. The infimum (or greatest lower bound) inf A is defined as the
maximum of all lower bounds. The completeness property of JR can be expressed
by the statement that every set which is bounded above has a supremum.
A real function j is said to be continuous if j-1(0) is open for each open
set O. Every continuous real function defined on a closed bounded set attains
its bounds on such a set, i.e. has a minimum and maximum value there. For
example, if j : [a, b] --+ JR is continuous, M = sup{j(x) : x E [a, b]} = j(xmax ),
m = inf{j(x): x E [a,b]} = j(Xmin) for some points Xmax,Xmin E [a,b].
The Intermediate Value Theorem says that a continuous function takes all
intermediate values between the extreme ones, i.e. for each y E [m, M] there is
aBE [a, b] such that y = j(B).
Specialising to real sequences (x n ), we can further define the upper limit
lim sUPn Xn as
inf{sup Xm : n E N}
n:=1
m~n
and the lower limit lim infn Xn as
sup{ inf
m~n
Xm :
n EN}.
1. Motivation and preliminaries
7
The sequence xn converges if and only if these quantities coincide and their
common value is then its limit. Recall that a sequence (xn) converges and the
real number x is its limit, written x = limx-too X n , if for every ~ > 0 there is an
N E N such that IX n - xl < c whenever n ~ N. A series En>l an converges if
the sequence Xm = E:=l an of its partial sums converges, and its limit is then
the sum E:::'=l an of the series.
1.2 The Riemann integral: scope and limitations
In this section we give a brief review of the Riemann integral, which forms part
of the staple diet in introductory analysis courses, and consider some of the
reasons why it does not suffice for more advanced applications.
Let f : [a, b] --+ ~ be a bounded real function, where a, b, with a < b, are
real numbers. A partition of [a, b] is a finite set P = {ao, a1, a2,"" an} with
a
= ao < a1 < a2 < ... < an = b.
The partition P gives rise to the upper and lower Riemann sums
n
U(P, f) =
n
L MiLlai,
L(P, f) =
i=l
Mi
=
L miLlai
i=l
sup
f(x)
ai-l~x~ai
and
for each i :S n. (Note that Mi and mi are well-defined real numbers since f is
bounded on each interval [ai-I, ai]')
In order to define the Riemann integral of f, one first shows that for any
given partition P, L(P, f) :S U(P, f), and next that for any refinement, i.e. a
partition P' ::J P, we must have L(P, f) :S L(P', f) and U(P', f) :S U(P, f).
Finally, since for any two partitions P 1 and P 2 , their union P1 U P 2 is a refinement of both, we see that L(P, f) :S U(Q, f) for any partitions P, Q. The set
{L(P, f) : P is a partition of [a, b]} is thus bounded above in~, and we call its
supremum the lower integral f of f on [a, b]. Similarly, the infimum of the
f:
set of upper sums is the upper integral
f: f.
The function
f is now said to be
8
Measure, Integral and Probability
f:
Riemann-integrable on [a, b] if these two numbers coincide, and their common
value is the Riemann integral of f, denoted by
f or, more commonly,
lb
f(x) dx.
This definition does not provide a convenient criterion for checking the integrability of particular functions; however, the following formulation provides a
useful criterion for integrability - see [8] for a proof.
Theorem 1.2 (Riemann's criterion)
f : [a, b] ---+ lR. is Riemann-integrable if and only if for every E > 0 there exists
a partition P,,, such that U(PE , f) - L(PE , f) < E.
Example 1.3
We calculate f01 f(x) dx when f(x) = y'x: our immediate problem is that square
roots are hard to find except for perfect squares. Therefore we take partition
points which are perfect squares, even though this means that the interval
lengths of the different intervals do not stay the same (there is nothing to say
that they should do, even if it often simplifies the calculations). In fact, take
the sequence of partitions
Pn
1222
i2
= {O,(-)
,(-) , ... ,(-) , ... ,1}
n
n
n
and consider the upper and lower sums, using the fact that
f
is increasing:
Hence
1
n I l
U(Pn , f) - L(Pn , f) = 3 "(2i - 1) = 3{n(n + 1) - n} = -.
n L..J
n
n
i=1
By choosing n large enough, we can make this difference less than any given
> 0, hence f is integrable. The integral must be ~, since both U(Pn , f) and
L(Pn , f) converge to this value, as is easily seen.
E
1. Motivation and preliminaries
9
Riemann's criterion still does not give us a precise picture of the class
of Riemann-integrable functions. However, it is easy to show (see [8]) that
any bounded monotone function belongs to this class, and only a little more
difficult to see that any continuous function f : [a, b] -+ R (which is of course
automatically bounded) will be Riemann-integrable.
This provides quite sufficient information for many practical purposes, and
the tedium of calculations such as that given above can be avoided by proving
the following theorem:
Theorem 1.4 (fundamental theorem of calculus)
If f : [a, b] -+ R is continuous and the function F : [a, b] -+ R has derivative
(i.e. F' = f on (a, b)) then
F(b) - F(a) =
lb
f
f(x) dx.
This result therefore links the Riemann integral with differentiation, and
displays F as a primitive (also called 'anti-derivative') of f:
F(x)
=
i:
f(x) dx
up to a constant, thus justifying the elementary techniques of integration that
form part of any calculus course.
We can relax the continuity requirement. A trivial step is to assume f
bounded and continuous on [a, b] except at finitely many points. Then f is
Riemann-integrable. To see this, split the interval into pieces on which f is
continuous. Then f is integrable on each and hence one can derive integrability
of f on the whole interval. As an example consider a function f equal to zero
for all x E [0,1] except al, ... , an where it equals 1. It is integrable with integral
over [0,1] equal to O.
Taking this further, however, will require the power of the Lebesgue theory:
in Theorem 4.33 we show that f is Riemann-integrable if and only if it is
continuous at 'almost all' points of [a, b). This result is by no means trivial, as
you will discover if you try to prove directly that the following function f, due
to Dirichlet, is Riemann-integrable over [0,1]:
.! ifx=!!! EQ
f(x) = { 0 if x ~ Q.
In fact, it is not difficult (see [8]) to show that f is continuous at each irrational
and discontinuous at every rational point, hence (as we will see) is continuous
at 'almost all' points of [0,1).
10
Measure, Integral and Probability
Since the purpose of this book is to present Lebesgue's theory of integration,
we should discuss why we need a new theory of integration at all: what, if
anything, is wrong with the simple Riemann integral described above?
First, scope: it doesn't deal with all the kinds of functions that we hope to
handle.
The results that are most easily proved rely on continuous functions on
bounded intervals. In order to handle integrals over unbounded intervals, e.g.
or the integral of an unbounded function
r
1
1
Jo ft dx,
we have to resort to 'improper' Riemann integrals, defined by a limit process.
For example consider the integrals
1 ft
1
and let n -+
00
or c -+
c
°
1
dx ,
respectively. This isn't all that serious a flaw.
Second, dependence on intervals: we have no easy way of integrating over more
general sets, or of integrating functions whose values are distributed 'awkwardly' over sets that differ greatly from intervals. For example, consider the
upper and lower sums for the indicator function llQl of Q over [0,1]; however
we partition [0, 1], each subinterval must contain both rational and irrational
points; thus each upper sum is 1 and each lower sum 0. Hence we cannot
calculate the Riemann integral of f over the interval [0,1]; it is simply 'too
discontinuous'. (You may easily convince yourself that f is discontinuous at all
points of [0, 1].)
Third, lack of completeness: rather more importantly from the point of view of
applications, the Riemann integral doesn't interact well with taking the limit
of a sequence of functions. One may expect results of the following form: if a
sequence (In) of Riemann-integrable functions converges (in some appropriate
sense) to f, then
fn dx -+
f dx.
We give two counterexamples showing what difficulties can arise if the functions (In) converge to f pointwise, i.e. fn(x) -+ f(x) for all x.
1. The limit need not be Riemann-integrable, and so the convergence question does not even make sense. Here we may take f = 11Q1, fn = IAn where
J:
J:
11
1. Motivation and preliminaries
An = {ql, ... , qn}, and the sequence (qn), n ;::: 1 is an enumeration of the
rationals, so that (in) is even monotone increasing.
2. The limit is Riemann-integrable, but the convergence of Riemann integrals does not hold. Let f = 0, consider [a, b] = [0,1], and put
in
if 0:::; x <
if.l
2n <
- x < .1
n
if.l<x<1.
n -
2n
n
Figure 1.1 Graph of fn
This is a continuous function with integral 1. On the other hand, the sequence fn(x) converges to f = 0 since for all x, fn(x) = 0 for n sufficiently
large (such that ~ < x). See Figure 1.1.
To avoid problems of this kind, we can introduce the idea of uniform convergence: a sequence (in) in 0[0,1] converges uniformly to f if the sequence
an = sup{lfn(x) - f(x)1 : 0 :::; x :::; 1} converges to O. In this case one can easily
prove the convergence of the Riemann integrals:
11
fn(X) dx -t
11
f(x) dx.
However, the 'distance' sup{lf(x) - g(x)1 : 0 :::; x :::; 1} has nothing to do
with integration as such and the uniform convergence is too restrictive for many
applications. A more natural concept of 'distance', given by f01 If(x) - g(x)1 dx,
leads to another problem. By defining
g.(x)
~ { ~(X -
!)
ifO:::;x:::;~
if.!-<x<l+.l
2
2
n
otherwise,
it can be shown that fo1 Ign(x) - gm(x)1 dx -t 0 as m, n -t 00; in Figure 1.2 the
shaded area vanishes. (We say that (in) is a Cauchy sequence in this distance.)
12
Measure, Integral and Probability
Yet there is no continuous function f to which this sequence converges since
the pointwise limit is f (x) = 1 for x > ~ and otherwise, so that f = 1 ( ~,1]'
So the space C([O, 1]) of all continuous functions f: [0,1] --+ JR is too small from
this point of view.
°
1
1 1 I
- -+22m
1
I
-+2 n
Figure 1.2 Graphs of gn, gm
This is rather similar to the situation which leads one to work with JR rather
than just with the set of rationals Q (there are Cauchy sequences without limits
in Q, for example a sequence of rational approximations of J2). Recalling the
crucial importance of completeness in the case of JR, we naturally look for a
theory of integration which does not have this shortcoming. In the process we
shall find that our new theory, which will include the Riemann integral as a
special case, also solves the other problems listed.
1.3 Choosing numbers at random
Before we start to develop the theory of Lebesgue measure to make sense of
the 'length' of a general subset of JR, let us pause to consider some practical
motivation. The simplicity of elementary probability with finite sample spaces
vanishes rapidly when we have an infinite number of outcomes, such as when
we 'pick a number between and 1 at random'. We face making sense of the
'probability' that a given x E [0,1] is chosen. A similar, slightly more general
question, is the following: what is the probability that the number we pick is
rational?
First a prior question: what do we mean by saying that we pick the number
x at random? 'Random' plausibly means that in each such trial, each real number is 'equally likely' to be picked, so that we impose the uniform probability
distribution on [0,1]. But the 'number' of possible choices is infinite. Hence the
event Ax that a fixed x is chosen ought to have zero probability. On the other
°
13
1. Motivation and preliminaries
hand, since some number between 0 and 1 is chosen, and it is not impossible
that it could be our x. Thus a set Ax #- 0 can have P(Ax) = O. Our way of
'measuring' probabilities need not, therefore, be able to distinguish completely
between sets - we cannot really expect this in general if we want to handle
infinite sets.
We can go slightly further: the probability that anyone of a finite set of
reals A = {Xl, X2, ... , xn} is selected should also be 0, since it seems natural
that this probability P(A) should equal 2::7=1 P( {Xi})' We can extend this to
claim the finite additivity property of the probability function A r-+ P(A), i.e.
that if AI, A 2 , ... , An are disjoint sets, then P(U7=1 Ai) = 2::7=1 P(Ai)' This
claim looks very plausible, and we shall see that it becomes an essential feature
of any sensible basis for a calculus of probabilities.
Less obvious is the claim that, under the uniform distribution, any countably
infinite set, such as Q, must also carry probability 0 - yet that is exactly what
an analysis of the 'area under the graph' of the function llQ suggests. We can
reinterpret this as a result of a 'continuity property' of the mapping A r-+ P(A)
when we let n -+ 00 in the above: if the sequence (Ai) of subsets of JR is disjoint
then we would like to have
00
n
P(U Ai) = 2~LP(Ai)
i=l
i=l
00
= LP(Ai ).
i=l
We shall see in Chapter 2 that this condition is indeed satisfied by Lebesgue
measure on the real line JR, and it will be used as the defining property of
abstract measures on arbitrary sets.
There is much more to probability than is developed in this book: for example, we do not discuss finite sample spaces and the elegant combinatorial
ideas that characterise good introductions to probability, such as [6] and [10].
Our focus throughout remains on the essential role played by Lebesgue measure in the description of probabilistic phenomena based on infinite sample
spaces. This leads us to leave to one side many of the interesting examples and
applications which can be found in these texts, and provide, instead, a consistent development of the theoretical underpinnings of random variables with
densities.
2
Measure
2.1 Null sets
The idea of a 'negligible' set relates to one of the limitations of the Riemann
integral, as we saw in the previous chapter. Since the function f = lQ takes
a non-zero value only on Q, and equals 1 there, the 'area under its graph' (if
such makes sense) must be very closely linked to the 'length' of the set Q. This
is why it turns out that we cannot integrate f in the Riemann sense: the sets
Q and 1R. \ Q are so different from intervals that it is not clear how we should
measure their 'lengths' and it is clear that the 'integral' of f over [0,1] should
equal the 'length' of the set of rationals in [0,1]. So how should we define this
concept for more general sets?
The obvious way of defining the 'length' of a set is to start with intervals
nonetheless. Suppose that I is a bounded interval of any kind, i.e. I = [a, b],
I = [a, b), I = (a, b] or I = (a, b). We simply define the length of I as l(I) = b-a
in each case.
As a particular case we have l({a}) = l([a,a]) = O. It is then natural to say
that a one-element set is 'null'. Before we extend this idea to more general sets,
first consider the length of a finite set. A finite set is not an interval but since
a single point has length 0, adding finitely many such lengths together should
still give O. The underlying concept here is that if we decompose a set into a
finite number of disjoint intervals, we compute the length of this set by adding
the lengths of the pieces.
As we have se,en, in general it may not be always possible to decompose a set
15
16
Measure, Integral and Probability
into non-trivial intervals. Therefore, we consider systems of intervals that cover
a given set. We shall generalise the above idea by allowing a countable number
of covering intervals. Thus we arrive at the following more general definition of
sets of 'zero length':
Definition 2.1
°
A null set A ~ ~ is a set that may be covered by a sequence of intervals
of arbitrarily small total length, i.e. given any E > we can find a sequence
{In : n ?: I} of intervals such that
00
and
(We also say simply that 'A is null'.)
Exercise 2.1
Show that we get an equivalent notion if in the above definition we
replace the word 'intervals' by any of these: 'open intervals', 'closed intervals', 'the intervals of the form (a, b]', 'the intervals of the form [a, b)'.
Note that the intervals do not need to be disjoint. It follows at once from
the definition that the empty set is null.
Next, anyone-element set {x} is a null set. For, let E > and take h =
(x - ~,x + ~), In = [0,0] for n ?: 2. (Why take In = [0,0] for n ?: 2? Well, why
not! We could equally have taken In = (0,0) = 0, of course!) Now
°
00
Ll(In)
= l(h) = ~ < E.
n=l
More generally, any countable set A = {Xl, X2, ... } is null. The simplest way
to show this is to take In = [xn' xn], for all n. However, as a gentle introduction
to the next theorem we will cover A by open intervals. This way it is more fun.
17
2. Measure
For, let
€
> 0 and cover A with the following sequence of intervals:
In = (Xn Since
I::=1 2~
= 1,
1
2
l(h) =
-€.
l(h) =
2'€'
l(h) =
2'€'
1
1
1
21
1
22
-
1
23
2.~n' xn + 2.~n) l(In) = ~€ . 21n
00
Ll(In)
n=l
= ~ < €,
as needed.
Here we have the following situation: A is the union of count ably many
one-element sets. Each of them is null and A turns out to be null as well.
We can generalise this simple observation:
Theorem 2.2
If
(Nn)n~l
is a sequence of null sets, then their union
00
is also null.
Proof
We assume that all N n , n ;::: 1, are null and to show that the same is true for N
we take any € > O. Our goal is to cover the set N by countably many intervals
with total length less than €.
The proof goes in three steps, each being a little bit tricky.
Step 1. We carefully cover each N n by intervals.
'Carefully' means that the lengths have to be small. 'Small' means that we
are going to add them up later to end up with a small number (and 'small'
here means less than €).
Since N1 is null, there exist intervals I~, k ;::: 1, such that
00
Ll(I~) < ~,
k=l
18
Measure, Integral and Probability
For N2 we find a system of intervals I~, k ;:::: 1, with
You can see a cunning plan of making the total lengths smaller at each step at
a geometric rate. In general, we cover N n with intervals II:, k ;:::: 1, whose total
length is less than 2~:
(X)
(X)
Ll(Il:) < ;n'
k=1
Nn
~
U II:.
k=1
Step 2. The intervals II: are formed into a sequence.
We arrange the countable family of intervals {II: h2:1,n2:1 into a sequence
Jj , j ;:::: 1. For instance we put J 1 = If, h = Ii, h = If, J4 = Ij, etc. so that
none of the II: are skipped. The union of the new system of intervals is the
same as the union of the old one and so
(X)
n=1
(X)
(X)
n=1k=1
Step 3. Compute the total length of J j
(X)
j=1
.
This is tricky because we have a series of numbers with two indices:
L l(Jj ) = L
(X)
(X)
j=1
n=1,k=1
l(II:).
Now we wish to write this as a series of numbers, each being the sum of a series.
We can rearrange the double sum because the components are non-negative (a
fact from elementary calculus)
n=~=1l(II:) = ~ (~l(II:)) < ~ 2n =
E
which completes the proof.
E,
o
Thus any countable set is null, and null sets appear to be closely related
to countable sets - this is no surprise as any proper interval is uncountable, so
any countable subset is quite 'sparse' when compared with an interval, hence
makes no real contribution to its 'length'. (You may also have noticed the
19
2. Measure
similarity between Step 2 in the above proof and the 'diagonal argument' which
is commonly used to show that IQ is a countable set.)
However, uncountable sets can be null, provided their points are sufficiently
'sparsely distributed', as the following famous example, due to Cantor, shows:
1. Start with the interval [0,1]' remove the 'middle third', i.e. the interval
(~, ~), obtaining the set C I , which consists of the two intervals [O,~] and
[~, 1].
2. Next remove the middle third of each of these two intervals, leaving C 2 ,
consisting of four intervals, each of length
etc. (See Figure 2.1.)
!,
3. At the nth stage we have a set Cn, consisting of 2n disjoint closed intervals,
each of length 3~. Thus the total length of Cn is (~r.
••••
••••
••••
••••
Figure 2.1 Cantor set construction (C3 )
We call
the Cantor set.
Now we show that C is null as promised.
Given any c > 0, choose n so large that (~r < c. Since C ~ Cn, and Cn
consists of a (finite) sequence of intervals of total length less than c, we see that
C is a null set.
All that remains is to check that C is an uncountable set. This is left for
you as
Exercise 2.2
Prove that C is uncountable.
Hint Adapt the proof of the uncountability of R.: begin by expressing
each x in [0,1] in ternary form:
with ak = 0, 1 or 2. Note that x E C iff all its ak equal
°
or 2.
20
Measure, Integral and Probability
Why is the Cantor set null, even though it is uncountable? Clearly it is the
distribution of its points, the fact that it is 'spread out' all over [0,1], which
causes the trouble. This makes it the source of many examples which show that
intuitively 'obvious' things are not always true! For example, we can use the
Cantor set to define a function, due to Lebesgue, with very odd properties:
If x E [0,1) has ternary expansion (an), Le. x = 0.ala2 ... with an = 0,1
or 2, define N as the first index n for which an = 1, and set N = 00 if none
of the an are 1 (Le. when x E C). Now set bn = ~ for n < N and bN = 1,
and let F(x) = E~=l ~ for each x E [0,1). Clearly, this function is monotone
increasing and has F(O) = 0, F(I) = 1. Yet it is constant on the middle thirds
(Le. the complement of C), so all its increase occurs on the Cantor set. Since we
have shown that C is a null set, F 'grows' from to 1 entirely on a 'negligible'
set. The following exercise shows that it has no jumps!
°
Exercise 2.3
Prove that the Lebesgue function F is continuous and sketch its partial
graph.
2.2 Outer measure
The simple concept of null sets provides the key to our idea of length, since it
tells us what we can 'ignore'. A quite general notion of 'length' is now provided
by:
Definition 2.3
The (Lebesgue) outer measure of any set A
~
R is given by
m*(A) = inf ZA
where
ZA
=
{L: l(In) : In are intervals, A ~ U In}.
00
00
n=l
n=l
We say the (In)n~l cover the set A. So the outer measure is the infimum
of lengths of all possible covers of A. (Note again that some of the In may be
empty; this avoids having to worry whether the sequence (In) has finitely or
infinitely many different members.)
21
2. Measure
Clearly m*(A) ~ 0 for any A ~ ~. For some sets A, the series I:~=ll(In)
may diverge for any covering of A, so m*(A) may by equal to 00. Since we
wish to be able to add the outer measures of various sets we have to adopt a
convention to deal with infinity. An obvious choice is a + 00 = 00, 00 + 00 = 00
and a less obvious but quite practical assumption is 0 x 00 = 0, as we have
already seen.
The set ZA is bounded from below by 0, so the infimum always exists.
If r E ZA, then [r, +00] ~ ZA (clearly, we may expand the first interval of
any cover to increase the total length by any number). This shows that ZA is
either {+oo} or the interval (x, +00] or [x, +00] for some real number x. So the
infimum of ZA is just x.
First we show that the concept of null set is consistent with that of outer
measure:
Theorem 2.4
A ~ ~ is a null set if and only if m*(A) =
o.
Proof
Suppose that A is a null set. We wish to show that inf ZA = O. To this end we
show that for any c > 0 we can find an element z E ZA such that z < c.
By the definition of null set we can find a sequence (In) of intervals covering
A with I:~ll(In) < c and so I:~=ll(In) is the required element z of ZA.
Conversely, if A ~ ~ has m*(A) = 0, then by the definition of inf, given
any c > 0, there is z E ZA, Z < c. But a member of ZA is the total length of
some covering of A. That is, there is a covering (In) of A with total length less
than c, so A is null.
0
This combines our general outer measure with the special case of 'zero
measure'. Note that m*(0) = 0, m*({x}) = 0 for any x E~, and m*(Q) = 0
(and in fact, for any countable X, m*(X) = 0).
Next we observe that m* is monotone: the bigger the set, the greater its
outer measure.
Proposition 2.5
If A
c
B then m*(A) :::; m*(B).
Hint Show that Z B C Z A and use the definition of info
Measure, Integral and Probability
22
The second step is to relate outer measure to the length of an interval.
This innocent result contains the crux of the theory, since it shows that the
formal definition of m*, which is applicable to all subsets of lR, coincides with
the intuitive idea for intervals, where our thought processes began. We must
therefore expect the proof to contain some hidden depths, and we have to
tackle these in stages: the hard work lies in showing that the length of the
interval cannot be greater than its outer measure: for this we need to appeal
to the famous Heine-Borel theorem, which states that every closed, bounded
subset B of lR is compact: given any collection of open sets Oa covering B (i.e.
B c Ua Oa), there is a finite subcollection (Oaik,,:n which still covers B, i.e.
B C U~=l Oai (for a proof see [1]).
Theorem 2.6
The outer measure of an interval equals its length.
Proof
If I is unbounded, then it is clear that it cannot be covered by a system of
intervals with finite total length. This shows that m * (I) = 00 and so m * (I) =
l(I) =
00.
So we restrict ourselves to bounded intervals.
Step 1. m*(I) :S l(I).
We claim that l(I) E ZI. Take the following sequence of intervals: h = I,
In = [0,0] for n :::: 2. This sequence covers the set I, and the total length is
equal to the length of I, hence l (I) E Z I. This is sufficient since the infimum
of ZI cannot exceed any of its elements.
Step 2. l(I) :S m*(I).
(i) I
= [a, b]. We shall show that for any E > 0,
l([a, b]) :S m*([a, b])
+ E.
(2.1)
This is sufficient since we may obtain the required inequality passing to the
limit, E ---+ O. (Note that if x, y E lR and y > x then there is an E > 0 with
y > x + E, e.g. E = ~(y - x).)
So we take an arbitrary E > O. By the definition of outer measure we can
find a sequence of intervals In covering [a, b] such that
L l(In) :S m*([a, b]) + ~.
00
n=l
(2.2)
23
2. Measure
We shall slightly increase each of the intervals to an open one. Let the endpoints
of In be an, bn , and we take
It is clear that
so that
L l(In) = L l(Jn) 00
00
n=l
n=l
~.
We insert this in (2.2) and we have
L l(J
00
n)
~ m*([a, b]) + E.
(2.3)
n=l
The new sequence of intervals of course covers [a, b], so by the Heine-Borel
theorem we can choose a finite number of I n to cover [a, b] (the set [a, b] is
compact in JR). We can add some intervals to this finite family to form an
initial segment of the sequence (In) - just for simplicity of notation. So for
some finite index m we have
m
[a,b]
<;;;;
UI
(2.4)
n.
n=l
Let I n = (c n , dn ). Put C = mini Cl,"" cm}, d = maxi d l , ... , dm }. The covering (2.4) means that C < a and b < d, hence l([a, b]) < d - c.
Next, the number d - C is certainly smaller than the total length of I n ,
n = 1, ... , m (some overlapping takes place) and
m
l([a, b]) < d -
C
<
L l(Jn).
(2.5)
n=l
Now it is sufficient to put (2.3) and (2.5) together in order to deduce (2.1)
(the finite sum is less than or equal to the sum of the series since all terms are
non-negative).
(ii) I = (a, b). As before, it is sufficient to show (2.1). Let us fix any
l( (a, b)) = l([a +
E
E
2' b - 2]) + E
~ m*([a + ~,b - ~]) + E (by (1))
~ m*( (a, b)) + E (by Proposition 2.5).
E
> O.
24
Measure, Integral and Probability
(iii) I = [a, b) or I = (a, b].
l(I) = l((a, b)) ::; m*((a, b))
::; m*(I)
(by (2))
(by Proposition 2.5),
which completes the proof.
0
Having shown that outer measure coincides with the natural concept of
length for intervals, we now need to investigate its properties. The next theorem
gives us an important technical tool which will be used in many proofs.
Theorem 2.7
Outer measure is countably subadditive, i.e. for any sequence of sets {En},
UEn) ::; L m*(En).
00
00
n=l
n=l
m* (
(Note that both sides may be infinite here.)
Proof (a warm-up)
Let us prove first a simpler statement:
Take an c > 0 and we show an even easier inequality
This is however sufficient because taking c = ~ and letting n -t 00 we get what
we need.
So for any c > 0 we find covering sequences (I~h?l of El and (Ink?l of
E2 such that
L l(I~) ::; m*(El) + ~,
00
k=l
L l(I~) ::; m*(E2) + ~;
00
k=l
hence, adding up,
L l(ID + L l(I~) ::; m*(El) + m*(E2) + c.
00
00
k=l
k=l
25
2. Measure
The sequence of intervals (It, If, Ii, Ii, I§ , Ij, ... ) covers El U E 2 , hence
00
m*(El
E2 )
U
:::;
00
L l(I~) + L l(I~),
k=l
k=l
which combined with the previous inequality gives the result.
o
Proof of the theorem
If the right-hand side is infinite, then the inequality is of course true. So, suppose
that L~=l m* (En) < 00. For each given € > 0 and n 2: 1 find a covering
sequence (Ik)k~l of En with
00
L l(Ik) :::; m*(En) + ;n'
k=l
The iterated series converges:
00
00
00
(Ll(Ik)) :::; L m*(En) + € < 00
L
n=l
k=l
n=l
and since all its terms are non-negative,
00
00
00
L (L l(Ik)) = L l(Ik)'
k=l
n,k=l
n=l
The system of intervals (Ik)k,n~l covers U~=l En, hence
00
00
00
n=l
n,k=l
n=l
To complete the proof we let
€
o
-+ O.
A similar result is of course true for a finite family (En)~=l:
m* (
m
m
n=l
n=l
UEn) :::; L
m*(En).
It is a corollary to Theorem 2.7 with Ek = 0 for k
> m.
26
Measure, Integral and Probability
Exercise 2.4
Prove that if m*(A) = 0 then for each B, m*(A U B) = m*(B).
Hint Employ both monotonicity and subadditivity of outer measure.
Exercise 2.5
Prove that if m*(ALlB)
Hint Note that A
~
= 0, then m*(A) = m*(B).
B U (ALlB).
We conclude this section with a simple and intuitive property of outer measure. Note that the length of an interval does not change if we shift it along the
real line: l([a, b]) = l([a + t, b+ t]) = b - a for example. Since the outer measure
is defined in terms of the lengths of intervals, it is natural to expect it to share
this property. For A c IR and t E IR we put A + t = {a + t : a E A}.
Proposition 2.8
Outer measure is translation-invariant, Le.
m*(A) = m*(A + t)
for each A and t.
Hint Combine two facts: the length of interval does not change when the interval is shifted and outer measure is determined by the length of the coverings.
2.3 Lebesgue-measurable sets and Lebesgue
measure
With outer measure, subadditivity (as in Theorem 2.7) is as far as we can
get. We wish, however, to ensure that if sets (En) are pairwise disjoint (Le.
Ei n E j = 0 if i 1- j), then the inequality in Theorem 2.7 becomes an equality.
It turns out that this will not in general be true for outer measure, although
examples where it fails are quite difficult to construct (we give such examples in
the Appendix). But our wish is an entirely reasonable one: any 'length function'
should at least be finitely additive, since decomposing a set into finitely many
disjoint pieces should not alter its length. Moreover, since we constructed our
2. Measure
27
length function via approximation of complicated sets by 'simpler' sets (i.e.
intervals) it seems fair to demand a continuity property: if pairwise disjoint
(En) have union E, then the lengths of the sets Bn = E \ Uk=lEk may be
expected to decrease to 0 as n --+ 00. Combining this with finite additivity leads
quite naturally to the demand that 'length' should be countably additive, i.e.
that
m*CQIEn)
=~m*(En)
whenEi nEj =0fori:j:.j.
We therefore turn to the task of finding the class of sets in lR which have this
property. It turns out that it is also the key property of the abstract concept of
measure, and we will use it to provide mathematical foundations for probability.
In order to define the 'good' sets which have this property, it also seems
plausible that such a set should apportion the outer measure of every set in lR
properly, as we state in Definition 2.9 below. Remarkably, this simple demand
will suffice to guarantee that our 'good' sets have all the properties we demand
of them!
Definition 2.9
A set E
~
lR is (Lebesgue-) measurable if for every set A
m*(A) = m*(A n E)
where EC
= lR\E, and we write E
~
lR we have
+ m*(A n E C)
(2.6)
E M.
We obviously have A = (A n E) U (A n EC), hence by Theorem 2.7 we have
m*(A) ::; m*(A n E)
+ m*(A n E C)
for any A and E. So our future task of verifying (2.6) has simplified: E E M if
and only if the following inequality holds:
m*(A) ?: m*(A
n E) + m*(A n E C) for all A
Now we give some examples of measurable sets.
Theorem 2.10
(i) Any null set is measurable.
(ii) Any interval is measurable.
~
lR.
(2.7)
28
Measure, Integral and Probability
Proof
(i) If N is a null set, then (Proposition 2.4) m*(N) = O. So for any A
have
m*(A n N) ::; m*(N) = 0 since AnN ~ N
m*(A n N C )
::;
m*(A)
since An N C
~
c
~
we
A
and adding together we have proved (2.7).
(ii) Let E = I be an interval. Suppose, for example, that I = [a, b]. Take any
A ~ ~ and E > O. Find a covering of A with
00
:L l(In) ::; m*(A) +
m*(A) ::;
E.
n=l
Clearly the intervals
I~
= In n [a, b] cover A n [a, b] hence
00
m* (A
n [a, b]) ::;
:L l(I~).
n=l
The intervals I~
= In n (-00, a),
I~'
m*(A n [a, W)
= In n (b, +(0) cover A n [a, W so
::; :L l(I~) + :L l(I~/).
00
00
n=l
n=l
Putting the above three inequalities together we obtain (2.7).
If I is unbounded, I = [a, (0) say, then the proof is even simpler since it is
sufficient to consider I~ = In n [a, (0) and I~ = In n (-00, a).
0
The fundamental properties of the class M of all Lebesgue-measurable subsets of ~ can now be proved. They fall into two categories: first we show that
certain set operations on sets in M again produce sets in M (these are what
we call 'closure properties') and second we prove that for sets in M the outer
measure m * has the property of countable additivity announced above.
Theorem 2.11
(i)
~ E
M,
(ii) if E E M then EC E M,
(iii) if En EM for all n = 1,2, ... then U~=l En EM.
Moreover, if En EM, n
= 1,2, ... and E j n Ek = 0 for j #- k, then
m* (
UEn) = :L m*(En).
00
00
n=l
n=l
(2.8)
29
2. Measure
Remark 2.12
This result is the most important theorem in this chapter and provides the
basis for all that follows. It also allows us to give names to the quantities under
discussion.
Conditions (i)-(iii) mean that M is a (J-field. In other words, we say that
a family of sets is a (J-field if it contains the base set and is closed under
complements and countable unions. A [O,ooJ-valued function defined on a (Jfield is called a measure if it satisfies (2.8) for pairwise disjoint sets, i.e. it is
countably additive.
An alternative, rather more abstract and general, approach to measure theory is to begin with the above properties as axioms, i.e. to call a triple (il, F, J.l)
a measure space if il is an abstractly given set, F is a (J-field of subsets of il,
and J.l : F f-t [O,ooJ is a function satisfying (2.8) (with J.l instead of m*). The
task of defining Lebesgue measure on IR then becomes that of verifying, with
M and m = m * on M defined as above, that the triple (IR, M, m) satisfies
these axioms, i.e. becomes a measure space.
Although the requirements of probability theory will mean that we have to
consider such general measure spaces in due course, we have chosen our more
concrete approach to the fundamental example of Lebesgue measure in order
to demonstrate how this important measure space arises quite naturally from
considerations of the 'lengths' of sets in IR, and leads to a theory of integration
which greatly extends that of Riemann. It is also sufficient to allow us to develop
most of the important examples of probability distributions.
Proof of the theorem
(i) Let A ~ R Note that An IR = A, IRC = 0, so that An IRC = 0. Now (2.6)
reads m*(A) = m*(A) + m*(0) and is of course true since m*(0) = 0.
(ii) Suppose E E M and take any A
but since (EC)C
pothesis.
=
~
R We have to show (2.6) for EC, i.e.
E this reduces to the condition for E which holds by hy-
We split the proof of (iii) into several steps. But first:
A warm-up Suppose that E1 n E2 = 0, E 1, E2 E M. We shall show that
E1 U E2 EM and m*(E1 U E 2) = m*(Ed + m*(E2).
30
Measure, Integral and Probability
Let A
~
R We have the condition for E 1 :
m*(A)
= m*(A n E 1 ) + m*(A n ED.
(2.9)
Now, apply (2.6) for E2 with An Ei in place of A:
m*(A
n ED = m*((A n ED n E 2) + m*((A n ED n E~).
= m*(A n (Ef n E 2)) + m*(A n (E~ n E~))
(the situation is depicted in Figure 2.2).
Figure 2.2 The sets A, El, E2
Since El and E2 are disjoint, Ei n E2
(El U E 2)c. We substitute and we have
= E 2. By de Morgan's law Ei n E2
=
Inserting this into (2.9) we get
Now by the subadditivity property of m* we have
m*(A n E 1 )
+ m*(A n E 2) ~ m*((A n Ed U (A n E 2))
= m*(A n (El U E 2)),
so (2.10) gives
which is sufficient for El U E2 to belong to M (the inverse inequality is always
true, as observed before (2.7)).
Finally, put A = El UE2 in (2.10) to get m*(El UE2) = m*(Ed +m*(E2)'
which completes the argument.
D
We return to the proof of the theorem.
31
2. Measure
Step 1. If pairwise disjoint E k , k
= 1,2, ... , are in M then their union is in
M and (2.8) holds.
We begin as in the proof of the warm-up and we have
m*(A)
m*(A)
= m*(A n Ed + m*(A n En
= m*(A n Ed + m*(A n E 2 ) + m*(A n (El U E 2 )C)
(see (2.10)) and after n steps we expect
m*(A)
n
n
k=l
k=l
= Lm*(AnEk) +m*(An (U Ekt)·
Let us demonstrate this by induction. The case n
Suppose that
m*(A)
(2.11)
= 1 is the first line above.
n-l
n-l
k=l
k=l
= L m*(A n Ek) + m* (A n ( U Ekt).
(2.12)
Since En E M, we may apply (2.6) with An (U~:i Ekt in place of A:
m*(An
n-l
n-l
n-l
k=l
k=l
k=l
(U Ekn = m*(An( UEk)CnEn) +m*(An (U Ek)CnE~).
(2.13)
Now we make the same observations as in the warm-up:
n-l
(UEkt n En = En
(Ei are pairwise disjoint),
k=l
n-l
n
( UEkt n E~ = ( UEkt
k=l
k=l
Inserting these into (2.13) we get
m*(A n (
(by de Morgan's law).
n-l
n
k=l
k=l
UEkt) = m*(A n En) + m*(A n ( UEkf),
and inserting this into the induction hypothesis (2.12) we get
n-l
m*(A)
=
L m*(A n Ek) + m*(A n En)
k=l
+ m* (A n (
n
UEkt)
k=l
as required to complete the induction step. Thus (2.11) holds for all n by
induction.
32
Measure, Integral and Probability
As will be seen at the next step the fact that Ek are pairwise disjoint is
not necessary in order to ensure that their union belongs to M. However, with
this assumption we have equality in (2.11) which does not hold otherwise. This
equality will allow us to prove countable additivity (2.8).
(U r 2 ( U r,
Since
Ek
k=l
Ek
k=l
from (2.11) by monotonicity (Proposition 2.5) we get
n
00
k=l
k=l
The inequality remains true after we pass to the limit n -+
00
00
k=l
k=l
00:
(2.14)
By countable subadditivity (Theorem 2.7)
00
00
k=l
k=l
and so
00
00
k=l
k=l
(2.15)
as required. So we have shown that U~l Ek EM and hence the two sides of
(2.15) are equal. The right-hand side of (2.14) is squeezed between the left and
right of (2.15), which yields
00
00
k=l
k=l
(2.16)
The equality here is a consequence of the assumption that Ek are pairwise
disjoint. It holds for any set A, so we may insert A = U~l E j . The last term
on the right is zero because we have m*(0). Next (U~l E j ) n En = En and
so we have (2.8).
Step 2. If Ell E2 EM, then El U E2 EM (not necessarily disjoint).
Again we begin as in the warm-up:
m*(A)
= m*(A n E 1 ) + m*(A n En.
(2.17)
33
2. Measure
Next, applying (2.6) to E2 with An E'f. in place of A we get
m*(A n Ef)
= m*(A n E'f. n E 2) + m*(A n E'f. n E~).
We insert this into (2.17) to get
m*(A)
= m*(A n Ed + m*(A n E'f. n E 2) + m*(A n E'f. n E~).
(2.18)
By de Morgan's law, E'f. n Ei = (EI U E 2)C, so (as before)
m*(A n E'f.
n E~) = m*(A n (EI U E2n.
(2.19)
By subadditivity of m* we have
m*(A n Ed
+ m*(A n E'f. n E 2) :::: m*(A n (EI
U E2))'
(2.20)
Inserting (2.19) and (2.20) into (2.18) we get
m*(A) :::: m*(A n (EI U E 2))
+ m*(A n (EI U E 2n
as required.
Step 3. If Ek E M, k
disjoint).
=
1, ... , n, then EI U ... U En E M (not necessarily
We argue by induction. There is nothing to prove for n
claim is true for n - 1. Then
=
1. Suppose the
so that the result follows from Step 2.
Step 4. If E I , E2 EM, then EI
n E2
EM.
We have E'f., Ei EM by (ii), E'f. U Ei E M by Step 2, (E'f. U Ei)C EM by
(ii) again, but by de Morgan's law the last set is equal to EI n E 2.
Step 5. The general case: if E I , E 2, ... are in M, then so is
U;:'=l E k ·
Let Ek E M, k = 1,2, .... We define an auxiliary sequence of pairwise
disjoint sets Fk with the same union as E k :
= EI
F2 = E2 \EI = E2 n E'f.
F3 = E3\(EI U E 2) =E3 n (EI
FI
U
E 2)C
34
Measure, Integral and Probability
Figure 2.3 The sets Fk
(see Figure 2.3).
By Steps 3 and 4 we know that all Fk are in M. By the very construction
they are pairwise disjoint, so by Step 1 their union is in M. We shall show that
00
00
k=l
k=l
U Fk = U E k·
This will complete the proof since the latter is now in M. The inclusion
00
00
k=l
k=l
UFk ~ UEk
is obvious since for each k, Fk ~ Ek by definition. For the inverse let a E
U%"=l Ek. Put S = {n EN: a E En} which is non-empty since a belongs to the
union. Let no = minS E S. If no = 1, then a E EI = Fl. Suppose no > 1. So
a E Eno and, by the definition of no, art E I , . .. ,a rt E no - l ' By the definition
of Fno this means that a E Fno ' so a is in U~l Fk.
0
Using de Morgan's laws you should easily verify an additional property of
M.
Proposition 2.13
If Ek EM, k
= 1,2, ... , then
00
E= nEkEM.
k=l
We can therefore summarise the properties of the family M of Lebesguemeasurable sets as follows:
M is closed under countable unions, countable intersections, and complements. It contains intervals and all null sets.
35
2. Measure
Definition 2.14
We shall write m(E) instead of m*(E) for any E in M and call m(E) the
Lebesgue measure of the set E.
Thus Theorems 2.11 and 2.6 now read as follows, and describe the construction which we have laboured so hard to establish:
Lebesgue measure m : M ~ [0,00] is a count ably additive set function defined on the a-field M of measurable sets. Lebesgue measure of an interval
is equal to its length. Lebesgue measure of a null set is zero.
2.4 Basic properties of Lebesgue measure
Since Lebesgue measure is nothing else than the outer measure restricted to
a special class of sets, some properties of the outer measure are automatically
inherited by Lebesgue measure:
Proposition 2.15
Suppose that A, B EM.
(i) If A
c B then m(A)
(ii) If A
c
~
m(B).
Band m(A) is finite then m(B \ A)
=
m(B) - m(A).
(iii) m is translation-invariant.
Since 0 E M we can take Ei = 0 for all i > n in (2.8) to conclude that
Lebesgue measure is additive: if Ei E M are pairwise disjoint, then
n
m(U E i ) =
i=l
n
L m(E
i ).
i=l
Exercise 2.6
Find a formula describing m(A U B) and m(A U B U C) in terms of
measures of the individual sets and their intersections (we do not assume
that the sets are pairwise disjoint).
36
Measure, Integral and Probability
Recalling that the symmetric difference ALlB of two sets is defined by
ALlB = (A \ B) U (B \ A) the following result is also easy to check:
Proposition 2.16
If A E M, m(ALlB) = 0, then B E M and m(A) = m(B).
Hint Recall that null sets belong to M and that subsets of null sets are null.
As we noted in Chapter 1, every open set in IR can be expressed as the union
of a countable number of open intervals. This ensures that open sets in IR are
Lebesgue-measurable, since M contains intervals and is closed under countable
unions. We can approximate the Lebesgue measure of any A E M from above
by the measures of a sequence of open sets containing A. This is clear from the
following result:
Theorem 2.17
(i) For anye > 0, A
c
IR we can find an open set 0 such that
AcO,
m(O) :::; m*(A)
+ e.
Consequently, for any E E M we can find an open set 0 containing E such
that m(O \ E) < e.
(ii) For any A
c
IR we can find a sequence of open sets On such that
n
n
Proof
(i) By definition of m*(A) we can find a sequence (In) of intervals with A
and
Un In
L l(In) 00
c
~ :::; m*(A).
n=l
Each In is contained in an open interval whose length is very close to that of
In; if the left and right endpoints of In are an and bn respectively let I n =
(an - 2n"+2' bn + 2n"+2). Set 0 = Un I n , which is open. Then A c 0 and
m(O) :::;
L l(Jn) :::; L l(In) + ~ :::; m*(A) + e.
00
00
n=l
n=l
37
2. Measure
When m(E) < 00 the final statement follows at once from (ii) in Proposition 2.15, since then m(O \ E) = m(O) - m(E) ::; e. When m(E) = 00 we
first write JR as a countable union of finite intervals: JR = Un(-n,n). Now
En = E n (-n, n) has finite measure, so we can find an open On ::J En with
m(On \ En) ::; 2'"n. The set 0 = Un On is open and contains E. Now
n
n
n
so that m( 0 \ E) ::; Ln m( On \ En) ::; e as required.
nn
(ii) In (i) use e = ~ and let On be the open set so obtained. With E =
On
we obtain a measurable set containing A such that m(E) < m(On) ::; m*(A)+~
for each n, hence the result follows.
0
Remark 2.18
Theorem 2.17 shows how the freedom of movement allowed by the closure
properties of the (j-field M can be exploited by producing, for any set A c
JR, a measurable set 0 ::J A which is obtained from open intervals with two
operations (countable unions followed by countable intersections) and whose
measure equals the outer measure of A.
Finally we show that monotone sequences of measurable sets behave as one
would expect with respect to m.
Theorem 2.19
Suppose that An E M for all n
~
1. Then we have:
(i) if An C An+! for all n, then
m(UAn) = lim meAn),
n
(ii) if An ::J An+! for all nand meAd <
men An) =
n
n--+oo
00,
then
lim meAn).
n--+oo
38
Measure, Integral and Probability
Proof
(i) Let B1 = A1, Bi = Ai - Ai- 1 for i > 1. Then U:1 Bi
Bi E M are pairwise disjoint, so that
L m(Bi)
= U: 1 Ai and the
00
=
(by countable additivity)
i=l
n
= n--+oo
lim "m(B i )
~
i=l
n
= lim m( UBi) (by additivity)
n->oo
n=l
= n->oo
lim m(An),
since An = U7=1 Bi by construction - see Figure 2.4.
Figure 2.4 Sets An, Bn
(ii) A1 \ A1
=
0 C A1 \ A2 C ... C A1 \ An C ... for all n, so that by (i)
m(U(A1 \ An))
n
= n->oo
lim m(A1 \ An)
and since m(Ad is finite, m(A1 \ An) = m(A1) - m(An). On the other hand,
Un (A1 \ An) = A1 \ An, so that
nn
m(U(A1 \ An))
n
The result follows.
= m(A1) - m(nAn) = m(A1) - n->oo
lim m(An).
n
D
39
2. Measure
Remark 2.20
The proof of Theorem 2.19 simply relies on the countable additivity of m and
on the definition of the sum of a series in [0,00], i.e.
n
00
'""'
m(Ai) = n--+<x>
lim '""'
~
L..-t m(Ai)'
i=l
i=l
Consequently the result is true, not only for the set function m we have constructed on M, but for any count ably additive set function defined on a a-field.
It also leads us to the following claim, which, though we consider it here only
for m, actually characterizes count ably additive set functions.
Theorem 2.21
The set function m satisfies:
(i) m is finitely additive, i.e. for pairwise disjoint sets (Ai) we have
n
n
m(U Ai) =
i=l
L m(A)
i=l
for each n;
(ii) m is continuous at 0, i.e. if (Bn) decreases to 0, then m(Bn) decreases to
0.
Proof
To prove this claim, recall that m : M foot [0,00] is count ably additive. This
implies (i), as we have already seen. To prove (ii), consider a sequence (Bn) in
M which decreases to 0. Then An = Bn \ Bn+! defines a disjoint sequence in
M, and Un An = B 1. We may assume that B1 is bounded, so that m(Bn) is
finite for all n, so that, by Proposition 2.15 (ii), m(An) = m(Bn )-m(Bn+ 1) 2::
and hence we have
°
00
n=l
k
=
lim ,"",[m(Bn) - m(Bn+1)]
k-too L..-t
n=l
= m(B1) -
lim m(Bn)
n-too
which shows that m(Bn) -+ 0, as required.
D
40
Measure, Integral and Probability
2.5 Borel sets
The definition of M does not easily lend itself to verification that a particular
set belongs to M; in our proofs we have had to work quite hard to show that
M is closed under various operations. It is therefore useful to add another
construction to our armoury, one which shows more directly how open sets
(and indeed open intervals) and the structure of O"-fields lie at the heart of
many of the concepts we have developed.
We begin with an auxiliary construction enabling us to produce new O"-fields.
Theorem 2.22
The intersection of a family of O"-fields is a O"-field.
Proof
Let Fa. be O"-fields for
Q:
E A (the index set A can be arbitrary). Put
a.EA
We verify the conditions of the definition.
(i)
~ E
Fa. for all
Q:
E A, so ~ E F.
(ii) If E E F, then E E Fa. for all
and so E C E F.
Q:
E A. Since the Fa. are O"-fields, E C E Fa.
(iii) If Ek E F for k = 1,2, ... , then Ek E Fa., all
all Q:, and so U~=l Ek E F.
Q:,
k, hence U~=l Ek E Fa.,
D
Definition 2.23
Put
B = n{F : F is a O"-field containing all intervals}.
We say that B is the O"-field generated by all intervals and we call the elements
of B Borel sets (after Emile Borel, 1871-1956). It is obviously the smallest 0"field containing all intervals. In general, we say that 9 is the O"-field generated
by a family of sets A if 9 = n{F: F is a O"-field such that F:l A}.
Example 2.24
(Borel sets) The following examples illustrate how the closure properties of
the O"-field B may be used to verify that most familiar sets in ~ belong to B.
41
2. Measure
(i) By construction, all intervals belong to B, and since B is a (I-field, all open
sets must belong to B, as any open set is a countable union of (open)
intervals.
(ii) Countable sets are Borel sets, since each is a countable union of closed
intervals of the form [a, a]; in particular Nand Q are Borel sets. Hence, as
the complement of a Borel set, the set of irrational numbers is also Borel.
Similarly, finite and cofinite sets are Borel sets.
The definition of B is also very flexible - as long as we start with all intervals
of a particular type, these collections generate the same Borel (I-field:
Theorem 2.25
If instead of the family of all intervals we take all open intervals, all closed
intervals, all intervals of the form (a, (0) (or of the form [a, (0), (-00, b), or
(-00, b]), all open sets, or all closed sets, then the (I-field generated by them is
the same as B.
Proof
Consider for example the (I-field generated by the family of open intervals
and denote it by C:
C = n{J:' ::J
J
J, :F is a (I-field}.
We have to show that B = C. Since open intervals are intervals, J C I (the
family of all intervals), then
{:F ::J I} c {:F::J J}
i.e. the collection of all (I-fields :F which contain I is smaller than the collection
of all (I-fields which contain the smaller family J, since it is a more demanding
requirement to contain a bigger family, so there are fewer such objects. The
inclusion is reversed after we take the intersection on both sides, thus C C B
(the intersection of a smaller family is bigger, as the requirement of belonging
to each of its members is a less stringent one).
We shall show that C contains all intervals. This will be sufficient, since B
is the intersection of such (I-fields, so it is contained in each, so B C C.
To this end consider intervals [a, b), [a, b], (a, b] (the intervals of the form
(a, b) are in C by definition):
00
1
[a, b) = n(a--,b),
n
n=l
42
Measure. Integral and Probability
00
1
1
n
n
[a,b] = n(a- -,b+ -),
n=l
(a, b] =
n(a,
00
n=l
1
b + -).
n
C as a O'-field is closed with respect to countable intersection, so it contains the
sets on the right. The argument for unbounded intervals is similar. The proof
is complete.
0
Exercise 2.7
Show that the family of intervals of the form (a, b] also generates the
O'-field of Borel sets. Show that the same is true for the family of all
intervals [a, b).
Remark 2.26
Since M is a O'-field containing all intervals, and 13 is the smallest such O'-field,
we have the inclusion 13 c M, i.e. every Borel set in ~ is Lebesgue-measurable.
The question therefore arises whether these O'-fields might be the same. In fact
the inclusion is proper. It is not altogether straightforward to construct a set
in M \13, and we shall not attempt this here (but see the Appendix). However,
by Theorem 2.17 (ii), given any E E M we can find a Borel set B :) E of the
form B =
On, where the (On) are open sets, and such that m(E) = m(B).
In particular,
nn
m(BLJ.E)
= m(B \
E)
= o.
Hence m cannot distinguish between the measurable set E and the Borel set
B we have constructed.
Thus, given a Lebesgue-measurable set E we can find a Borel set B such that
their symmetric difference ELJ.B is a null set. Now we know that ELJ.B E M,
and it is obvious that subsets of null sets are also null, and hence in M. However,
we cannot conclude that every null set will be a Borel set (if 13 did contain all
null sets then by Theorem 2.17 (ii) we would have 13 = M), and this points to
an 'incompleteness' in 13 which explains why, even if we begin by defining m
on intervals and then extend the definition to Borel sets, we would also need
to extend it further in order to be able to identify precisely which sets are
'negligible' for our purposes. On the other hand, extension of the measure m to
the O'-field M will suffice, since M does contain all m-null sets and all subsets
of null sets also belong to M.
43
2. Measure
We show that M is the smallest a-field on JR with this property, and we say
that M is the completion of B relative to m and (JR, M, m) is complete (whereas
the measure space (JR, B, m) is not complete). More precisely, a measure space
(X, F, J.l) is complete if for all F E F with J.l(F) = 0, for all N c F we have
N E F (and so J.l(N) = 0).
The completion of a a-field Q, relative to a given measure J.l, is defined as
the smallest a-field F containing Q such that, if NeG E Q and J.l( G) = 0,
then N E F.
Proposition 2.27
The completion of Q is of the form {GUN: G E F, N
c
F E F with J.l( F)
= O}.
This allows us to extend the measure J.l uniquely to a measure fl on F by
setting fl(G U N) = J.l(G) for G E Q.
Theorem 2.28
M is the completion of B.
Proof
We show first that M contains all subsets of null sets in B: so let NcB E B,
B null, and suppose A c R To show that N E M we need to show that
m*(A) ?: m*(A n N) + m*(A n NC).
First note that m*(AnN) :::; m*(N) :::; m*(B) = 0. So it remains to show that
m*(A) ?: m*(A n NC) but this follows at once from monotonicity of m*.
Thus we have shown that N E M. Since M is a complete a-field containing
B, this means that M also contains the completion C of B.
Finally, we show that M is the minimal such a-field, i.e. that M c C: first
On E B as described
consider E E M with m*(E) < 00, and choose B =
above such that B :J E, m(B) = m*(E). (We reserve the use of m for sets in
B throughout this argument.)
Consider N = B \ E, which is in M and has m*(N) = 0, since m* is
additive on M. By Theorem 2.17 (ii) we can find L :J N, L E Band m(L) = 0.
In other words, N is a subset of a null set in B, and therefore E = B \ N
belongs to the completion C of B. For E E M with m*(E) = 00, apply the
above to En = En [-n,n] for each n E N. Each m*(En) is finite, so the En all
belong to C and hence so does their countable union E. Thus M c C and so
they are equal.
D
nn
44
Measure, Integral and Probability
Despite these technical differences, measurable sets are never far from 'nice'
sets, and, in addition to approximations from above by open sets, as observed
in Theorem 2.17, we can approximate the measure of any E E M from below
by those of closed subsets.
Theorem 2.29
If E E M then for given E > 0 there exists a closed set FeE such that
m(E \ F) < E. Hence there exists BeE in the form B = Un F n , where all the
Fn are closed sets, and m(E \ B) = o.
Proof
The complement EC is measurable and by Theorem 2.17 we can find an open
set 0 containing EC such that m(O \ E C) ::; E. But 0 \ E C = 0 n E = E \ OC,
and F = OC is closed and contained in E. Hence this F is what we need. The
final part is similar to Theorem 2.17 (ii), and the proof is left to the reader. 0
Exercise 2.8
Show that each of the following two statements is equivalent to saying
that E E M:
(a) given
E
> 0 there is an open set 0 ::) E with m*(O \ E) < E,
(b) given
E
> 0 there is a closed set FeE with m*(E \ F) < E.
Remark 2.30
The two statements in the above Exercise are the key to a considerable generalization, linking the ideas of measure theory to those of topology:
A non-negative countably additive set function f.1 defined on B is called a
regular Borel measure if for every Borel set B we have:
f.1(B)
=
inf{f.1(O) : 0 open, 0::) B},
f.1(B) = sup{f.1(F) : F closed, FeB}.
In Theorems 2.17 and 2.29 we have verified these relations for Lebesgue measure. We shall consider other concrete examples ofregular Borel measures later.
45
2. Measure
2.6 Probability
The ideas which led to Lebesgue measure may be adapted to construct measures generally on arbitrary sets: any set fl carrying an outer measure (i.e. a
mapping from P(fl) to [0,00]' monotone and count ably subadditive) can be
equipped with a measure J.L defined on an appropriate O"-field F of its subsets.
The resulting triple (fl, F, J.L) is then called a measure space, as observed in
Remark 2.12. Note that in the construction of Lebesgue measure we only used
the properties, not the particular form of the outer measure.
For the present, however, we shall be content with noting simply how to
restrict Lebesgue measure to any Lebesgue-measurable subset B of lR with
m(B) > 0:
Given Lebesgue measure m on the Lebesgue O"-field M, let
MB
= {A n B: A
E M}
and for A E MB write
Proposition 2.31
(B,MB,mB) is a complete measure space.
We can finally state precisely what we mean by 'selecting a number from
[0,1] at random': restrict Lebesgue measure m to the interval B = [0,1] and
consider the O"-field of M[O,l] of measurable subsets of [0,1]. Then m[O,l] is a
measure on M[O,l] with 'total mass' 1. Since all subintervals of [0,1] with the
same length have equal measure, the 'mass' of m[O,l] is spread uniformly over
[0,1]' so that, for example, the 'probability' of choosing a number from [0, 110 )
is the same as that of choosing a number from [160' 170)' namely 110 , Thus all
numerals are equally likely to appear as first digits of the decimal expansion
of the chosen number. On the other hand, with this measure, the probability
that the chosen number will be rational is 0, as is the probability of drawing
an element of the Cantor set C.
We now have the basis for some probability theory, although a general
development still requires the extension of the concept of measure from lR to
abstract sets. Nonetheless the building blocks are already evident in the detailed
development of the example of Lebesgue measure. The main idea in providing a
mathematical foundation for probability theory is to use the concept of measure
46
Measure, Integral and Probability
to provide the mathematical model of the intuitive notion of probability. The
distinguishing feature of probability is the concept of independence, which we
introduce below. We begin by defining the general framework.
2.6.1 Probability space
Definition 2.32
A probability space is a triple (rl, F, P) where rl is an arbitrary set, F is a
a--field of subsets of rl, and P is a measure on F such that
P(rl) = 1,
called probability measure or, briefly, probability.
Remark 2.33
The original definition, given by Kolmogorov in 1932, is a variant of the above
(see Theorem 2.21): (rl,F,P) is a probability space if (rl,F) are given as
above, and P is a finitely additive set function with P( 0) = 0 and P( rl) = 1
such that P(Bn ) '\t 0 whenever (Bn) in F decreases to 0.
Example 2.34
We see at once that Lebesgue measure restricted to [0,1] is a probability measure. More generally: suppose we are given an arbitrary Lebesgue-measurable
set rl c JR., with m(rl) > O. Then P = c·mn, where c = mln) , and m = mn denotes the restriction of Lebesgue measure to measurable subsets of rl, provides
a probability measure on rl, since P is complete and P(rl) = l.
For example, if rl = [a, b], we obtain c = b~a' and P becomes the 'uniform
distribution' over [a, b]. However, we can also use less familiar sets for our base
space; for example, rl = [a, b] n (JR. \ Q), c = b~a gives the same distribution
over the irrationals in [a, b].
2.6.2 Events: conditioning and independence
The word 'event' is used to indicate that something is happening. In probability
a typical event is to draw elements from a set and then the event is concerned
with the outcome belonging to a particular subset. So, as described above, if
rl = [0,1] we may be interested in the fact that a number drawn at random
47
2. Measure
from [0,1] belongs to some A C [0,1]. We want to estimate the probability of
this happening, and in the mathematical setup this is the number P(A), here
m[o,lj(A). So it is natural to require that A should belong to M[O,lj, since these
are the sets we may measure. By a slight abuse of the language, probabilists
tend to identify the actual 'event' with the set A which features in the event.
The next definition simply confirms this abuse of language.
Definition 2.35
Given a probability space ([2, F, P) we say that the elements of F are events.
Suppose next that a number has been drawn from [0,1] but has not been
revealed yet. We would like to bet on it being in [0,
and we get a tip that
it certainly belongs to [0,
Clearly, given this 'inside information', the probability of success is now ~ rather than
This motivates the following general
definition.
i]
n
i.
Definition 2.36
> 0. Then the number
Suppose that P(B)
P(AIB) = P(A n B)
P(B)
is called the conditional probability of A given B.
Proposition 2.37
The mapping A
H
P(AIB) is count ably additive on the (i-field Fs.
Hint Use the fact that A
H
P(A n B) is count ably additive on F.
A classical application of the conditional probability is the total probability
formula which enables the computation of the probability of an event by means
of conditional probabilities given some disjoint hypotheses:
Exercise 2.9
Prove that if Hi are pairwise disjoint events such that
P(Hi ) =1= 0, then
L P(AIHi)P(Hi).
00
P(A)
=
i=l
U:l Hi = [2,
48
Measure, Integral and Probability
It is natural to say that the event A is independent of B if the fact that B
takes place has no influence on the chances of A, i.e.
P(AIB)
=
P(A).
By definition of P(AIB) this immediately implies the relation
P(A n B)
= P(A)P(B),
which is usually taken as the definition of independence. The advantage of this
practice is that we may dispose of the assumption P(B) > 0.
Definition 2.38
The events A, B are independent if
P(A n B)
= P(A) . P(B).
Exercise 2.10
Suppose that A and B are independent events. Show that AC and Bare
also independent.
The exercise indicates that if A and B are independent events, then all
elements of the <7-fields they generate are mutually independent, since these <7fields are simply the collections FA = {0,A,Ac,f?} and FE = {0,B,Bc,f?}
respectively. This leads us to a natural extension of the definition: two <7-fields
Fl and F2 are independentif for any choice of sets Al E Fl and A2 E F2 we
have P(A 1 n A 2 ) = P(AdP(A 2 ).
However, the extension of these definitions to three or more events (or
several <7-fields) needs a little care, as the following simple examples show:
Example 2.39
i]
Let f? = [0,1]' A = [0,
as before; then A is independent of B = [i,~] and of
C = [i,~] u [~, 1]. In addition, Band C are independent. However,
P(A n B
n C)
-1= P(A)P(B)P(C).
Thus, given three events, the pairwise independence of each of the three possible
pairs does not suffice for the extension of 'independence' to all three events.
On the other hand, with A = [0,
B = C = [0, 116 ]U[i, ~~] (or alternatively
9
with C = [0, 116] U [1 6,1]),
i]'
P(A n B n C) = P(A)P(B)P(C)
(2.21)
2. Measure
49
but none of the pairs make independent events.
This confirms further that we need to demand rather more if we wish to
extend the above definition - pairwise independence is not enough, nor is (2.21);
therefore we need to require both conditions to be satisfied together. Extending
this to n events leads to:
Definition 2.40
The events AI, ... , An are independent if for all k :::; n for each choice of k
events, the probability of their intersection is the product of the probabilities.
Again there is a powerful counterpart for u-fields (which can be extended
to sequences, and even arbitrary families):
Definition 2.41
The u-fields F l , F 2 , ... , Fn defined on a given probability space (D, F, P) are
independent if, for all choices of distinct indices il,i 2 , ... ,i k from {1,2, ... ,n}
and all choices of sets Fin E Fin' we have
The issue of independence will be revisited in the subsequent chapters where
we develop some more tools to calculate probabilities.
2.6.3 Applications to mathematical finance
As indicated in the Preface, we will explore briefly how the ideas developed
in each chapter can be applied in the rapidly growing field of mathematical
finance. This is not intended as an introduction to this subject, but hopefully it will demonstrate how a consistent mathematical formulation can help
to clarify ideas central to many disciplines. Readers who are unfamiliar with
mathematical finance should consult texts such as [4], [5], [7] for definitions and
a discussion of the main ideas of the subject.
Probabilistic modelling in finance centres on the analysis of models for the
evolution of the value of traded assets, such as stocks or bonds, and seeks
to identify trends in their future behaviour. Much of the modern theory is
concerned with evaluating derivative securities such as options, whose value is
determined by the (random) future values of some underlying security, such as
a stock.
50
Measure, Integral and Probability
We illustrate the above probability ideas on a classical model of stock prices,
namely the binomial tree. This model is based on finitely many time instants
at which the prices may change, and the changes are of a very simple nature.
Suppose that the number of steps is N, and denote the price at the k-th step
by S (k), 0 :::; k :::; N. At each step the stock price changes in the following way:
the price at a given step is the price at the previous step multiplied by U with
probability p or D with probability q = 1 - p, where 0 < D < U. Therefore
the final price depends on the sequence W = (WI, W2, ... , W N) where Wi = 1
indicates the application of the factor U or Wi = 0, which indicates application
of the factor D. Such a sequence is called a path and we take D to consist of
all possible paths. In other words,
k
S(k)
=
S(O)
II 1](i),
i=I
where
1](k)
=
{
U with probability p
D with probability q.
Exercise 2.11
Suppose N = 5, U = 1.2, D = 0.9, and S(O) = 500. Find the number
of all paths. How many paths lead to the price S(5) = 524.88? What is
the probability that S(5) > 900 if the probability of going up in a single
step is 0.5?
In general, the total number of paths is clearly 2N and at step k there are
k
+ 1 possible prices.
We construct a probability space by equipping D with the (j-field 2J? of all
subsets of D, and the probability defined on single-element sets by P( {w }) =
pkqN-k, where k = L~I Wi.
As time progresses we gather information about stock prices, or, what
amounts to the same, about paths. This means that having observed some
prices, the range of possible future developments is restricted. Our information
increases with time and this idea can be captured by the following family of
(j-fields.
Fix m < n and define a (j-field Fm = {A : w,w' E A ===} WI = W~,W2
w~, ... , Wm = w~}. So all paths from a particular set A in this (j-field have
identical initial segments while the remaining coordinates are arbitrary. Note
that
Fo
=
{D, 0},
51
2. Measure
Fl = {A 1 ,Al,Sl,0}, where Al = {w : WI = I}, i.e. 8(1) = 8(O)U, and
i.e. 8(1) = 8(O)D.
Al = {w : WI = O}
Exercise 2.12
Prove that Fm has 22 '" elements.
Exercise 2.13
Prove that the sequence F m is increasing.
This sequence is an example of a filtration (the identifying features are that
the O"-fields should be contained in F and form an increasing chain), a concept
which we shall revisit later on.
The consecutive choices of stock prices are closely related to coin-tossing.
Intuition tells us that the latter are independent. This can be formally seen by
introducing another O"-field describing the fact that at a particular step we have
a particular outcome. Suppose W is such that Wk = 1. Then we can identify the
set of all paths with this property Ak = {w: Wk = I} and extend to a O"-field:
Qk = {Ak' A k , Sl, 0}. In fact, Ak = {w : Wk = O}.
Exercise 2.14
Prove that Qm and Qk are independent if m
of-
k.
2.7 Proofs of propositions
Proof (of Proposition 2.5)
If the intervals In cover B, then they also cover A: A c B c Un In, hence
Z B C Z A. The infimum of a larger set cannot be greater than the infimum of a
smaller set (trivial illustration: inf{O, 1, 2} < inf{l, 2}, inf{O, 1, 2} = inf{O, 2}),
0
hence the result.
Proof (of Proposition 2.8)
If a system In of intervals covers A then the intervals In + t cover A + t.
Conversely, if I n cover A + t then I n - t cover A. Moreover, the total length of
a family of intervals does not change when we shift each by a number. So we
52
Measure, Integral and Probability
have a one-one correspondence between the interval coverings of A and A + t
and this correspondence preserves the total length of the covering. This implies
that the sets ZA and ZA+t are the same, so their infima are equal.
D
Proof (of Proposition 2.13)
By de Morgan's law
nEk = (U Ekt
00
00
k=l
k=l
By Theorem 2.11 (ii) all Ek are in M, hence by (iii) the same can be said about
the union U~=l E k. Finally, by (ii) again, the complement of this union is in
M, and so the intersection n~=l Ek is in M.
D
Proof (of Proposition 2.15)
(i) Proposition 2.5 tells us that the outer measure is monotone, but since m is
just the restriction of m* to M, then the same is true for m: A c B implies
m(A) = m*(A) :S m*(B) = m(B).
(ii) We write B as a disjoint union B = Au (B \ A) and then by additivity of
m we have m(B) = m(A) + m(B \ A). Subtracting m(A) (here it is important
that m(A) is finite) we get the result.
(iii) Translation invariance of m follows at once from translation invariance of
the outer measure in the same way as in (i) above.
D
Proof (of Proposition 2.16)
The set ALlB is null, hence so are its subsets A \ Band B \ A. Thus these
sets are measurable, and so is An B = A \ (A \ B), and therefore also B =
(A n B) U (B \ A) E M. Now m(B) = m(A n B) + m(B \ A) as the sets on
the right are disjoint. But m(B \ A) = 0 = m(A \ B), so m(B) = m(A n B) =
m(A n B) + m(A \ B) = m((A n B) U (A \ B)) = m(A).
D
Proof (of Proposition 2.27)
The family 9 = {GUN: G E F, N c F E F with J.l(F) = O} contains the set X
since X E :F. If Gi U Ni E g, Ni C Fi , J.l( Fi ) = 0, then UGi U Ni = UG i U UNi
is in 9 since the first set on the right is in F and the second is a subset of a null
set UFi E:F. If GUN E g, N c F, then (GUN)C = (GUF)CU((F\N)nGC),
which is also in 9 Thus 9 is a a-field. Consider any other a-field 1l containing
53
2. Measure
:F and all subsets of null sets. Since 1i is closed with respect to the unions, it
contains 9 and so 9 is the smallest a-field with this property.
0
Proof (of Proposition 2.31)
It follows at once from the definitions and the hint that MB is a a-field. To see
that mB is a measure we check countable additivity: with C i = Ai n B pairwise
disjoint in MB, we have
Therefore (B, MB, mB) is a measure space. It is complete, since subsets of null
sets contained in B are by definition mB-measurable.
0
Proof (of Proposition 2.37)
Assume that An are measurable and pairwise disjoint. By the definition of
conditional probability
00
P(n~l AnI B ) =
=
1
00
P(B) P((1Jl An)
1
00
P(B) P(n~l (An
1
n B)
n B))
00
= P(B) ~ P(A n n B)
00
since An n B are also pairwise disjoint and P is count ably additive.
0
3
Measurable functions
3.1 The extended real line
The length of IR is unbounded above, i.e. 'infinite'. To deal with this we defined
Lebesgue measure for sets of infinite as well as finite measure. In order to handle
functions between such sets comprehensively, it is convenient to allow functions
which take infinite values: we take their range to be (part of) the 'extended
real line' iR = [-00,00]' obtained by adding the 'points at infinity' -00 and
+00 to IR. Arithmetic in this set needs a little care, as already observed in
Section 2.2: we assume that a + 00 = 00 for all real a, a x 00 = 00 for a > 0,
a x 00 = -00 for a < 0, 00 x 00 = 00 and x 00 = 0, with similar definitions
for -00. These are all 'obvious' intuitively (except possibly x (0), and (as
for measures) we avoid ever forming 'sums' of the form 00 + (-00). With these
assumptions 'arithmetic works as before'.
°
°
3.2 Lebesgue-measurable functions
The domain of the functions we shall be considering is usually IR. Now we
have the freedom of defining f only 'up to null sets': once we have shown two
functions f and 9 to be equal on IR \ E where E is some null set, then f = 9
for all practical purposes. To formalise this, we say that f has a property (P)
almost everywhere (a.e.) if f has this property at all points of its domain, except
55
56
Measure, Integral and Probability
possibly on some null set.
For example, the function
j (x)
= {01 for x -I- 0
for x
=
0
is almost everywhere continuous, since it is continuous on lR. \ {O}, and the
exceptional set {O} is null. (Note that probabilists tend to say 'almost surely'
(a.s.) instead of 'almost everywhere' (a.e.) and we shall follow their lead in the
sections devoted to probability.)
The next definition will introduce the class of Lebesgue-measurable functions. The condition imposed on j : lR. -+ lR. will be necessary (though not
sufficient) to give meaning to the (Lebesgue) integral Jj dm. Let us first give
some motivation.
Integration is always concerned with the process of approximation. In the
Riemann integral we split the interval I = [a, b], over which we integrate into
small pieces In - again intervals. The simplest method of doing this is to divide
the interval into N equal parts. Then we construct approximating sums by
multiplying the lengths of the small intervals by certain numbers en (related to
the values of the function in question; for example en = infln j, en = sUP1n j,
or en = j(x) for some x E In):
J:
For large N this sum is close to the Riemann integral
j(x) dx (given some
regularity of f).
The approach to the Lebesgue integral is similar but there is a crucial difference. Instead of splitting the integration domain into small parts, we decompose
the range of the function (see Figure 3.1).
Figure 3.1 Riemann vs. Lebesgue
57
3. Measurable functions
Again, a simple way is to introduce short intervals I n of equal length. To
build the approximating sums we first take the inverse images of I n by f, i.e.
f- 1 (In). These may be complicated sets, not necessarily intervals. Here the
theory of measure developed previously comes into its own. We are able to
measure sets provided they are measurable, i.e. they are in M. Given that, we
compute
n=1
where Cn E I n or Cn = inf I n , for example.
The following definition guarantees that this procedure makes sense (though
some extra care may be needed to arrive at a finite number as N ---+ 00).
Definition 3.1
Suppose that E is a measurable set. We say that a function
(Lebesgue- ) measurable if for any interval I ~ ffi.
rl(I)
=
{x
E ffi.:
f(x)
E
I}
E
f
E ---+ ffi. is
M.
In what follows, the term measurable (without qualification) will refer to
Lebesgue-measurable functions.
If all the sets f-l(I) E B, i.e. if they are Borel sets, we call f Borelmeasurable, or simply a Borel function.
The underlying philosophy is one which is common for various mathematical notions: the inverse image of a 'nice' set is 'nice'. Remember continuous
functions, for example, where the inverse image of any open set is required
to be open. The actual meaning of the word 'nice' depends on the particular
branch of mathematics. In the above definitions, note that since B C M, every
Borel function is (Lebesgue-) measurable.
Remark 3.2
The terminology is somewhat unfortunate. 'Measurable' objects should be measured (as with measurable sets). However, measurable functions will be integrated. This confusion stems from the fact that the word integrable, which
would probably fit best here, carries a more restricted meaning, as we shall see
later. This terminology is widely accepted and we are not going to try to fight
the whole world here.
We give some equivalent formulations:
58
Measure, Integral and Probability
Theorem 3.3
The following conditions are equivalent:
(i) j is measurable,
(ii) for all a, j-l((a, 00)) is measurable,
(iii) for all a, rl([a, 00)) is measurable,
(iv) for all a, j-l((-oo,a)) is measurable,
(v) for all a, j-l((-oo,a]) is measurable.
Proof
Of course (i) implies any of the other conditions. We show that (ii) implies (i).
The proofs of the other implications are similar, and are left as exercises (which
you should attempt).
We have to show that for any interval I, j-l(1) EM. By (ii) we have that
for the particular case I = (a, 00). Suppose I = (-oo,aJ. Then
rl(( -00, a])
= j-l(lR \ (a, 00)) = E \ rl((a, 00)) EM
(3.1)
since both E and j-l (( a, 00)) are in M (we use the closure properties of M
established before). Next
00
r 1((-00,b)) = rl( U(-oo,b-
1
;;:J)
n=1
By (3.1), j-l (( -00, b - ~]) EM and the same is true for the countable union.
From this we can easily deduce that
Now let I
=
(a, b), and
rl((a, b)) = rl(( -00, b) n (a, 00))
= rl(( -00, b)) n rl((a, 00))
is in M as the intersection of two elements of M. By the same reasoning M
contains
j-l([a,b]) =
=
r
1((-00,bJ n [a, 00))
j-l(( -00, b]) n rl([a, 00))
and half-open intervals are handled similarly.
D
59
3. Measurable functions
3.3 Examples
The following simple results show that most of the functions encountered 'in
practice' are measurable.
1. Constant functions are measurable. Let
f _ l((a, (0)) =
f (x) == c.
Then
if a < c
otherwise
{ lR
0
and in both cases we have measurable sets.
2. Continuous functions are measurable. For we note that (a, (0) is an open
set and so is f- 1 ((a, (0)). As we know, all open sets are measurable.
3. Define the indicator function of a set A by
1A
(X )
={
I
0
if x E A
otherwise.
Then
A
E
M
{::}
lA is measurable
since
1:;:'((a,00))
~{ ~
if a < 0
ifO:::;a<l
if a ;::: 1.
Exercise 3.1
Prove that every monotone function is measurable.
Exercise 3.2
Prove that if f is a measurable function, then the level set {x : f (x)
is measurable for every a E R
= a}
Remark 3.4
In the Appendix, assuming the validity of the axiom of choice, we show that
there are subsets of lR which fail to be Lebesgue-measurable, and that there
are Lebesgue-measurable sets which are not Borel sets. Thus, if P(lR) denotes
the O"-field of all subsets of lR, the following inclusions are strict:
f3
cM
C P(lR).
60
Measure, Integral and Probability
These (rather esoteric) facts can be used, by considering the indicator functions of these sets, to construct examples of non-measurable functions and of
measurable functions which are not Borel functions. While it is important to be
aware of these distinctions in order to understand why these different concepts
are introduced at all, such examples will not feature in the applications of the
theory which we have in mind.
3.4 Properties
The class of measurable functions is very rich, as the following results show.
Theorem 3.5
The set of real-valued measurable functions defined on E E M is a vector space
and closed under multiplication, i.e. if f and 9 are measurable functions then
f + g, and fg are also measurable (in particular, if 9 is a constant function
9 == c, cf is measurable for all real c).
Proof
Fix measurable functions
show that for each a E JR.,
B
f, 9 : E
-t R First consider
f + g. Our goal is to
= (f + g)-l( -00, a) = {t : f(t) + g(t) < a} EM.
Suppose that all the rationals are arranged in a sequence {qn}. Now
U{t : f(t) < qn, g(t) < a 00
B =
qn}
n=l
- we decompose the half-plane below the line x + y = a into a countable union
of unbounded 'boxes': {(x, y) : x < qn, Y < a - qn} (see Figure 3.2). Clearly
{t: f(t) < qn,g(t) < a - qn}
=
{t: f(t) < qn} n {t: g(t) < a - qn}
is measurable as an intersection of measurable sets. Hence B E M as a countable union of elements of M.
To deal with f 9 we adopt a slightly indirect approach in order to remain
'one-dimensional': first note that if 9 is measurable, then so is -g. Hence f -g =
f + (-g) is measurable. Since fg = (f + g)2 - (f - g)2}, it will suffice to prove
that the square of a measurable function is measurable. So take a measurable
H
61
3. Measurable functions
a
Figure 3.2 Boxes
h : E -+lR and consider {x E E : h 2 (x) > a}. For a < 0 this set is E E M, and
for a :::: 0
{x: h2 (x) > a} = {x : h(x) > Va} U {x: h(x) < -Va}.
Both sets on the right are measurable, hence we have shown that h 2 is measurable. Apply this with h = f + 9 and h = f - 9 respectively, to conclude that f 9
is measurable. It follows that cf is measurable for constant c, hence the class
of real-valued measurable functions forms a vector space under addition.
D
Remark 3.6
An elegant proof of the theorem is based on the following lemma, which will also
be useful later. Its proof makes use of the simple topological fact that every
open set in lR 2 decomposes into a countable union of rectangles, in precise
analogy with open sets in lR and intervals.
Lemma 3.7
Suppose that F : lR x lR -+ lR is a continuous function. If f and 9 are measurable,
then h(x) = F(f(x), g(x)) is also measurable.
It now suffices to take F(u,v)
proof of Theorem 3.5.
= u + v, F(u,v) = uv to obtain a second
62
Measure, Integral and Probability
Proof of the lemma
For any real a,
{x: h(x) > a}
= {x : U(x), g(x)) EGa}
where G a = {(u, v) : F(u, v) > a} = F-I((a, (0)). Suppose for the moment
that we have been lucky and G a is a rectangle: G a = (aI, bt) x (CI, dt} .
•
(f(x), g(x))
Figure 3.3 The sets G a
It is clear from Figure 3.3 that
{x: h(x) > a}
= {x: f(x)
E
(al,bt} and g(x) E (cI,dt}}
= {x : f(x)
E
(aI, bt}} n {x: g(x) E (CI, dt}}.
In general, we have to decompose the set G a into a union ofrectangles. The set
G a is an open subset of lR x lR since F is continuous. Hence it can be written
as
00
n=l
where Rn are open rectangles Rn
00
{x: h(x) > a}
=
=
(an, bn )
X
(c n , dn ). So
U {x: f(x) E (an' bn )} n {x: g(x) E (cn , dn )}
n=l
3. Measurable functions
63
is measurable due to the stability properties of M.
o
A simple application of Theorem 3.5 is to consider the product flA. If f
is a measurable function and A is a measurable set, then flA is measurable.
This function is simply f on A and 0 outside A. Applying this to the set
A = {x E E : f(x) > O} we see that the positive part f+ of a measurable
function is measurable: we have
j+(x) = {f(X)
o
Similarly the negative part
f-
of
_
f
is measurable, since
{ 0
f (x)
=
if f(x) > 0
if f(x) ::; O.
f(x)
_
if f(x) > 0
if f(x) ::; o.
Proposition 3.8
Let E be a measurable subset of R
(i) f: E -+ lR is measurable if and only if both f+ and f- are measurable.
(ii) If f is measurable, then so is
lfl;
but the converse is false.
Hint Part (ii) requires the existence of non-measurable sets (as proved in the
Appendix), not their particular form.
Exercise 3.3
Show that if f is measurable, then the truncation of f:
{
a
f (x)
=
a
f(x)
if f(x) > a
if f(x) ::; a,
is also measurable.
Exercise 3.4
Find a non-measurable
f
such that j2 is measurable.
Passage to the limit does not destroy measurability - all the work needed
was done when we established the stability properties of M!
64
Measure, Integral and Probability
Theorem 3.9
If {fn} is a sequence of measurable functions defined on the set E in ~, then
the following are measurable functions also:
Proof
It is sufficient to note that the following are measurable sets:
k
{x: (maxfn)(x) > a}
n<k
-
=
U{x: fn(x) > a},
n=l
n
k
{x: (minfn)(x) > a} =
{x: fn(x) > a},
n<k
n=l
U
00
{X: (sup fn)(x) > a} =
{x: fn(x) > a},
n2k
n=k
n
00
{X: (inf fn)(x) 2: a} =
{x: fn(x) 2: a}.
n>k
n=k
For the upper limit, by definition
lim sup fn
n-'oo
=
inf{sup fm}
n21 m2n
and the above relations show that hn = sUPm2n f m is measurable, hence
0
infn21 hn(x) is measurable. The lower limit is done similarly.
Corollary 3.10
If a sequence f n of measurable functions converges (pointwise) then the limit
is a measurable function.
Proof
This is immediate since lim n-. oo f n = lim sUPn-.oo f n, which is measurable.
0
Remark 3.11
Note that Theorems 3.5 and 3.9 have counterparts for Borel functions, i.e. they
remain valid upon replacing 'measurable' by 'Borel' throughout.
65
3. Measurable functions
Things are slightly more complicated when we consider the role of null
sets. On the one hand, changing a function on a null set cannot destroy its
measurability, i.e. any measurable function which is altered on a null set remains
measurable. However, as not all null sets are Borel sets, we cannot conclude
similarly for Borel sets, and thus the following results have no natural 'Borel'
counterparts.
Theorem 3.12
If f : E -+ ~ is measurable, E E M, 9 : E -+
g(x)} is null, then 9 is measurable.
~
is such that the set {x : f(x) =
Proof
Consider the difference d(x)
= g(x) - f(x).
{x: d(x) > a}
= {
It is zero except on a null set, so
a null set
a full set
~f a ~ 0
<0
If a
where a full set is the complement of a null set. Both null and full sets are
measurable, hence d is a measurable function. Thus 9 = f +d is measurable. D
Corollary 3.13
If (fn) is a sequence of measurable functions and fn(x) -+ f(x) almost everywhere for x in E, then f is measurable.
Proof
Let A be the null set such that fn(x) converges for all x E E \ A. Then lAcfn
converge everywhere to 9 = lAcf which is therefore measurable. But f = 9
almost everywhere, so f is also measurable.
D
Exercise 3.5
Let
f n be a sequence of measurable functions. Show that the set E =
{x: fn(x) converges} is measurable.
Since we are able to adjust a function f at will on a null set without altering its measurability properties, the following definition is a useful means of
concentrating on the values of f that 'really matter' for integration theory, by
identifying its bounds 'outside null sets':
66
Measure, Integral and Probability
Definition 3.14
Suppose f : E ~ iR is measurable. The essential supremum ess sup f is defined
as inf {z : f ::; z a.e.} and the essential infimum ess inf f is sup{ z : f 2: z a.e.}.
Note that ess sup f can be +00. If ess sup f = -00, then f = -00 a.e. since
by definition of ess sup, f ::; -n a.e. for all n 2: 1. Now if ess sup f is finite,
and A = {x : ess sup f < f(x)}, define An for n 2: 1 by
1
An = {x: ess supf < f(x) - -}.
n
These are null sets, hence so is A = Un An, and thus we have verified:
f ::; ess sup f
a.e.
The following is now straightforward to prove.
Proposition 3.15
If f, 9 are measurable functions, then
ess sup (f
+ g)
::; ess sup f
+ ess sup g.
Exercise 3.6
Show that for measurable f, ess sup f ::; sup f. Show that these quantities coincide when f is continuous.
3.5 Probability
3.5.1 Random variables
In the special case of probability spaces we use the phrase random variable to
mean a measurable function. That is, if (n, F, P) is a probability space, then
X: n ~ ~ is a random variable if for all a E ~ the set X-1([a, (0)) is in F:
{w En: X(w) 2: a} E F.
In the case where n c ~ is a measurable set and F = !3 is the O'-field of Borel
subsets of n, random variables are just Borel functions ~ ~ R
In applied probability, the set n represents the outcomes of a random experiment that can be observed by means of various measurements. These measurements assign numbers to outcomes and thus we arrive at the notion of random
67
3. Measurable functions
variable in a natural way. The condition imposed guarantees that questions of
the following sort make sense: What is the probability that the value of the
random variable lies within given limits?
3.5.2 l7-fields generated by random variables
As indicated before, the random variables we encounter will in fact be Borelmeasurable functions. The values of the random variable X will not lead us
to non-Borel sets; in fact, they are likely to lead us to discuss much coarser
distinctions between sets than are already available within the complexity of
the Borel O"-field 13. We should therefore be ready to consider different O"-fields
contained within F. To be precise, the family of sets
X-I(13) = {S
c F: S = X-I(B)
for some B E 13}
is a O"-field. If X is a random variable, X-I(13) c F but it may be a much
smaller subset depending on the degree of sophistication of X. We denote this
O"-field by Fx and call it the O"-field generated by X.
The simplest possible case is where X is constant, X == a. The X-I (B) is
either fl or 0, depending on whether a E B or not and the O"-field generated
is trivial: F = {0, fl}.
If X takes two values a =1= b, then Fx contains four elements: Fx =
{0, fl, X- I ({a}), X- I ({b})}. If X takes finitely many values, Fx is finite. If X
takes denumerably many values, Fx is uncountable (it may be identified with
the O"-field of all subsets of a countable set). We can see that the size of Fx
grows together with the level of complication of X.
Exercise 3. 7
Show that Fx is the smallest O"-field containing the inverse images
X-I (B) of all Borel sets B.
Exercise 3.8
Is the family of sets {X (A) : A E F} a O"-field?
The notion of F x has the following interpretation. The values of the measurement X are all we can observe. From these we deduce some information on
the level of complexity of the random experiment, that is the size of fl and F x,
and we can estimate the probabilities of the sets in F x by statistical methods.
The O"-field generated represents the amount of information produced by the
68
Measure, Integral and Probability
random variable. For example, suppose that a die is thrown and only 0 and 1
are reported depending on the number shown being odd or even. We will never
distinguish this experiment from coin-tossing. The information provided by the
measurement is insufficient to explore the complexity of the experiment (which
has six possible outcomes, here grouped together into two sets).
3.5.3 Probability distributions
For any random variable X we can introduce a measure on the CT-field of Borel
sets B by setting
We call P x the probability distribution of the random variable X.
Theorem 3.16
The set function Px is count ably additive.
Proof
Given pairwise disjoint Borel sets Bi their inverse images X-l(Bi) are pairwise
disjoint and X-1(UiBi) = U i X-l(B i ), so
PX(UBi ) = P(X- 1(UBi )) = P(UX- 1(Bi )) = LP(X- 1(B i ))
i
i
i
i
D
as required.
Thus (JR., B, Px ) is a probability space. For this it is sufficient to note that
Px(JR.) = p(n) = l.
We consider some simple examples. Suppose that X is constant, i.e. X == a.
Then we call Px the Dime measure concentrated at a and denote it by 8a .
Clearly
I ifaEB
8a (B) = { 0
d
if a 'F B.
In particular, 8a ( { a}) = l.
If X takes two values,
X(w) = {
~
with probability p
with probability 1 - p,
69
3. Measurable functions
then
and so
if a, bE B
if a E B,b tJ. B
if b E B,a tJ. B
otherwise,
Px(B) = pt5a(B) + (1 - p)t5b (B).
The distribution of a general discrete random variable (i.e. one which takes
only finitely many different values, except possibly on some null set) is of the
form: if the values of X are ai taken with probabilities Pi > 0, i = 1,2, ...
LPi = 1, then
00
Px(B) = LPit5ai (B).
i=l
Classical examples are:
(1) the geometric distribution, where Pi = (1 - q)qi for some q E (0,1),
(2) the Poisson distribution where Pi = ~: e- A•
We shall not discuss the discrete case further since this is not our primary
goal in this text, and it is covered in many elementary texts on probability
theory (such as [10]).
Now consider the classical probability space with [l = [0,1], F = S, P =
ml[O,l] - Lebesgue measure restricted to [0,1]. We can give examples of random
variables given by explicit formulae.
For instance, let X(w) = aw + b. Then the image of [0,1] is the interval
[b, a + b] and Px = ~ml [b,a+b], i.e. for Borel B
Px(B) = m(B n [b, a + b]).
a
Example 3.17
Suppose a car leaves city A at random between 12 am and 1 pm. It travels at 50
mph towards B which is 25 miles from A. What is the probability distribution
of the distance between the car and B at 1 pm?
Clearly, this distance is 0 with probability ~, i.e. if the car departs before
12.30. As a function of the starting time (represented as w E [0, 1]) the distance
has the form
X( )_{ 0
ifwE[O,~]
w 50w - 25 if w E (~, 1]
A
and Px = ~ PI + ~ P2 where PI = 150 , P2 = m[0,25]. In this example, therefore,
Px is a combination of Dirac and Lebesgue measures.
70
Measure, Integral and Probability
In later chapters we shall explore more complicated forms of X and the corresponding distributions after developing further machinery needed to handle
the computations.
3.5.4 Independence of random variables
Definition 3.18
X, Yare independent if the cr-fields generated by them are independent.
In other words, for any Borel sets B, C in JR,
Example 3.19
Let (Sl = [0, I],M) be equipped with Lebesgue measure. Consider X = I[o,!J'
Y = I[t,iJ' Then Fx = {0, [0, 1], [0,
(~, I]} are clearly independent.
n
(~, I]}, Fy
= {0, [0, 1], [~,
n
[O,~) U
Example 3.20
Let Sl be as above and let X(w) = w, Y(w) = 1 - w. Then Fx = Fy = M. A
cr-field cannot be independent with itself (unless it is trivial): Take A E M and
then independence requires P(A n A) = P(A) x P(A) (the set A belongs to
'both' cr-fields), i.e. P(A) = P(A)2, which can happen only if either P(A) =
or P(A) = 1. So a cr-field independent with itself consists of sets of measure
zero or one.
°
3.5.5 Applications to mathematical finance
Consider a model of stock prices, discrete in time, i.e. assume that the stock
prices are given by a sequence S(n) of random variables, n = 1,2, ... , N. If
the length of one step is h, then we have the time horizon T = N h and we
shall often write S(T) instead of S(N). An example of such a model is the
binomial tree considered in the previous chapter. Recall that a European call
option is the random variable of the form (S(N) -K)+ (N is the exercise time,
K is the strike price, S is the underlying asset). A natural generalisation of
this is a random variable of the form J(S(N)) for some measurable function
71
3. Measurable functions
f : jR ---+ R
This random variable is of course measurable with respect to the
(I-field generated by S(N). This allows us to formulate a general definition:
Definition 3.21
A European derivative security (contingent claim) with the underlying asset
represented by a sequence S(n) and exercise time N is a random variable X
measurable with respect to the (I-field F generated by S(N).
Proposition 3.22
A European derivative security X must be of the form X
measurable real function f.
= f(S(N)) for some
The above definition is not sufficient for applications. For example, it does
not cover one of the basic derivative instruments, namely futures. Recall that
a holder of the futures contract has the right to receive (or an obligation to
pay in case of negative values) a certain sequence (X(l), ... , X(N)) of cash
payments, depending on the values of the underlying security. To be specific,
if for example the length of one step is one year and r is the risk-free interest
rate for annual compounding, then
X(n) = S(n)(l + r)N-n - X(n - 1)(1 + r)N-n+l.
In order to introduce a general notion of derivative security which would cover
futures, we first consider a natural generalisation,
X(n) = fn(S(O), S(l), ... , S(n)),
and then we push the level of generality ever further:
Definition 3.23
A derivative security (contingent claim) with the underlying asset represented
by a sequence (S(n)) and the expiry time N is a sequence (X(l), ... ,X(N))
of random variables such that X(n) is measurable with respect to the (I-field
Fn generated by (S(O), S(l), ... , S(n)), for each n = 1, ... , N.
Proposition 3.24
A derivative security X must be of the form X = f(S(O), S(l), ... , S(N)) for
some measurable f : jRN+l ---+ R
72
Measure, Integral and Probability
We could make one more step and dispose of the underlying random variables. The role of the underlying object would be played by an increasing
sequence of O"-fields fn and we would say that a contingent claim (avoiding
here the other term) is a sequence of random variables X(n) such that X(n) is
fn-measurable, but there is little need for such a generality in practical applications. The only case where that formulation would be relevant is the situation
where there are no numerical observations but only some flow of information
modelled by events and O"-fields.
Example 3.25
Payoffs of exotic options depend on the whole paths of consecutive stock prices.
For example, the payoff of a European lookback option with exercise time N
is determined by
f(xo,
Xl,···, XN)
= max{xo, Xl,···, XN}
-
XN·
Exercise 3. 9
Find the function f for a down-and-out call (which is a European call
except that it ceases to exist if the stock price at any time before the
exercise date goes below the barrier L < 8(0)).
Example 3.26
Consider an American put option in a binomial model. We shall see that it fits
the above abstract scheme. Recall that American options can be exercised at
any time before expiry and the payoff of a put exercised at time n is (K - 8 (n)) + ,
written g(8(n)) for brevity, g(x) = (K - x)+. This option offers to the holder
cash flow of the same nature as the stock. The latter is determined by the stock
price and stock can be sold at any time, of course only once. The American
option can be sold or exercised also only once. The value of this option will
be denoted by pA (n) and we shall show that it is a derivative security in the
sense of Definition 3.23.
We shall demonstrate that it is possible to write
for some functions fn- Consider an option expiring at N
= 2. Clearly
h(x) = g(x).
At time n = 1 the holder of the option can exercise or wait till n = 2. The value
of waiting is the same as the value of European put issued at n = 1 with exercise
3. Measurable functions
73
time N = 2 (which is well known and will be seen in Section 7.4.4 in some detail)
can be computed as the expectation with respect to some probability p of the
discounted payoff. The value of the American put is the greater of the two, so
h(x) = max {g(x), 1: r (Ph (xU) + (1- p)h(xD)]}.
The same argument gives
fo(x) = max {g(x), _1_ (Ph (xU) + (1- p*)h(xD)]}.
l+r
In general, for an American option expiring at time N we have the following
chain of recursive formulae:
fN(X) = g(x),
fn-l(X) = max {g(x), 1: r (Pfn(xU) + (1- p)fn(xD)]}.
3.6 Proofs of propositions
Proof (of Proposition 3.8)
(i) We have proved that if f is measurable then so are f+, f-. Conversely,
note that f(x) = f+(x) - f-(x), so Theorem 3.5 gives the result.
(ii) The function u I--t lui is continuous, so Lemma 3.7 with F(u,v) = lui gives
measurability of If I (an alternative is to use If I = f+ + f-). To see that the
converse is not true take a non-measurable set A and let f = lA - lAc. It is
non-measurable since {x : f(x) > O} = A is non-measurable. But If I = 1 is
clearly measurable.
0
Proof (of Proposition 3.15)
Since f ::; ess supf and 9 ::; ess supg a.e., by adding we have f +g ::; ess supf +
ess sup 9 a.e. So the number ess sup f +ess sup 9 belongs to the set {z : f +9 ::;
z a.e.} and hence the infimum of this set is smaller than this number.
0
Proof (of Proposition 3.22)
First note that the o--field generated by S(N) is ofthe form:F = {S(N)-l(B) :
B-Borel} since these sets form a o--field and any other o--field such that S(N)
is measurable with respect to it has to contain all inverse images of Borel sets.
Next we proceed in three steps:
74
Measure, Integral and Probability
1. Suppose X = IA for A E F. Then A = S(N)-l(B) for a Borel subset of
R Put f = IB and clearly X = f 0 S(N).
2. If X is a step function, X = L cilAi then take f = L cilBi where
Ai
= S(N)-l(Bi)'
3. In general, a measurable function X can be approximated by step func22n
k
tions Xn = Lk=O 2 n • IY_l([~.!'P)) (see Proposition 4.15 for more details)
and we take f = limsupfn, where fn corresponds to Yn as in step 2) and the
sequence clearly converges on the range of S(N).
0
Proof (of Proposition 3.24)
We follow the lines of the previous proof.
1. Suppose X = IA for A E F. Then A = (S(l), ... , S(N))-l(B) for Borel
B C jRN, and f = IB satisfies the claim.
Steps 2. and 3. are the same as in the proof of the previous proposition. 0
4
Integral
The theory developed in this chapter deals with Lebesgue measure for the sake
of simplicity. However, all we need (except for the section where we discuss the
Riemann integration) is the property of m being a measure, i.e. a countably
additive (extended-) real-valued function f1 defined on a cr-field :F of subsets of a
fixed set n. Therefore, the theory developed for the measure space (JR, M, m) in
the following sections can be extended virtually without change to an abstractly
given measure space (n,:F, f1).
We encourage the reader to bear in mind the possibility of such a generalisation. We will need it in the probability section at the end of the chapter, and
in the following chapters.
4.1 Definition of the integral
We are now able to resolve one of the problems we identified earlier: how to
integrate functions like l<Q1, which take only finitely many values, but where the
sets on which these values are taken are not at all 'like intervals'.
Definition 4.1
A non-negative function <p : JR ---+ JR which takes only finitely many values, i.e.
the range of <p is a finite set of distinct non-negative reals {aI, a2,"" an}, is a
75
76
Measure, Integral and Probability
simple function if all the sets
i
= 1,2, ... , n,
are measurable sets. Note that the sets Ai E M are pairwise disjoint and their
union is R
Clearly we can write
n
'P(x) =
L
ai1Ai (x),
i=l
so that (by Theorem 3.5) each simple function is measurable.
Definition 4.2
The (Lebesgue) integral over E E M of the simple function 'P is given by
(see Figure 4.1). Note that since we shall allow m(Ai)
convention 0 x 00 = 0 here.
+00, we use the
-
E
Figure 4.1 Integral of a simple function
Example 4.3
Consider the simple function 1Q which takes the value 1 on Q and 0 on IR \ Q.
By the above definition we have
llQ
dm
=1x
m(Q)
+0
x m(1R \ Q)
=0
4. Integral
77
since Q is a null set. Recall that this function is not Riemann-integrable. Similarly, Ie has integral 0, where C is the Cantor set.
Exercise 4.1
Find the integral of cp over E where
(a) cp(x)
= Int(x),
= [0,10],
(b) cp(x)
= Int(x 2 ),
(c) cp(x)
= Int(sinx),
E
E
= [0,2],
E
= [0,211']
and Int denotes the integer part of a real number. (Note that many texts
use the symbol [x] to denote Int(x). We prefer to use Int for increased
clarity.)
In order to extend the integral to more general functions, Henri Lebesgue (in
1902) adopted an apparently obvious, but subtle device: instead of partitioning
the domain of a bounded function f into many small intervals, he partitioned
its range into a finite number of small intervals of the form Ai = [ai-I, ai),
and approximated the 'area' under the graph of f by the upper sum
n
S(n)
=
Laim(j-l(Ai))
i=l
and the lower sum
n
s(n)
=
L ai_lm(j-l(Ai ))
i=l
respectively; then integrable functions had the property that the infimum of
all upper sums equals the supremum of all lower sums - mirroring Riemann's
construction (see also Figure 3.1).
A century of experience with the Lebesgue integral has led to many equivalent definitions, some of them technically (if not always conceptually) simpler.
We shall follow a version which, while very similar to Lebesgue's original construction, allows us to make full use of the measure theory developed already.
First we stay with non-negative functions:
Definition 4.4
For any non-negative measurable function
is defined as
Ie f
f and
E E M the integral
dm = sup Y (E, f)
IE f dm
78
Measure, Integral and Probability
where
Y(E, f)
=
{fe cpdm : °: :; cp :::; f, cp is simple}.
Note that the integral can be +00, and is always non-negative. Clearly, the
set Y(E, f) is always of the form [0, xJ or [0, x), where the value x = +00 is
allowed.
If E
= [a, bJ we write the integral as
I:
°
lb
f dm,
lb
f(x) dm(x),
I:
or even as
f(x) dx, when no confusion is possible (and we set
f dm =
- Iba f dm if a> b). The notation I f dm means IIR f dm.
Clearly, if for some A E M and a non-negative measurable function 9 we
have 9 = on AC, then any non-negative simple function that lies below 9 must
be zero on AC. Applying this to 9 = f.IA we obtain the important identity
i
fdm
=
J
fiAdm.
Exercise 4.2
°
Suppose that f: [0, 1J -+ lR is defined by letting f(x) = on the Cantor
set and f(x) = k for all x in each interval of length 3- k which has been
removed from [0, 1J. Calculate 101 f dm.
Hint Recall that I:~1
kX k - 1
= d~ (I:;;'=o xk) =
(1-.!-x)2
when Ixl
< 1.
If f is a simple function, we now have two definitions of the integral; thus
for consistency you should check carefully that the above definitions coincide.
Proposition 4.5
For simple functions, Definitions 4.2 and 4.4 are equivalent.
Furthermore, we can prove the following basic properties of integrals of
simple functions:
Theorem 4.6
Let cp, 'IjJ be simple functions. Then:
79
4. Integral
(i) if cp:::; 'ljJ then IE cpdm:::; IE'ljJdm,
(ii) if A, B are disjoint sets in M, then
[
lAuB
(iii) for all constants a
>0
cpdm = [ cpdm +
lA
L
acpdm = a
[
lB
cpdm,
L
cpdm.
Proof
(i) Notice that Y(E, cp) ~ Y(E, 'ljJ) (we use Definition 4.4).
(ii) Employing the properties of m we have (cp = I: cilD;)
[
lAuB
(iii) If cp
=
I: CilA;
L
acpdm
cpdm = LCim(Di n (A U B))
= L ci(m(Di n A) + m(Di n B))
= L
cim(Di
=
cpdm
n A) + L
L +L
then acp =
I: aCilA;
ci(Di
n B)
cpdm.
and
= Lacim(EnAi ) = a LCim(EnAi ) = a
as required.
L
cpdm
D
Next we show that the properties of the integrals of simple functions extend
to the integrals of non-negative measurable functions:
Theorem 4.7
Suppose
f and 9 are non-negative measurable functions.
(i) If A EM, and
f :::; 9 on A, then
L f dm :::; Lgdm.
(ii) If B ~ A, A, B E M, then
L f dm :::; L f dm .
80
(iii) For
Measure, Integral and Probability
1
a ~ 0,
afdm=a l f dm .
(iv) If A is null then
lfdm=o.
(v) If A, BE M, AnB = 0, then
r
lAUB
f dm =
rf dm + lBrf dm.
lA
Proof
(i) Notice that Y(A,J) s:;;; Y(A,g) (there is more room to squeeze simple functions under 9 than under J) and the sup of a bigger set is larger.
(ii) If <.p is a simple function lying below f on B, then extending it by zero
outside B we obtain a simple function which is below f on A. The integrals of
these simple functions are the same, so Y(B, J) s:;;; Y(A, J) and we conclude as
in (i).
(iii) The elements of the set Y(A, aJ) are of the form a x x where x E Y(A, J),
so the same relation holds between their suprema.
(iv) For any simple function <.p, IA <.pdm = O. To see this, take <.p = 2:::Ci1E;>
say, then m(A n E i ) = 0 for each i, so Y(A, J) = {O}.
(v) The elements of Y(AUB, J) are of the form IAuB <.pdm, so by Theorem 4.6
(ii) they are of the form IA <.pdm + IB <.pdm. So Y(A U B, J) = Y(A, J) +
Y(B, J) and taking suprema this yields IAuB f dm ~ IA f dm + IB f dm. For
the opposite inequality, suppose that the simple functions <.p and '¢ satisfy:
<.p ~ f on A and <.p = 0 off A, / while '¢ ~ f on Band '¢ = 0 off B. Since
An B = 0, we can construct a new simple function "( ~ f by setting "( = <.p on
A, "( = '¢ on Band "( = 0 outside Au B. Then
1
<.pdm+
L
'¢dm
=
=
~
1 L
"(dm+
lAUB
r
"(dm
lAUB
r
fdm.
"(dm
On the right we have an upper bound which remains valid for all simple functions that lie below f on Au B. Thus taking suprema over <.p and '¢ separately
0
on the left gives IA f dm + IB f dm ~ IAuB f dm.
81
4. Integral
Exercise 4.3
Prove the following mean value theorem for the integral: if a::; f(x) ::; b
for x E A, then am(A) ::; fA f dm ::; bm(A).
We now confirm that null sets are precisely the 'negligible sets' for integration theory:
Theorem 4.8
Suppose f is a non-negative measurable function. Then
if In~.f dm = O.
f
= 0 a.e. if and only
Proof
First, note that if f = 0 a.e. and 0 ::; <P ::; f is a simple function, then <p = 0
a.e. since neither f nor <p takes a negative value. Thus flR <p dm = 0 for all such
<p and so flR f dm = 0 also.
Conversely, given flR f dm = 0, let E = {x : f(x) > o}. Our goal is to show
that m(E) = O. Put
En =
[1 00 )) for
f -1 (-,
n
n ~ 1.
Clearly, (En) increases to E with
00
To show that m(E) = 0 it is sufficient to prove that m(En) = 0 for all n. (See
Theorem 2.19.) The function <p = *lEn is simple and <p ::; f by the definition
of En. So
{ <pdm
ilR
= ~m(En)::;
n
(
ilR
f dm = 0,
hence m(En) = 0 for all n.
D
Using the results proved so far, the following 'a.e.' version of the monotonicity of the integral is not difficult to prove:
Proposition 4.9
If f and 9 are measurable then f ::; 9 a.e. implies
f
f dm ::;
f 9 dm.
82
Measure, Integral and Probability
Hint Let A = {x : f(x) :::; g(x)}; then B
use Theorems 4.7 and 4.8.
= AC
is null and flA :::; glA. Now
Using Theorems 3.5 and 3.9 you should now provide a second proof of a
result we already noted in Proposition 3.8 but repeat here for emphasis:
Proposition 4.10
The function f : lR -t lR is measurable iff both f+ and f- are measurable.
4.2 Monotone convergence theorems
The crux of Lebesgue integration is its convergence theory. We can make a
start on that by giving the following famous result:
Theorem 4.11 (Fatou's lemma)
If {fn} is a sequence of non-negative measurable functions then
liminf
n-+oo
JfE fn dm 2: JfE
(liminf fn) dm.
n-+oo
Proof
Write
f = liminf fn
n-+oo
and recall that
f = n-+oo
lim gn
where gn = infk2n!k (the sequence gn is non-decreasing). Let 'P be a simple
function, 'P :::; f. To show that
inf f f n dm
JfE f dm :::; lim
n-+oo JE
it is sufficient to see that
f fn dm
JfE 'P dm :::; liminf
n-+oo JE
for any such 'P.
83
4. Integral
The set where f = 0 is irrelevant since it does not contribute to
so we can assume, without loss of generality, that f > 0 on E. Put
IE f dm,
zp(x) = { <p(x) - E > 0 if <p(x) > 0
o
if<p(x)=Oorx1.E
where E is sufficiently small to ensure zp :2: O.
Now zp < f, 9n /' f, so 'eventually' 9n :2: zp. We make the last statement
more precise: put
and we have
00
Next,
r
zpdm::::;
r
: :; 1
JAnnE
9n dm
(as 9n dominates zp on An)
fk dm
for k :2: n (by the definition of 9n)
JAnnE
AnnE
: : ; Ie!k dm
for k :2: n. Hence
r
J AnnE
Now we let n
-t 00:
writing zp
(as E is the larger set)
zpdm ::::; liminf
r !k dm.
k-+oo J E
= I:~=l cilBi for some Ci :2: 0,
(4.1)
Bi EM, i ::::; l,
and the inequality (4.1) remains true in the limit:
r zpdm::::; liminf Jr !k dm.
JE
k-+oo
E
We are close - all we need is to replace zp by <p in the last relation. This will be
done by letting E -t 0 but some care will be needed.
Suppose that m({x: <p(x) > O}) < 00. Then
Ie zpdm = Ie <pdm - Em({x: <p(x) > O})
and we get the result by letting
E -t
O.
84
Measure, Integral and Probability
IE
The case m({x : 'P(x) > O}) = 00 has to be treated separately. Here
'P dm = 00, so
I dm = 00. We have to show that
IE
lim inf { Ik dm
k-+oo JE
= 00.
Let Ci be the values of'P and let a = ~ min{ Ci} ({cd is a finite set!). Similarly
to above, put
Dn = {x : gn(X) > a}
and
since Dn /' R As before,
J
DnnE
for k 2: n, so lim inf
gn dm::;
r
JDnnE
!k dm::;
1!k
E
dm
IE Ik dm has to be infinite.
o
Example 4.12
Let In = l[n,n+lj' Clearly I In dm = 1 for all n, liminf In = 0 (= lim In), so
the above inequality may be strict and we have
J
(lim In) dm =I- lim
J
In dm.
Exercise 4.4
Construct an example of a sequence of functions with the strict inequality
as above, such that all In are zero outside the interval [0,1].
It is now easy to prove one of the two main convergence theorems.
Theorem 4.13 (monotone convergence theorem)
If {In} is a sequence of non-negative measurable functions, and {In (x) : n 2: 1}
increases monotonically to I(x) for each x, i.e. In /' I pointwise, then
lim
rIn(x) dm = J{ I dm.
n-+ooJE
E
4. Integral
85
Proof
Since
In:::; I, IE In dm:::; IE I dm and so
limsup {
n-+oo
lE
Indm:::; { Idm.
lE
Fatou's lemma gives
{ Idm:::; liminf {
lE
n-+oo
lE
Indm,
which together with the basic relation
sup f In dm
1E In dm:::; lim
n-+oo 1E
liminf {
n-+oo
gives
{ I dm = liminf { In dm = lim sup { In dm,
lE
n-+oo lE
n-+oo lE
hence the sequence IE In dm converges to IE I dm.
o
Corollary 4.14
Suppose Un} and I are non-negative and measurable. If Un} increases to I
almost everywhere, then we still have IE In dm /' IE I dm for all measurable
E.
Proof
Suppose that In/' I a.e. and A is the set where the convergence holds, so that
AC is null. We can define
gn =
Then using E =
{
In
0
onA
on AC,
onA
g={ 0I on AC.
[E n AC] U [E n A] we get
{ gn dm = { In dm + { 0 dm
lE
lEnA
lEnAc
= {
Indm+ { In dm
lEnA
lEnAc
= L'ndm
(since EnAc is null), and similarly IEgdm = IEldm. The convergence gn --+ 9
holds everywhere, so by Theorem 4.13, IEgndm --+ IEgdm.
0
86
Measure, Integral and Probability
To apply the monotone convergence theorem it is convenient to approximate
non-negative measurable functions by increasing sequences of simple functions.
Proposition 4.15
For any non-negative measurable f there is a sequence Sn of non-negative simple
functions such that Sn / ' f.
Hint Put
(see Figure 4.2).
Figure 4.2 Approximation by simple functions
4.3 Integrable functions
All the hard work is done: we can extend the integral very easily to general
real functions, using the positive part f+ = max(j, 0), and the negative part
f- = max( - f, 0), of any measurable function f : IR -+ IR. We will not use the
non-negative measurable function If I alone: as we saw in Proposition 3.8, If I
can be measurable without f being measurable!
Definition 4.16
If E E M and the measurable function f has both
finite, then we say that f is integrable, and define
h
fdm
=
h
f + dm -
h
IE f+ dm and IE f- dm
rdm .
87
4. Integral
.c1 (E). In what
The set of all functions that are integrable over E is denoted by
follows E will be fixed and we often simply write .c 1 for 1 (E).
.c
Exercise 4.5
For which a is f(x)
= xC< in .c1 (E) where (a) E = (0,1); (b) E = (1, oo)?
Hint See the remark following Example 4.27 below.
Note that
f
is integrable iff
If I is integrable,
Llfl dm =
and that
Lf+ Lr
dm +
dm .
Thus the Lebesgue integral is an 'absolute' integral: we cannot 'make' a function
integrable by cancellation of large positive and negative parts. This has the
consequence that some functions which have improper Riemann integrals fail
to be Lebesgue-integrable (see Section 4.5).
The properties of the integral of non-negative functions extend to any, not
necessarily non-negative, integrable functions.
Proposition 4.17
If f and 9 are integrable,
Hint If f
:::
f :::
g, then
g, then j+ :::::: g+ but
J f dm :::::: J9 dm.
f- ?:
g-.
Remark 4.18
We observe (following [13], 5.12) that many proofs of results concerning integrable functions follow a standard pattern, utilising linearity and monotone
convergence properties. To prove that a 'linear' result holds for all functions in
a space such as .c 1 (E) we proceed in four steps:
1. verify that the required property holds for indicator functions - this is
usually so by definition,
2. use linearity to extend the property to non-negative simple functions,
3. then use monotone convergence to show that the property is shared by all
non-negative measurable functions,
4. finally, extend to the whole class of functions by writing
using linearity again.
f = f+ - f-
and
88
Measure, Integral and Probability
The next result gives a good illustration of the technique.
We wish to show that the mapping f r-+ fA f dm is linear. This fact is
interesting on its own, but will also allow us to show that £1 is a vector space.
Theorem 4.19
For any integrable functions
f,
9 their sum
f +9
is also integrable and
Proof
We apply the technique described in Remark 4.18. In this instance, we can
conveniently begin with simple functions thus bypassing Step 1.
Step 2. Suppose first that f and 9 are non-negative simple functions. The
result is a matter of routine calculation: let f = I: ailAi, 9 = I: bj IB)" The
sum f + 9 is also a simple function which can be written in the form
f + 9 = ~)ai + bj)IAinBj'
i,j
Therefore
h+
(f
g) dm = 2:)ai
+ bj)m(Ai n B j n E)
<,J
= I: I: aim(Ai n B j n E) + I: I: bjm(Ai n B j n E)
j
j
j
j
j
j
j
=
h
fdm
+
h
j
j
gdm
where we have used the additivity of m and the facts that Ai cover lR and the
same is true for B j .
89
4. Integral
Step 3. Now suppose that f, 9 are non-negative measurable (not necessarily
simple) functions. By Proposition 4.15 we can find sequences Sn, tn of simple
functions such that Sn /' f and tn /' g. Clearly Sn + tn /' f + g, hence using the
monotone convergence theorem and the additivity property for simple functions
we obtain
(f+g)dm= lim r(sn+tn)dm
Jr
n-+oo JE
E
= lim
n-+oo
lim rtn dm
JrE Sn dm + n-+oo
JE
= h fdm + h gdm .
of
This, in particular, implies that the integral of f + 9 is finite if the integrals
and 9 are finite.
f
Step 4. Finally, let
f,
9 be arbitrary integrable functions. Since
hIf
+
gl dm::;
h(If I
+
Igl) dm,
we can use Step 2 to deduce that the left-hand side is finite.
We have
f + 9 = (f + g)+ - (f + g)f + 9 = (f+ - f-) + (g+ - g-),
so
We rearrange the equality to have only additions on both sides:
We have non-negative functions on both sides, so by what we have proved so
far,
hence
h(f+g)+dm- h(f+g)- dm = h f + dm - h r dm+ h g+ dm - h g- dm.
By definition of the integral the last relation implies the claim of the theorem.
o
90
Measure, Integral and Probability
The following result is a routine application of monotone convergence:
Proposition 4.20
If f is integrable and c E JR, then
Ie(cf)dm=c lef dm .
Hint Approximate
f by a sequence of simple functions.
We complete the proof that £1 is a vector space:
Theorem 4.21
For any measurable E, £1(E) is a vector space.
Proof
Let
f, 9 E £ 1 . To show that f +9 E £ 1 we have to prove that If +9 I is integrable:
Ie If + 91 dm:s:: Ie (If I + 191) dm = Ie If Idm + Ie 191 dm < 00.
Now let c be a constant:
Ie Icfl dm = Ie lei If I dm =
so that
cf E £1(E).
lei
Ie If Idm < 00,
D
We can now answer an important question on the extent to which the
integral determines the integrand.
Theorem 4.22
If fA f dm :s:: fA 9 dm for all A E M, then f :s:: 9 almost everywhere. In particular, if fA f dm = fA 9 dm for all A E M, then f = 9 almost everywhere.
Proof
By additivity of the integral (and Proposition 4.19 below) it is sufficient to
show that fA hdm 2: 0 for all A E M implies h 2: 0 (and then take h = 9 - f).
91
4. Integral
Write A = {x : h(x) < O}; then A
monotonicity of the integral,
=
UAn where An
=
{x : h(x) ~ -~}. By
which is non-negative but this can only happen if m(An) = O. The sequence of
sets An increases with n, hence m(A) = 0, and so h(x) 2: 0 almost everywhere.
A similar argument shows that if fA hdm ~ 0 for all A, then h ~ 0 a.e.
This implies the second claim of the theorem: put h = 9 - I and fA h dm is
both non-negative and non-positive, hence h 2: 0 and h ~ 0 a.e., thus h = 0
a.e.
0
The next proposition lists further important properties of integrable functions, whose straightforward proofs are typical applications of the results proved
so far.
Proposition 4.23
(i) An integrable function is a.e. finite.
(ii) For measurable
I
and A,
m(A)
(iii) I f
I
i~i I ~
iI
dm
~ m(A) s~p f.
dml ~ fill dm.
(iv) Assume that 12:0 and
f I dm = O. Then 1=0 a.e.
The following theorem gives us the possibility of constructing many interesting measures, and is essential for the development of probability distributions.
Theorem 4.24
Let 12: O. Then A
f-t
fA I dm is a measure.
Proof
Denote Il(A)
= fA I dm. The goal is to show
92
Measure, Integral and Probability
for pairwise disjoint E i . To this end consider the sequence gn =
note that gn /' g, where 9 = flU~l Ei' Now
f1ur=1 Ei
and the monotone convergence theorem completes the proof.
and
D
4.4 The dominated convergence theorem
Many questions in analysis centre on conditions under which the order of two
limit processes, applied to certain functions, can be interchanged. Since integration is a limit process applied to measurable functions, it is natural to ask
under what conditions on a pointwise (or pointwise a.e.) convergent sequence
(in), the limit of the integrals is the integral of the pointwise limit function f,
i.e. when can we state that lim fn dm = (lim fn) dm? The monotone convergence theorem (Theorem 4.13) provided the answer that this conclusion is
valid for monotone increasing sequences of non-negative measurable functions,
though in that case, of course, the limits may equal +00. The following example
shows that for general sequences of integrable functions the conclusion will not
hold without some further conditions:
J
J
Example 4.25
Let fn(x)
=
nl[o,~J (x). Clearly fn(x) -+ 0 for all x but
J fn(x) dx = 1.
The limit theorem which turns out to be the most useful in practice states
that convergence holds for an a.e. convergent sequence which is dominated by
an integrable function. Again Fatou's lemma holds the key to the proof.
Theorem 4.26 (dominated convergence theorem)
Suppose E E M. Let (in) be a sequence of measurable functions such that
Ifni :s: 9 a.e. on E for all n ;::: 1, where 9 is integrable over E. If f = limn-too fn
a.e. then f is integrable over E and
lim
r fn(x) dm = jErf dm.
n-toojE
93
4. Integral
Proof
Suppose for the moment that In ;::: O. Fatou's lemma gives
r In dm.
JrE I dm :::; liminf
n-too JE
It is therefore sufficient to show that
limsup
n-too
r In dm:::; JEr I dm.
(4.2)
JE
Fatou's lemma applied to 9 - In gives
r
lim (g JE n-too
In) dm :::; liminf
n-too
r
JE (g -
In) dm.
On the left we have
On the right
liminf
n-too
JrE (g -
In) dm = liminf (
n-too
=
JrE gdm - JrE In dm)
r gdm-limsup JE
r Indm,
JE
n-too
where we have used the elementary fact that
lim inf( -an) = -lim sup an.
n--+oo
n----too
Putting this together we get
r 9 dm - JEr I dm:::; JE
r 9 dm - lim sup JE
r In dm.
JE
n-too
IE
Finally, subtract
gdm (which is finite) and multiply by -1 to arrive at (4.2).
Now consider a general, not necessarily non-negative sequence Un). Since
by the hypothesis
-g(x) :::; In(x) :::; g(x)
we have
0:::; In(x) + g(x) :::; 2g(x)
and we can apply the result proved for non-negative functions to the sequence
In(x) + g(x) (the function 2g is of course integrable).
0
94
Measure, Integral and Probability
Example 4.27
Going back to the example preceding the theorem, fn = nl[O,ci;], we can see
that an integrable 9 to dominate fn cannot be found. The least upper bound
is g(x) = sUPnfn(x), g(x) = k on (k~l'
so
J
g(x)dx
H
1
00
1
00
1
+
k=l
+
= Lk(-k - - k1) = L - k1 = +00.
k=l
For a typical positive example consider
fn(x)
nsinx
= 1 + n2x1/2
for x E (0,1). Clearly fn(x) --+ O. To conclude that limn J fn dm = 0 we need an
integrable dominating function. This is usually where some ingenuity is needed;
however in the present example the most straightforward estimate will suffice:
11 :
~~x~/21 ~ 1 + ;X1/ 2 ~ n 2;1/2 = nx~/2 ~ x;/2·
,Ix
(To see from first principles that the dominating function 9 : x H
is integrable over [0, 1J can be rather tedious - cf. the worked example in Chapter 1
for the Riemann integral of x H .J;i. However, we shall show shortly that the
Lebesgue and Riemann integrals of a bounded function coincide if the latter
exists, and hence we can apply the fundamental theorem of the calculus to
confirm the integrability of g.)
The following facts will be useful later.
Proposition 4.28
Suppose f is integrable and define gn = fl[-n,n], h n = min(j, n) (both truncate
f in some way: the gn vanish outside a bounded interval, the h n are bounded).
Then J If - gnl dm --+ 0, J If - hnl dm --+ o.
Hint Use the dominated convergence theorem.
Exercise 4.6
Use the dominated convergence theorem to find
lim
n-+oo
/,00 fn(x) dx
1
95
4. Integral
where
Exercise 4.7
Investigate the convergence of
for a > 0, and for a
= O.
Exercise 4.8
Investigate the convergence of
We will need the following extension of Theorem 4.19:
Proposition 4.29
For a sequence of non-negative measurable functions
Hint The sequence 9k = 2::~=1
in
we have
in is increasing and converges to 2::~=1 in.
We cannot yet conclude that the sum of the series on the right-hand side is
a.e. finite, so 2::~=1 in need not be integrable. However:
Theorem 4.30 (8eppo-Levi)
Suppose that
f Jlik
I dm is finite.
k=1
Then the series
and
2::;;'=1 h(x)
converges for almost all x, its sum is integrable,
Jf ik
k=1
dm =
f Jh
k=1
dm.
96
Measure, Integral and Probability
Proof
The function 'P(x)
sition 4.29
= 2:%:1 Ifk(X)1
is non-negative, measurable, and by Propo-
J
'Pdm =
fJ
Ifkl dm.
k=1
This is finite, so 'P is integrable. Therefore 'P is finite a.e. Hence the series
2:%:1 Ifk(x)1 converges a.e. and so the series 2:%:1 fk(X) converges (since it
converges absolutely) for almost all x. Let f(x) = 2:%"=1 fk(X) (put f(x) =
for x for which the series diverges - the value we choose is irrelevant since the
set of such x is nUll). For all partial sums we have
°
n
I L h(x)1
::; 'P(x),
k=1
so we can apply the dominated convergence theorem to find
J
f dm =
J
~
lim ~ hdm = n-+oo
lim
n-+oo
k=1
= nl~~t
k=1
J
fk dm =
dm
J~
L-t fk
fJ
k=1
k=1
hdm
o
as required.
Example 4.31
Recalling that
~ k-l
1
~ kx
= (1 _ X)2'
k=1
we can use the Beppo-Levi theorem to evaluate the integral fol (1~~:)2 dx: first
let fn(x) = nxn-l(log x)2 for n 2: 1, x E (0,1), so that fn 2: 0, fn is continuous,
hence measurable, and
~ fn(x) = (logx)2 =
~
n=1
1-x
f(x)
is finite for x E (0,1). By Beppo-Levi the sum is integrable and
f
11 f(x) dx =
o
n=1
11
0
fn(x) dx.
97
4. Integral
To calculate
I01 fn(x) dx we first
use integration by parts to obtain
Exercise 4.9
The following are variations on the above theme:
(a) For which values of a E lR does the power series 'En>O nax n define
an integrable function on [-I, I]?
(b) Show that
Iooo
eX
~ 1 dx = ~2.
4.5 Relation to the Riemann integral
Our prime motivation for introducing the Lebesgue integral has been to provide
a sound theoretical foundation for the twin concepts of measure and integral,
and to serve as the model upon which an abstract theory of measure spaces can
be built. Such a general theory has many applications, a principal one being
the mathematical foundations of the theory of probability. At the same time,
Lebesgue integration has greater scope and more flexibility in dealing with limit
operations than does its Riemann counterpart.
However, just as with the Riemann integral, the computation of specific integrals from first principles is laborious, and we have, as yet, no simple 'recipes'
for handling particular functions. To link the theory with the convenient techniques of elementary calculus we therefore need to take two further steps: to
prove the fundamental theorem of the calculus as stated in Chapter 1 and to
show that the Lebesgue and Riemann integrals coincide whenever the latter
exists. In the process we shall find necessary and sufficient conditions for the
existence of the Riemann integral.
In fact, given Proposition 4.23 the proof of the fundamental theorem becomes a simple application of the intermediate value theorem for continuous
functions, and is left to the reader:
Proposition 4.32
If f
: [a, b] -+ lR is continuous then f is integrable and the function F given by
F(x) =
f dm is differentiable for x E (a, b), with derivative F' = f·
I:
98
Measure, Integral and Probability
A, B E M are disjoint, then IAUE f dm =
IA f dm+ IE f dm. Thus show that we can write F(x+h) -F(x) = IxX+h f dm
for fixed [x, x + h] c (a, b).
Hint
Note that if
f
E £1 and
We turn to showing that Lebesgue's theory extends that of Riemann:
Theorem 4.33
Let
(i)
f : [a, b]
H
lR be bounded.
f is Riemann-integrable if and only if
Lebesgue measure on [a, b].
f is a.e. continuous with respect to
(ii) Riemann-integrable functions on [a, b] are integrable with respect to Lebesgue measure on [a, b] and the integrals are the same.
Proof
We need to prepare a little for the proof by recalling notation and some basic
facts. Recall from Chapter 1 that any partition
P = {ai : a = ao < al < ... < an = b}
of the interval [a, b], with Ll i = ai - ai-l (i = 1,2, ... , n) and with Mi (resp.
mi) the sup (resp. inf) of f on Ii = [ai-I, ad, induces upper and lower Riemann
sums Up = 2:7=1 MiLl i and Lp = 2:7=1 miLli. But these are just the Lebesgue
integrals of the simple functions Up = 2:7=1 Milli and lp = 2:~1 milli, by
definition of the integral for such functions.
Choose a sequence of partitions (Pn ) such that each P n+ 1 refines P n and
the length of the largest subinterval in Pn goes to 0; writing Un for uPn and
In for IPn we have In :s; f :s; Un for all n. Apply this on the measure space
([a,b],M[a,b],m) where m = m[a,b] denotes Lebesgue measure restricted to
[a, b]. Then U = infn Un and I = sUPn In are measurable functions, and both
sequences are monotone, since
(4.3)
Thus U = limn Un and I = limn In (pointwise) and all functions in (4.3) are
bounded on [a, b] by M = sup{j(x) : x E [a, b]}, which is integrable on [a, b].
By dominated convergence we conclude that
lim Un
n
=
n iar undm = Jar udm,
lim
b
and the limit functions
b
U
lim Ln
n
= li~
iar In dm = iar I dm
and I are (Lebesgue- ) integrable.
b
b
99
4. Integral
Now suppose that x is not an endpoint of any of the intervals in the partitions (Pn ) - which excludes only count ably many points of [a, b]. Then we
have:
f is continuous at x iff u(x) = f(x) = l(x).
This follows at once from the definition of continuity, since the length of each
subinterval approaches 0 and so the variation of f over the intervals containing
x approaches 0 iff f is continuous at x.
The Riemann integral
f(x) dx was defined as the common value of
J:
J:
J:
limn Un =
u dm and limn Ln =
I dm whenever these limits are equal.
To prove (i), assume first that f is Riemann-integrable, so that the upper
and lower integrals coincide:
u dm =
I dm. But I :::; f :::; u, hence (u I) dm = 0 means that u = I = f a.e. by Theorem 4.22. Hence f is continuous
a.e. by the above characterization of continuity of f at x, which only excludes
a further null set of partition points.
Conversely, if f is a.e. continuous, then u = f = I a.e. and u and I are
Lebesgue-measurable, hence so is f (note that this uses the completeness of
Lebesgue measure!). But f is also bounded by hypothesis, so it is Lebesgueintegrable over [a, b], and as the integrals are a.e. equal, the integrals coincide
(but note that
f dm denotes the Lebesgue integral of f!):
J:
J:
I
a
b
J:
1
b
I dm =
J:
1
b
f dm =
u dm.
(4.4)
Since the outer integrals are the same, f is by definition also Riemannintegrable, which proves (i).
To prove (ii), note simply that if f is Riemann-integrable, (i) shows that f
is a.e. continuous, hence measurable, and then (4.4) shows that its Lebesgue
integral coincides with the two outer integrals, hence with its Riemann integral.
D
Example 4.34
Recall the following example from Section 1.2: Dirichlet's function defined on
[0,1] by
f(x) = {
3 if x tf. Q
ifx=~EQ
is a.e. continuous, hence Riemann-integrable, and its Riemann integral equals
its Lebesgue integral, which is 0, since f is zero outside the null set Q.
We have now justified the unproven claims made in earlier examples when
evaluating integrals, since, at least for any continuous functions on bounded
100
Measure, Integral and Probability
intervals, the techniques of elementary calculus also give the Lebesgue integrals
of the functions concerned. Since the integral is additive over disjoint domains
use of these techniques also extends to piecewise continuous functions.
Example 4.35 (improper Riemann integrals)
Dealing with improper Riemann integrals involves an additional limit operation; we define such an integral by:
1
00
f(x) dx
:=
-00
r f(x) dx
b
lim
a-+-oo,b-+ooJa
whenever the double limit exists. (Other cases of 'improper integrals' are discussed in Remark 4.37.)
Now suppose for the function f : lR. t--+ lR. this improper Riemann integral
exists. Then the Riemann integral
f(x) dx exists for each bounded interval
[a, b], so that f is a.e. continuous on each [a, b], and thus on R The converse is
false, however: the function f which takes the value 1 on [n, n + 1) when n is
even, and -1 when n is odd, is a.e. continuous (and thus Lebesgue-measurable
on lR.), but clearly the above limits fail to exist.
More generally, it is not hard to show that if f E £1(lR.) then the above
double limits will always exist. On the other hand, the existence of the double
limit does not by itself guarantee that f E £1 without further conditions:
consider the function (see Figure 4.3)
J:
f(x) =
{
(-w
-
On+1
if x E [n, n
if x < O.
+ 1), n
Figure 4.3 Graph of
f
Clearly the improper Riemann integral exists,
1
00
-00
f(x)dx=I:(-l)n
n=O
n
+1
:::: 0
101
4. Integral
and the series converges. However, f ~ C}, since
diverges.
fIR
If Idm
= 2::~=o
n~I' which
This yields another illustration of the 'absolute' nature of the Lebesgue integral: f E C} iff If I E £1, so we cannot expect a finite sum for an integral whose
'pieces' make up a conditionally convergent series. For non-negative functions
these problems do not arise; we have:
Theorem 4.36
If f ~ 0 and the above improper Riemann integral of f exists, then the Lebesgue
integral fIR f dm always exists and equals the improper integral.
Proof
To see this, simply note that the sequence (In) with fn = f1[-n,n] increases
monotonically to f, hence f is Lebesgue-measurable. Since fn is Riemannintegrable on [-n, n], the integrals coincide there, i.e.
( fn dm = jn f(x)dx
JIR
-n
1: 1:
for each n, so that fn E £1(JR) for all n. By hypothesis the double limit
li~
f(x) dx =
f(x) dx
exists. On the other hand,
lim
n
1 fndm = ( fdm
JIR
JIR
by monotone convergence, and so
f
E
£1 (JR) and
as required.
D
Exercise 4.10
Show that the function f given by f(x) = si~x (x
Riemann integral over JR, but is not in £1.
¥- 0) has an improper
Measure, Integral and Probability
102
Remark 4.37
A second kind of improper Riemann integral is designed to handle functions
which have asymptotes on a bounded interval, such as f(x) = ~ on (0,1). For
such cases we can define
lb
a
f(x) dx = lim
e:'\"D
lb
a+e:
f(x) dx
when the limit exists. (Similar remarks apply to the upper limit of integration.)
4.6 Approximation of measurable functions
The previous section provided an indication of the extent of the additional
'freedom' gained by developing the Lebesgue integral: Riemann integration
binds us to functions whose discontinuities form an m-null set, while we can
still find the Lebesgue integral of functions that are nowhere continuous, such
as lQ. We may ask, however, how real this additional generality is: can we, for
example, approximate an arbitrary f E [,1 by continuous functions? In fact,
since continuity is a local property, can we do this for arbitrary measurable
functions? And this, in turn, provides a link with simple functions, since every
measurable function is a limit of simple functions. We can go further, and ask
whether for a simple function 9 approximating a given measurable function f
we can choose the inverse image 9 -1 ( { ai }) of each element of the range of 9 to
be an interval (such a 9 is usually called a step function; 9 = En enlIn' where
In are intervals). We shall tackle this question first:
Theorem 4.38
If f is a bounded measurable function on [a, bJ and e > 0 is given, then there
exists a step function h such that
J: If - hi
Proof
First assume additionally that
sup
Since f
~
{l
b
f ~ o. Then
cpdm : 0
dm < e.
J: f
~ cp ~ f,
dm is well defined as
simple}.
cp we have If - cpl = f - cp, so we can find a simple function cp
4. Integral
103
satisfying
lbl/-cpldm= lb Idm-l b cpdm<~.
It then remains to approximate an arbitrary simple function cp which vanishes
off [a, b] by a step function h. The finite range {al, a2, ... , an} of the function
cp partitions [a, b], yielding disjoint measurable sets Ei = cp-1 ({ ai}) such that
U~=l Ei = [a, b]. We now approximate each Ei by intervals: note that since cp
is simple, M = sup{cp(x): x E [a,b]} < 00. By Theorem 2.17 we can find open
sets Oi such that Ei C Oi and m(Oi \ E i ) < 2:M for i ::; n. Since each Ei has
finite measure, so do the Oi, hence each Oi can in turn be approximated by a
finite union of disjoint open intervals: we know that Oi = U;:l Iij , where the
open intervals can be chosen disjoint, so that m(Oi) = 2:;:1 m(Iij) < 00. As
the series converges, we can find k i such that m( Oi) -m(U;~l Iij) < 2:M. Thus
with Gi = U;~l Iij we have
f:
IIEi - leJ dm
=
m(EiL!Gi ) < n~ for each
i ::; n. So set h = 2:~=1 ailei. This step function satisfies
f:
f: Icp -
hi dm <
i
and hence
II - hi dm < IE.
The extension to general I is clear: 1+ and 1- can be approximated to
within by step functions h1 and h2 say, so with h = h1 - h2 we obtain
i
lb II - hi dm::; lb 1/+ - h11 dm + lb 1/- - h21 dm <
which completes the proof.
IE,
o
Figure 4.4 Approximation by continuous functions
The 'payoff' is now immediate: with I and h as above, we can re-order the
intervals Iij into a single finite sequence (Jm)m~n with Jm = (cm , dm ) and h =
2:~=1 amlJm • We may assume that l(Jm ) = (d m -cm) > ~, and approximate
104
Measure, Integral and Probability
IJ m by a continuous function gm by setting gm = 1 on the slightly smaller
interval (c m +~, d - ~) and 0 outside J m , while extending linearly in between
f:
IIJm - gml dm < ~.
(see Figure 4.4). It is obvious that gm is continuous and
Repeating for each J m and taking c' < n~' where K = maxm~n laml, shows
that the continuous function 9 = 2::=1 amgm satisfies
Combining this inequality with Theorem 4.38 yields:
f: Ih -
gl dm
<
i·
Theorem 4.39
Given f E £,1 and c > 0, we can find a continuous function g, vanishing outside
some finite interval, such that f If - gl dm < c.
Proof
The preceding argument has verified this when f is a bounded measurable
function vanishing off some interval [a, b]. For a given f E £,1 [a, b] we can again
assume without loss that f 2: O. Let fn = min(j, n); then the fn are bounded
measurable functions dominated by f, fn ---+ f, so that f: If - fNI dm < for
some N. We can now find a continuous g, vanishing outside a finite interval,
b
b
such that J:a IfN - gl dm < i· Thus fa If - gldm < c.
Finally, let f E £,1 (JR) and f 2: 0 be given. Choose n large enough to ensure
that f{lxl'2:n} f dm < ~ (which we can do as fIR If I dm is finite; Proposition 4.28),
and simultaneously choose a continuous 9 with f{lxl'2:n} 9 dm < ~ which satisfies
i
f~n If - gl dm < ~. Thus
fIR If -
gl dm < c.
0
The well-known Riemann-Lebesgue lemma, which is very useful in the discussion of Fourier series, can be easily deduced from the above approximation
theorems:
Lemma 4.40 (Riemann-Lebesgue)
Suppose f E £, 1 (JR). Then the sequences sk = f~oo f (x) sin kx dx and
f~oo f (x) cos kx dx both converge to 0 as k ---+ 00.
Ck
Proof
We prove this for (Sk) leaving the other, similar, case to the reader. For simplicity of notation write f for f~oo' The transformation x = y + I shows that
Sk =
J
f(y
+~) sin(ky + 1l') dy =
-
J
f(y
+~) sin(ky) dy.
4. Integral
105
Since Isin xl :::; 1,
f If(x) - f(x +
~)I dx ;::: If (f(x) -
f(x +
~)) sinkx dxl = 2l s kl·
It will therefore suffice to prove that J If(x) - f(x + h)1 dx --7 0 when h --7 O.
This is most easily done by approximating f by a continuous g which vanishes
outside some finite interval [a, b], and such that J If - gl dm < ~ for a given
e > O. For Ihl < 1, the continuous function gh(X) = g(x + h) then vanishes off
[a -1,b+ 1] and
f If(x + h) - f(x)1 dm:::; f If(x + h) - g(x + h)1 dm
+ f Ig(x + h) - g(x)1 dm + f Ig(x) - f(x)1 dm.
The first and last integrals on the right are less than ~, while the integrand of
the second can be made less than 3(b_ca+2) whenever Ihl < 8, by an appropriate
choice of 8 > 0, as g is continuous. As g vanishes outside [a - 1, b + 1], the
second integral is also less than ~. Thus if Ihl < 8, J If(x + h) - f(x) dm < e.
This proves that limk-too J f(x) sin kx dx = O.
D
4.7 Probability
4.7.1 Integration with respect to probability distributions
Let X be a random variable with probability distribution Px. The following
theorem shows how to perform a change of variable when integrating a function
of X. In other words, it shows how to change the measure in an integral. This
is fundamental in applying integration theory to probabilities. We emphasise
again that only the closure properties of a-fields and the countable additivity
of measures are needed for the theorems we shall apply here, so that we can
use an abstract formulation of a probability space (il, F, P) in discussing their
applications.
Theorem 4.41
Given a random variable X : il
in
--7
lR,
g(X(w)) dP(w)
=
l
g(x) dPx(x).
(4.5)
106
Measure, Integral and Probability
Proof
We employ the technique described in Remark 4.18. For the indicator function
9 = lA we have P(X E A) on both sides. Then by linearity we have the result
for simple functions. Approximation of non-negative measurable 9 by a monotone sequence of simple functions combined with the monotone convergence
theorem gives the equality for such g. The case of general 9 E £1 follows as
0
before from the linearity of the integral, using 9 = g+ - g- .
The formula is useful in the case where the form of Px is known and allows
one to carry out explicit computations.
Before we proceed to these situations, consider a very simple case as an
illustration of the formula. Suppose that X is constant, i.e. X(w) == a. Then
on the left in (4.5) we have the integral of a constant function, which equals
g(a)P(il) = g(a) according to the general scheme of integrating indicator functions. On the right P x = 8a and thus we have a method of computing an
integral with respect to Dirac measure: f g(x) d8a = g(a).
For discrete X taking values ai with probabilities Pi we have
J
g(X) dP
=L
g(ai)Pi,
t
which is a well-known formula from elementary probability theory (see also
Section 3.5.3). In this case we have P x = LiPi8ai and on the right, the integral
with respect to the combination of measures is the combination of the integrals:
J
g(x)dPx
= LPi
t
J
g(x)d8ai (x).
In fact, this is a general property.
Theorem 4.42
If Px
= LiPiPi, where the Pi are probability measures, LPi = 1, Pi 2: 0, then
J
g(x) dPx(x)
=
~Pi
t
J
g(x) dPi(x).
Proof
The method is the same as above: first consider indicator functions lA and
the claim is just the definition of Px : on the left we have Px(A), on the right
LiPiPi(A). Then by additivity we get the formula for simple functions, and
107
4. Integral
finally, approximation and use of the monotone convergence theorem completes
the proof as before.
0
4.7.2 Absolutely continuous measures: examples of
densities
The measures P of the form
A
H
P(A)
=
L
fdm
with non-negative integrable f will be called absolutely continuous, and the
function f will be called a density of P with respect to Lebesgue measure, or
simply a density. Clearly, for P to be a probability we have to impose the
condition
J
fdm = 1.
Students of probability often have an oversimplified mental picture of the
world of random variables, believing that a random variable is either discrete
or absolutely continuous. This image stems from the practical computational
approach of many elementary textbooks, which present probability without the
necessary background in measure theory. We have already provided a simple
example which shows this to be a false dichotomy (Example 3.17).
The simplest example of a density is this: let [l
finite Lebesgue measure and put
f(x)
= {
f{'hn
c
lR be a Borel set with
ifxE[l
otherwise.
We have already come across this sort of measure in the previous chapter,
that is, the probability distribution of a specific random variable. We say that
in this case the measure (distribution) is uniform. It corresponds to the case
where the values of the random variable are spread evenly across some set,
typically an interval, such as in choosing a number at random (Example 2.34).
Slightly more complicated is the so-called triangle distribution with the
density of the form shown in Figure 4.5.
The most famous is the Gaussian or normal density
(4.6)
This function is symmetric with respect to x = /-l, and vanishes at infinity, i.e.
limx~_oo n(x) = 0 = limx~oo n(x) (see Figure 4.6).
108
Measure, Integral and Probability
Figure 4.5 Triangle distribution
Figure 4.6 Gaussian distribution
Exercise 4.11
Show that I~oo n(x) dx
=
l.
Hint First consider the case p,
general case to this.
0, a
1 and then transform the
The meaning of the number p, will become clear below and a will be explained in the next chapter.
Another widely used example is the Cauchy density:
1 1
c(x) = :; 1 + x2'
This density gives rise to many counterexamples to 'theorems' which are too
good to be true.
Exercise 4.12
Show that I~oo c( x) dx
= 1.
4. Integral
109
The exponential density is given by
f(x)
if x 2: 0
otherwise.
= { ~e-AX
Exercise 4.13
Find the constant c for
f
to be a density of probability distribution.
The gamma distribution is really a large family of distributions, indexed by
a parameter t > O. It contains the exponential distribution as the special case
where t = 1. Its density is defined as
f(x)=
{
I
\t t-I -AX
~(t)AX
e
if x 2: 0
otherwise
10
where the gamma function r(t) = 00 xt-Ie- x dx.
The gamma distribution contains another widely used distribution as a
special case: the distribution obtained from the density f when A = ~ and t = ~
for some dEN is denoted by X2 (d) and called the chi-squared distribution with
d degrees of freedom.
The (cumulative) distribution function corresponding to a density is given
by
F(y) =
[Yoo f(x) dx.
If f is continuous then F is differentiable and F' (x) = f (x) by the fundamental
theorem of calculus (see Proposition 4.32). We say that F is absolutely continuous if this relation holds with integrable f, and then f is the density of
the probability measure induced by F. The following example due to Lebesgue
shows that continuity of F is not sufficient for the existence of a density.
Example 4.43
Recall the Lebesgue function F defined on page 20. We have F(y) = 0 for
y :S 0, F(y) = 1 for y 2: 1, F(y) = ~ for y E [i, ~), F(y) = for y E [i, ~),
F (y) = ~ for y E [~,
and so on. The function F is constant on the intervals
removed in the process of constructing the Cantor set (see Figure 4.7).
It is differentiable almost everywhere and the derivative is zero. So F cannot
be absolutely continuous since then f would be zero almost everywhere, but
on the other hand its integral is 1.
&)
i
110
Measure, Integral and Probability
Figure 4.7 Lebesgue's function
We now define the (cumulative) distribution function of a random variable
X : fl -+ JR., where, as above, (fl, F, P) is a given probability space:
Fx(y) = P({w: X(w) ::; y}) = Px((-oo,y]).
Proposition 4.44
(i) Fx is non-decreasing (Yl ::; Y2 implies FX(Yl) ::; FX (Y2»,
(ii) lim y ---+ oo Fx(y)
= 1, limy ---+_ oo Fx(y) = 0,
(iii) Fx is right-continuous (if y -+ Yo, Y ~ Yo, then Fx(y) -+ F(yo».
Exercise 4.14
Show that F x is continuous if and only if Px ({y}) =
°
for all y.
Exercise 4.15
Find Fx for
(a) a constant random variable X, X(w)
= a for all w,
(b) X : [0,1] -+ JR. given by X(w) = min{w, 1- w} (the distance to the
nearest endpoint of the interval [0,1]),
(c) X: [0,1]2 -+ JR., the distance to the nearest edge of the square [0,1]2.
The fact that we are doing probability on subsets of JR.n as sample spaces
turns out to be not restrictive. In fact, the interval [0,1] is sufficient, as the
following Skorokhod representation theorem shows.
4. Integral
111
Theorem 4.45
If a function F : lR -+ [0,1] satisfies conditions (i)-(iii) of Proposition 4.44, then
there is a random variable defined on the probability space ([0,1], B, m[O,lj),
X: [0,1]-+ lR, such that F = Fx.
Proof
We write, for w E [0,1],
X+(w) = inf{x : F(x) > w},
X-(w) = sup{x: F(x) < w}.
Figure 4.8 Construction of X-; continuity point
W ...................,
x+ (w) = X- (w)
Figure 4.9 Construction of X-; discontinuity point
Three possible cases are illustrated in Figures 4.8, 4.9 and 4.10. We show
that Fx- = F, and for that we have to show that F(y) = m( {w : X-(w) ::; y}).
The set {w : X-(w) ::; y} is an interval with left endpoint 0. We are done if
we show that its right endpoint is F(y), i.e. if X-(w) ::; y is equivalent to
w ::; F(y).
112
Measure, Integral and Probability
w ---- ----;;.--..,--------.-"
X-(w)
Figure 4.10 Construction of X-; 'flat' piece
Suppose that w :s:; F(y). Then
{x: F(x) < w} C {x: F(x) < F(y)} C {x: x:S:; y}
(the last inclusion by the monotonicity of F), hence X-(w) = sup{x: F(x)
<
w}:S:; y.
Suppose that X- (w) :s:; y. By monotonicity F(X-(w)) :s:; F(y). By the rightcontinuity of F, w :s:; F(X-(w)) (if w > F(X-(w)), then there is Xo > X-(w)
such that F(X-(w)) < F(xo) < w, which is impossible since Xo is in the set
whose supremum is taken to get X-(w)), so w :s:; F(y).
For future use we also show that Fx+ = F. It is sufficient to see that m( {w :
X-(w) < X+(w)}) = 0 (which is intuitively clear as this may happen only when
the graph of F is 'flat', and there are countably many values corresponding to
the 'flat' pieces, their Lebesgue measure being zero). More rigorously,
{w: X-(w) < X+(w)}
=
U{w: X-(w):S:; q < X+(w)}
qEQ
and m( {w : X-(w) :s:; q
q})=F(q)-F(q)=O.
< X+(w)}) = m( {w : X-(w) :s:;
q} \ {w : X+(w) :s:;
D
The following theorem provides a powerful method for calculating integrals
relative to absolutely continuous distributions. The result holds for general
measures but we formulate it for a probability distribution of a random variable
in order not to overload or confuse the notation.
113
4. Integral
Theorem 4.46
If Px defined on Rn is absolutely continuous with density
integrable with respect to Px , then
lx, 9 : Rn -+
R is
[ g(x) dPx(x) = [ Ix (x)g(x) dx.
JJRn
JJRn
Proof
For an indicator function g(x) = lA(X) we have Px(A) on the left which equals
fA Ix (x) dx by the form of P, and consequently is equal to fJRn lA(X)/x(x) dx,
i.e. the right-hand side. Extension to simple functions by linearity and to general
0
integrable 9 by limit passage is routine.
Corollary 4.47
In the situation of the previous theorem we have
[ g(X) dP = [ Ix (x)g(x) dx.
J [}
JJRn
Proof
This is an immediate consequence of the above theorem and Theorem 4.41.
0
We conclude this section with a formula for a density of a function of a
random variable with given density. Suppose that Ix is known and we want to
find the density of Y = g(X).
Theorem 4.48
If 9 : R -+ R is increasing and differentiable (thus invertible), then
Proof
Consider the distribution function:
Differentiate with respect to y to get the result.
o
114
Measure, Integral and Probability
Remark 4.49
A similar result holds if 9 is decreasing. The same argument as above gives
Example 4.50
If X has standard normal distribution
1
1 2
n(x) = __ e-2"X
..n;rr
(i.e. I-" = 0 and 0' = 1 in (4.6)), then the density of Y = I-" + aX is given by
(4.6). This follows at once from Theorem 4.48: g-l (y) = JL~X; its derivative is
equal to ~.
Exercise 4.16
Find the density of Y
= X3
where
Ix = 1[0,1].
4.7.3 Expectation of a random variable
If X is a random variable defined on a probability space (n, F, P) then we
introduce the following notation:
JE(X) =
In
XdP
and we call this abstract integral the (mathematical) expectation of X.
Using the results from the previous section we immediately have the following formulae: the expectation can be computed using the probability distribution:
JE(X)
=
i:
i:
x dPx(x),
and for absolutely continuous X we have
JE(X)
=
xIx (x) dx.
115
4. Integral
Example 4.51
Suppose that Px
= ~PI + ~P2' where PI = oa,
IE(X)
=
P2 has a density
h.
Then
I
2a +I2! xf(x) dx.
So, going back to Example 3.17 we can compute the expectation of the random
variable considered there:
IE(X) =
1
°
1 1 (25
2 . + 2 25 10
x dx
= 6.25.
Exercise 4.17
Find the expectation of
(a) a constant random variable X, X(w)
= a for all w,
(b) X : [0,1] -t JR given by X(w) = min{w, 1- w} (the distance to the
nearest endpoint of the interval [0,1]),
(c) X: [0, IF -t JR, the distance to the nearest edge of the square [0,1]2.
Exercise 4.18
Find the expectation of a random variable with
(a) uniform distribution over the interval [a, b],
(b) triangle distribution,
(c) exponential distribution.
4.7.4 Characteristic function
In what follows we will need the integrals of some complex functions. The
theory is a straightforward extension of the real case.
Let Z = X + iY where X, Yare real-valued random variables and define
!
ZdP=
!
XdP+i! YdP.
Clearly, linearity of the integral and the dominated convergence theorem hold
for the complex case. Another important relation which remains true is:
I!
ZdPI:::;
!
IZldP.
116
Measure, Integral and Probability
J
J
To see this consider the polar decomposition of Z dP = I Z dPle- i8 . Then,
with ~(z) as the real part of the complex number z, I Z dPI = ei8 Z dP =
Jei8 Z dP is real, hence equal to J ~(ei8 Z) dP, but ~(ei8 Z) ~ lei8 ZI = IZI and
we are done.
The function we wish to integrate is exp{itX} where X is a real random
variable, t E lR. Then
J
exp{itX} dP =
J
cos(tX) dP + i
which always exists, by the boundedness of x
t--t
J
J
J
sin(tX) dP
exp{itx}.
Definition 4.52
For a random variable X we write
cpx(t) = JE(e dX )
for t E JR.. We call cp x the characteristic function of X.
To compute cp x it is sufficient to know the distribution of X:
cpx(t)
=
J
eitx dPx(x)
and in the absolutely continuous case
cpx(t)
=
J
eitx fx(x) dx.
Some basic properties of the characteristic function are given below. Other
properties are explored in Chapters 6 and 8.
Theorem 4.53
The function cpx satisfies
(i) cpx(O)
= 1, Icpx(t)1 ~ 1,
(ii) CPaX+b(t) = eitbcpx(at).
Proof
(i) The value at 0 is 1 since the expectation of the constant function is
its value. The estimate follows from Proposition 4.23 (iii): I J eitx dPx(x)1 ~
J leitxl dPx(x) = 1.
4. Integral
117
(ii) Here we use the linearity of the expectation:
CPaX+b(t) = lE(eit(aX+b») =
lE(eitaXe~tb)
= eitblE(ei(ta)X) = eitbcpx(ta),
o
as required.
Exercise 4.19
Find the characteristic function of a random variable with
(a) uniform distribution over the interval [a, b],
(b) exponential distribution,
(c) Gaussian distribution.
4.7.5 Applications to mathematical finance
Consider a derivative security of European type, that is, a random variable of
the form f(8(N)), where 8(n), n = 1, ... , N, is the price of the underlying
security, which we call a stock for simplicity. (Or we write f(8(T)), where
the underlying security is described in continuous time t E [0, T] with prices
8(t).) One of the crucial problems in finance is to find the price Y(O) of such
a security. Here we assume that the reader is familiar with the following fact,
which is true for certain specific models consisting of a probability space and
random variables representing the stock prices:
Y(O)
= exp{ -rT}IE(f(8(T))),
(4.7)
where r is the risk-free interest rate for continuous compounding. This will be
explained in some detail in Section 7.4.4 but here we just want to draw some
conclusions from this formula using the experience gathered in the present
chapter.
In particular, taking account of the form of the payoff functions for the
European call (f(x) = (x - K)+) and put (f(x) = (K - x)+) we have the
following general formulae for the value of call and put, respectively:
C
=
exp{ -rT}IE(8(T) - K)+,
P = exp{ -rT}IE(K - 8(T))+.
Without relying on any particular model one can prove the following relation, called call-put parity (see [4] for instance):
8(0)
=
C- P
+ K exp{ -rT}.
(4.8)
118
Measure, Integral and Probability
Proposition 4.54
The right-hand side of the call-put parity identity is independent of K.
Remark 4.55
This proposition allows us to make an interesting observation, which is a version
of a famous result in finance, namely the Miller-Modigliani theorem which says
that the value of a company does not depend on the way it is financed. Let us
very briefly recall that the value of a company is the sum of equity (represented
by stock) and debt, so the theorem says that the level of debt has no impact on
company's value. Assume that the company borrowed K exp{ -rT} at the rate
equal to r and it has to pay back the amount of K at time T. Should it fail, the
company goes bankrupt. So the stockholders, who control the company, can
'buy it back' by paying K. This will make sense only if the value S(T) of the
company exceeds K. The stock can be regarded as a call option so its value is
C. The value of the debt is thus K exp{ -rT} - P, less than the present value
of K, which captures the risk that the sum K may not be recovered in full.
We now evaluate the expectation to establish explicit forms of the general
formula (4.7) in the two most widely used models.
Consider first the binomial model introduced in Section 2.6.3. Assume that
the probability space is equipped with a measure determined by the probability
p = ~=g for the up movement in single step, where R = exp{rh}, h being
the length of one step. (This probability is called risk-neutral; observe that
1E('T7) = R.) To ensure that 0 < p < 1 we assume throughout that D < R < U.
Proposition 4.56
In the binomial model the price C of a call option with exercise time T
is given by the Cox-Ross-Rubinstein formula
= hN
C = S(O)lli(A, N,pUe- rT ) - Ke-rTlli(A, N,p)
where lli(A, N,p) = L: f=A (~)pk(l - p)N-k and A is the first integer k such
that S(O)Uk DN-k > K.
In the famous continuous-time Black-Scholes model, the stock price at time
T is of the form
S(T)
= S(O) exp{ (r -
0'2
2)T + O'w(T)},
119
4. Integral
where r is the risk-free rate, (J' > 0 and w(T) is a random variable with Gaussian
distribution with mean 0 and variance T. (The reader familiar with finance will
notice that we again assume that the probability space is equipped with a
risk-neutral measure.)
Proposition 4.57
We have the following Black-Scholes formula for C :
where
Exercise 4.20
Find the formula for the put option.
4.8 Proofs of propositions
Proof (of Proposition 4.5)
Let
f =
LCilAi. We have to show that
L cim(Ai n E) = sup Y(E, I).
First, we may take <p = f in the definition of Y(E, f), so the number on the
left (L cim(Ai n E)) belongs to Y(E, I) and so
L cim(Ai
n E)
S; sup Y(E,
I).
For the converse take any a E Y (E, I). So
a= k'l/Jdm=Ldjm(EnBj)
for some simple 'l/J S;
f.
Now
a= L L
j
i
djm(E n B j n Ai)
120
Measure, Integral and Probability
by the properties of measure (Ai form a partition of JR). For x E B j
f(x) = Ci and 'l/J(x) = d j and so d j :::; Ci (if only B j n Ai -=I- 0). Hence
n Ai,
a:::; LLcim(EnBj nAi ) = LCim(EnAi)
j
o
since B j partition R
Proof (of Proposition 4.9)
Let A = {x : f(x) :::; g(x)}, then AC is null and flA :::; glA. So J flA dm :::;
J glA dm by Theorem 4.7. But since AC is null, J flAc dm = 0 = J glAc dm.
So by (v) of the same theorem,
r f dm = lAr f dm + lAcr f dm = lAr f dm
: :; lAr 9 dm = lAr 9 dm + lAcr 9 dm = lJRr 9 dm.
lJR
o
Proof (of Proposition 4.10)
If both f+ and f- are measurable then the same is true for f since f = f+ - f- .
Conversely, (f+)-l([a,oo)) = JR if a :::; 0 and (f+)-l([a,oo)) = f- 1([a, 00))
otherwise; in each case a measurable set. Similarly for f-.
0
Proof (of Proposition 4.15)
Put
2 2n
Sn
=
L
k
2n
If-'([fo,W))'
k=O
which are measurable since the sets Ak = f- 1([2kn , k2~1)) are measurable. The
sequence increases since if we take n + 1, then each Ak is split in half, and to
each component of the sum there correspond two new components. The two
values of the fraction are equal to or greater than the old one, respectively. The
convergence holds since for each x the values sn(x) will be a fraction of the
form 2':, approximating f(x). Figure 4.2 illustrates the above argument.
0
Proof (of Proposition 4.17)
If f :::; g, then f+ :::; g+ but f- ~ g-. These inequalities imply J f+ dm :::;
J g+ dm and J g- dm :::; J f- dm. Adding and rearranging gives the result. 0
121
4. Integral
Proof (of Proposition 4.20)
The claim for simple functions f = I: ailAi is just elementary algebra. For nonnegative measurable f, and positive c, take Sn /'" f, and note that CS n /'" cf
and so
J
cf dm = lim
Finally for any
negative parts.
f
J
CSn dm = lim c
J
Sn dm = c lim
J
Sn dm = c
J
f dm.
and c we employ the usual trick introducing the positive and
0
Proof (of Proposition 4.23)
(i) Suppose that f(x) = 00 for x E A with m(A) > O. Then the simple
functions Sn = nlA satisfy Sn ::; f, but Sn dm = nm(A) and the supremum
here is 00. Thus J f dm = 00 - a contradiction.
J
(ii) The simple function s(x) = cIA with c = infA f has integral infA fm(A)
and satisfies S ::; f, which proves the first inequality. Put t(x) = dlA with
d = sup A f and f ::; t, so J f dm ::; J t dm, which is the second inequality.
(iii) Note that -If I ::; f ::;
are done.
Ifl
hence -
J If Idm ::; J f dm ::; J If Idm and we
(iv) Let En = f-l([~, 00)), and E = U~=l En. The sets Ei are measurable
and so is E. The function S = ~ lEn is a simple function with S ::; f. Hence
J sdm ::; J f dm = 0, so J Sn dm = 0, hence ~m(En) = O. Finally, m(En) = 0
for all n. Since En C E n+1 , m(E) = limm(En) = O. But E = {x : f(x) > O},
so f is zero outside the null set E.
0
Proof (of Proposition 4.28)
If n --+ 00 then l[-n,n] --+ 1 hence gn = fl[-n,n] --+ f. The convergence
is dominated: gn ::; If I and by the dominated convergence theorem we have
J If - gnl dm --+ O. Similarly, hn = min(j,n) --+ f as n --+ 00 and hn ::; If I so
J If - hnl dm --+ O.
0
Proof (of Proposition 4.29)
Using
J(j + g) dm = J f dm + J9 dm we can easily obtain (by induction)
122
Measure, Integral and Probability
for any n. The sequence L~=l Ik is increasing (Ik ~
L%:l Ik. So the monotone convergence theorem gives
j f l k dm =
k=l
nl~jtlkdm = nl~~t j
k=l
0)
Ikdm
k=l
=
and converges to
f
j Ik dm
k=l
o
as required.
Proof (of Proposition 4.32)
Continuous functions are measurable, and
I
is bounded on [a, b], hence
I
E
.el[a, b]. Fix a < x < x + h < b, then F(x + h) - F(x) = J:+ h I dm, since the
intervals [a, x] and (x, x + h] are disjoint, so that the integral is additive with
respect to the upper endpoint. By the mean value property the values of righthand integrals are contained in the interval [Ah, Bh], where A = inf{f(t) : t E
[x, x + h]} and B = sup{f(t) : t E [x, x + h]}. Both extrema are attained, as
I is continuous, so we can find tr,t2 in [x,x + h] with A = I(tr), B = l(t2)'
Thus
f(tl) ~
11
h
x
x
+ h f dm ~ l(t2)'
*
The intermediate value theorem provides () E [0,1] such that f(x + ()h) =
J:+ h f dm = F(x+hZ-F(x). Letting h --+ 0, the continuity of I ensures that
F'(X) = I(x).
0
Proof (of Proposition 4.44)
(i) If YI ~ Y2, then {w : X(w) ~ yr} c {w : X(w) ~ Y2} and by the monotonicity of measure
Fx(yr) = P( {w : X(w)
~
yr})
~
P( {w : X(w)
~
yd) = FX(Y2)'
(ii) Let n --+ 00; then Un{w : X(w) ~ n} = fl (the sets increase). Hence
P({w: X(w) ~ n}) --+ P(fl) = 1 by Theorem 2.19 (i) and so limy~ooFx(Y) =
1. For the second claim consider Fx(-n) = P({w : X(w) ~ -n}) and note
that limy~_oo Fx(y) = P(nn{w: X(w) ~ -n}) = P(0) = O.
(iii) This follows directly from Theorem 2.19 (ii) with An
Yn /' y, because Fx(y) = P(nn{w: X(w)
~
Yn}).
= {w : X(w)
~
Yn},
0
123
4. Integral
Proof (of Proposition 4.54)
Inserting the formulae for the option prices in call-put parity, we have
S(O) = exp{ -rT}
= exp{ -rT}
= exp{ -rT}
(In
(1
(S(T) - K)+dP -
In
(K - S(T))+dP + K)
(S(T) - K)dP
-1
{S(Tl2K}
In
{S(Tl<K}
(K - S(T))dP + K)
S(T)dP,
o
which is independent of K, as claimed.
Proof (of Proposition 4.56)
The general formula C = e-rTJE(S(N) - K)+, where S(N) has binomial distribution, gives
C = e- rT
t, (~)pk(l
N
= e- rT {;
- q)N-k(S(O)U kD N- k - K)+
(~)pk(l _ q)N-k(S(O)U kD N- k -
We can rewrite this as follows: note that if we set q
(1 - p)Ue- rT , so that 0:::; q :::; 1 and
C = S(O)
t. (~)qk(l_
q)N-k - Ke- rT
K)
= PUe- rT then 1 - q =
t. (~)pk(l_
p)N-k,
which we write concisely as
C = S(O)lJi(A, N,pUe- rT ) - Ke-rTlJi(A, N,p)
(4.9)
where lJi(A, N,p) = I: f=A (~)pk(l - p)N-k is the complementary binomial
0
distribution function.
Proof (of Proposition 4.57)
To compute the expectation JE(e-rT(S(T) -K)+) we employ the density ofthe
Gaussian distribution, hence
124
Measure, Integral and Probability
The integration reduces to the values of y satisfying
8(0)e-!cr 2T eYcrVT - Ke- rT 2 0
since otherwise the integrand is zero. Solving for y gives
Ke- rT
Y>
- d=
1
ln~+2(J
2T
-----''-'---=--
(J/T
For those y we can drop the positive part and employ linearity. The first term
on the right is
8(0)
1+
00
eycrVT-!cr2Te-!y2 dy = 8(0)
Hence the substitution z
1+
00
e-!(y-crVT)2 dy.
= y - (J/T yields
where N is the cumulative distribution function of the Gaussian (normal) distribution.
The second term is
_Ke- rt
so writing d 1
1+
00
e-!y 2 dy
= _Ke- rT N( -d),
= -d + (J/T, d2 = -d, we are done.
D
5
Spaces of integrable functions
Until now we have treated the points of the measure space (JR., M, m) and,
more generally, of any abstract probability space (fl, F, P), as the basic objects,
and regarded measurable or integrable functions as mappings associating real
numbers with them. We now alter our point of view a little, by treating an
integrable function as a 'point' in a function space, or, more precisely, as an
element of a normed vector space. For this we need some extra structure on the
space of functions we deal with, and we need to come to terms with the fact
that the measure and integral cannot distinguish between functions which are
almost everywhere equal.
The additional structure we require is to define a concept of distance (i.e.
a metric) between given integrable functions - by analogy with the familiar
Euclidean distance for vectors in JR.n we shall obtain the distance between two
functions as the length, or norm, of their difference - thus utilising the vector
space structure of the space of functions. We shall be able to do this in a variety
of ways, each with its own advantages - unlike the situation in JR. n , where all
norms turn out to be equivalent, we now obtain genuinely different distance
functions.
It is worth noting that the spaces of functions we shall discuss are all infinitedimensional vector spaces: this can be seen already by considering the vector
space C([a, b], JR.) of real-valued continuous functions defined on [a, b] and noting that a polynomial function of degree n cannot be represented as a linear
combination of polynomials of lower degree.
Finally, recall that in introducing characteristic functions at the end of the
125
126
Measure, Integral and Probability
previous chapter, we needed to extend the concept of integrability to complexvalued functions. We observed that for f = u + iv the integral defined by
IEfdm = IEudm+iIEvdm is linear, and that the inequality IIEfdml::;
IE If I dm remains valid. When considering measurable functions f : E ---+ C
in defining the appropriate spaces of integrable functions in this chapter, this
inequality will show that IE f dm E C is well defined.
The results proved below extend to the case of complex-valued functions,
unless otherwise specified. When wishing to emphasise that we are dealing
with complex-valued functions in particular applications or examples, we shall
use notation such as f E £1 (E, q to indicate this. Complex-valued functions
will have particular interest when we consider the important space £2(E) of
'square-integrable' functions.
5.1 The space Ll
First we recall the definition of a general concept of 'distance' between points
of a set:
Definition 5.1
Let X be any set. The function d : X x X ---+ lR is a metric on X (and (X, d)
is called a metric space) if it satisfies:
(i) d(x, y) ?: 0 for all x, y E X,
(ii) d(x, y)
= 0 if and only if x = y,
(iii) d(y, x) = d(x, y) for all x, y E X,
(iv) d(x, z) ::; d(x, y)
+ d(y, z)
for all x, y, z E X.
The final property is known as the triangle inequality and generalizes the
well-known inequality of that name for vectors in lRn. When X is a vector space
(as will be the case in almost all our examples) then there is a very simple way
to generate a metric by defining the distance between two vectors as the 'length'
of their difference. For this we require a further definition:
Definition 5.2
Let X be a vector space over lR (or
is a norm on X if it satisfies:
q.
The function x
f-t
Ilxll
from X into lR
127
5. Spaces of integrable functions
(i)
(ii)
(iii)
(iv)
Ilxll ~ 0 for all x E X,
Ilxll = 0 if and only if x = 0,
Ilaxll = lalllxli for all a E JR (or q, x
Ilx + yll ::; Ilxll + Ilyll for all x, y E X.
E X,
Clearly a norm x t-+ Ilxll on X induces a metric by setting d(x,y) = IIx-yll.
The triangle inequality follows once we observe that Ilx-zll = II(x-y)+(y-z)11
and apply (iv).
We naturally wish to use the integral to define the concept of distance
between functions in 1 (E), for measurable E C JR. The presence of null sets
in (JR, M, m) means that the integral cannot distinguish between the function
that is identically 0 and one which is 0 a.e. The natural idea of defining the
'length' of the vector f as
If Idm thus runs into trouble, since it would be
possible for non-zero elements of 1 (E) to have 'zero length'.
The solution adopted is to identify functions which are a.e. equal, by defining an equivalence relation on 1 (E) and defining the length function for the
resulting equivalence classes of functions, rather than for the functions themselves.
Thus we define
.c
IE
.c
.c
where the equivalence relation is given by:
f == 9 if and only if f(x) = g(x) for almost all x E E
(that is, {x E E: f(x) f g(x)} is null).
Write [f] for the equivalence class containing the function f E
hE [f] iff h(x) = f(x) a.e.
.c1 (E). Thus
Exercise 5.1
Check that
== is an equivalence relation on .c 1 (E).
We now show that Ll(E) is a vector space, since .c1 is a vector space by
Theorem 4.21. However, this requires that we explain what we mean by a
linear combination of equivalence classes. This can be done quite generally for
any equivalence relation. However, we shall focus on what is needed in our
particular case: define the [f] + [g] as the class [f + g] of f + g, i.e. hE [f] + [g]
iff h(x) = f(x) + g(x) except possibly on some null set. This is consistent,
since the union of two null sets is null. The definition clearly does not depend
on the choice of representative taken from each equivalence class. Similarly for
128
Measure, Integral and Probability
multiplication by constants:
space with these operations.
a[fl
[afl for a E R Hence L1(E) is a vector
Convention Strictly speaking we should continue to distinguish between
the equivalence class [Jl E Ld E) and the function f E £1 (E) which is a
representative of this class. To do so consistently in all that follows would,
however, obscure the underlying ideas, and there is no serious loss of clarity
by treating f interchangeably as a member of L1 and of £1, depending on the
context. In other words, by treating the equivalence class [fl as if it were the
function f, we implicitly identify two functions as soon as they are a.e. equal.
With this convention it will be clear that the 'length function' defined below is
a genuine norm on L 1 (E).
For f E £1 (E) we write
Ilflh =
LIf I
dm.
We verify that this satisfies the conditions of Definition 5.2:
(i)
(ii)
(iii)
IIflll = 0 if and only if f = 0 a.e., so that f == 0 as an element of £l(E),
IIcflll = IE Icfl dm = lei IE If Idm = lei· IIflh, (c E lR),
IIf + glh = IE If + gl dm :::; IE If Idm + IE Igl dm = IIfll1 + IIglh·
The most important feature of L1(E), from our present perspective, is the
fact that it is a complete normed vector space. The precise definition is given
below. Completeness of the real line lR and Euclidean spaces lRn is what guides
the analysis of real functions, and here we seek an analogue which has a similar
impact in the infinite-dimensional context provided by function spaces. The
definition will be stated for general normed vector spaces:
Definition 5.3
Let X be a vector space with norm I . IIx. We say that a sequence
Cauchy if
Ve > 0 3N : Vn, m ~ N, IIfn - fmllx < e.
fn
E X is
If each Cauchy sequence is convergent to some element of X, then we say that
X is complete.
129
5. Spaces of integrable functions
Example 5.4
Let in(x) = ~l[n,n+l](X), and suppose that n =I- m.
Ilin - imlh =
1
001
-ll[n,n+1] -l[m,m+l]1 dx
in
o
x
+1
1
im+11
-dx+
-dx
x
m
X
n+1
m+1
= log-- +log--.
n
m
=
n
If an ---+ 1, then log an ---+ 0 and the right-hand side can be as small as we
wish: for > 0 take N such that log
< ~. So in is a Cauchy sequence in
£1(0, (0). (When E = (a, b), we write £1 (a, b) for £1(E), etc.)
E
Ni/
Exercise 5.2
Decide whether each of the following is Cauchy as a sequence in £1 (0, (0):
(a) in
= l[n,n+l],
(b) in
= ~l(O,n)'
(c)
= x\ l(O,n)·
in
The proof of the main result below makes essential use of the Beppo-Levi
theorem in order to transfer the main convergence question to that of series
of real numbers; its role is essentially to provide the analogue of the fact that
in lR (and hence in C) absolutely convergent series will always converge. (The
Beppo-Levi theorem clearly extends to complex-valued functions, just as we
showed for the dominated convergence theorem, but we shall concentrate on
the real case in the proof below, since the extension to C is immediate.)
We digress briefly to recall how this property of series ensures completeness
in R let (xn) be a Cauchy sequence in lR, and extract a subsequence (x nk ) such
that IXn - Xnk I < 2- k for all n ~ nk as follows:
• find nl such that IXn - xn,l < 2- 1 for all n ~ nl,
• find n2 > nl such that IXn - x n2 < 2- 2 for all n ~ n2,
•
• find nk
1
> nk-l with
IXn - Xnk
I < 2- k for all n
~ nk.
The Cauchy property ensures each time that such nk can be found. Now
consider the telescoping series with partial sums
130
Measure, Integral and Probability
which has
Thus this series converges, in other words (x nk ) converges in JR, and its limit
is also that of the whole Cauchy sequence (x n ).
To apply the Beppo-Levi theorem below we therefore need to extract a
'rapidly convergent sequence' from the given Cauchy sequence in Ll(E). This
provides an a.e. limit for the original sequence, and the Fatou lemma does the
rest.
Theorem 5.5
The space Ll(E) is complete.
Proof
Suppose that in is a Cauchy sequence. Let E = ~. There is Nl such that for
n ~ Nl
1
Ilin - iN,lll ::; 2'
Next, let
for n
~
E
=
~, and for some N2
> Nl we have
N 2. In this way we construct a subsequence iNk satisfying
for all n. Hence the series Ln>l IliNn+l - iNn III converges and by the BeppoLevi theorem, the series
00
+ 2:)iNn+l (x)
n=l
iN, (x)
- iNn (x)]
converges a.e.; denote the sum by i(x). Since
k
iNl (x)
+ 2:)iNn+l (x)
n=l
- iNn (x)] = iNk+l
the left-hand side converges to i(x), so iNk+l (x) converges to i(x). Since the
sequence of real numbers in(x) is Cauchy and the above subsequence converges,
the whole sequence converges to the same limit i(x).
131
5. Spaces of integrable functions
We have to show that IE £1 and IIIk - 1111 ~ O.
Let E > O. The Cauchy condition gives an N such that
"In, m ::: N, Il/n - Iml11
< E.
By Fatou's lemma,
II/-/mih
=
j 11-Iml dm :S liminfj liNk - Iml dm = liminf IllNk - Imlh <
k-+oo
So I - 1m E £1 which implies
III - Imlh ~ O.
k-+oo
I = (f - 1m) + 1m
E.
(5.1)
E £1, but (5.1) also gives
D
5.2 The Hilbert space L2
The space we now introduce plays a special role in the theory. It provides
the closest analogue of the Euclidean space lRn among the spaces of functions,
and its geometry is closely modelled on that of lRn. It is possible, using the
integral, to induce the norm via an inner product, which in turn provides a
concept of orthogonality (and hence 'angles') between functions. This gives £2
many pleasant properties, such as a 'Pythagoras theorem' and the concept of
orthogonal projections, which plays vital role in many applications.
To define the norm, and hence the space £2(E) for a given measurable set
E c lR, let
11/112 =
(lI/12dm)~
.c
and define 2 (E) as the set of measurable functions for which this quantity is
finite. (Note that, as for £1, we require non-negative integrands; it is essential
that the integral is non-negative in order for the square root to make sense.
Although we always have j2(x) = (f(X))2 ::: 0 when I(x) is real, the modulus
is needed to include the case of complex-valued functions I : E ~ C. This
also makes the notation consistent with that of the other £P -spaces we shall
consider below where 1/12 is replaced by I/IP for arbitrary p ::: 1.)
We introduce £2(E) as the set of equivalence classes of elements of 2(E),
under the equivalence relation I == 9 iff I = 9 a.e., exactly as for £l(E),
and continue the convention of treating the equivalence classes as functions. If
I : E ~ C satisfies
1/12 dm < 00 we write I E £2(E, q - again using I
interchangeably as a representative of its equivalence class and to denote the
class itself.
.c
IE
Measure, Integral and Probability
132
It is straightforward to prove that L2(E) is a vector space: clearly, for a E lR,
lafl 2 is integrable if Ifl2 is, while
If + gl2 :::; 22 max{lfI 2, Ig12} :::; 4(lf1 2 + Ig12)
shows that L2(E) is closed under addition.
5.2.1 Properties of the L 2 -norm
We provide a simple proof that the map f H IIfl12 is a norm: to see that
it satisfies the triangle inequality requires a little work, but the ideas will be
very familiar from elementary analysis in lR n , as is the terminology, though the
context is rather different. We state and prove the result for the general case
of L2(E,C).
Theorem 5.6 (Schwarz inequality)
If f,g E L 2 (E,C) then fg E L 1(E,C) and
I
h
fgdml :::; Ilfgll1 :::; IIfl1211g112
(5.2)
where g denotes the complex conjugate of g.
Proof
Replacing f, 9 by If I, Igi we may assume that f and 9 are non-negative (the first
inequality has already been verified, since IIfgl11 = JE Ifgl dm, and the second
only involves the modulus in each case). Since we do not know in advance
that JE f 9 dm is finite, we shall first restrict attention to bounded measurable
functions by setting fn = min{f,n} and gn = min{g,n}, and confine our
domain of integration to the bounded set En [-k, k] = Ek.
For any t E lR we have
0:::;
r (fn+ tgn)2dm= JEkr f~dm+2t JEkr fngn dm +t2 JErkg;'dm.
JEk
As a quadratic in t this does not have two distinct solutions, so the discriminant
is non-positive. Thus for all n ~ 1
(2
r fngn dm )2:::; 4 JrEk f~dm JrEk g;'dm
JE k
:::; 4
h h
Ifl2dm
= 41Ifll~llgll~·
Igl2dm
133
5. Spaces of integrable functions
Monotone convergence now yields
for each k, and since E
= Uk Ek we obtain finally that
D
which implies the Schwarz inequality.
The triangle inequality for the norm on L2(E, C) now follows at once - we
need to show that 111+ gl12 ::::: 111112 + IIgl12 for I, 9 E L2(E, q:
III+gll~= LII+gI 2 dm= L(f+g)(f+g)dm= L(f+g)Cl+g)dm.
The latter integral is
L
1112 dm +
which is dominated by
L
L
(fg + 19) dm +
(111112 + Ilg112)2
(fg + g1) dm :::::
L
Igl2 dm,
since the Schwarz inequality gives
L
2
IIgl dm :::::
211111211g112.
The result follows.
The other properties are immediate:
1. clearly
111112 = 0 means that 1112 = 0 a.e.,
hence 1=0 a.e.,
2. for a E C, IIaII12 = (J~ la11 2 dm)~ = la111f112'
Thus the map I H 111112 is a norm on L2(E, q.
The proof that L2(E) is complete under this norm is similar to that for
£l(E), and will be given in Theorem 5.24 below for arbitrary LP-spaces (1 <
p < (0).
In general, without restriction of the domain set E, neither L1 ~ L2 nor
L2 ~ L1. To see this consider E = [1,(0), I(x) = ~. Then I E L2(E) but
I tJ. L1(E). Next put F = (0,1), g(x) =
Now 9 E L1(F) but 9 tJ. L 2(F).
ft.
For finite measure spaces - and hence for probability spaces! - we do have
a useful inclusion:
Proposition 5.7
If the set D has finite measure (that is, m(D)
< (0), then L2(D)
C
L 1(D).
134
Measure, Integral and Probability
Iii
Hint Estimate
lil2 is finite.
by means of
lil2
and then use the fact that the integral of
Before exploring the geometry induced on L2 by its norm, we consider
examples of sequences in L2 to provide a little practice in determining which
are Cauchy sequences for the L 2 -norm, and compare this with their behaviour
as elements of L1.
Example 5.8
We show that the sequence in = ~1[n,n+11 is Cauchy in L2(0, 00).
and for c >
m,n?: N.
°
let N be such that
'Ii
<
~. Then
Ilim - inll <
c whenever
Exercise 5.3
Is the sequence
gn(x) =
l(n,oo) (x)
1
x2
a Cauchy sequence in L2(R)?
Exercise 5.4
Decide whether each ofthe following is Cauchy as a sequence in L2(0, 00):
(a) in
=
l(O,n),
(b) in = ~l(O,n)'
(c)
in
=
x\-l(O,n)'
135
5. Spaces of integrable functions
5.2.2 Inner product spaces
We are ready for the additional structure specific (among the Lebesgue function
spaces) to L2:
(j,g) =
J
jgdm
(5.3)
defines an inner product, which induces the L 2 -norm:
To explain what this means we verify the following properties, all of which
follow easily from the integration theory we have developed:
Proposition 5.9
Linearity (in the first argument)
(j + g, h) = (j, h) + (g, h),
(cj, h)
= c(j, h).
Conjugate symmetry
(j, g) = (g, f).
Positive definiteness
(j, f) ::::: 0,
(j, f) = 0 {=} j =
o.
Hint Use the additivity of the integral in the first part, and recall for the last
part that if j = 0 a.e. then j is the zero element of L2(E, q.
As an immediate consequence we get conjugate linearity with respect to the
second argument
(j, cg + h) = c(j, g) + (j, h).
Of course, if j, g E L2 are real-valued, the inner product is real and linear in
the second argument also.
Examination of the proof of the Schwarz inequality reveals that the particular form of the inner product defined here on L2(E, IC) is entirely irrelevant
for this result: all we need for the proof is that the map defined in (5.3) has
the properties proved for it in the last Proposition.
We shall therefore make the following important definition, which will be
familiar from the finite-dimensional context of ~n, and which we now wish to
apply more generally.
136
Measure, Integral and Probability
Definition 5.10
An inner product on a vector space Hover e is a map (".) : H x H ---+ e
which satisfies the three conditions listed in Proposition 5.9. The pair (H, (".))
is called an inner product space.
Example 5.11
The usual scalar product in JRn makes this space into a real inner product
space, and en equipped with (z, w) = L:~=I Zi'Wi is a complex one.
Proposition 5.9 shows that L2(E, q is a (complex) inner product space.
With the obvious simplifications in the definitions the vector space L2(E, JR) =
L2(E) is a real inner product space, i.e. with JR as the set of scalars.
The following identities are immediate consequences of the above definitions.
Proposition 5.12
Let (H, (".)) be a complex inner product space, with induced norm 11·11. The
following identities hold for all hI, h2 E H:
(i) Parallelogram law:
(ii) Polarisation identity:
Remark 5.13
These identities, while trivial consequences of the definitions, are useful in
checking that certain norms cannot be induced by inner products. An example
is given in Exercise 5.5 below. With the addition of completeness, the identities
serve to characterise inner product norms: it can be proved that in the class
of complete normed spaces (known as Banach spaces), those whose norms are
induced by inner products (i.e. are Hilbert spaces) are precisely those for which
the parallelogram law holds, and the inner product is then recovered from the
norm via the polarisation identity. We shall not prove this here.
137
5. Spaces of integrable functions
Exercise 5.5
Show that it is impossible to define an inner product on the space C[O, 1]
of continuous functions 1 : [0, 1] --+ IR which will induce the sup norm
1111100 = sup{ll(x)1 : x E [0, I]}.
Hint Try to verify the parallelogram law with the functions
given by I(x) = 1, g(x) = x for all x.
I, 9 E C[O, 1]
Exercise 5.6
Show that it is impossible to define an inner product on the space
£1([0,1]) with the norm 11·111.
Hint Try to verify the parallelogram law with the functions given by
I(x) = ~ -x, g(x) =x-~.
5.2.3 Orthogonality and projections
We have introduced the concept of inner product space in a somewhat roundabout way, in order to emphasise that this structure is the natural additional
tool available in the space £2, which remains our principal source of interest.
The additional structure does, however, allow us to simplify many arguments
and prove results which are not available for other function spaces. In a sense,
mathematical life in £2 is 'as good as it gets' in an infinite-dimensional vector
space, since the structure is so similar to that of the more familiar spaces IR n
and en.
As an example of the power of this new tool, recall that for vectors in
n
IR we have the important notion of orthogonality, which means that the scalar
product of two vectors is zero. This extends to any inner product space, though
we shall first state it and produce explicit examples for £2: the functions I, 9
are orthogonal if
(I,g) = 0.
Example 5.14
If
1 = 1[0,11, then (I,g)
g(x)=x-~.
°
if and only if
fo1 g(x) dx
= 0, for example if
138
Measure, Integral and Probability
Exercise 5. 7
Show that f(x)
are orthogonal.
= sin nx, g(x) = cos mx for x E [-11',11'], and 0 outside,
Show that f(x) = sinnx, g(x)
are orthogonal for n -I- m.
= sinmx for x E [-11',11'], and 0 outside,
In fact, in any inner product space we can define the angle between the
elements g, h by setting
(g, h)
cosO = Ilgllllhll'
Note that by the Schwarz inequality this quantity - which, as we shall see
below, also has a natural interpretation as the correlation between two (centred)
random variables - lies in [-1, 1] and that (g, h) = 0 means that cos 0 = 0, i.e.
is an odd multiple of ~. It is therefore natural to say that 9 is orthogonal to
h if (g,h) = O.
Orthogonality of vectors in a complex inner product space (H, (', .)) provides
a way offormulating Pythagoras' theorem in H: since Ilg+hl1 2 = (g+h, g+h) =
(g, g) + (h, h) + (g, h) + (h, g) = IIgl12 + IIhl1 2 + (g, h) + (h, g) we see at once that
if 9 and h are orthogonal in H, then Ilg + hl1 2 = IIgl12 + Ilh11 2.
Now restrict attention to the case where (H, (', .)) is complete in the inner
product norm II . II - recall that this means (see Definition 5.3) that if (h n ) is a
Cauchy sequence in H then there exists hE H such that limn --+ oo Ilh n = hll = o.
As noted in Remark 5.13, we call H a Hilbert space if this holds. We content
ourselves with one basic fact about such spaces:
Let K be a complete subspace of H, so that the above condition also holds
for (h n ) in K and then yields h E K. Just as the horizontal side of a rightangled triangle in standard position in the (x, y)-plane is the projection of the
hypotenuse onto the horizontal axis, and the vertical side is orthogonal to that
axis, we now prove the existence of orthogonal projections of a vector in H
onto the subspace K.
o
Theorem 5.15
Let K be a complete subspace of the Hilbert space H. For each h E H we can
find a unique hi E K such that hI! = h - hi is orthogonal to every element of
K. Equivalently, Ilh - hili = inf{llh - kll : k E K}.
Proof
The two conditions defining hi are equivalent: assume that hi E K has been
5. Spaces of integrable functions
139
found so that h" = h - h' is orthogonal to every k E K. Given k E K, note
that, as (h' - k) E K, (h", h' - k) - 0, so Pythagoras' theorem implies
Ilh - kl1 2 = II(h - h') + (h' - k)112 = IIh"I12 + Ilk' - kl1 2 > 11k"112
unless k = h'. Hence Ilh"ll = Ilh - h'li = inf{llh - kll : k E K} = bK, say.
Conversely, having found h' E K such that Ilh - h'li = bK, then for any real
t and k E K, h' + tk E K, so that
Ilh - (h' + tk)11 2 2: Ilh - h / 11 2 .
Multiplying out the inner products and writing h" = h - h', this means that
-t[(h", k)+(k, h")]+t 21IkI1 2 2: O. This can only hold for all t near 0 if (h", k) = 0,
so that h".lk for every k E K.
To find h' E K with Ilh - h'li = bK, first choose a sequence (kn) in K
such that Ilh - knll --+ bK as n --+ 00. Then apply the parallelogram law
(Proposition 5.12 (i)) to the vectors hI = h- ~(km + k n ) and h2 = ~ (km - kn ).
Note that hI + h2 = h - k n and hI - h2 = h - k m . Hence the parallelogram
law reads
and since ~(km +kn ) E K, Ilh- ~(km +k n )112 2: b'k. As m, n --+ 00 the left-hand
side converges to 2b'k, hence that final term on the right must converge to O.
Thus the sequence (k n ) is Cauchy in K, and so converges to an element h' of
K. But since Ilh - knll --+ bK while Ilkn - h'li --+ 0 as n --+ 00, Ilh - h'li <
Ilh - knll + Ilk n - h'li shows that Ilh - h'li = bK. This completes the proof. D
In writing h = h' + h" we have decomposed the vector h E H as the sum of
two vectors, the first being its orthogonal projection onto K, while the second
is orthogonal to all vectors in K. We say that h" is orthogonal to K, and denote
the set of all vectors orthogonal to K by K..l... This exhibits H as a direct sum
H = K EB K..l.. with each vector of the first factor being orthogonal to each
vector in the second factor.
We shall use the existence of orthogonal projections onto subspaces of
L2 (n, F, P) to construct the conditional expectation of a random variable with
respect to a O"-field in Section 5.4.3.
Remark 5.16
The foregoing discussion barely scratches the surface of the structure of inner
product spaces, such as L2(E), which is elegantly explained, for example in [11].
On the one hand, the concept of orthogonality in an inner product space leads
140
Measure, Integral and Probability
to consideration of orthonormal sets, i.e. families (e a ) in H that are mutually
orthogonal and have norm 1. A natural question arises whether every element
of H can be represented (or at least approximated) by linear combinations of
the (e a ). In L2([_7r, 7rJ) this leads, for example, to Fourier series representations of functions in the form I(x) = 2:.':=0(1, 'l/Jn)'l/Jn, where the orthonormal
functions are 'l/Jo =
'l/J2n (x) =
cos nx, 'l/J2n-1 (x) =
sin nx, and the
series converges in L2-norm. The completeness of L2 is crucial in ensuring the
existence of such an orthonormal basis.
vk,
.fir
.fir
5.3 The LP spaces: completeness
More generally, the space LP(E) is obtained when we integrate the pth powers
of III. For p;::: 1, we say that I E LP (and similarly for LP(E) and LP(E, C)) if
I/IP is integrable (with the same convention of identifying I and g when they
are a.e. equal). Some work will be required to check that LP is a vector space
and that the 'natural' generalisation of the norm introduced for L2 is in fact a
norm. We shall need p ;::: 1 to achieve this.
Definition 5.17
For each p ;::: 1, p < 00, we define (identifying classes and functions)
U(E) = {f :
LI/IP
dm is finite}
and the norm on LP is defined by
Ilfllp =
(L I/IP
dm )
~.
(With this in mind, we denoted the norm in L1(E) by
by 11/112.)
11/111
and that in L2(E)
Recall Definition 3.14, where we considered non-negative measurable functions I : E --+ [0,00]. Now we define the essential supremum of a measurable
real-valued I on E by
ess sup I := inf{c:
III
~
c a.e.}.
More precisely, if F = {c;::: 0: m{IfI- 1((c,00])} = O}, we set ess supl = infF
(with the convention inf 0 = +00). It is easy to see that the infimum belongs
to F.
141
5. Spaces of integrable functions
Definition 5.18
A measurable function f satisfying ess suplfl < 00 is said to be essentially
bounded and the set of all essentially bounded functions on E is denoted by
LOO(E) (again with the usual identification of functions with a.e. equivalence
classes), with the norm Ilfll oo = ess supf.
We shall need to justify the notation by showing that for each p (1 :::; p :::;
00), (LP(E), 11·11) is a normed vector space.
First we observe that V(E) is a vector space for 1 :::; p < 00. If f and 9
belong to LP, then they are measurable and hence so are ef and f + g. We have
lef(x)IP = leIPlf(x)IP, hence
Ilefll p =
(J lef(x)IP dX)
1
P
= lei
(J If(x)IP dX)
1
P
= leillfllp'
Next If(x) + g(x)IP :::; 2P max{lf(x)IP, Ig(x)IP} and so Ilf + gllp is finite if Ilfllp
and Ilgllp are. Moreover, if Ilfllp = 0 then If(x)IP = 0 almost everywhere and
so f (x) = 0 almost everywhere. The converse is obvious.
The triangle inequality
is by no means obvious for general p ?: 1: we need to derive a famous inequality
due to Holder, which is also extremely useful in many contexts, and generalises
the Schwarz inequality.
Remark 5.19
Before tackling this, we observe that the case of Loo (E) is rather easier: If +gl :::;
If I + Igl at once implies that Ilf + glloo :::; Ilflloo + Ilglloo (see Proposition 3.15)
and similarly lafl = lallfl gives Ilaflloo = laillfiloo. Thus LOO(E) is a vector
space and since Ilflloo = 0 obviously holds if and only if f = 0 a.e., it follows
that II . 1100 is a norm on LOO(E). Exercises 3.6 and 5.6 show that this norm
cannot be induced by any inner product on Loo.
Lemma 5.20
For any non-negative real numbers x, y and all a, f3 E (0,1) with a + f3
have
= 1 we
142
Measure, Integral and Probability
Proof
If x = 0 the claim is obvious. So take x > O. Consider I(t) = (1 - (3) + (3t - t f3
for t ~ 0 and (3 as given. We have f'(t) = (3 - (3t f3 - 1 = (3(1- t f3 - 1 ) and since
0<(3 < 1, f'(t) < 0 on (0,1). So I decreases on [0,1] while f'(t) > 0 on (1,00),
hence I increases on [1,00). So 1(1) = 0 is the only minimum point of I on
[0,00), that is I(t) ~ 0 for t ~ O. Now set t
that is' x
(lL)f3 <
0: + (3lL. Writing x
x
that x"'yf3 ::; o:x + (3y as required.
= ~,
= x"'+f3
then (1- (3) + (3~ - (~)f3 ~ 0,
we have x",+f3 (lL)f3
<
o:x + (3xlLx' so
x
D
Theorem 5.21 (Holder's inequality)
*= 1, p > 1, then for I E LP(E), 9 E Lq(E), we have Ig E L1 and
If ~ +
that is
JIlgl
dm ::;
(J
III P dm )
*(J Igl qdm) i .
Proof
Step 1. Assume that 1IIIIp = Ilgllq = 1, so we only need to show that IIIgl11 ::;
1. We apply Lemma 5.20 with 0: = 1,
(3 = 1,
x = IIIP, Y = Iglq, then we have
P
q
Integrating, we obtain
JIlgldm::; ~ J
IIIPdm+
p
since
~q
JIglqdm = ~
J IIIP dm = 1, J Iglq dm = 1. So we have
p
+
~q = 1
Illglh ::; 1 as required.
Step 2. For general I E LP and 9 E Lq we write 1IIIIp = a, Ilgllq = b for
some a, b > O. (If either a or b is zero, then one of the functions is zero almost
everywhere and the inequality is trivial.) Hence the functions = ~ I, 9 = 9
satisfy the assumption of Step 1, and so Iliglh ::; Ilillpllgllq' This yields
i
1
ablllglh::;
and after multiplying by
ab the result
i
1
1
;:lllllp"bllgllq
is proved.
D
143
5. Spaces of integrable functions
Letting p = q = 2 and recalling the definition of the scalar product in L2
we obtain the following now familiar special case of Holder's inequality.
Corollary 5.22 (Schwarz inequality)
If f,g E L2, then
We may now complete the verification that
II . lip is a norm on LP(E).
Theorem 5.23 (Minkowski's inequality)
For each p 2: 1,
f, 9 E LP(E)
Proof
Assume 1
< p < 00 (the case p = 1 was done earlier). We have
If
+ glP = 1(1 + g)(1 + g)P-11
and also taking q such that ~
If
Hence
(1 + g)p-l
+~=
:::;
Ifllf + glP-l + Igllf + glP-l,
1, in other words, p + q = pq, we obtain
+ gl(p-l)q = If + glP < 00.
E Lq and
11(1 + g)P-11I q =
(1
1
If
+ glP dm) q.
We may apply Holder's inequality:
1+
If
glP dm:::;
1Ifllf +
glP-l dm +
1Igllf +
glP-l dm
: :; (1
~ (1 +
i
+ (1
~ (1 +
i
~ ((J IflPdm) *+ (J IglPdm) ')
Ifl P dm )
IglP dm )
A
If
glP dm )
If
glP dm )
144
Measure, Integral and Probability
1
with A = (J If + glP dm) q. If A = 0 then Ilf + gllp
prove. So suppose A > 0 and divide by A:
Ilf
= 0 and there is nothing to
+ gllp = (/ If + glP dm) l-i
=
~ (/ If + glP dm )
~ ( / Ifl Pdm ) i + (/ IglPdm) i
=
Ilfllp
+ Ilgllp,
D
which was to be proved.
Next we prove that LP(E) is an example of a complete normed space (i.e.
a Banach space) for 1 < p < 00, i.e. that every Cauchy sequence in LP(E)
converges in norm to an element of LP (E). We sometimes refer to convergence
of sequences in the LP- norm as convergence in pth mean.
The proof is quite similar to the case p = 1.
Theorem 5.24
The space LP(E) is complete for 1 < p < 00.
Proof
Given a Cauchy sequence fn (that is, Ilfn - fmllp ---+ 0 as n, m ---+ 00), we find
a subsequence fnk with
for all k 2 1 and we set
k
gk =
L
L
00
Ifni+l - fni I,
i=l
9
= k---,>oo
lim gk =
Ifni+l - fni I·
i=l
tr
The triangle inequality yields Ilgk lip ~ I:~=l
< 1 and we can apply Fatou's
lemma to the non-negative measurable functions gf, k 2 1, so that
Ilgll~ = /
lim gr dm
n---'>oo
~ liminf/gr
dm ~ 1.
k---,>oo
Hence 9 is almost everywhere finite and fnl + I:i>l (fni+l - fnJ converges
absolutely almost everywhere, defining a measurable-function f as its sum.
145
5. Spaces of integrable functions
We need to show that f E LP. Note first that f = limk-roo fnk a.e., and
given c > 0 we can find N such that Ilfn - fmllp < c for m, n ?: N. Applying
Fatou's lemma to the sequence (Ifni - fmIP)i~l' letting i -+ 00, we have
Hence f - fm E LP and so f = fm + U - fm) E LP and we have Ilf - fmllp
for all m ?: N. Thus fm -+ f in LP- norm as required.
<c
0
The space Loo(E) is also complete, since for any Cauchy sequence Un) in
Loo(E) the union of the null sets where 1!k(x)1 > Ilflloo or Ifn(x) - fm(x)1 >
Ilfn - fmlloo for k, m, n E N, is still a null set, F say. Outside F the sequence Un)
converges uniformly to a bounded function, f say. It is clear that Ilfn - flloo -+ 0
and f E Loo(E), so we are done.
Exercise 5.8
Is the sequence
Cauchy in L4?
We have the following relations between the LP spaces for different p which
generalise Proposition 5.7.
Theorem 5.25
If E has finite Lebesgue measure, then Lq(E)
<:;;:
LP(E) when 1 ::; p::; q ::;
00.
Proof
Note that If(x)IP ::::: 1 if If(x)1 ::; 1. If If(x)1 ?: 1, then If(x)IP ::::: If(x)lq. Hence
klflP dm::;
so if m(E) and
k1
If(x)IP ::; 1 + If(xW,
dm + k lflq dm = m(E)
+ k lflq dm < 00,
IE Ifl q dm are finite, the same is true for IE Ifl Pdm.
0
146
Measure, Integral and Probability
5.4 Probability
5.4.1 Moments
Random variables belonging to spaces LP(D), where the exponent pEN, play
an important role in probability.
Definition 5.26
The moment of order n of a random variable X E L n (D) is the number
JE(xn),
Write JE(X)
n
=
1,2, ....
= p; then central moments are given by
JE(X - p)n,
n = 1,2, ....
By Theorem 4.41 moments are determined by the probability distribution:
JE(xn)
=
j xn dPx(x),
JE((X - pt) = j(x - ptdPx(x),
and if X has a density f x then by Theorem 4.46 we have
JE(xn) = j xnfx(x)dx,
JE((X - p)n)
=
j (x - pt fx(x) dx.
Proposition 5.27
If JE(xn) is finite for some n, then for k :s: n, JE(Xk) are finite. If JE(xn) is
infinite, then the same is true for JE(Xk) for k 2: n.
Hint Use Theorem 5.25.
Exercise 5.9
Find X, so that JE(X2) =
00,
JE(X) <
00.
Can such an X have JE(X) = 07
Hint You may use some previous examples in this chapter.
147
5. Spaces of integrable functions
Definition 5.28
The variance of a random variable is the central moment of second order:
Var(X) = lE(X _lE(X))2.
Clearly, writing J.l
=
lE(X),
Var(X) = lE(X2 - 2J.lX
+ J.l2) = lE(X2) -
2J.llE(X)
+ J.l2 = lE(X2) -
J.l 2.
This shows that the first two moments determine the second central moment. This may be generalised to arbitrary order and, what is more, this relationship also goes the other way round.
Proposition 5.29
Central moments of order n are determined by moments of order k for k
~
n.
~
n.
Hint Use the binomial theorem and linearity of the integral.
Proposition 5.30
Moments of order n are determined by central moments of order k for k
Hint Write lE(xn) as lE((X - J.l + J.l)n) and then use the binomial theorem.
Exercise 5.10
Find Var(aX) in terms of Var(X).
Example 5.31
If X has the uniform distribution on [a,b], that is, fx(x) = b~al[a,bl(x), then
J
lb
1
x f x (x) dx = b _ a a
X
b
1 1 2
1
)
dx = b _ a '2 x Ia = '2 (a + b .
Exercise 5.11
Show that for uniformly distributed X, VarX
= 112 (b - a)2.
148
Measure, Integral and Probability
Exercise 5.12
Find the variance of
(a) a constant random variable X, X(w)
= a for all w,
(b) X : [0,1] -+ lR given by X(w) = min{w, 1 - w} (the distance to the
nearest endpoint of the interval [0,1]),
(c) X : [0,
IF -+ lR, the distance to the nearest edge of the square [0,1 F.
We shall see that for the Gaussian distribution the first two moments determine the remaining ones. First we compute the expectation:
Theorem 5.32
1
-"fiir(J
1
IR
xe-
(X-i)2
2"
dx=f.L.
Proof
Make the substitution z = X:I", then, writing
~(J
J
-;:l dx = ~
xe-(X2
J
f
for fIR'
ze-4 dz
+
J:xn Je- 4 dz.
Notice that the first integral is zero since the integrand is an odd function. The
second integral is "fiir, hence the result.
D
So the parameter f.L in the density is the mathematical expectation. We
show now that (J2 is the variance.
Theorem 5.33
Proof
Make the same substitution
1
--
"fiir(J
J
as
before: z
(x_I')2
= x:J.L; then
(x - f.L)2e-~ dx
= -(J2-
"fiir
J
2
z2
z e-' dz.
149
5. Spaces of integrable functions
Integrate by parts u
2
a
.,fiir
j
= z, v = ze- z2 /2, to get
z2
Z2 e- T
a2
z21+<Xl
dz = - .,fiirze- T -<Xl
a
+.,fiir
2
j e-
z2
T
dz = a 2
o
since the first term vanishes.
Note that the odd central moments for a Gaussian random variable are zero:
the integrals
_1_
-1L)2k+1
('";;.l dx
.,fiira
vanish since after the above substitution we integrate an odd function. By
repeating the integration by parts argument one can prove that
e-
j(x
Example 5.34
Let us consider the Cauchy density ~
(we shall see it is impossible):
-1
1["
J+<Xl -1-- dx =
X
-<Xl
+ X2
1 (
-2
1["
1';x2and try to compute the expectation
lim
xn-t+<Xl
In(l
+ x~) -
lim
yn-t-<Xl
In(l
+ Y~) )
for some sequences x n , Yn. The result, if finite, should not depend on their
choice, however if we set for example Xn = aYn, then we have
(
lim
Xn -t+<Xl
In(l
+ x~) -
lim
Yn -t-<Xl
In(l
+ y~)) =
lim In 1 + a~;'
1 + Yn
Yn -t<Xl
= In a,
which is a contradiction. As a consequence, we see that for the Cauchy density
the moments do not exist.
Remark 5.35
We give without proof a simple relation between the characteristic function
and the moments (recall that r.px(t) = JE(eitX ) - see Definition 4.52):
If r.px is k-times continuously differentiable then X has finite kth moment
and
k
1 dk
JE(X ) = i k dtkr.px(O).
Conversely, if X has kth moment finite then r.px(t) is k-times differentiable and
the above formula holds.
150
Measure, Integral and Probability
5.4.2 Independence
The expectation provides a useful criterion for the independence of two random
variables.
Theorem 5.36
The random variables X, Yare independent if and only if
lE(f(X)g(Y)) = lE(f(X))lE(g(Y))
holds for all Borel measurable bounded functions
f,
(5.4)
g.
Proof
Suppose that (5.4) holds and take any Borel sets B l , B 2 · Let
and application of (5.4) gives
f = 1B" 9 = 1B2
The left-hand side equals
L1Bl XB2 (X(w), Y(w)) dP(w) = P((X E Bd
n (Y E B 2 )),
whereas the right-hand side is P(X E BdP(Y E B 2 ), thus proving the independence of X and Y.
Suppose now that X, Yare independent. Then (5.4) holds for f = 1B"
9 = 1 B2 , B l , B2 Borel sets, by the above argument. By linearity we extend the
formula to simple functions: r.p = 2: bi 1Bi' 1jJ = 2: j Cj 1cj ,
lE(r.p(X)1jJ(Y)) = lE(Lbi1Bi(X) LCj1c](Y))
= L bi c jlE( lBi (X)lc (Y))
j
i,j
i,j
j
= lE( r.p(X) )lE( 1jJ(Y)).
We approximate general f, 9 by simple functions and the dominated convergence theorem (f, 9 are bounded) extends the formula to f, g.
0
151
5. Spaces of integrable functions
Proposition 5.37
Let X, Y be independent random variables. If lE(X)
lE(XY) = o.
=
0, lE(Y)
Hint The above theorem cannot be applied with f(x) = x, g(x)
functions are not bounded). So some approximation is required.
=
0, then
= x (these
The expectation is nothing but an integral, so the number (X, Y) = lE(XY)
is the inner product in the space L2(il) of random variables square integrable
with respect to P. Hence independence implies orthogonality in this space.
If the expectation of a random variable is non-zero, we modify the notion of
orthogonality. The idea is that adding (or subtracting) a number does not
destroy or improve independence.
Definition 5.38
For a random variable with finite f..L = lE(X) we write Xc = X -lE(X) and we
call Xc a centred random variable (clearly lE(Xc) = 0). The covariance of X
and Y is defined as
Cov(X, Y)
= (Xc, Yc) = lE((X -lE(X))(Y -lE(Y))).
The correlation is the cosine of the angle between Xc and Yc: if IIXI12 and 11Y112
are non-zero, then we write
PX,Y
=
(Xc, Yc)
IIXI1211Y112
=
Cov(X, Y)
IIX11211Y112·
We say that X, Y are uncorrelated if P = O.
Note that some elementary algebra gives a more convenient expression for
the covariance:
Cov(X, Y) = lE(XY) -lE(X)lE(Y).
Thus uncorrelated X, Y satisfy lE(XY) = lE(X)lE(Y). Clearly independent
random variables are uncorrelated; it is sufficient to take f(x) = x - lE(X),
g(x) = x-lE(y) in Theorem 5.36. The converse is not true in general, although
- as we shall see in Chapter 6 - it holds for Gaussian random variables.
Example 5.39
Let il = [-1,1] with Lebesgue measure: P = ~ml[-1,11' X = x, Y = x 2 •
Then lE(X) = 0, lE(XY) = J~l x 3dx = 0, hence Cov(X, Y) = 0 and thus
152
Measure, Integral and Probability
= O. However X, Yare not independent. Intuitively this is clear since
y = X2, so that each of X, Y is a function of the other. Specifically, take
A = B = [-~, ~] and compare (as required by Definition 3.18) the probabilities
P(X-l(A) n y-l(A)) and P(X-l(A))P(y-l(A)). We obtain X-l(A) = A,
y-l(A) = [- ~, ~], hence P(X-l(A) n y-l(A)) = ~, P(X-l(A))
~,
PX,y
P(y-l(A)) = ~ and so X and Yare not independent.
Exercise 5.13
Find the correlation
PX,Y
if X
=
2Y
+ 1.
Exercise 5.14
Take n = [0,1] with Lebesgue measure and let X(w) = sin 21l'w, y(w)
cos21l'w. Show that X, Yare un correlated but not independent.
=
We close the section with two further applications.
Proposition 5.40
The variance of the sum of uncorrelated random variables is the sum of their
variances:
n
n
Var(L Xi) =
i=l
L Var(Xi)'
i=l
Hint To avoid cumbersome notation first prove the formula for two random
variables
Var(X + Y) = Var(X) + Var(Y)
using the formula Var(X) = lE(X2) - (lEX)2.
Proposition 5.41
Suppose that X, Yare independent random variables. Then we have the following formula for the characteristic function:
cp x +y (t) = cp x (t)cpy (t).
More generally, if Xl"'" Xn are independent, then
Hint Use the definition of characteristic functions and Theorem 5.36.
153
5. Spaces of integrable functions
5.4.3 Conditional expectation (first construction)
The construction of orthogonal projections in complete inner product spaces,
undertaken in Section 5.2.3, allows us to provide a preview of perhaps the most
important concept in modern probability theory: the conditional expectation of
an F-measurable integrable random variable X, given a O'-field Q contained in
F (we call Q a sub-O'-field of F). We study this idea in detail in Chapter 7 where
we will also justify the definition below by reference to more familiar concepts,
but the construction of the conditional expectation as a Q-measurable random
variable can be achieved for any integrable X with the tools we have readily
to hand. Our argument owes much to the elegant construction given in [13].
We begin with X E L2 = L2(D, F, P). Thinking of O'-fields as containing 'information' about random events consider the subspace L2(9) of L2 consisting of
Q-measurable functions. (In accordance with our convention in Section 5.1, we
work with functions rather than with equivalence classes.) By Theorem 5.24,
the inner product space L2 is complete, and the vector subspace L2 (9) is a
complete subspace. Thus the construction of the orthogonal projection in Section 5.2.3 applies and we denote by Y the orthogonal projection of X on L2(9).
We can interpret Y as the 'best predictor' of X among the class of Q-measurable
functions. We know that X - Y is orthogonal to L2 (9) and by definition of the
inner product in L2 this means that
(X - Y, Z)£2 =
1
(X - Y)Z dP = 0
for every Z E L2(9). In particular, since la E L2(9) for every G E Q, we have
fc
Y dP
=
fc
X dP.
(5.5)
This motivates the following definition:
Definition 5.42
Let (D,F,P) be a probability space and suppose that Q is a sub-O'-field of F.
Suppose that for an F-measurable X there exists a unique (up to a null set)
Q-measurable Y such that (5.5) holds for every G E Q. We write Y = lE(XI9)
and call Y the conditional expectation of X given Q.
The above considerations show that the conditional expectation is well defined for X E L2. However, condition (5.5) makes sense for X, Y from just L1.
This gives hope of extending the scope of the definition to such X.
We shall show how to construct lE(XI9) E L1(9) for an arbitrary X E
L1(F).
154
Measure, Integral and Probability
Step 1. The case of non-negative bounded X.
If X is bounded, it is in L2(;:) by Theorem 5.25, since P(D) is finite. Hence
it has a conditional expectation Y.
We shall see that Y 2: 0 P-a.s. To this end suppose that Y takes negative
values with positive probability. Then there exists n 2: 1 such that the set
G = {Y < -~} E g has P(G) > O. Thus fa Y dP < -~P(G) < O. But
fa Y dP = faX dP 2: 0 by Proposition 4.17 - a contradiction.
Step 2. Approximating sequence for non-negative X ELI.
Take an arbitrary X 2: 0 in L1(;:), and for n 2: 1 set Xn = min (X, n).
Then Xn is bounded and non-negative, so Step 1 applies to X n , yielding a
non-negative Yn E L2(g) with fa Yn dP = faXn dP.
Since (Xn) is increasing with n, so is (faXn dP) for each G E g. Hence
for each G, fa Y n dP ::::: fa Yn+1 dP, therefore Y n+1 - Y n 2: O-a.s., that is, the
sequence (Yn ) increases a.s.
Step 3. Taking the limit.
Set Y(w) = limsuPn-too Yn(w) for each wED. By Theorem 3.9 Y is gmeasurable.
Moreover, for G E g, fa Yn dP = faXn dP ::::: fa X dP < 00 for all n.
The monotone convergence theorem shows that the integrals (fa Y n dP)n2:1
increase to fa Y dP, and that the final integral is finite, so that Y E L1(g).
On the other hand, fa Yn dP = faXn dP and the latter integrals also
increase to faX dP, so that this implies fa Y dP = fa X dP for all G E g.
Step 4. General X (not necessarily non-negative).
We can consider X = X+ - X-. By Step 3 there are Y+, Y- E L1 (g))
such that for G E g both fa y+ dP = faX+ dP and fa Y- dP = fa x- dP.
Subtracting on both sides we obtain fa Y dP = faX dP, where Y = y+ Y- E L1(g).
Step 5. Uniqueness.
Theorem 4.22, applied to P instead of m and to g instead of M, implies
that Y is P-a.s. unique: if Z E L1 (g) also satisfies fa Z dP = faX dP for
every G E g, then Z = Y, P-a.s.
This is often expressed by saying that Y is a version of lE(Xlg ): by definition of L1 (D, g, P) the uniqueness claim is that all versions belong to the
same equivalence class in L1 (D, g, P) under the equivalence relation: f == 9 iff
P({w ED: f(w) =1= g(w)}) = O.
155
5. Spaces of integrable functions
5.5 Proofs of propositions
Proof (of Proposition 5.7)
ID
Suppose that
P(x) dx is finite. Then using a :::; 1 + a2 (which follows from
(a - 1)2 2: 0) we have
Iv f(x) dx:::; Ivl dx + Iv f2(x) dx = m(D) + Iv f2(x) d(x) <
00.
D
Proof (of Proposition 5.9)
We verify the first two properties using the linearity of the integral:
J
=J
(j + g, h) =
(j(x)
+ g(x))h(x) dx
f(x)h(x) dx +
(cf, h)
=
J
J
g(x)h(x) dx
= (j, h) + (g, h),
cf(x)h(x) dx
=c
J
f(x)h(x) dx
= c(j, h).
The symmetry is obvious since f(x)g(x) = g(x)f(x) under the integral.
D
Proof (of Proposition 5.12)
Parallelogram law: Ilhl + h2112 = (hI + h2,hl + h2) = (hl,h l ) + (hl,h2) +
(h2' hI) + (h2' h2), Ilhl - h2112 = (hI - h2' hI - h2) = (hI, hI) - (hI, h2) (h2' hd + (h2' h2), and adding we get the result.
Polarization identity: subtract the above Ilhl +h2112 -llhl - h2112 = 2(hl' h2) +
2(h2' hI), replace h2 by ih2 to get Ilhl + ih2112 - Ilhl - ih2112 = 2(hl' ih2) +
2(ih2' hd. Insert the obtained expressions into the right-hand side of the identity in question. On the left we have 2[(hl' h2)+(h2' hd+i(hl' ih2)+i(ih2, hI)] =
2[(hl' h2) + (h2' hI) + i( -i)(hl , h2) + i 2(h2, hI)] = 4(hl' h2)'
D
Proof (of Proposition 5.27)
Suppose JE(xn) < 00, which means that X E Ln(st), then since the measure
of st is finite we may apply Theorem 5.25 and so X E Lk(st) for all k :::; n.
If JE(xn) = 00 the same must be true for JE(Xk) for k 2: n since otherwise
JE(Xk) < 00 would imply JE(xn) < 00 - a contradiction.
D
156
Measure, Integral and Probability
Proof (of Proposition 5.29)
Using the binomial expansion we have
and so by linearity of the expectation
D
Proof (of Proposition 5.30)
We have
D
Proof (of Proposition 5.37)
Let fn(x) = max{ -n, min{x, n}}. By Theorem 5.36, since X, Y are independent, we have lE(fn(X)fn(Y)) = lE(fn(X))lE(fn(Y)). Integrability of X and Y
enables us to pass to the limit, which is 0 on the right.
D
Proof (of Proposition 5.40)
Let X, Y be uncorrelated random variables. Then
Var(X
+ Y)
=
lE(((X
+ Y)
-lE(X + y))2)
= lE(X + Yf - (lE(X) + lE(y))2
+ 2lE(XY) + lEy2 - (lEX)2 - 2lE(X)lE(Y) - (lEy)2
= lEX2 - (lEX)2 + lEy2 - (lEy)2 + 2[lE(XY) -lE(X)lE(Y)]
= Var(X) + Var(Y)
=
since lE(XY)
lEX2
= lE(X)lE(Y). The general case for n random variables follows by
157
5. Spaces of integrable functions
induction or by repetitive use of the formula for two:
Var(Xl
+ ... + Xn) =
+ [X2 + ... + Xn])
= Var(X1 ) + Var(X2 + [X3 + ... + Xn])
Var(X l
D
Proof (of Proposition 5.41)
By definition,
CPX+y(t) = E(eit(X+Yl)
= E(eitXeitY)
= E(eitX)E(eitY) (by Theorem 5.36)
= CPx (t)cpy (t).
The generalization to n components is straightforward - induction or step-byD
step application of the result for two.
6
Product measures
6.1 Multi-dimensional Lebesgue measure
In Chapter 2 we constructed Lebesgue measure on the real line. The basis for
that was the notion of the length of an interval. Consider now the plane JR2 in
place of R Here by interval we understand a rectangle of any sort:
where
It, I2 are any intervals. The 'length' of a rectangle is its area
The concept of null set is introduced as in the one-dimensional case. As before,
countable sets are null. It is worth noting that on the plane we have more
sophisticated null sets such as, for example, a line segment or the graph of a
function.
The whole construction goes through without change and the resulting measure is the Lebesgue measure m2 on the plane defined on the O"-field generated
by the rectangles.
A subtle point which clearly illustrates the difference from linear measure is the following: any set of the form A x {a}, a E JR, is null and hence
Lebesgue-measurable on the plane. An interesting case of this is when A is a
non-measurable set on the real line!
Next, we consider JR3. By 'interval' here we mean a cube, and the 'length'
is its volume:
159
160
Measure, Integral and Probability
v(C)
= l(h)l(I2)l(I3).
Now surfaces are examples of pull sets. Following the same construction we
obtain Lebesgue measure m3 in ]R.3.
Finally, we consider ]R.n (this includes the particular cases n = 1,2,3 so for
a true mathematician this is the only case worth attention as it covers all the
others). 'Intervals' are now n-dimensional cubes:
1 = h x··· x In
and generalised 'length' is given by
l(l)
= l(Id .. ·l(In).
An interesting example of a null set is a hyperplane.
The above provides motivation for what follows. The multi-dimensional
Lebesgue measures will emerge again from the considerations below where we
will work with general measure spaces. Bearing in mind that we are principally
interested in probabilistic applications, we stick to the notation of probability
theory.
6.2 Product u-fields
Let (n 1 ,.1"b PI), (n2 ,.1"2, P2) be two measure spaces. Put
n = n1 x n2 •
We want to define a measure P on
n2 .
n.
n to 'agree' with the measures given on nb
Before we construct P we need to specify its domain, that is, a a-field on
Definition 6.1
Let .1" be the smallest a-field of subsets of n containing the 'rectangles' Al x A2
for all Al E .1"b A2 E .1"2. We call .1" the product a-field of .1"1 and .1"2. In other
words, the product a-field is generated by the family of sets ('rectangles')
n=
{AI x A2 : Al E .1"bA2 E .1"2}.
The notation used for the product a-field is simply: .1" =.1"1 x .1"2.
There are many ways in which the same product a-field may be generated,
each of which will prove useful in the sequel.
161
6. Product measures
Theorem 6.2
(i) The product a-field :FI x :F2 is generated by the family of sets ('cylinders'
or 'strips')
(ii) The product a-field :F is the smallest a-field such that the projections
n -+ nl ,
Pr2 : n -+ n2,
Prl :
Prl(WI,W2) = WI
Pr2(WI,W2) = W2
are measurable.
Proof
Recall that we write a(e) for the a-field generated by a family e.
Clearly C is contained in 'R, hence the a-field generated by C is smaller than
the a-field generated by 'R. On the other hand,
Al
X
A2 = (AI x
n2) n (nl
x A2)'
hence the rectangles belong to the a-field generated by cylinders: 'R C a(C).
This implies a('R) C a(a(C)) = a(C), which completes the proof of (i).
For (ii) note that
Pr 11 (A I ) = Al x
n2,
Pr2"I(A 2 ) =
nl
x A 2,
hence the projections are measurable by (i). But the smallest a-field such that
they are both measurable is the smallest a-field containing the cylinders, which
is :F by (i) again.
D
Consider a particular case of nl , n2 = ~, :FI = :F2 = B Borel sets. Then
we have two ways of producing :F = B2 - the Borel sets on the plane: we can
use the family of products of Borel sets or the family of products of intervals.
The following result shows that they give the same collection of subsets of ~2.
Proposition 6.3
The a-fields generated by
I = {h
are the same.
X
12 : h, 12 are intervals}
162
Measure, Integral and Probability
Hint Use the idea of the proof of the preceding theorem.
We may easily generalise to n factors: suppose we are given n measure
spaces (n i , F i , Pi), i = 1, ... , n, then the product a-fields in n = n l x··· x nn
is the a-field generated by the sets
{AI x ... x An: Ai E Fd·
6.3 Construction of the product measure
Recall that (nl, F l , PI)' (n 2, F 2, P2) are arbitrary measure spaces. We shall
construct a measure P on n = n l x n 2 which is determined by PI, P2 in a
natural way. A technical assumption is needed here: PI and P2 are taken to be
a-finite, that is, there is a sequence of measurable sets An with U~=l An = n l ,
Pl(A n ) finite (and the same for P2, n 2). This is of course true for probability
measures and also for Lebesgue measure (in the latter case An = [-n, n] will do
for example). For simplicity we assume that Pl, P2 are finite. (The extension to
the case of a-finite is obtained by a routine limit passage n -+ 00 in the results
obtained for the restrictions of the measures to An.)
The motivation provided by construction of multi-dimensional Lebesgue
measures gives the following natural condition on P:
(6.1)
for Al E F l , A2 E F 2.
We want to generalise (6.1) to all sets from the product a-field. To do this
we introduce the notion of a section of a subset A of n l x n 2: for W2 E n 2,
A similar construction is carried out for WI E
nl :
Theorem 6.4
If A is in the product a-field F, then for each W2, AW2 E Fl, and for each wl,
AWl E F2·
Proof
Let
163
6. Product measures
--------------:.;0-----;--....
Figure 6.1 WI section of a set
Figure 6.2 W2 section of a set
If we show that 9 is a a-field containing rectangles, then 9
smallest a-field with this property.
If A = Al X A 2 , Al E F l , then
if W2 E A2
if W2 1. A2
is in Fl so the rectangles are in g.
If A E g, then il \ A E 9 since
(il\ A)W2 = {WI: (Wl,W2) E (il \ A)}
= ill \ {WI: (Wl,W2)
=il 1 \A
w2 '
Finally, let An E
g.
Since
00
E A}
= F since F is the
164
Measure, Integral and Probability
the union U:=I An is also in g.
The proof for AWl is exactly the same.
If A
= Al
X
A 2, then the function W2
f-+
o
P(AW2 ) is a step function:
and hence we have
This motivates the general formula; for any A we write
(6.2)
We call P the product measure and we will sometimes denote it by P = PI X P2.
We already know that PI (A W2 ) makes sense since AW2 E FI as shown before.
For the integral to make sense we need more:
Theorem 6.5
Suppose that PI, P2 are finite. If A E F, then the functions
are measurable with respect to F 2 , F I , respectively, and
1
!II
PI(Aw2) dP2(w2) =
1
!l2
P2(AwJ dPI(wd·
(6.3)
Before we prove this theorem we allow ourselves a digression concerning
the ways in which a-fields can be produced. This will greatly simplify several
proofs. Given a family of sets A, let F be the smallest a-field containing A.
Suppose that A is a field, i.e. it is closed with respect to finite unions and
differences: A, B E A implies AU B, A \ B E A. Then we have an alternative
way of characterizing a-fields by means of so-called monotone classes.
Definition 6.6
A monotone class is a family of sets closed under countable unions of increasing
sets and countable intersections of decreasing sets. That is, 9 is a monotone
class if
00
Al C A2 C .. "
Ai E 9
=>
UAi E g,
i=1
165
6. Product measures
n
00
Al J A2 J ... ,
Ai E g
Ai E g.
=?
i=l
Lemma 6.7 (monotone class theorem)
The smallest monotone class gA containing a field A coincides with the a-field
:FA generated by A.
Proof
A a-field is a monotone class, so gA C :FA (since gA is the smallest monotone
class containing A).
To prove the converse, we first show that gA is a field. The family of sets
contains A since A E A implies AC E A egA. We observe that it is a monotone
class. Suppose that Al c A2 C ... , are such that Ai EgA. We have to show
that (U Ai)C EgA· We have Ai J A~ J ... and hence Ai E gA because gA
is a monotone class. By de Morgan's law Ai = (U Ai)C so the union satisfies
the required condition. The proof for the intersection is similar: Al J A2 J ...
implies Ai C A~ c ... , hence U Ai E gA and so
Ai)C = U Ai also belongs
to gAo
We conclude that
n
n
(n
so gA is closed with respect to taking complements.
Now consider unions. First fix A E A and consider
This family contains A (if B E A, then AU B E A egA) and is a monotone
class. For, let Bl C B2 C ... be such that Au Bi EgA. Then Au Bl C
Au B2 C ... , hence U(A UBi) EgA, thus Au UBi EgA. Similar arguments
work for the intersection of a decreasing chain of sets, so for this fixed A,
This means that for A E A and B E gA we have Au BE gAo
Now take arbitrary A EgA. By what we have just observed,
A c {B : Au B EgA}
166
Measure, Integral and Probability
and, by the same argument as before, the latter family is a monotone class. So
this time for general A, which completes the proof that QA is a field.
Now, having shown that QA is a field, we observe that it is a a--field. This
is obvious since for a sequence Ai E QA we have Al CAl U A2 C "', they all
are in QA (by the field property) and so is their union (since QA is a monotone
class).
Therefore QA is a a--field containing A, so it contains the a--field generated
by A:
D
which completes the proof.
n
The family
of rectangles introduced above is not a field, so it cannot be
used in the above result. Therefore we take A to be the family of all unions of
disjoint rectangles.
Proof (of Theorem 6.5)
Write
Q
= {A: W2
H
P I (A w2 ),
r PI (A
in2
w2
WI H
)dP2(W2)
=
P2(A w ,) are measurable and
r P2(A ,)dPI (WI)}'
in,
w
The idea of the proof is this. First we show that n c Q, then A C Q, and finally
we show that Q is a monotone class. By Lemma 6.7, Q = F, which means that
the claim of the theorem holds for all sets from :F.
If A is a rectangle, A = Al X A 2, then as we noticed before, W2 H PI(Aw ,),
WI H P2 (A w ,) are indicator functions multiplied by some constants and (6.3)
holds, each side being equal to PI (A I )P2 (A 2 ).
Next let A = (AI x A 2) U (BI x B 2) be the union of disjoint rectangles.
Disjoint means that either Al n BI = 0 or A2 n B2 = 0. Assume the former,
for example. Then
if W2 E A2 n B2
if W2 E A2 \ B2
if W2 E B2 \ A2
otherwise
167
6. Product measures
and
1
!12
PI (A W2 ) dP2(w2)
=
[PI (AI)
+ PI (B I )]P2(A 2 n B 2)
+ PI (AdP2(A2 \
=
+ PI (BdP2(B2 \
PI (A I )[P2(A2 n B 2) + P2 (A 2 \ B2)]
+ PI (B I )[P2(A 2 n B 2) + P2(B2 \ A2)]
PI (A I )P2(A 2) + Pl (B I )P2(B 2).
B2)
A2)
On the other hand,
and
1
!11
P2(Awl) dPI(WI) = PI (A I )P2(A 2) + PI (BdP2(B2)
as before.
The general case of finitely many rectangles can be proved in the same way.
This is easy but tedious and we skip this argument, hoping that presenting it
in detail for two rectangles is sufficient to guide the reader in the general case.
It remains true that the functions W2 r-+ PI (A w2 ), WI r-+ P2 (A w1 ) are simple
functions, and so the verification of (6.3) is just simple algebra.
It remains to verify that Q is a monotone class. Let Al C A2 C . .. be sets
from Q; hence the functions W2 r-+ Pl ((A i )w2), WI r-+ P2((A i )wJ are measurable.
They increase with i since the sections (A)W2' (Ai)Wl are increasing. If i -+ 00,
then
PI((Ai)w2) -+ PI(U(A i )W2)
i
=
PI((UA i )w2)
i
and so the function W2 r-+ PI((U i A i )W2) is measurable. The same argument
shows that the function WI r-+ P2((Ui A)wJ is measurable. The equality (6.3)
holds for each i and by the monotone convergence theorem it is preserved in
the limit. Thus (6.3) holds for unions UAi.
For intersections the argument is similar. The sequences in question are decreasing; the functions W2 r-+ PI ((ni A i )W2)' WI r-+ P2((ni Ai)Wl) are measurable as their limits and (6.3) holds by the monotone convergence theorem. 0
Theorem 6.8
Suppose that PI, P2 are finite measures. The set function P given by (6.2) is
count ably additive. Any other measure coinciding with P on rectangles is equal
to P on the product O'-field.
168
Measure, Integral and Probability
Proof
Let Ai E F be pairwise disjoint. Then (A i )W2 are also pairwise disjoint and
P(UAi ) = 12 P1((UAi )w2)dP2(W2)
t
= 12 P1(U(A)W2) dP2(w2)
t
= 12 L Pd(Ai)W2) dP2(w2)
t
= L 12 P1((Ai )w2)dP2(W2)
t
= LP(A i )
where we have employed the fact that the section of the union is the union of
the sections (see Figure 6.3) and the Beppo-Levi theorem.
Figure 6.3 Section of a union
For uniqueness let Q be a measure defined on the product O'-field F such
that P(A 1 x A 2) = Q(A1 x A 2), A1 E F 1, A2 E F 2. Let
H = {A
C 5]1
x 5]2 : A E F, P(A)
= Q(A)}.
This family contains all rectangles by the hypothesis. It contains unions of
disjoint rectangles by the additivity of both P and Q; in other words H contains
the field A.
It remains to show that it is a monotone class since then it coincides with F
by Lemma 6.7. This is quite straightforward, using again the fact that P and
Q are measures. If A1 C A2 C ... , Ai E H, then P(Ai) = Q(A i ), P(U Ai) =
limP(A i ), Q(UAi ) = limQ(Ai), hence P(UA i ) = Q(UAi ), which means that
H is closed with respect to monotone unions. The argument for monotone
6. Product measures
169
intersections is exactly the same: if Al :J A2 :J ... , then p(n Ai) = lim P(Ai)'
Q(nAi ) = limQ(Ai), hence P(Ai) = Q(Ai) implies p(nA i ) = Q(nAi)'
0
Remark 6.9
The uniqueness part of the proof of Theorem 6.8 illustrates an important technique: in order to show that two measures on a O'-field coincide it suffices to
prove that they coincide on the generating sets of that O'-field, by an application
of the monotone class theorem.
As an immediate consequence of Theorem 6.5 we have
The completion of the product O'-field M x M built from the O'-field of
Lebesgue-measurable sets is the O'-field M2 on which m2 is defined.
It easy to see that two-dimensional Lebesgue measure coincides with the
completion of the product of one-dimensional ones. First, they agree on rectangles built from intervals. As a consequence, they agree on the O'-field generated
by such rectangles, which is the Borel O'-field on the plane. The completion of
Borel sets gives the O'-field of Lebesgue-measurable sets in the same way as in
the one-dimensional case.
6.4 Fubini's theorem
We wish to integrate functions defined on the product ofthe spaces (ill, F l , Pd,
(il2, F 2, P2) by exploiting the integration with respect to the measures PI, P2
individually.
We tackle the issue of measurability first.
Theorem 6.10
If a non-negative function j : ill x il2 --+ IR is measurable with respect to
Fl x F2, then for each WI E ill the function (which we shall call a section of
j, see Figure 6.4) W2 t-t f(Wl,W2) is F 2-measurable, and for each W2 E il2 the
section WI t-t j (WI, W2) is Fl -measurable.
170
Measure, Integral and Probability
Figure 6.4 Section of f
Proof
First we approximate
tion 4.15; we write
f
by simple functions in similar fashion to Proposiif f(w 1, w)
E [1£
H1) k < n 2
2
n' n '
if f(w1, W2) > n
and as n -+ 00, fn /" J.
The sections of simple measurable functions are simple and measurable.
This is clear for the indicator functions as observed above, and next we use the
fact that the section of the sum is the sum of the sections.
Finally, it is clear that the sections of f n converge to the sections of f and
since measurability is preserved in the limit, the theorem is proved.
0
Corollary 6.11
The functions
are :Ii, .r2-measurable, respectively.
Proof
The integrals may be taken (being possibly infinite) due to measurability of
171
6. Product measures
the functions in question. By the monotone convergence theorem, they are limf n (WI, W2) dPl (WI),
its of the integrals of the sections of f n' The integrals
fn(Wl' W2) dP2(w2) are simple functions, and hence the limits are measur-
I!tl
I!t2
0
ili~.
Theorem 6.12
Let
f
be a measurable non-negative function defined on ill x il 2 . Then
(6.4)
Proof
For the indicator function of a rectangle Al x A2 each side of (6.4) just becomes
P1 (A 1 )P2 (A 2 ). Then by additivity of the integral the formula is true for simple
functions. Monotone approximation of any measurable f by simple functions
allows us to extend this formula to the general case.
0
Theorem 6.13 (Fubini's theorem)
If f E Ll(il1 x il2 ) then the sections are integrable in appropriate spaces, the
functions
are in
£1 (ild, L 1 (il 2 ), respectively, and (6.4) holds: in concise form it reads
Proof
This relation is immediate by the decomposition f = f+ - f- and the result
proved for non-negative functions. The integrals are finite since if fELl then
f+ ,f- E Ll and all the integrals on the right are finite.
0
Measure, Integral and Probability
172
Remark 6.14
The whole procedure may be extended to the product of an arbitrary finite number of spaces. In particular, we have a method of constructing ndimensional Lebesgue measure as the completion of the product of n copies of
one-dimensional Lebesgue measure.
Example 6.15
Let !II = !l2 = [0, IJ, PI = P2 = m[O,l],
f(x
,y
) = {~
0
if 0 < y < x < 1
otherwise.
We shall see that the integral of f over the square is infinite. For this we take
a non-negative simple function dominated by f and compute its integral. Let
<p(x, y) = n if f(x, y) E [n, n + 1). Then <p(x, y) = n if x > y, x E (V~+l'
The area of this set is ~ (~ - n~l) and
JnJ.
1
[0,1)2
<pdm2
=
1 1
1
1 1
Ln-(-)
= L -- =
2 n
n+1
2n + 1
00
00
n=l
n=l
00.
Hence the function
~
g(x,y) = { -~
o
ifO<y<x<1
ifO<x<y<1
otherwise
is not integrable since the integral of g+ is infinite (the same is true for the
integral of g-).
Exercise 6.1
For g from the above example show that
1111
g(x,y)dxdy = -1,
1111
g(x,y)dydx = 1,
which shows that the iterated integrals may not be equal if Fubini's
theorem condition is violated.
The following proposition opens the way for many applications of product
measures and Fubini's theorem.
173
6. Product measures
Proposition 6.16
Let f : JR. -+ JR. be measurable and positive. Consider the set of all points in the
upper half-plane being below the graph of f:
Aj = {(x,y): 0 ~ y < f(x)}.
Show that Aj is m2-measurable and m2(Aj) =
J f(x) dx.
Hint For measurability 'fill' Aj with rectangles using the approximation of f
by simple functions. Then apply the definition of the product measure.
Exercise 6.2
J
Compute lO,3jxl-l,2j x2ydm2'
Exercise 6.3
2
Compute the area of the region inside the ellipse ~2
2
+ f,- = 1.
6.5 Probability
6.5.1 Joint distributions
Let X, Y be two random variables defined on the same probability space
(n, F, P). Consider the random vector
(X, Y) :
n -+ JR. 2 •
Its distribution is the measure defined for the Borel sets on the plane given by
p(X,Y) (B) = P((X, Y) E B),
Be JR. 2 .
If this measure can be written as
for some integrable f(xy), then we say that X, Y have a joint density.
The joint distribution determines the distributions of one-dimensional random variables X, Y:
Px(A) = p(X,Y)(A x JR.),
174
Measure, Integral and Probability
Py(A)
=
p(X,y)(JR x A),
for Borel A c JR; these are called marginal distributions. If X, Y have a joint
density, then both X and Y are absolutely continuous with densities given by
Ix(x) = l
Jy(y)
I(x,y) (x, y) dy,
= ll(x,y)(x,Y)dX.
The following example shows that the converse is not true in general.
Example 6.17
Let [l = [0,1] with P = m[O,1] and let (X, Y)(w) = (w, w). This vector does not
have density since p(X, Y) ( {(x, y) : x = y}) = 1 and for any integrable function
I : JR2 -t JR, f{( x,y.x-y
). _ } I(x, y) dm2(x, y) = 0; a contradiction. However the
marginal distributions Px, Py are absolutely continuous with the densities
Ix = Jy = 1[0,1]'
Example 6.18
A simple example of joint density is the uniform one: I = mtA) lA, with Borel
A C JR 2. A particular case is A = [0,1] x [0,1], then clearly the marginal
densities are 1[0,1]'
Exercise 6.4
Take A to be the square with corners at (0,1), (1,0), (2,1), (1,2). Find
the marginal densities of 1= lA.
Exercise 6.5
Let Ix,y(x, y) = 5~ (x 2 + y2) if 0
Find P(X + Y > 4), P(Y > X).
< x < 2, 1 < y < 4 and zero otherwise.
The two-dimensional standard Gaussian (normal) density is given by
It can be shown that p is the correlation of X, Y, random variables whose
densities are the marginal densities of n(x, y) (see [10]).
175
6. Product measures
Joint densities enable us to compute the distributions of various functions
of random variables. Here is an important example.
Theorem 6.19
If X, Y have joint density !x,Y, then the density of their sum is given by
l
fx+y(z) =
!x,Y(x, z - x) dx.
(6.6)
Proof
We employ the distribution function:
Fx+y(z) = P(X + Y :::; z)
= PX,y({(x,y): x+y:::; z})
=
Jr
i{(x,y):x+y5,z}
!x,Y(x, y) dxdy
rjZ-X !x,Y(x, y) dydx
= jZ r!x,Y(x, y' _ x) dxdy'
ill?
=
ill?
-00
-00
(we have used the substitution y'
differentiation gives the result.
= y + x and Fubini's theorem), which by
0
Exercise 6.6
Find !x+y if !x,y = l[O,llx[O,ll'
6.5.2 Independence again
Suppose that the random variables X, Yare independent. Then for a Borel
rectangle B = Bl X B2 we have
p(x,y)(B1 x B 2) = P((X, Y) E Bl x B 2)
= P((X E BI} n (Y E B 2))
= P(X E BI}P(Y E B 2)
= PX (B 1 )Py (B 2 )
and so the distribution p(X,Y) coincides with the product measure Px x Py on
rectangles; therefore they are the same. The converse is also true:
Measure, Integral and Probability
176
Theorem 6.20
The random variables X, Yare independent if and only if
p(X,Y)
= Px
x Py .
Proof
The 'only if' part was shown above. Suppose that p(X,Y) = Px x Py and take
any Borel sets B l , B 2. The same computation shows that P((X E B l ) n (Y E
B 2)) = P(X E BdP(Y E B 2), i.e. X and Yare independent.
0
We have a useful version of this theorem in the case of absolutely continuous
random variables.
Theorem 6.21
If X, Y have a joint density, then they are independent if and only if
f(x,Y)(X,y)
=
(6.7)
fx(x)fy(y).
If X and Yare absolutely continuous and independent, then they have a joint
density and it is given by (6.7).
Proof
Suppose f(x,Y) is the joint density of X, Y. If they are independent, then
r
1BI XB2
f(x,y)(x,y) dm2(x,y) = P((X, Y) E Bl x B 2)
= P(X
r
= r
1
=
E
BdP(Y
fx(x) dm(x)
lBI
E
B 2)
r
fy(y) dm(y)
lB2
fx(x)fy(y) dm2(x, y),
BI XB2
which implies (6.7). The same computation shows the converse:
P((X,Y)EB l xB 2)=
=
r
r
f(x,y)(x,y) dm2(x,y)
lB 1 XB2
fx(x)fy(y) dm2(x, y)
lB 1 XB 2
= P(X
E
Bl)P(Y E B2)'
177
6. Product measures
For the final claim note that the function Ix (x) jy (y) plays the role of the joint
density if X and Yare independent.
0
Corollary 6.22
If Gaussian random variables are orthogonal, then they are independent.
Proof
Inserting p = 0 into (6.5), we immediately see that the two-dimensional Gaussian density is the product of the one-dimensional ones.
0
Proposition 6.23
The density of the sum of independent random variables with densities
is given by
Ix+y(z) =
1
lx, Iy
Ix(x)jy(z - x) dx.
Exercise 6. 7
Suppose that the joint density of x, Y is lA where A is the square with
corners at (0,1), (1,0), (2,1), (1,2). Are X, Y independent?
Exercise 6.8
Find P(Y > X) and P(X
Ix =
1[0,1]'
+Y >
1), if X, Yare independent with
jy = ~1[0,2]'
6.5.3 Conditional probability
We consider the case of two random variables X, Y with joint density Ix,Y(x, y).
Given Borel sets A, B, we compute
P(Y
E
B!X
E
A)
= P(X
E A, Y E B)
P(X E A)
fAXB I(x,y) (x, y) dm2(x, y)
fA fx(x) dm(x)
= fA I(x,y)(x,y) dx dy
} BfA Ix (x) dx
r
Measure, Integral and Probability
178
using Fubini's theorem. So the conditional distribution of Y given X E A has
a density
h(ylX E A) = fAf(x,y)(x,y)dx
fA fx(x) dx
The case where A = {a} does not make sense here since then we would have
zero in the denominator. However, formally we may put
h( IX = ) = f(x,y)(a,y)
y
a
fx(a) ,
which makes sense if only fx(a)
relevant since
#-
O. This restriction turns out to be not
r
r
r
P((X,Y)E{(x,y):fx(x)=O})=
f(X,y)(x,y)dxdy
J{(x,Y):fx(x)=O}
=
J{x:fx(x)=O}
=
rf(x,Y) (x, y) dydx
JJR
fx(x) dx
J{x:fx(x)=O}
=0.
We may thus define the conditional probability of Y E B given X = a by
means of h(ylX = a), which we briefly write as h(yla):
P(Y E BIX = a) =
and the conditional expectation
lE(YIX = a) =
Ie
h(yla) dy
k
yh(yla) dy.
This can be viewed as a random variable with X as the source of randomness.
Namely, for wEn we write
lE(YIX)(w) =
k
yh(yIX(w)) dy.
This function is of course measurable with respect to the a-field generated by
X.
179
6. Product measures
The expectation of this random variable can be computed using Fubini's
theorem:
JE(JE(YIX)) = JE(L yh(yIX(w)) dy)
= LLyh(ylx) dy!x(x) dx
=
LL
yf(x,Y) (x, y) dxdy
= LyJy(Y)dY
= JE(Y).
More generally, for A cD, A = X-I (B), B Borel,
i
JE(YIX) dP = llB(X)JE(YIX) dP
= llB(X(w))(L yh(YIX(w)) dy) dP(w)
=
LL
lB(x)yh(ylx) dyfx(x) dx
= LL lB(x)yf(x,y)(x,y)dxdy
= llA(X)YdP
=
i
YdP.
This provides a motivation for a general notion of conditional expectation of
a random variable Y, given random variable X: JE(YIX) is a random variable
measurable with respect to the (i-field Tx generated by X and such that for
all A E Tx
i
JE(YIX) dP =
i
Y dP.
We will pursue these ideas further in the next chapter. Recall that we have
already outlined a general construction in Section 5.4.3, using the existence of
orthogonal projections in L2.
Exercise 6.9
Let fx,y = lA, where A is the triangle with corners at (0,0), (2,0),
(0,1). Find the conditional density h(ylx) and conditional expectation
JE(YIX = 1).
180
Measure, Integral and Probability
Exercise 6.10
Let /x,Y(x, y) = (x
for each y E R
+ y)lA'
where A = [0,1] x [0,1]. Find lE(XIY = y)
6.5.4 Characteristic functions determine distributions
We have now sufficient tools to prove a fundamental property of characteristic
functions.
Theorem 6.24 (inversion formula)
If the cumulative distribution function of a random variable X is continuous
at a, b E JR, then
Fx(b) - Fx(a) = lim c--+oo
1 jC e- iua _ e- iub
211'
.
-c
lU
'Px(u) duo
Proof
First, by the definition of 'Px,
1
-2
11'
jC e- iua - e- iub
.
-c
lU
1
211'
'PX(U) du = -
jC e- iua - e- iub
.
-c
1.
JR
lU
elUX dPx(x) duo
We may apply Fubini's theorem, since
I e- iUa ._ e- iub eiuxi
=
lib e iux
which is integrable with respect to Px x
U:
1
211'
1
211'
-
jC
jC e- iua - e- iub .
e1ux
-c
iu
du
sinu(x-a)-sinu(x-b)
-c
d(x)1 :::; b - a,
a
lU
U
ml [-c,c[.
We compute the integral in
1 jC eiu(x-a) _ eiu(x-b)
=-
211'
1
211'
du+ -
iu
-c
jC
-c
du
=
cosu(x-a)-cosu(x-b)d
.
U.
lU
The second integral vanishes since the integrand is an odd function. We change
variables in the first: y = u(x - a), z = u(x - b) and then it takes the form
1 jc(x-a) sin y
J(x,c) = -
211'
-c(x-a)
-
y
1 jC(X-b) sin z
dy - 211'
-c(x-b)
-
Z
dz
= h(x,c) - J2 (x,c),
181
6. Product measures
say. We employ the following elementary fact without proof:
j
t
s
siny d
- - y-+7r
Y
as t -+
00, S
-+
-00.
Consider the following cases:
1. x
< a, then also x < band c(x-a) -+
-c(x - b) -+
I(x, c) -+ O.
00
as c -+
00.
c(x-b) -+ -00, -c(x-a) -+ 00,
Hence h(x, c) -+ -~, 12 (x, c) -+ -~ and so
-00,
2. x> b, then also x> a, and c(x-a) -+ 00, c(x-b) -+ 00, -c(x-a) -+ -00,
-c(x - b) -+ -00, as c -+ 00, so h(x, c) -+ ~, 12 (x, c) -+ ~ and the result
is the same as in 1.
3. a < x < b, hence h(x,c) -+~, 12 (x, c) -+ -~ and the limit of the whole
expression is 1.
Write f(x) = lim c --+ oo I(x, c) (we have not discussed the values x
but they are irrelevant, as will be seen).
1
lim 27r
c--+oo
fC
-c
e- iua
-
.
lU
e- iub
'Px(u) du = lim
=
1
L
c--+oo
= a, x = b,
lR
I(x, c) dPx(x)
f(x) dPx(x)
by Lebesgue's dominated convergence theorem. The integral of f can be easily
computed since f is a simple function:
L
f(x) dPx(x) = Px((a, b]) = Fx(b) - Fx(a)
(Px({a}) = Px({b}) = 0 since Fx is continuous at a and b).
D
Corollary 6.25
The characteristic function determines the probability distribution.
Proof
Since Fx is monotone, it is continuous except (possibly) at countably many
points where it is right-continuous. Its values at discontinuity points can be
approximated from above by the values at continuity points. The latter are
determined by the characteristic function via the inversion formula.
Finally, we see that Fx determines the measure Px . This is certainly so
for B = (a, b]: Px((a, b]) = Fx(b) - Fx(a). Next we show the same for any
182
Measure, Integral and Probability
interval, then for finite unions of intervals, and the final extension to any Borel
set is via the monotone class theorem.
0
Theorem 6.26
If cp x is integrable, then X has a density which is given by
1
fx(x) = 27r
1
00
-00
.
e-IUXcpx(u)
duo
Proof
The function f is well defined. To show that it is a density of X we first show
that it gives the right values of the probability distribution of intervals (a, b]
where Fx is continuous:
fx(x) dx =
Jr
a
b
=
~
27r
1
00
-00
~
cpx(u) (
l
l
b
iux
dX) du
r
c
b
iux
lim
cpx(u) (
c--+oo 27r -c
Ja e- dX) du
= lim -
1
c
c--+oo 27r -c
=
eJr
a
cpx(u)
e- iua
-
.
e- iub
lU
du
Fx(b) - Fx(a)
by the inversion formula. This extends to all a, b since Fx is right-continuous
and the integral on the left is continuous with respect to a and b. Moreover,
Fx is non-decreasing, so
fx(x) dx :::: 0 for all a :=:; b, hence fx :::: O. Finally
1
00
-00
J:
fx(x) dx = lim Fx(b) b--+oo
lim Fx(a)
a--+-oo
so f x is a density.
= 1,
o
6.5.5 Application to mathematical finance
Classical portfolio theory is concerned with an analysis of the balance between
risk and return. This balance is of fundamental importance, particularly in
corporate finance, where the key concept is the cost of capital, which is a rate
of return based on the level of risk of an investment. In probabilistic terms,
return is represented by the expectation and risk by the variance. A theory
which deals only with two moments of a random variable is relevant if we
183
6. Product measures
assume the normal (Gaussian) distribution of random variables in question,
since in that case these two moments determine the distribution uniquely. We
give a brief account of basic facts of portfolio theory under this assumption.
Let k be a return on some investment in single period, that is, k(w) =
V(I'~~~r(O) where V(O) is the known amount invested at the beginning, and
V(l) is the random terminal value. A typical example which should be kept in
mind is buying and selling one share of some stock. With a number of stocks
available, we are facing a sequence (k i ) ofreturns on stock Si, ki = Si(I,;}(~)si(O),
but for simplicity we restrict our attention to just two, kl' k 2. A portfolio is
formed by deciding the percentage split, between holdings in SI and S2, of the
initial wealth V(O) by choosing the weights w = (WI, W2), WI + W2 = 1. Then,
as is well known and elementary to verify, the portfolio of nl = w~,~6~) shares
of stock number one and n2
= wJ2~6~) shares of stock number two has return
kw = Wlkl
+ W2k2.
We assume that the vector (kl' k 2) is jointly normal with correlation coefficient
p. We denote the expectations and variances of the ingredients by J-Li = IE(ki ),
= Var(ki ). It is convenient to introduce the following matrix:
0-;
where C12 = C21 is the covariance between kl and k 2. Assume (which is not
elegant to do but saves us an algebraic detour) that 0 is invertible, with 0- 1 =
[dijJ. By definition, the joint density has the form
It is easy to see that (6.5) is a particular case of this formula with J-Li = 0,
(Ji = 1, -1 < p < 1. It is well known that the characteristic function cp(tl' t2) =
IE(exp{i(t 1 k 1 +t2k2)}) of the vector (kl,k2) is of the form
cp(h, t2) = exp{i
1
2
L
i=1
tiJ-Li -
'2
2
L
Cijtitj}.
(6.8)
i.j=1
We shall show that the return on the portfolio is also normally distributed
and we shall find the expectation and standard deviation. This can all be done
in one step.
184
Measure, Integral and Probability
Theorem 6.27
The characteristic 'Pw function of kw is of the form
Proof
By definition 'Pw(t) = JE(exp{itkw }), and using the form of kw we have
'Pw(t) = JE(exp{it(w1k1 + W2 k2)}
= JE(exp{itwlkl
= 'P(tWl, tW2)
+ itw2k2})
by the definition of the characteristic function of a vector. Since the vector is
normal, (6.8) immediately gives the result.
D
The multi-dimensional version of Corollary 6.25 (which is easy to believe
after mastering the one-dimensional case, but slightly tedious to prove, so we
take it for granted, referring the reader to any probability textbook) shows that
kw has normal distribution with
The fact that the variance of a portfolio can be lower than the variances of the
components is crucial. These formulae are valid in the general case (i.e. without
the assumption of a normal distribution) and can be easily proved using the
formula for kw. The main goal of this section was to see that the portfolio
return is normally distributed.
Example 6.28
Suppose that the second component is not random, i.e. S2(1) is a constant
independent of w. Then the return k2 is risk-free and it is denoted by r (the
notation is usually reserved for the case where the length of the period is one
year). It can be thought of as a bank account and it is convenient to assume
that S2(0) = 1. Then the portfolio of n shares purchased at the price Sl (0) and
m units of the bank account has the value V(l) = nSd1)+m(1+r) at the end
of the period and the expected return is kw = wIllI + W2r, WI = n~(6~),
W2 = 1 - WI. The assumption of normal joint returns is violated but the
standard deviation of this portfolio can be easily computed directly from the
185
6. Product measures
definition, giving
with the above).
aw
=
W1a1 (0'2
= 0 of course and the formula is consistent
Remark 6.29
The above considerations can be immediately generalised to portfolios built of
any finite number of ingredients with the following key formulae:
kw
=
LWiki,
/-tw
=
L
a;' = L
Wi/-ti,
WiWjCij·
i,j
This is just the beginning of the story started in the 1950s by Nobel prize
winner Harry Markowitz. A vast number of papers and books on this topic have
been written since, proving the general observation that 'simple is beautiful'.
6.6 Proofs of propositions
Proof (of Proposition 6.3)
Denote by FR the a-field generated by the Borel 'rectangles' R = {B1 X
B2 : B 1 , B2 E B}, and by Fr the a-field generated by the true rectangles
I = {It X 12 : It,I2 are intervals}.
Since I C R, obviously Fr C FR.
To show the inverse inclusion we show that Borel cylinders B1 x [22 and
[21 x B2 are in Fr. For that write D = {A : A x [22 E Fr}; note that this is a
D
a-field containing all intervals, hence BcD as required.
Proof (of Proposition 6.16)
Let Sn = l: ck1Ak be an increasing sequence of simple functions convergent to
f. Let Rk = Ak X [0, Ck] and the union of such rectangles is in fact J sndm.
Then U~l Uk Rk = A j, so A j is measurable.
For the second claim take a y section of Aj which is the interval [0, f(x)).
Its measure is f(x) and by the definition of the product measure m2(A j ) =
J f(x) dx.
D
186
Measure, Integral and Probability
Proof (of Proposition 6.23)
The joint density is the product of the densities: !x,Y(x, y)
substituting this to (6.6) immediately gives the result.
=
!x(x)Jy(y) and
0
7
The Radon-Nikodym theorem
In this chapter we shall consider the relationship between a real Borel measure v
and the Lebesgue measure m. Key to such relationships is Theorem 4.24, which
shows that for each non-negative integrable real function I, the set function
A
t-+
v(A)
=
i
Idm
(7.1)
defines a (Borel) measure v on (R, M). The natural question to ask is the
converse: exactly which real Borel measures can be found in this way? We shall
find a complete answer to this question in this chapter, and in keeping with our
approach in Chapters 5 and 6, we shall phrase our results in terms of general
measures on an abstract set Q.
7.1 Densities and conditioning
The results we shall develop in this chapter also allow us to study probability
densities (introduced in Section 4.7.2), conditional expectations and conditional
probabilities (see Sections 5.4.3 and 6.5.3) in much greater detail. For v as
defined above to be a probability measure, we clearly require J I dm = 1. In
particular, if v = Px is the distribution of a random variable X, the function
I = Ix corresponding to v in (7.1) was called the density of X.
In similar fashion we defined the joint density l(x,Y) of two random variables
in Section 6.5.1, by reference of their joint distribution to two-dimensional
187
188
Measure, Integral and Probability
Lebesgue measure m2: if X and Yare real random variables defined on some
probability space (.0, F, P) their joint distribution is the measure defined on
Borel subsets B of JR2 by p(x,y)(B) = P((X, Y) E B). In the special case
where this measure, relative to m2, is given as above by an integrable function
f(x,Y), we say that X and Y have this function as their joint density.
This, in turn, leads naturally (see Section 6.5.3) to the concepts of conditional density
h(yla) = h(ylX = a) = f(x,y)(a,y)
fx(a)
and conditional expectation
JE(YIX
= a) =
~ yh(yla) dy.
Recalling that X : .0 --+ JR, the last equation can be written as JE(YIX)(w) =
Iu~. yh(yIX(w)) dy, displaying the conditional expectation as a random variable
JE(YIX) : .0 --+ JR, measurable with respect to the a-field Fx generated by X.
An application of Fubini's theorem leads to a fundamental identity, valid for
allAEFx
i
JE(YIX) dP
=
i
Y dP.
(7.2)
The existence of this random variable in the general case, irrespective of the
existence of a joint density, is of great importance in both theory and applications - Williams [13] calls it 'the central definition of modern probability'. It is
essential for the concept of martingale, which plays such a crucial role in many
applications, and which we introduce at the end of this chapter.
As we described in Section 5.4.3, the existence of orthogonal projections
in L2 allows one to extend the scope of the definition further still: instead of
restricting ourselves to random variables measurable with respect to a-fields
of the form F x we specify any sub-a-field Q of F and ask for a Q-measurable
random variable JE(YIQ) to play the role of JE(YIX) = JE(YIFx) in (7.2). As was
the case for product measures, the most natural context for establishing the
properties of the conditional expectation is that of general measures; note that
the proof of Theorem 4.24 simply required monotone convergence to establish
the countable additivity of P. We therefore develop the comparison of abstract
measures further, as always guided by the specific examples ofrandom variables
and distributions.
7.2 The Radon-Nikodym theorem
In the special case where the measure v has the form v(A) = fA f dm for some
non-negative integrable function f we said (Section 4.7.2) that v is absolutely
189
7. The Radon-Nikodym theorem
IA
continuous with respect to m. It is immediate that
f dm = 0 whenever
m(A) = 0 (see Theorem 4.7 (iv)). Hence m(A) = 0 implies v(A) = 0 when the
measure v is given by a density. We use this as a definition for the general case
of two given measures.
Definition 7.1
Let fl be a set and let F be a O'-field of its subsets. (The pair (fl, F) is a
measurable space.) Suppose that v and f-L are measures on (fl, F). We say that
v is absolutely continuous with respect to f-L if f-L(A) = 0 implies v(A) = 0 for
A E F . We write this as v « f-L.
Exercise 7.1
Let A1, A2 and f-L be measures on (fl, F). Show that if A1
then (A1 + A2) « f-L.
«
f-L and A2
«
f-L
It will not be immediately obvious what this definition has to do with the
usual notion of continuity of functions. We shall see later in this chapter how it
fits with the concept of absolute continuity of real functions. For the present,
we note the following reformulation of the definition, which is not needed for
the main result we will prove, but serves to bring the relationship between v
and f-L a little 'closer to home' and is useful in many applications:
Proposition 7.2
Let v and f-L be finite measures on the measurable space (fl, F). Then v « f-L if
and only if for every E > 0 there exists a rS > 0 such that for F E F, f-L(F) < rS
implies v(F) < E.
Hint Suppose the (E, rS)-condition fails. We can then find E > 0 and sets (Fn)
such that for all n 2: 1, f-L(Fn) < 2~ but v(Fn) > E. Consider f-L(A) and v(A)
for A = nn2:1 (Ui2:n Fi).
We generalise from the special case of Lebesgue measure: if f-L is any measure
on (fl, F) and f : fl --t ~ is a measurable function for which I f df-L exists, then
v(F) =
f df-L defines a measure v «f-L. (This follows exactly as for m, since
f-L(F) = 0 implies
f df-L = O. Note that we employ the convention 0 x 00 = 0.)
For O'-finite measures, the following key result asserts the converse:
IF
IF
190
Measure, Integral and Probability
Theorem 7.3 (Radon-Nikodym)
Given two O'-finite measures v, p, on a measurable space (D,F), with v« p"
then there is a non-negative measurable function h : D ---+ lR such that v(F) =
IF h dp, for every F E :F. The function h is unique up to p,-null sets: if 9 also
satisfies v(F) = IF 9 dp, for all FE F, then 9 = h a.e. (p,).
Since the most interesting case for applications arises for probability spaces
and then h E Ll (p,), we shall initially restrict attention to the case where p, and
v are finite measures. In fact, it is helpful initially to take p, to be a probability
measure, i.e. p,( D) = 1. From among several different approaches to this very
important theorem, we base our argument on one given by R.e. Bradley in
the American Mathematical Monthly (Vol. 96, no 5, May 1989, pp. 437-440),
since it offers the most 'constructive' and elementary treatment of which we
are aware.
It is instructive to begin with a special case: Suppose (until further notice)
that p,(fl) = 1. We say that the measure p, dominates v when 0::; v(F) ::; p,(F)
for every F E :F. This obviously implies v « p,. In this simplified situation we
shall construct the required function h explicitly. First we generalise the idea of
partitions and their refinements, which we used to good effect in constructing
the Riemann integral, to measurable subsets in (fl, F).
Definition 7.4
Let (fl, F) be a measurable space. A finite (measurable) partition of D is a
finite collection of disjoint subsets P = (Aik::;n in F whose union is n. The
finite partition p' is a refinement of P if each set in P is a disjoint union of
sets in P'.
Exercise 7.2
Let PI and P 2 be finite partitions of fl. Show that the coarsest partition
(i.e. with least number of sets) which refines them both consists of all
intersections An B, where A E PI, BE P 2 .
The following is a simplified 'Radon-Nikodym theorem' for dominated measures:
Theorem 7.5
Suppose that p,(fl)
= 1 and 0 ::;
v(F) ::; p,(F) for every F E :F. Then there
191
7. The Radon-Nikodym theorem
exists a non-negative F-measurable function h on
for all F E F.
n such that v( F) =
IF h dJ.L
We shall prove this in three steps: in Step 1 we define the required function hp for sets in a finite partition 'P and compare the functions hPl and hp2
when the partition 'P2 refines 'Pl. This enables us to show that the integrals
I n h~ dJ.L are non-decreasing if we take successive refinements. Since they are
also bounded above (by 1), c = sup I n h~ dJ.L exists in R In Step 2 we then construct the desired function h by a careful limit argument, using the convergence
theorems of Chapter 4. In Step 3 we show that h has the desired properties.
Step 1. The function h p for a finite partition.
Suppose that 0 S v(F) S J.L(F) for every F E F. Let 'P = {AI, A 2, ... , A k }
be a finite partition of n such that each Ai E F. Define the simple function
h p : n -+ lR by setting
hp(w) = Ci =
:~~:~
for w E Ai when J.L(Ai) > 0, and hp(w) = 0 otherwise.
Since hp is constant on each 'atom' Ai, V(Ai) = IA; hp dJ.L. Then hp has the
following properties:
(i) For each finite partition 'P of n, 0 S hp(w) S 1 for all wEn.
(ii) If A = UjEJ Aj for an index set J
Thus v(n) = InhpdJ.L.
C
(iii) If 'PI and 'P2 are finite partitions of
hPn' (n = 1,2) we have
(a) for all A E 'PI, IA hI dJ.L
{I, 2, ... ,k} then v(A)
n and 'P2 refines 'PI then, with hn =
= v(A) = IA h2 dJ.L,
(b) for all A E 'PI, IA hlh2 dJ.L = IA h~ dJ.L.
(iv) In(h~ - hD dJ.L = In(h 2 - h l )2 dJ.L and therefore
fa h~ fa h~ + fa
dJ.L =
dJ.L
= IF hp dJ.L.
(h2 - hl )2 dJ.L
~
fa h~
We now prove these assertions in turn.
(i) This is trivial by construction of hp , since J.L dominates v.
dJ.L.
192
Measure, Integral and Probability
(ii) Let A = UjEJ Aj for some index set J C {I, 2, ... ,k}. Since the {Aj} are
disjoint and v(Aj ) = 0 whenever {1(Aj) = 0, we have
v(A) = L v(Aj) =
L
v(~j) {1(Aj)
jEJ
jEJ,/-L(Ai»O {1( j)
1.
L
cj{1(Aj ) = L
hp d{1
jEJ,/-L(A;»O
jEJ AJ
=
i
hp d{1.
In particular, since P partitions fl, this holds for A = fl.
(iii) (a) With the P n , hn as above (n = 1,2) we can write A = UjEJBj for each
A E PI, where J is a finite index set and B j E P 2 . The sets B j are pairwise
disjoint, and again v(Bj ) = 0 when {1(Bj) = 0, so that
jA hl d{1=v(A)=L
JEJ
= L
jEJ
(b) With A
on A, so that
as
1.
V
(Bj )=
h2d{1 =
BJ
j
L
v((!J)) {1(Bj)
jEJ,/-L(B j »0 {1 J
h2d{1.
A
in part (a) and {1(A) > 0, note that hI
= :i~j is constant
v(A) {
(v(A))2
j v(A) 2 {2
jA hlh2 d{1 = {1(A)
} A h2 d{1 = {1(A) = A({1(A)) d{1 = } A hI d{1.
(iv) By (iii) (b), fA h l (h 2 - hd d{1
partition fl, we also have
= 0 for every A
E Pl. Since the Ai E PI
Hence
In (h2 - hl )2 d{1
= In (h~ - 2hlh2 + hi) d{1
=
In[h~ -
2hl (h2 - hd - hi] d{1
= In (h~ - hi) d{1,
and thus
193
7. The Radon-Nikodym theorem
Step 2. Passage to the limit - construction of h.
In Step 1 we showed that the integrals I n h~ dp, are non-decreasing over sucMoreover, by (i) above, each funccessive refinements of a finite partition of
tion hp satisfies 0 ~ hp(w) ~ 1 for all wEn. Thus, setting c = sup
h~ dp"
where the supremum is taken over all finite partitions of n, we have 0 ~ c ~ 1.
(Here we use the assumption that p,( n) = 1.)
For each n ::::: 1 let Pn be a finite measurable partition of n such that
h~n dp, > c- 4~' Let Qn be the smallest common refinement ofthe partitions
Pl, P 2 ,· .. , P n . For each n, Qn refines P n by construction, and Qn+lrefines Qn
since each Qk consists of all intersections Al n A2 n ... n A k , where Ai E Pi,
i ~ k. Hence each set in Qn is a disjoint union of sets in Qn+l. We therefore
have the inequalities:
n.
In
In
c - 41n
<
1h~n ~ 1ht ~ 1h~n+l
dp,
dp,
dp,
~ c.
Using the identity proved in Step 1 (iv), we now have
1
(hQn+l - hQJ2 dp,
=
The Schwarz inequality applied with
for each n ::::: 1,
l l h Qn+1
-
1(h~n+l
f =
-
h~J dp, < 4~'
IhQn+l - hQn I and 9
hQn Idp,
== 1 then yields
< 21n'
In
By the Beppo-Levi theorem, since l:n>l
IhQn+l - hQn I dp, is finite, we
conclude that the series l:n>l (hQn+l - hQJ converges almost everywhere (p,),
so that the limit function -
(noting that Ql = Pd is well-defined almost everywhere (p,). We complete the
construction by setting h = 0 on the exceptional p,-null set.
Step 3. Verification of the properties of h.
By Step 1 (i) it follows that 0 ~ h(w) ~ 1, and it is clear from its construction that h is F-measurable.
We need to show that v(F) = IF hdp, for every F E F. Fix any such
measurable set F and let n ::::: 1. Define Rn as the smallest common refinement
of the two partitions Qn (defined as in Step 2) and {F,FC}. Since F is a finite
disjoint union of sets in R n , we have v(F) = IF hn n dp, from Step 1 (ii).
194
Measure, Integral and Probability
By Step 2, c - 4~ < In h~n dfL ::; In h~n dfL ::; c, so, as before, we can
conclude that In(h Rn - hQJ2 dfL < 4~' and using the Schwarz inequality once
more, this time with 9 = IF, we have
I
k
(hRn - hQJ dfLl ::;
k
IhRn - hQn I dfL <
2~'
For all n, v(F) = IF hRn dfL = IF (h Rn -hQJ dfL+ IF hQn dfL· The first integral
on the right converges to 0 as n ~ 00, while the second converges to IF hdfL
by dominated convergence theorem (since for all n ~ 1, 0 ::; h Qn ::; 1 and fL(f"l)
is finite). Thus we have verified that v(F) = IF hdfL, as required.
It is straightforward to check that the assumption fL(fl) = 1 is not essential
since for any finite positive measure fL we can repeat the above arguments
using -;;fm instead of fl. We write the function h defined above as ~~ and call
it the Radon-Nikodym derivative of v with respect to fl. Its relationship to
derivatives of functions will become clear when we consider real functions of
bounded variation.
Exercise 7.3
Let fl = [0, 1] with Lebesgue measure and consider measures fL, v given
by densities lA, lB respectively. Find a condition on the sets A, B, so
that fL dominates v and find the Radon-Nikodym derivative ~~ applying
the above definition of the function h.
Exercise 7.4
Suppose fl is a finite set equipped with the algebra of all subsets. Let fL
and v be two measures on fl such that fL( {w}) ::f 0, v( {w}) ::f 0, for all
wE fl. Decide under which conditions fL dominates v and find ~~.
The next observation is an easy application of the general procedure highlighted in Remark 4.18:
Proposition 7.6
If fL and cp are finite measures with 0 ::; fL ::; cp, and if hJ.L = ~ is constructed
as above, then for any non-negative F-measurable function 9 on fl we have
l
gdfL =
l
ghJ.L dcp.
The same identity holds for any 9 E Ll(fL).
195
7. The Radon-Nikodym theorem
Hint Begin with indicator functions, use linearity of the integral to extend to
simple functions, and monotone convergence for general non-negative g. The
rest is obvious from the definitions.
For finite measures we can now prove the general result announced earlier:
Theorem 7.7 (Radon-Nikodym)
Let v and J-L be finite measures on the measurable space (n, F) and suppose
that v « J-L. Then there is a non-negative F-measurable function h on n such
that v(A) = fA h dJ-L for all A E F.
Proof
Let cp = v + J-L. Then cp is a positive finite measure which dominates both
v and J-L. Hence the Radon-Nikodym derivatives hv = ~~ and hIJ- = ~ are
well defined by the earlier constructions. Consider the sets F = {hIJ- > O} and
G = {hIJ- = O} in:F. Clearly J-L(G) = fa hIJ- dcp = 0, hence also v(G) = 0, since
v« J-L. Define h = ~v IF, and let A E F, A c F. By the previous proposition,
I'
with hlA instead of g, we have
v(A) =
i
hv dcp =
i
hhIJ- dcp =
i
hdJ-L
as required. Since J-L and v are both null on G this proves the theorem.
D
Exercise 7.5
Let n = [0,1] with Lebesgue measure and consider probability measures
J-L, v given by densities j, 9 respectively. Find a condition characterising
the absolute continuity v « J-L and find the Radon-Nikodym derivative
dv
dIJ- .
Exercise 7.6
Suppose n is a finite set equipped with the algebra of all subsets and
let J-L and v be two measures on n. Characterise the absolute continuity
v « J-L and find ~~.
You can now easily complete the picture for a-finite measures and verify
that the function h is 'essentially unique':
196
Measure, Integral and Probability
Proposition 7.8
The Radon-Nikodym theorem remains valid if the measures 1/ and ft are 0"finite: for any two such measures with 1/ « ft we can find a finite-valued nonnegative measurable function f on [l such that 1/( F) = JF h dft for all F E F.
The function h so defined is unique up to ft-null sets, i.e. if 9 : [l -+ ~+ also
satisfies 1/( F) = Jn 9 dft for all F E F then 9 = h a.e. (with respect to ft)·
Hint There are sequences (An), (Bm) of sets in F with ft(An), I/(Bm) finite
for all m, n ~ 1 and U n >1 An = [l = Um>1 Bm. We can choose these to be
sequences of disjoint sets-(why?). Hence display [l as the disjoint union of the
sets An n Bm (m, n ~ 1).
Radon-Nikodym derivatives of measures obey simple combination rules
which follow from the uniqueness property. We illustrate this with the sum
and composition of two Radon-Nikodym derivatives, and leave the 'inverse
rule' as an exercise.
Proposition 7.9
Assume we are given O"-finite measures A, 1/, ft satisfying A « ft and
Radon-Nikodym derivatives ~~ and ~~, respectively.
(i) With ¢
(ii) If A «
= A + 1/ we have ~ = ~~ + ~~ a.s.
1/
then ~~
= ~~ ~~ a.s.
1/
«
ft with
(ft).
(ft).
Exercise 7. 7
Show that if ft, 1/ are equivalent measures, i.e. both
are true, then
1/
«
ft and ft
«
1/
Given a pair of O"-finite measures A, ft on ([l, F) it is natural to ask whether
we can identify the sets for which ft(E) = 0 implies A(E) = O. This would
mean that we can split the mass of A into two pieces, one being represented
by a ft-integral, and the other 'concentrated' on ft-null sets, i.e. away from the
mass of ft. We turn this idea of 'separating' the masses of two measures into
the following:
197
7. The Radon-Nikodym theorem
Definition 7.10
If there is a set E E F such that A(F) = A(E n F) for every F E F then A is
concentrated on E. If two measures j.L, v are concentrated on disjoint subsets of
Q, we say that they are mutually singular and write j.L..l v.
Clearly, if A is concentrated on E and EnF = 0, then A(F) = A(EnF) = o.
Conversely, if for all F E F, F n E = 0 implies A(F) = 0, consider A(F) =
A(F n E) + A(F'-...E). Since (F"E) n E = 0 we must have A(F'-...E) = 0, so
A(F) = A(F n E). We have proved that A is concentrated on E if and only if
for all F E F, FnE = 0 implies A(F) = O. We gather some simple facts about
mutually singular measures:
Proposition 7.11
If j.L, v, Al,A2 are measures on a a-field F, the following are true:
(i) If Al ..lj.L and A2 ..lj.L then also (AI
+ A2) ..lj.L.
« j.L and A2 ..lj.L then A2 .1 AI.
If v « j.L and v ..lj.L then v = o.
(ii) If Al
(iii)
Hint For (i), with i = 1,2 let Ai, Bi be disjoint sets with Ai concentrated on
Ai, j.L on B i . Consider Al U A2 and Bl n B 2. For (ii) use the remark preceding
the proposition.
The next result shows that a unique 'mass-splitting' of a a-finite measure
relative to another is always possible:
Theorem 7.12 (Lebesgue decomposition)
Let A, j.L be a-finite measures on (Q, F). Then A can be expressed uniquely as
a sum of two measures, A = Aa + As where Aa « j.L and As..lj.L.
Proof
Existence: we consider finite measures; the extension to the a-finite case is
routine. Since 0 ::; A ::; A + j.L = c.p, i.e. ¢ dominates A, there is 0 ::; h ::; 1 such
that A(E) = IEhdc.p for all measurable E. Let A = {w : h(w < I} and B =
{w: h(w) = I}. Set Aa(E) = A(AnE) and As(E) = A(BnE) for every E E F.
Now if E c A and j.L(E) = 0 then A(E) = IE hdc.p = IE hdA, so that
I E (1- h) dA = O. But h < 1 on A, hence also on E. Therefore we must have
198
Measure, Integral and Probability
A(E) = O. Hence if E E F and f-l(E) = O,Aa(E) = A(AnE) = 0 as AnE c A.
So Aa « A. On the other hand, if E c B we obtain A(E) =
h dip =
1 d(A + f-l) = A(E) + f-l(E) , so that f-l(E) = O. As A = Be we have shown
that f-l(E) = 0 whenever En A = 0, so that f-l is concentrated on A. Since As
is concentrated on B this shows that As and f-l are mutually singular.
0
Uniqueness is left to the reader. (Hint: Employ Proposition 7.11.)
IE
IE
Combining this with the Radon-Nikodym theorem, we can describe the
structure of A with respect to f-l as 'basis measure':
Corollary 7.13
With f-l, A, Aa, As as in the theorem, there is a f-l-a.s. unique non-negative measurable function h such that
for every E E F.
Remark 7.14
This result is reminiscent of the structure theory of finite-dimensional vector
spaces: if x E ]Rn and m < n, we can write x = y + z, where y = 2::':1 Yiei
is the orthogonal projection onto ]Rm and z is orthogonal to this subspace.
We also exploited similar ideas for Hilbert space. In this sense the measure f-l
has the role of a 'basis' providing the 'linear combination' which describes the
projection of the measure A onto a subspace of the space of measures on n.
Exercise 7.8
Consider the following measures on the real line: PI = 6o, P2 =
215ml[0,25] , P3 = ~Pl + ~P2 (see Example 3.17). For which i =I j do
we have Pi « Pj ? Find the Radon-Nikodym derivative in each such
case.
Exercise 7. 9
Let A = 6o + ml[1,3] , f-l
Corollary 7.13.
61
+ ml[2,4]
and find Aa, As, and h as in
199
7. The Radon-Nikodym theorem
7.3 Lebesgue-Stieltjes measures
Recall (see Section 3.5.3) that given any random variable X : fl --+ JR, we
define its probability distribution as the measure Px = P 0 X-Ion Borel
sets on JR (i.e. we set P(X ::::: x) = Po X-I((-oo,xj) = Px((-oo,xj) and
extend this to B). Setting Fx(x) = Px((-oo,xj), we verified in Proposition 4.44 that the distribution function Fx so defined is monotone increasing,
right-continuous, with limits at infinity Fx (-00) = limx->_oo Fx (x) = 0 and
Fx(+oo) = limx->oo Fx(x) = 1.
In Chapter 4 we studied the special case where Fx(x) = Px((-oo,xj) =
r~oo fx dm for some real function fx, the density of P x with respect to
Lebesgue measure m. Proposition 4.32 showed that if fx is continuous, then
Fx is differentiable and has the density fx as its derivative at every x E R
On the other hand, the Lebesgue function in Example 4.43 illustrated that
continuity of Fx is not sufficient to guarantee the existence of a density.
Moreover, when Fx has a density fx, the measure P x was said to be 'absolutely continuous' with respect to m. In the context of the Radon-Nikodym
theorem we should reconcile the terminology of this special case with the general one considered in the present chapter. Trivially, when
dIm
we have Px « m, so that Px has a Radon-Nikodym derivative
with
respect to m. The a.s. uniqueness ensures that
= fx a.s.
Later in this chapter we shall establish the precise analytical requirements
on the cumulative distribution function Fx which will guarantee the existence
of a density.
dIm
7.3.1 Construction of Lebesgue-Stieltjes measures
To do this we first study, only slightly more generally, measures defined on
(JR, B) which correspond in similar fashion to increasing, right-continuous functions on R Their construction mirrors that of Lebesgue measure, with only
a few changes, by generalising the concept of 'interval length'. The measures
we obtain are known as Lebesgue-Stieltjes measures. In this context we call a
function F : JR --+ JR a distribution function if F is monotone increasing and
right-continuous. It is clear that every finite measure p, defined on (JR, B) defines
such a function by F(x) = p,((-oo, x]), with F(-oo) = limx->_ooF(x) = 0,
F(+oo) = limx->oo F(x) = p,(fl).
200
Measure, Integral and Probability
Our principal concern, however, is with the converse: given a monotone
right-continuous F : lR ~ lR, can we always associate with F a measure on
(S?, B), and if so, what is its relation to Lebesgue measure?
The first question is answered by looking back carefully at the construction
of Lebesgue measure m on lR in Chapter 2: first we defined the natural concept
of interval length, i(l) = b - a, for any interval I with endpoints a, b (a < b),
and by analogy with our discussion of null sets, we defined Lebesgue outer
measure m* for an arbitrary subset of lR as the infimum of the total lengths
I::=li(ln) of all sequences (In)n~l of intervals covering A. To generalise this
idea, we should clearly replace b - a by F(b) - F(a) to obtain a 'generalised
interval length' relative to F, but since F is only right-continuous we will need
to take care of possible discontinuities. Thus we need to identify the possible
discontinuities of monotone increasing functions - fortunately these functions
are rather well behaved, as you can easily verify in the following:
Proposition 7.15
If F : lR ~ lR is monotone increasing (i.e. Xl ~ X2 implies F(Xl) ~ F(X2))
then the left-limit F(x-) and the right-limit F(x+) exist at every X E lR and
F(x-) ~ F(x) ~ F(x+). Hence F has at most countably many discontinuities,
and these are jump discontinuities, i.e. F(x-) < F(x+).
Hint For any x, consider sup{F(y) : y < x} and inf{F(y) : x < y} to verify the
first claim. For the second, note that F(x-) < F(x+) if F has a discontinuity
at x. Use the fact that Q is dense in lR to show that there can only be count ably
many such points.
Since F is monotone, it remains bounded on bounded sets. For simplicity
we assume that limx -+_ oo F(x) = O. We define the 'length relative to F' of the
bounded interval (a, b] by
ip(a, b] = F(b) - F(a).
Note that we have restricted ourselves to left-open, right-closed intervals.
Since F is right-continuous, F(x+) = F(x) for all x, including a, b. Thus
ip(a,b] = F(b+) - F(a+), and all jumps of F have the form F(x) - F(x-).
By restricting to intervals of this type we also ensure that ip is additive over
adjoining intervals: if a < c < b then ip(a, b] = ip(a, c] + ip(c, b].
We generalise Definition 2.2 as follows:
201
7. The Radon-Nikodym theorem
Definition 7.16
The F-outer measure of any set A <;;; lR is the element of [0,00]
mF(A) = inf ZF(A)
where
00
00
n=l
n=l
Our 'covering intervals' are now also restricted to be left-open and rightclosed. This is essential to 'make things fit together', but does not affect measurability: recall (Theorem 2.25) that the Borel CT-field is generated whether we
start from the family of all int.ervals or from various sub-families.
Now consider the proof of Theorem 2.6 in more detail: our purpose there
was to prove that the outer measure of an interval equals its length. We show
how to adapt the proof to make this claim valid for mp and IF applied to
intervals of the form (a, b]. It will be therefore helpful to review the proof of
Theorem 2.6 before reading on!
Step 1. The proof that mF((a, b]) ~ IF(a, b] remains much the same:
To see that IF(a, b] E ZF((a, b]), we cover (a, b] by (In) with h = (a, b],
In = (a, a] = 0, n > 1. The total length of this sequence is F(b) - F(a) =
IF(a, b], hence the result follows by definition of info
Step 2. It remains to show that IF(a, b] ~ mF((a, b]). Here we need to be
careful always to 'approach points from the right' in order to make use of the
right-continuity of F and thus to avoid its jumps.
Fix c: > 0 and 0 < < b - a. By definition of inf we can find a covering of
I = (a, b] by intervals In = (an, bn] such that I::=llF(In) < mp(I) +~. Next,
let I n = (an' b~), where by right-continuity of F, for each n ~ 1 we can choose
b~ > bn and F(b~) - F(b n ) < 2 n\r" Then F(b~) - F(an) < {F(bn) - F(anH +
e
2 n +1 '
The (In)n~l then form an open cover of the compact interval [a + 0, b], so
that by the Heine-Borel theorem there is a finite subfamily (In)n"5:.N, which
also covers [a+o, b]. Re-ordering these N intervals I n we can assume that their
right-hand endpoints form an increasing sequence and then
°
F(b) - F(a + 0)
=
IF(a + 0, b] ~
N
L
N
L {F(b~) -
n=l
F(anH
L
00
{F(bn) - F(an) + 2:+1} <
IF(In)
n=l
n=l
< mF(I) +c:.
<
+~
202
Measure, Integral and Probability
This holds for all E > 0, hence F(b) - F(a + 8) ::; m'F(I) for every 8 > 0.
By right-continuity of F, letting 8 -!. we obtain IF(a, b] = lim8.j.o IF(a + 8, b] ::;
mF(a, b]. This completes the proof that mF((a, b]) = IF(a, b].
°
This is the only substantive change needed from the construction that led
to Lebesgue measure. The proof that m'F is an outer measure, i.e.
mp(A)
~
0,
mp(0)
= 0, mp(A)::; mp(B) if A
~
B,
mp(U Ai) ::; L mp(Ai),
00
00
i=l
i=l
is word-for-word identical with that given for Lebesgue outer measure (Proposition 2.5, Theorem 2.7). Hence, as in Definition 2.9 we say that a set E is
measurable (for the outer measure m'F) if for each A ~ JR,
mF(A) = mp(A n E)
+ mF(A n EC).
Again, the proof of Theorem 2.11 goes through verbatim, and we denote the
resulting Lebesgue-Stieltjes measure, i.e. m'F restricted to the a-field MF of
Lebesgue-Stieltjes measurable sets, by mF. By construction, just like Lebesgue
measure, mF is a complete measure: subsets of mF-null sets are in MF. However, as we shall see later, MF does not always coincide with the a-field M of
Lebesgue-measurable sets, although both contain all the Borel sets. It is also
straightforward to verify that the properties of Lebesgue measure proved in
Section 2.4 hold for general Lebesgue-Stieltjes measures, with one exception:
the outer measure m'F will not, in general, be translation-invariant. We can
see this at once for intervals, since IF((a + t, b + t]) = F(b + t) - F(a + t) will
not usually equal F(b) - F(a); simply take F(x) = arctan x, for example. In
fact, it can be shown that Lebesgue measure is the unique translation-invariant
measure on JR.
Note, moreover, that a singleton {a} is now not necessarily a null set for
mF: we have, by the analogue of Theorem 2.19, that
mF( {a}) = lim mF((a n-4oo
~,a])
= F(a)
n
- lim F(a n-4oo
~) = F(a)
n
- F(a-).
Thus, the measure of the set {a} is precisely the size ofthe jump at a (if any).
From this it is easy to see by similar arguments how the 'length' of an interval
depends on the presence or absence of its endpoints: given that mF( (a, b]) =
F(b)-F(a), we see that: mF((a, b)) = F(b- )-F(a), mF([a, b]) = F(b)-F(a-),
mF([a, b)) = F(b-) - F(a-).
7. The Radon-Nikodym theorem
203
Example 7.17
When F = l[a,oo) we obtain mF = 8a , the Dirac measure concentrated at
a. Similarly, we can describe a general discrete probability distribution, where
the random variable X takes the values {ai : i = 1,2, ... , n} with probabilities
{Pi = 1,2, ... , n} as the Lebesgue-Stieltjes measure arising from the function
F = 2:~=lPil[ai'oo).
Mixtures of discrete and continuous distributions, such as described in Example 3.17, clearly also fit into this picture. Of course, Lebesgue measure m is
the special case where the distribution is uniform, i.e. if F(x) = x for all x E lR.
then mF = m.
Example 7.18
Only slightly more generally, every finite Borel measure f..t on lR. corresponds to a
Lebesgue-Stieltjes measure, since the distribution function F(x) = f..t(( -00, x])
is obviously increasing and is right-continuous by Theorem 2.19 applied to f..t and
the intervals In = (-00, X + ~]. The corresponding Lebesgue-Stieltjes measure
mF is equal to f..t since they coincide on the generating family of intervals of
the above form. Hence they coincide on the a-field B of Borel sets. By our
construction of mF as a complete measure it follows that mF is the completion
of f..t.
Example 7.19
Return to the Lebesgue function F discussed in Example 4.43. Since F is
continuous and monotone increasing, it induces a Lebesgue-Stieltjes measure
mF on the interval [0,1], whose properties we now examine. On each 'middle
thirds' set F is constant, hence these intervals are null sets for mF, and as
there are count ably many of them, so is their union, the 'middle thirds' set, D.
Hence the Cantor set C = DC satisfies
1 = F(I) - F(O)
= mF([O, 1]) = mF(C),
(Note that since F is continuous, mF( {O}) = F(O) - F(O-) = 0; in fact, each
singleton is mF-null.) We thus conclude that mF is concentrated on a null set
for Lebesgue measure m, i.e. mF ..L m, and that in the Lebesgue decomposition
of mF relative to m there is no absolutely continuous component (by uniqueness
of the decomposition).
204
Measure, Integral and Probability
Exercise 7.10
Suppose the monotone increasing function F is non-constant at most
countably many points (as would be the case for a discrete distribution).
Show that every subset of lR is mF-measurable.
Hint Consider
mF
over the bounded interval [-M, M] first.
Exercise 7.11
Find the Lebesgue-Stieltjes measure
F(x)
~{
mF
generated by
Oifx<O
2x if x E [0, 1]
2 if x ~ 1.
7.3.2 Absolute continuity of functions
We now address the requirements on a distribution F which ensure that it has
a density. As we saw in Example 4.43, continuity of a probability distribution
function does not guarantee the existence of a density. The following stronger
restriction, however, does the trick:
Definition 7.20
A real function F is absolutely continuous on the interval [a, b] if, given E > 0,
there is J > 0 such that for every finite set of disjoint intervals Jk = (Xk' Yk),
k = 1,2, ... , n, contained in [a, b] and with L~=l (Yk - Xk) < J, we have
L~=l !F(Xk) - F(Yk)1 < E.
This condition will allow us to identify those distribution functions which
generate Lebesgue-Stieltjes measures that are absolutely continuous (in the
sense of measures) relative to Lebesgue measure. We will see shortly that absolutely continuous functions are also 'of bounded variation': this describes functions which do not 'vary too much' over small intervals. First we verify that
the indefinite integral (see Proposition 4.32) relative to a density is absolutely
continuous.
Proposition 7.21
f E L 1 ([a,bJ), where the interval [a,b] is finite, then the function F(x)
fax f dm is absolutely continuous.
If
205
7. The Radon-Nikodym theorem
Hint Use the absolute continuity of /1( G) =
measure m.
JG If I dm with respect to Lebesgue
Exercise 7.12
Decide which of the following functions are absolutely continuous:
(a) f(x) = lxi, x E [-1,1],
(b) g(x) =
ft,
x E [0,1],
(c) the Lebesgue function.
The next result is the important converse to the above example, and shows
that all Stieltjes integrals arising from absolutely continuous functions lead
to measures which are absolutely continuous relative to Lebesgue measure,
and hence have a density. Together with Example 7.19 this characterises the
distributions arising from densities (under the conditions we have imposed on
distribution functions).
Theorem 7.22
If F is monotone increasing and absolutely continuous on JR, let mF be the
Lebesgue-Stieltjes measure it generates. Then every Lebesgue-measurable set
is mF-measurable, and on these sets mF « m.
Proof
We first show that if the Borel set B has m(B) = 0, then also mF(B) = O.
Recall that, given 6> 0 we can find an open set 0 containing B with m(O) < 6
(Theorem 2.17), and there is a sequence of disjoint open intervals (Ikk?l,
h = (ak' bk) with union O. Since the intervals are disjoint, their total length
is less than 6. By the absolute continuity of F, given any E > 0, we can find
6 > 0 such that for every finite sequence of intervals Jk = (Xk' Yk), k ::; n, with
total length I:~=1 (Yk - Xk) < 6, we have
n
I)F(Yk) - F(Xk)} <
k=l
Applying this to the sequence
n
(Ik)k~n
L {F(bk) k=l
~.
for a fixed n we obtain
F(ak)} < ~.
206
Measure, Integral and Probability
As this holds for every n, we also have E~l {F(bk) - F(ak)} ~ ~ < c. This
is the total length of a sequence of disjoint intervals covering 0 :J B, hence
mF(B) < c for every c > 0, so mF(B) = O.
Now for every Lebesgue-measurable set E with m(E) = 0 we can find a
Borel set B :2 E with m(B) = O. Thus also mF(B) = O. Now E is a subset of
an mF-null set, hence it is also mF-null. Hence all m-measurable sets are mFmeasurable and m-null sets are mF-null, Le. mF «m when both are regarded
0
as measures on M.
Together with the Radon-Nikodym theorem, the above result helps to
clarify the structural relationship between Lebesgue measure and Lebesgue-Stieltjes measures generated by monotone increasing right-continuous functions, and thus, in particular, for probability distributions: when the function
F is absolutely continuous it has a density f, and can therefore be written
as its 'indefinite integral'. Since the Lebesgue--Stieltjes measure mF « m, the
Radon-Nikodym derivative d;:"F is well defined. Conversely, for the density f
of F to exist, the function F must be absolutely continuous. It now remains to
clarify the relationship between the Radon-Nikodym derivative
and the
density f. It is natural to expect from the example of a continuous f (Proposition 4.32) that f should be the derivative of F (at least m-a.e.). So we need to
understand which conditions on F will ensure that F'(x) exists for m-almost
all x E~.
We shall address this question in the somewhat wider context where the
'integrator' function F is no longer necessarily monotone increasing, but has
bounded variation, as introduced in the next section.
d:;::
7.3.3 Functions of bounded variation
Since in general we need to handle set functions that can take negative values,
for example, the map
E -+ Lgdm, where 9 E L1(m),
(clearly, L1(m) denotes the space of L1 functions with respect to the measure
m) we therefore need a concept of 'generalised length functions' which are
expressed as the difference of two monotone increasing functions. We need first
to characterise such functions. This is done by introducing the following:
207
7. The Radon-Nikodym theorem
Definition 7.23
A real function F is of bounded variation on [a, b] (briefly F E BV[a, b]) if
TF[a, b] < 00, where for any x E [a, b]
n
TF[a, x] = sup{L IF(Xk) - F(Xk-dl}
k=l
with the supremum taken over all finite partitions of [a, x] with a
... < Xn =
= Xo < Xl <
X.
We introduce two further non-negative functions by setting
n
PF[a, x] = sUP{L[F(Xk) - F(Xk-I)]+}
k=l
and
n
NF[a, x] = sUP{L[F(Xk) - F(Xk-dt}
k=l
where the supremum is again taken over all partitions of [a, x]. The functions
TF(PF , N F) are known respectively as the total (positive, negative) variation
functions of F. We shall keep a fixed in what follows, and consider these as
functions of x for x 2: a.
We can easily verify the following basic relationships between these definitions:
Proposition 7.24
If F is of bounded variation on [a, b], we have F(x) - F(a)
while TF(X) = PF(X) + NF(x) for x E [a, b].
= PF(X) - NF(x),
Hint Consider p(x) = E~=I[F(Xk) - F(Xk-d]+ and n(x) = E~=dF(Xk) F(Xk-I)]- for a fixed partition of [a, x] and note that F(x) - F(a) = p(x) n(x). Now use the definition of the supremum. For the second identity consider
TF(X) 2: p(x) + n(x) = 2p(x) - F(x) + F(a) and use the first identity.
Proposition 7.25
If F is of bounded variation and a ::; x ::; b then TF[a, b] = TF[a, x] + TF[X, b].
Similar results hold for PF and N F. Hence all three variation functions are
monotone increasing in x for fixed a E R Moreover, if F has bounded variation
on [a, b], then it has bounded variation on any [e, d] c [a, b].
208
Measure, Integral and Probability
Hint Adding a point to a partition will increase all three sums. On the other
hand, putting together partitions of [a, e] and [e, b] we obtain a partition of
[a,b].
We show that bounded variation functions on finite intervals are exactly
what we are looking for:
Theorem 7.26
Let [a, b] be a finite interval. A real function is of bounded variation on [a, b]
if and only if it is the difference of two monotone increasing real functions on
[a,b].
Proof
If F is of bounded variation, use F(x) = [F(a) +Pp(x)]-Np(x) from Proposition 7.24 to represent F as the difference of two monotone increasing functions.
Conversely, if F = 9 - h is the difference of two monotone increasing functions,
then for any partition a = Xo < Xl < ... < Xn = b of [a, b] we obtain, since g, h
are increasing,
n
n
L !F(Xi) - F(Xi-dl = L
i=l
Ig(Xi) - h(Xi) - g(xi-d + h(Xi-l)1
i=l
n
n
::; L[g(Xi) - g(xi-d] + L[h(Xi) - h(Xi-d]
i=l
::; g(b) - g(a)
+ h(b) -
i=l
h(a).
Thus M = g(b) - g(a) + h(b) - h(a) is an upper bound independent of the
choice of partition, and so Tp[a, b] ::; M < 00, as required.
D
This decomposition is minimal: if F = Fl - F2 and Fl , F2 are increasing,
then for any partition a = Xo < Xl ... < Xn = b we can write, for fixed i ::; n,
{F(Xi) - F(Xi-l)}+ - {F(Xi) - F(Xi-l)}- = F(Xi) - F(Xi-d
= {Fl(Xi) - Fl(Xi-l)} - {F2(Xi) - F2(Xi-l)}
which shows from the minimality property of X = x+ - x- that each term in
the difference on the right dominates its counterpart of the left. Adding and
taking suprema we conclude that Pp is dominated by the total variation of Fl
and N p by that of F 2 . In other words, in the collection of increasing functions
whose difference is F, the functions (F( a) + Pp) and N p have the smallest sum
at every point of [a, b].
7. The Radon-Nikodym theorem
209
Exercise 7.13
(a) Let F be monotone increasing on [a, b]. Find Tp[a, b].
(b) Prove that if F E BV[a, b] then F is continuous a.e. (m) and
Lebesgue-measurable.
(c) Find a differentiable function which is not in BV[O, 1].
(d) Show that ifthere is a (Lipschitz) constant M > 0 such that IF(x)yl for all x, y E [a, b], then F E BV[a, b].
F(y)1 :S Mix -
The following simple facts link bounded variation and absolute continuity
for functions on a bounded interval [a, b]:
Proposition 7.27
Suppose the real function F is absolutely continuous on [a, b]; then we have:
(i) F
E BV[a, b],
(ii) If F = FI - F2 is the minimal decomposition of F as the difference of two
monotone increasing functions described in Theorem 7.26, then both FI
and F2 are absolutely continuous on [a, b].
Hint Given c > 0 choose 0 > 0 as in Definition 7.20. In (i), starting with
an arbitrary partition (Xi) of [a, b] we cannot use the absolute continuity of
F unless we know that the subintervals are of length O. So add enough new
partition points to guarantee this and consider the sums they generate. For (ii),
compare the various variation functions when summing over a partition where
the sum of intervals lengths is bounded by O.
Definition 7.28
If F E BV[a, b], where a, b E JR., let F = FI - F2 be its minimal decomposition
into monotone increasing functions. Define the Lebesgue-Stieltjes signed measure determined by F as the countably additive set function mp given on the
O'-field B of Borel sets by mp = mp, -mp2' where mpi is the Lebesgue-Stieltjes
measure determined by F i , (i = 1,2).
We shall examine signed measures more generally in the next section. For
the present, we note the following:
210
Measure, Integral and Probability
Example 7.29
JE
When considering the measure Px(E) =
fx dm induced on IR by a density
fx we restrict attention to fx 2: 0 to ensure that Px is non-negative. But for
a measurable function f : IR ---+ IR we set (Definition 4.16)
fe f
dm
=
fe r
dm -
fe r
dm whenever
fe ifi
dm
< 00.
JE
The set function 1/ defined by I/(E) =
f dm then splits naturally into the
difference of two measures, i.e. 1/ = 1/+ - 1/-, where 1/+ (E) = JE f+ dm and
1/- (E) =
f- dm. Restricting to a function f supported on [a, bJ and setting
F(x) =
f dm, we obtain mF = 1/, and if F = FI - F2 as in the above
1/+, mF2 = 1/- by the minimality properties of the
definition, then mF,
splitting of F.
JE
J:
7.3.4 Signed measures
The above example and the definition of Lebesgue-Stieltjes measures generated
by a BV function motivate the following abstract definition and the subsequent
search for a similar decomposition into the difference of two measures. We proceed to outline briefly the structure of signed measures in the abstract setting,
which provides a general context for the above development of Stieltjes integrals and distribution functions. Our results will enable us to define integrals
of functions relative to signed measures by reference to the decomposition of
the signed measure into 'positive and negative parts', exactly as above. We also
obtain a more general Lebesgue decomposition and Radon-Nikodym theorem,
thus completing the description of the structure of a bounded signed measure
relative to a given O"-finite measure. This leads to the general version of the
fundamental theorem of the calculus signalled earlier.
Definition 7.30
A signed measure on a measurable space (n,:F) is a set function 1/ : :F ---+
(-00, +ooJ satisfying
(i) 1/(0)
(ii)
=
I/(U: I
0
Ei ) = L::II/(Ei ) if Ei E :F and Ei n E j = 0 for i
-I- j.
We need to avoid ambiguities like 00-00 by demanding that 1/ should take at
most one of the values ±oo; therefore we consistently demand that I/(E) > -00
for all sets E in its domain. Note also that in (ii) either both sides are +00,
211
7. The Radon-Nikodym theorem
or they are both finite, so that the series converges in R Since the left side is
unaffected by any re-arrangement of the terms of the series, it follows that the
series converges absolutely whenever it converges, i.e. 2::1 Iv(Ei)1 < 00 if and
only if Iv(U:1 Ei)1 < 00. The convergence is clear in the motivating example,
since for any E ~ lR we have
Iv(E)1 = I
Ie
f dml
~
Ie
If I dm <
00
when f E £1 (lR).
Note that v is finitely additive (let Ei = 0 for all i > n in (ii), then (i) implies
v(U~l Ei ) = 2:~=1 V(Ei) if Ei E F and Ei n E j = 0 for i ::f. j, i,j ~ n).
Hence if F ~ E, FE F, and Iv(E)1 < 00, then Iv(F)1 < 00, since both sides of
v(E) = v(F) + v(E \ F) are finite and v(~F) > -00 by hypothesis.
Signed measures do not inherit the properties of measures without change:
as a negative result we have:
Proposition 7.31
A signed measure v defined on a O'-field F is monotone increasing (F
implies v(F) ~ v(E)) if and only v is a measure on F.
c E
Hint 0 is a subset of every E E F.
On the other hand, a signed measure attains its bounds at some sets in F.
More precisely: given a signed measure von (D, F) one can find sets A and B
in F such that v(A) = inf{v(F) : F E F} and v(B) = sup{v(F) : F E F}.
Rather than prove this result directly we shall deduce it from the HahnJordan decomposition theorem. This basic result shows how the set A and its
complement can be used to define two (positive) measures v+, v- such that
v = v+ - v-, with v+(F) = v(F n AC) and v-(F) = -v(F n A) for all F E F.
The decomposition is minimal: if v = Al - A2 where the Ai are measures, then
v+ ~ Al and v- ~ A2'
Restricting attention to bounded signed measures (which suffices for applications to probability theory), we can derive this decomposition by applying
the Radon-Nikodym theorem. (Our account is a special case of the treatment
given in [11], Ch. 6, for complex-valued set functions.) First, given a bounded
signed measure v : F ---t lR, we seek the smallest (positive) measure J1 that
dominates v, i.e. satisfies J1(E) ;:: Iv(E)1 for all E E F. Defining
00
i=l
i2:1
212
Measure, Integral and Probability
produces a set function which satisfies Ivl(E) 2: Iv(E)1 for every E. The requirement f-l(Ei) 2: Iv(Ei)1 for all i then yields
= L f-l(Ei) 2:
00
f-l(E)
i=l
L Iv(Ei)1
00
i=l
for any measure f-l dominating v. Hence to prove that Ivl has the desired properties we only need to show that it is count ably additive. We call Ivl the total
variation of v. Note that we use countable partitions of n here, just as we used
sequences of intervals when defining Lebesgue measure in R
Theorem 7.32
The total variation
Ivl of a bounded signed measure is a
(positive) measure on
:F.
Proof
Partitioning E E F into sets (Ei ), choose (ai) in jR+ such that ai < Ivl(Ei)
for all i. Partition each Ei in turn into sets {Aij h, and by definition of sup
we can choose these to ensure that ai < Lj Iv( Aij) I for every i 2: 1. But
the (Aij) also partition E, hence Li ai < Li,j Iv(Ai,j)1 < Ivl(E). Taking the
supremum over all sequences (ai) satisfying these requirements ensures that
Li Ivl(Ed = sup Li ai ::; Ivl(E).
For the converse inequality consider any partition (B k ) of E and note that
for fixed k, (BknEik~l partitions B k , while for fixed i, (BknEih?l partitions
E i . This means that
Since the terms of the double series are all non-negative, we can exchange the
order of summation, so that finally
But the partition (B k ) of E was arbitrary, so the estimate on the right also
dominates Ivl(E). This completes the proof that Ivl is a measure.
D
We now define the positive (resp. negative) variation of the signed measure
v by setting:
7. The Radon-Nikodym theorem
213
Clearly both are positive measures on F, and we have
With these definitions we can immediately extend the Radon-Nikodym and
Lebesgue decomposition theorems to the case where v is a bounded signed measure (we keep the notation used in Section 7.3.2, so here J.L remains positive!):
Theorem 7.33
Let J.L be CT-finite (positive) measure and suppose that v is a bounded signed
measure. Then there is unique decomposition v = Va + vs , into two signed
measures, with Va « J.L and Vs ..1 J.L. Moreover, there is a unique (up to sets of
J.L-measure 0) hE L 1 (J.L) such that va(F) =
hdJ.L for all F E:F.
IF
Proof
Given v = v+ - v- we wish to apply the Lebesgue decomposition and RadonNikodym theorems to the pairs of finite measures (v+, J.L) and (v-, J.L). First we
need to check that for a signed measure A « J.L we also have IAI « J.L (for then
clearly both >.+ « J.L and A- « J.L). But if J.L(E) = 0 and (Fi) partitions E,
then each J.L(Fi ) = 0, hence A(Fi) = 0, so that Li>lIA(Fi)1 = O. As this holds
for each partition, IA(E)I = O.
Similarly, if A is concentrated on a set A, and An E = 0, then for any
partition (Fi) of E we will have A(Fi) = 0 for every i ;::: 1. Thus IAI(E) = 0,
so IAI is also concentrated on A. Hence if two signed measures are mutually
singular, so are their total variation measures, and thus also their positive
and negative variations. Applying the Lebesgue decomposition and RadonNikodym theorems to the measures v+ and v- provides (positive) measures
(v+)a, (v+)s, (v-)a, (v-)s such that v+ = (v+)a + (v+)s, and (v+)a(F) =
hi dJ.L, while v- = (v-)a + (v-)s and (v-)a(F) =
hI! dJ.L, for non-negative
functions hi, hI! E L 1 (J.L), and with the measures (v+)s, (V-)8 each mutually
singular with J.L. Letting Va = (v+)a - (v-)a we obtain a signed measure Va « J.L,
and a function h = hi - hI! E C.l(J.L) with va(F) =
hdJ.L for all FE :F. The
signed measure Vs = (v+)s - (v-)s is clearly singular to J.L, and h is unique up
to J.L-null sets, since this holds for hi, hI! and the decomposition V = v+ - v- is
IF
IF
IF
0
~~.
Example 7.34
IE
If 9 E L1 (J.L) then v( E) =
9 dJ.L is a signed measure and V « J.L. The RadonNikodym theorem shows that (with our conventions) all signed measures v « J.L
214
Measure, Integral and Probability
have this form.
We are nearly ready for the general form of the fundamental theorem of
calculus. First we confirm, as may be expected from the proof of the RadonNikodym theorem, the close relationship between the derivative of the bounded
variation function F induced by a bounded signed (Borel) measure 1/ on JR and
the derivative f = F':
Theorem 7.35
= 1/((-00, x]) then for any
If 1/ is a bounded signed measure on JR and F(x)
a E JR, the following are equivalent:
(i) F is differentiable at a, and F'(a)
=
L.
(ii) given E > 0 there exists 8 > 0 such that I :;,~1)
J contains a and I(J) < E.
- LI < E if the open interval
Proof
We may assume that L = 0; otherwise consider p = 1/ - Lm instead, restricted
to a bounded interval containing a. If (i) holds with L = 0 and E > 0 is given,
we can find 8 > 0 such that
IF(Y) - F(x)1 < ElY -
xl
whenever
Iy - xl < 8.
Let J = (x, y) be an open interval containing a with (y-x) < 8. For sufficiently
large N we can ensure that a > x + > x and so for k 2: 1, Yk = X + N~k is
bounded above by a and decreases to x as k --+ 00. Thus
-k
II/(Yk' y])1 = IF(Y) - F(Yk)1 ~ IF(y) - F(a)1
~ E{(Y - a) + (a - Yk)} < Em(J).
+ IF(a) -
F(Yk)1
But since Yk --+ x, I/(Yk' y] --+ I/(x, y] and we have shown that I:;,~1) I < E. Hence
(ii) holds. For the converse, let E,8 be as in (ii), so that with x < a < Y and
Y - x < 8, (ii) implies II/(x, Y + ~)I < c(Y + ~ - x) for all large enough n. But
as (x, y] = nn(x, Y + ~), we also have
II/(x, y]1 < IF(y) - F(x)1 < E(Y - x).
(7.3)
Finally, since (ii) holds, 11/({a})1 ~ 11/(1)1 < cl(J) for any small enough open
interval J containing a. Thus F(a) = F(a-) and so F is continuous at a. Since
x < a < Y < x + 8, we conclude that (7.3) holds with a instead of x, which
shows that the right-hand derivative of F at a is 0, and with a instead of y,
215
7. The Radon-Nikodym theorem
which shows the same for the left-hand derivative. Thus F'(a)
holds.
= 0, and so (i)
0
Theorem 7.36 (fundamental theorem of calculus)
Let F be absolutely continuous on [a, b]. Then F is differentiable m-a.e. and its
Lebesgue-Stieltjes signed measure mF has Radon-Nikodym derivative
=
F' m-a.e. Moreover, for each x E [a, b],
d:;::
F(x) - F(a) = mF[a, x] =
l
x
F'(t)dt.
You should compare the following theorem with with the elementary version
given in Proposition 4.32.
Proof
d:;::
The Radon-Nikodym theorem provides
= h E Ll(m) such that mF(E) =
IE h dm for all E E B. Choosing the partitions
Pn
= {(ti, ti+1]
: ti
= a + 2~ (b -
a), i
~ 2n },
°
we obtain, successively, each P n as the smallest common refinement of the
partitions Pl, P2, ... , Pn- l . Thus, setting hn(a) = and
we obtain a sequence (h n ) corresponding to the sequence (hQJ constructed in
Step 2 of the proof of the Radon-Nikodym theorem. It follows that hn(x) -+
h(x) m-a.e. But for any fixed x E (a, b), condition (ii) in Theorem 4.6 applied
to the function F on each interval (ti' ti+l) with length less than 8, and with
L = h(x), shows that h = F' m-a.e. The final claim is now obvious from the
definitions.
0
The following result is therefore immediate and it justifies the terminology
'indefinite integral' in this general setting.
Corollary 7.37
If F is absolutely continuous on [a, b] and F' =
°
m-a.e. then F is constant.
216
Measure, Integral and Probability
A final corollary now completes the circle of ideas for distribution functions
and their densities:
Corollary 7.38
I:
If f E L1([a,b]) and F(x) =
fdm for each x E [a,b] then F is differentiable
m-a.e. and F'(x) = f(x) for almost every x E [a,b].
7.3.5 Hahn-Jordan decomposition
As a further application of the Radon-Nikodym theorem we derive the HahnJordan decomposition of v which was outlined earlier. First we need the following
Theorem 7.39
Let v be a bounded signed measure and let Ivl be its total variation. Then
we can find a measurable function h such that Ih(w)1 = 1 for all w Efland
v(E) =
hdlvl for all E E F.
IE
Proof
The Radon-Nikodym theorem provides a measurable function h with v(E) =
h dlvl for all E E :F since every lvi-null set is v-null ({ E, 0, 0, ... } is a
partition of E). Let Ca = {w : Ih(w)1 < o;} for 0; > O. Then, for any partition
{Ed of c,:"
IE
L
i21
Iv(Ei)1 = L
11. hdlVl1
i21
E,
~ Lo;lvl(Ei ) = o;lvl(C
a ).
i21
As this holds for any partition, it holds for the supremum, i.e. Ivl(Ca ) ~
o;lvl(Ca ). For 0; < 1 we must conclude that C a is lvi-null, and hence also
v-null. Therefore Ihl :::: 1 v-a.e.
To show that Ihl ~ 1 v-a.e. we note that if E has positive lvi-measure, then,
by definition of h,
hdlvll
Iv(E)1
Ivl(E) = Ivl(E) ~ 1.
lIE
That this implies Ihl ~ 1 v-a.e. follows from the proposition below, applied
with p = Ivl. Thus the set where Ihl of- 1 is lvi-null, hence also v-null, and we
can redefine h there, so that Ih( w) I = 1 for all w E fl.
0
217
7. The Radon-Nikodym theorem
Proposition 7.40
Given a finite measure p and a function fELl (p), suppose that for every
E E F with p(E) > 0 we have Ip(~J) IEfdpl:::; 1. Then If(w)l:::; 1, p-a.e.
Hint Let E
= {f > I}.
If p(E)
> 0 consider IE iE) dp.
We are ready to derive the Hahn-Jordan decomposition very simply:
Proposition 7.41
Let v be a bounded signed measure. There are disjoint measurable sets A, B
such that AuB = f! and v+(F) = v(BnF), v-(F) = v(AnF) for all F E:F.
Consequently, if v = >'1 - >'2 for measures >'1, >'2 then >'1 2: v+ and >'2 2: v-.
Hint Since dv = h dlvl and Ihl = 1 let A = {h = -I}, B = {h = I}. Use the
definition of v+ to show that v+(F) = ~ IF(l + h) dlvl = v(F n B) for every F.
Exercise 7.14
Let v be a bounded signed measure. Show that for all F, v+(F) =
sUPacFv(G), v-(F) = -infacFv(G), all the sets concerned being
members of F.
Hint v(G) :::; v+(G) :::; v(B n G)
+ v((B n F)"",(B n G)) = v(B n F).
Exercise 7.15
Show that when v(F) = IF f d/1 where f E L 1 (/1), where /1 is a (positive)
measure, the Hahn decomposition sets are A = {f < O} and B = {f 2:
O}, and v+(F) = IF f+ dv, while v-(F) = IFf-dv.
We finally arrive at a general definition of integrals relative to signed measures:
Definition 7.42
Let /1 be signed measure and
integral IF f d/1 by
f
a measurable function on F E F. Define the
218
Measure, Integral and Probability
whenever both terms on the right are finite or are not of the form ±(oo - 00).
The function is sometimes called summable if the integral so defined is finite.
Note that the earlier definition of a Lebesgue-Stieltjes signed measure fits into
this general framework. We normally restrict attention to the case when both
terms are finite, which clearly holds when J.1- is bounded.
Exercise 7.16
Verify the following: Let J.1- be a finite measure and define the signed
measure v by v(F) = IFgdJ.1-. Prove that f E Ll(v) if and only if
f g E £1 (J.1-) and IE f dv = IE f g dJ.1- for all J.1--measurable sets E.
7.4 Probability
7.4.1 Conditional expectation relative to a O'-field
Suppose we are given a random variable X E L 1 (P), where (fl,F,P) is a
probability space. In Chapter 5 we defined the conditional expectation IE(XI9)
of X E L 2 (P) relative to a sub-a-field 9 of F as the a.s. unique random variable
Y E L2(9) (meaning g-measurable) satisfying the condition
fa YdP = faXdP for all G
E
g.
(7.4)
The construction was a consequence of orthogonal projections in the Hilbert
space L2 with the extension to all integrable random variables undertaken 'by
hand', which required a little care. With the Radon-Nikodym theorem at our
disposal we can verify the existence of conditional expectations for integrable
random variables very simply:
The (possibly signed) bounded measure v(F) = IF X dP is absolutely continuous with respect to P. Restricting both measures to (fl,9) maintains this
relationship, so that there is a g-measurable, P-a.s. unique random variable Y
such that v(G) =
Y dP for every G E g. But by definition v(G) =
X dP,
so the defining equation (7.4) of Y = IE(XI9) has been verified.
Ie
Ie
Remark 7.43
In particular, this shows that for X E L2(F) its orthogonal projection onto
L2(9) is a version of the Radon-Nikodym derivative of the measure v : F ---+
IFXdP.
219
7. The Radon-Nikodym theorem
We shall write lE(XIQ) instead of Y from now on, always keeping in mind
that we have freedom to choose a particular 'version', i.e. as long as the results
we seek demand only that relations concerning lE(XIQ) hold P-a.s., we can alter
this random variable on a null set without affecting the truth of the defining
equation:
Definition 7.44
A random variable lE(XIQ) is called the conditional expectation of X relative
to a a-field Q if
(i) lE(XIQ) is Q-measurable,
(ii) fa lE(XIQ) dP = faX dP for all G
E Q.
We investigate the properties of the conditional expectation. To begin with,
the simplest are left for the reader as a proposition. In this and the subsequent
theorem we make the following assumptions:
1. All random variables concerned are defined on a probability space (il, F, P);
2. X, Y and all (Xn) used below are assumed to be in L 1 (il,F,P);
3. Q and 1£ are sub-a-fields of F.
The properties listed in the next proposition are basic, and are used time
and again. Where appropriate we give verbal description of its 'meaning' in
terms of information about X.
Proposition 7.45
The conditional expectation lE(XIQ) has the following properties:
(i) lE(lE(XIQ)) = IE(X)
(more precisely: any version of the conditional expectation of X has the
same expectation as X).
(ii) If X is Q-measurable, then lE(XIQ) = X
(if, given Q, we already 'know' X, our 'best estimate' of it is perfect).
(iii) If X is independent of Q, then lE(XIQ)
= lE(X)
(if Q 'tells us nothing' about X, our best guess of X is its average value).
220
Measure, Integral and Probability
(iv) (Linearity) lE((aX
a,b
+ bY)IQ) =
alE(XIQ)
+ blE(YIQ)
for any real numbers
(note again that this is really says that each linear combination of versions
of the right-hand side is a version of the left-hand side).
Theorem 7.46
The following properties hold for lE(XIQ) as defined above:
(i) If X
~
0 then lE(XIQ)
O-a.s.
~
(positivity).
(ii) If (Xnk:":l are non-negative and increase a.s. to X, then (lE(XnlQ)k":l
increase a.s. to lE(XIQ)
('monotone convergence' of conditional expectations).
(iii) If Y is Q-measurable and XY is integrable, then lE(XYIQ)
= YlE(XIQ)
('taking out a known factor').
(iv) If He Q then lE([lE(XIQ)lIH)
=
lE(XIH)
(the tower property).
(v) If <p : IR ---+ IR is a convex function and <p(X) E £l(P), then
lE(<p(X)IQ) ~ <p(lE(XIQ))·
(This is known as the conditional Jensen inequality - a similar result holds
for expectations. Recall that a real function <p is convex on (a, b) if for all
x, y E (a, b), <p(px + (1 - p)y) ~ p<p(x) + (1 - p)<p(y); the graph of <p stays
on or below the straight line joining (x, <p(x)), (y, <p(y)).)
Proof
= {lE(XIQ) < -i-}
(i) For each k ~ 1 the set Ek
r
X dP
iE k
=
r
iEk
E Q, so that
lE(XIQ) dP.
As X ~ 0, the left-hand side is non-negative, while the right-hand side is
bounded above by -f,P(Ek ). This forces P(Ek ) = 0 for each k, hence also
P(lE(XIQ) < O} = P(Uk E k ) = o. Thus lE(XIQ) ~ O-a.s.
7. The Radon-Nikodym theorem
221
(ii) For each n let Yn be a version of lE(Xnlm. By (i) and as in Section 5.4.3,
the (Yn ) are non-negative and increase a.s. Letting Y = limsuPn Yn provides a
9-measurable random variable such that the real sequence (Yn(w))n converges
to Y(w) for almost all w. Corollary 4.14 then shows that (fa Yn dPk::::l increases to fa Y dP. But we have fa Yn dP = faXn dP for each n, and (Xn)
increases pointwise to X. By the monotone convergence theorem it follows that
(faXn dP)n~l increases to fa X dP, so that fa X dP = fa Y dP. This shows
that Y is a version of lE(Xlm and therefore proves our claim.
(iii) We can restrict attention to X ?: 0, since the general case follows from
this by linearity. Now first consider the case of indicators: if Y = IE for some
E E 9, we have, for all G E 9,
r lE(XI9) dP = lEna
r X dP = JIEX
dP.
ja IElE(Xlm dP = lEna
a
so that IElE(XI9) satisfies the defining equation and hence is a version of the
conditional expectation of the product XY. So lE(XYlm = YlE(Xlm has been
verified when Y = IE and E E 9. By the linearity property this extends to
simple functions, and for arbitrary Y ?: 0 we now use (ii) and a sequence
(Yn ) of simple functions increasing to Y to deduce that, for non-negative X,
lE(XYnI9) = YnlE(Xlm increases to lE(XYI9) on the one hand and to YlE(Xlm
on the other. Thus if X and Y are both non-negative we have verified (iii).
Linearity allows us to extend this to general Y = y+ - Y- .
(iv) We have fa lE(XI9) dP = faX dP for G E 9 and fH lE(XI1£) dP =
fH X dP for H E 1-£ c 9. Hence for H E 1£ we obtain fH lE(Xlm dP =
fH lE(XI1-£) dP. Thus lE(XI1-£) satisfies the condition defining the conditional
expectation of lE(Xlm with respect to 1-£, so that lE([lE(XlmJI1-£) = lE(XI1-£).
(v) A convex function can be written as the supremum of a sequence of
affine functions, i.e. there are sequences (an), (b n ) of reals such that r.p(x) =
sUPn(anx + bn ) for every x E R Fix n, then since r.p(X(w)) ?: anX(w) + bn for
all w, the positivity and linearity properties ensure that
for all w E n"An where P(An) = O. Since A = Un An is also null, it follows
that for all n ?: 1, lE( r.p(X) Im(w) ?: anlE(XI9)(w) + bn a.s. Hence the inequality
also holds when we take the supremum on the right, so that (lE(r.p(X)lm(w) ?:
r.p[(lE(XI9)(w)J a.s. This proves (v).
D
222
Measure, Integral and Probability
An immediate consequence of (v) is that the LP-norm ofJE(XI9) is bounded
by that of X for p 2': 1, since the function 'P(x) = IxlP is then convex: we obtain
IJE(XI9)IP = 'P(JE(XI9))
s JE('P(X)19) = JE(IXIPI9) a.s.
so that
where the penultimate step applies (i) of Proposition 7.45 to IXIP. Take pth
roots to have IJE(XI9)lp S IXlp'
Exercise 7.17
Let fl
= [0,1] with Lebesgue measure and let X(w) = w. Find JE(XI9) if
(a) 9 = {[O,
n (~, 1], [0,1]' 0},
(b) 9 is generated by the family of sets {B
c
[O,~], Borel}.
7.4.2 Martingales
Suppose we wish to model the behaviour of some physical phenomenon by a
sequence (Xn) ofrandom variables. The value Xn(w) might be the outcome of
the nth toss of a 'fair' coin which is tossed 1000 times, with 'heads' recorded
as 1, 'tails' as 0. Then Y(w) = L~~~ Xn(w) would record the number oftimes
that the coin had landed 'heads'. Typically, we would perform this random
experiment a large number of times before venturing to make statements about
the probability of 'heads' for this coin. We could average our results, i.e. seek to
compute JE(Y). But we might also be interested in guessing what the value of
Xn(w) might be after k < n tosses have been performed, i.e. for a fixed wE fl,
does knowing the values of (Xi(W))iSk give us any help in predicting the value
of Xn(w) for n > k? In an 'idealised' coin-tossing experiment it is assumed that
it does not, that is, the successive tosses are assumed to be independent - a
fact which often perplexes the beginner in probability theory.
There are many situations where the (Xn) would represent outcomes where
the past behaviour of the process being modelled can reasonably be taken to
influence its future behaviour, e.g. if Xn records whether it rains on day n.
We seek a mathematical description of the way in which our knowledge of
past behaviour of (Xn) can be codified. A natural idea is to use the O'-field
Fk = O'{ Xi : SiS k} generated by the sequence (Xn )n2:0 as representing
the knowledge gained from knowing the first k outcomes of our experiment. We
call (Xn)n>O a (discrete) stochastic process to emphasise that our focus is now
°
7. The Radon-Nikodym theorem
223
on the 'dynamics' of the sequence of outcomes as it unfolds. We include a Oth
stage for notational convenience, so that there is a 'starting point' before the
experiment begins, and then Fo represents our knowledge before any outcome
is observed.
So the information available to us by 'time' k (i.e. after k outcomes
have been recorded) about the 'state of the world' w is given by the values
(Xi(W))O::;i::;k and this is encapsulated in knowing which sets of Fk contain the
point w. But we can postulate a sequence of O'-fields (Fn )n20 quite generally,
without reference to any sequence of random variables. Again, our knowledge
of any particular w is then represented at stage k ~ 1 by knowing which sets in
Fk contain w. A simple example is provided by the binomial stock price model
of Section 2.6.3 (see Exercise 2.13). Guided by this example, we turn this into
a general definition.
Definition 7.47
Given a probability space (fl, F, P), a (discrete) filtration is an increasing sequence of sub-O'-fields (Fn )n20 of F; i.e.
Fo C FI C F2 C ... C Fn C ... c F.
We write IF = (Fn )n20. We say that the sequence (X n )n20 of random variables
is adapted to the filtration IF if Xn is Fn-measurable for every n ~ O. The tuple
(fl, F, (Fn )n20, P) is called a filtered probability space.
We shall normally assume in our applications that Fo = {0, fl}, so that
we begin with 'no information', and very often we shall assume that the 'final'
O'-field generated by the whole sequence, i.e. Foe = dUn>o F n ), is all of F (so
that, by the end of the experiment, 'we know all there is to-know'). Clearly (Xn)
is adapted to its natural filtration (Fn), where Fn = O'(Xi : 0 ::; i ::; n) for each
n, and it is adapted to every filtration which contains this one. But equally, if
Fn = O'(Xi : 0 ::; i ::; n) for some process (Xn ), it may be that for some other
process (Yn ), each Yn is Fn-measurable, i.e. (Yn ) is adapted to (Fn ). Recall that
by Proposition 3.24 this implies that for each n ~ 1 there is a Borel-measurable
function in : lR. n+1 --+ lR. such that Yn = i(Xo, Xl, X 2,.··, Xn).
We come to the main concept introduced in this section:
Definition 7.48
Let (fl, F, (Fn)n>o, P) be a filtered probability space. A sequence of random
variables (Xn )n20 on (fl, F, P) is a martingale relative to the filtration IF =
(Fn)n>o, provided:
224
Measure, Integral and Probability
(i) (Xn) is adapted to IF,
(ii) each Xn is in L 1 (P),
(iii) for each n ::::: 0, lE(Xn+lIFn)
= Xn .
We note two immediate consequences of this definition which are used over
and over again:
°
1. If m > n ::::: then lE(XmIFn) = X n . This follows from the tower property
of conditional expectations, since (a.s.)
2. Any martingale (Xn) has constant expectation:
holds for every n ::::: 0, by 1) and by (i) in Proposition 7.45.
A martingale represents a 'fair game' in gambling: betting, for example, on
the outcome of the coin tosses, our winnings in 'game n' (the outcome of the
nth toss) would be ..1Xn = Xn - X n - 1 , that is the difference between what
we had before and after that game. (We assume that Xo = 0.) If the games
are fair we would predict at time (n - 1), before the nth outcome is known,
that lE(..1XnIFn-d = 0, where Fk = a(Xi : i :S k) are the a-fields of the
natural filtration of the process (Xn). This follows because our knowledge at
time (n - 1) is encapsulated in F n - 1 and in a fair game we would expect our
incremental winnings at any stage to be on average. Hence in this situation
the (Xn) form a martingale.
Similarly, in a game favourable to the gambler we should expect that
lE(..1Xn IFn- 1 ) ::::: 0, i.e. lE(Xn IFn- 1 ) ::::: X n - 1 a.s. We call a sequence satisfying this inequality (and (i), (ii) of Definition 7.48) a submartingale, while a
game unfavourable to the gambler (hence favourable to the casino!) is represented similarly by a supermartingale, which has lE(XnIFn-d :S X n - 1 a.s. for
every n. Note that for a submartingale the expectations of the (Xn) increase
with n, while for a supermartingale they decrease. Finally, note that the properties of these processes do not change if we replace Xn by Xn - Xo (as long
as Xo E L1(Fo), to retain integrability and adaptedness), so that we can work
without loss of generality with processes that start with Xo = 0.
°
Example 7.49
The most obvious, yet in some ways quite general, example of a martingale
consists of a sequence of conditional expectations: given a random variable
7. The Radon-Nikodym theorem
225
L1(F) and a filtration (Fn)n?O of sub-O'-fields of F, let Xn = lE(XIFn)
for every n. Then lE(Xn+1IFn) = lE(lE(XIFn+d IFn) = lE(XIFn) = X n, using
the tower property again. We can interpret this by regarding each Xn as giving
us the information available at time n, i.e. contained in the O'-field F n , about
X
E
the random variable X. (Remember that the conditional expectation is the
'best guess' of X, with respect to mean-square errors, when we work in L2.)
For a finite filtration {Fn : 0 ~ n ~ N} with F N = F it is obvious that
lE(XIFN) = X. For an infinite sequence we might hope similarly that 'in the
limit' we will have 'full' information about X, which suggests that we should
be able to retrieve X as the limit of the (Xn) in some sense. The conditions
under which limits exist require careful study - see e.g. [13] and [5] for details.
A second standard example of a martingale is:
Example 7.50
Suppose (Zn)n>1 is a sequence of independent random variables with zero
mean. Let Xo = O,Fo = {0, D}, set Xn = I:~=1 Zk and define Fn = O'(Zk :
k ~ n) for each n 2: 1. Then (Xn)n?O is a martingale relative to the filtration
(Fn). To see this recall that for each n, Zn is independent of F n-1, so that
lE(ZnIFn-d = lE(Zn) = O. Hence lE(XnIFn-d = lE(Xn-1IFn-d + lE(Zn) =
X n - 1 , since X n - 1 is F n _ 1 -measurable. (You should check carefully which properties of the conditional expectation we used here!)
A 'multiplicative' version of this example is the following:
Exercise 7.18
Let Zn 2: 0 be a sequence of independent random variables with
lE(Zn) = fJ. = 1. Let Fn = O'{Zk : k ~ n} and show that, Xo = 1,
Xn = Z1Z2'" Zn (n 2: 1) defines a martingale for (Fn), provided all
the products are integrable random variables, which holds, e.g., if all
Zn
E
LOO(D,F,P).
Exercise 7.1 9
Let (Zn)n?1 be a sequence of independent random variables with mean
fJ. = lE(Zn) of. 0 for all n. Show that the sequence of their partial sums
Xn = Z1 + Z2 + ... + Zn is not a martingale for the filtration (Fn)n,
where Fn = O'{Zk : k ~ n}. How can we 'compensate' for this by altering
Xn?
226
Measure, Integral and Probability
7.4.3 Doob decomposition
Let X = (Xnk:::o be a martingale for the filtration IF = (Fnk::o (with
our above conventions); briefly we simply refer to the martingale (X, IF). The
quadratic function is convex, hence by Jensen's inequality (Theorem 7.46) we
have lE(X~+lIFn) ?: (lE(Xn+1IFn))2 = X~, so X 2 is a submartingale. We investigate whether it is possible to 'compensate', as in Exercise 7.19, to make
the resulting process again a martingale. Note that the expectations of the X~
are increasing, so we will need to subtract an increasing process from X 2 to
achieve this.
In fact, the construction of this 'compensator' process is quite general. Let
Y = (Yn ) be any adapted process with each Yn ELl. For any process Z write
its increments as ..1Zn = Zn - Zn-1 for all n. Recall that in this notation the
martingale property can be expressed succinctly as lE(..1Zn IFn-d =
we
shall use this repeatedly in what follows.
We define two new processes A = (An) and M = (Mn) with Ao = 0, Mo = 0,
via their successive increments,
°-
We obtain lE(..1MnIFn_1) = lE([..1Yn -lE(..1Yn IFn- 1)]IFn-d = lE(..1YnIFn-dlE(..1Yn IFn - 1) = 0, as lE(..1YnIFn-d is Fn_1-measurable. Hence M is a martingale. Moreover, the process A is increasing if and only if
..1An =
lE(..1YnIFn-d = lE(YnIFn-d - Yn- 1, which holds if and only if Y is a submartingale. Note that An = I:Z=l ..1Ak = I:Z=l [lE(Yk IFk-d - Yk- 1] is F n- 1measurable. Thus the value of An is 'known' by time n - 1. A process with
this property is called predictable, since we can 'predict' its future values one
step ahead. It is a fundamental property of martingales that they are not predictable: in fact, if X is a predictable martingale, then we have
° : :;
X n- 1 = lE(XnIFn-1)
= Xn a.s. for every n
where the first equality is the definition of martingale, while the second follows
since Xn is Fn_1-measurable. Hence a predictable martingale is a.s. constant,
and if it starts at it will stay there. This fact gives the decomposition of an
adapted process Y into the sum of a martingale and a predictable process a
useful uniqueness property: first, since Mo = = A o, we have Yn = Yo + Mn +
An for the processes M, A defined above. If also Yn = Yo + M~ + A~, where
M~ is a martingale, and A~ is predictable, then
°
°
Mn - M~ = A~ - An a.s.
°
is a predictable martingale, at time 0. Hence both sides are
so the decomposition is a.s. unique.
°
for every nand
7. The Radon-Nikodym theorem
227
We call this the Doob decomposition of an adapted process. It takes on
special importance when applied to the submartingale Y = X2 which arises
from a martingale X. In that case, as we saw above, the predictable process A
is increasing, so that An :::; An+! a.s. for every n, and the Doob decomposition
reads:
X 2 =X6+ M + A .
In particular, if Xo = 0 (as we can assume without loss of generality), we have
written X2 = M + A as the sum of a martingale M and a predictable increasing
process A. The significance of this is revealed in a very useful property of
martingales, which was a key component of the proof of the Radon-Nikodym
theorem (see Step 1 (iv) of Theorem 7.5, where the martingale connection is
well hidden!): for any martingale X we can write, with (..1Xn)2 = (Xn -Xn_ 1 )2:
lE(..1Xn)2IFn_l) = lE([X; - 2XnX n- 1 + X;_dIFn-d
= lE(X;IFn- 1 ) - 2Xn-llE(..1XnIFn-l)
+ X;_l
= lE([X; - X;_llIFn- 1 ),
where in the second step we took out 'what is known', then used the martingale
property and cancelled the resulting terms. Hence given the martingale X with
Xo = 0, the decomposition X 2 = M + A yields, since M is also a martingale:
0= lE(..1MnIFn-l) = lE((..1Xn)2 - ..1An )IFn- 1 )
= lE([X; - X;_llIFn- 1) -lE(..1AnIFn-l)'
In other words, since A is predictable,
(7.5)
which exhibits the process A as a conditional 'quadratic variation' process of
the original martingale X. Taking expectations: lE((..1Xn)2) = lE(..1An).
Example 7.51
Note also that lE(X;) = lE(Mn) + lE(An) = lE(An) (why?), so that both sides
are bounded for all n if and only if the martingale X is bounded as a sequence
in L2(Q, F, P). Since (An) is increasing, the a.s. limit Aoo(w) = limn-+oo An(w)
exists, and the boundedness of the integrals ensures in that case that lE(Aoo) <
00.
Exercise 7.20
Suppose (Zn)n~l is a sequence of Bernoulli random variables, with each
Zn taking the values 1 and -1, each with probability!. Let Xo = 0,
228
Measure, Integral and Probability
Xn = Zl + Z2 + ... + Zn, and let (.1'n) be the natural filtration generated
by the (Zn). Verify that (X~) is a submartingale, and find the increasing
process (An) in its Doob decomposition. What 'unexpected' property of
(An) can you detect in this example?
In the discrete setting we now have the tools to construct 'stochastic integrals' and show that they preserve the martingale property. In fact, as we
saw for Lebesgue--Stieltjes measures, for discrete distributions the 'integral'
is simply an appropriate linear combination of increments of the distribution
function. If we wish to use a martingale X as an integrator, we therefore need
to deal with linear combinations of the increments .1Xn = Xn - X n- 1 . Since
we are now dealing with stochastic processes (that it, functions of both nand
w) rather than real functions, measurability conditions will help determine
what constitutes an 'appropriate' linear combination. So, if for wED we set
Io(w) = 0 and form sums
In(w) =
n
n
k=l
k=l
L Ck(W)(.1Xk )(w) = L Ck(W)(Xk(W) -
Xk-1(W)) for n ?: 1,
we look for measurability properties of the process (c n ) which ensure that
the new process (In) has useful properties. We investigate this when (c n ) is a
bounded predictable process and X is a martingale for a given filtration (.1'n).
Some texts call the process (In) a martingale transform - we prefer the term
discrete stochastic integral. We calculate the conditional expectation of In:
since Cn is .1'n_1-measurable and lE(.1Xnl.1'n-d = lE(Xnl.1'n-d - X n- 1 = O.
Therefore, when the process C = (c n ) which is integrated against the martingale X = (X n ), is predictable, the martingale property is preserved under the
discrete stochastic integral: I = (In) is also a martingale with respect to the
filtration (.1'n). We shall write this stochastic integral as c· X, meaning that for
all n ?: 0, In = (c· X)n' The result has sufficient importance for us to record
it as a theorem:
Theorem 7.52
Let (D, .1', (.1'n)n~o, P) be a filtered probability space. If X is a martingale and
C is a bounded predictable process, then the discrete stochastic integral c· X is
again a martingale.
229
7. The Radon-Nikodym theorem
Note that we use the boundedness assumption in order to ensure that
ck.1Xk is integrable, so that its conditional expectation makes sense. For L2_
martingales (which are what we obtain in most applications) we can relax this
condition and demand merely that Cn E L 2 (Fn _ 1 ) for each n.
While the preservation of the martingale property may please mathematicians, it is depressing news for gamblers! We can interpret the process C as
representing the size of the stake the gambler ventures in every game, so that
Cn is the amount (s)he bets in game n. Note that Cn could be 0, which mean
that the gambler 'sits out' game n and places no bet. It also seems reasonable
that the size of the stake depends on the outcomes of the previous games, hence
Cn is Fn_1-measurable, and thus C is predictable.
The conclusion that c· X is then a martingale means that 'clever' gambling
strategies will be of no avail when the game is fair. It remains fair, whatever
strategy the gambler employs! And, of course, if it starts out unfavourable to
the gambler, so that X is a supermartingale (Xn - 1 2: lE(XnIFn- 1)), the above
calculation shows that, as long as Cn 2: 0 for each n, then lE(InIFn-d ::; In-I, so
that the game remains unfavourable, whatever non-negative stakes the gambler
places (and negative bets seem unlikely to be accepted, after all ... ). You will
verify immediately, of course, that a submartingale X produces a submartingale
c· X when C is a non-negative process. Sadly, such favourable games are hard
to find in practice.
Combining the definition of (In) with the Doob decomposition of the submartingale X 2 , we obtain the identity which illustrates why martingales make
useful 'integrators'. We calculate the expected value of the square of (c· X)n
when C = (c n ) is predictable and X = (Xn) is a martingale:
n
n
k=l
j,k=l
Consider terms in the double sum separately: when j
< k we have
= lE(CjCk.1Xj.1XkIFk-d = lE(CjCk.1XjlE(.1XkIFk-l)) = 0
since the first three factors are all Fk_1-measurable, while lE(.1XkIFk-l) = 0
lE(CjCk.1Xj.1Xk)
since X is a martingale. With j, k interchanged this also shows that these terms
are 0 when k < j.
The remaining terms have the form
lE(C%(.1Xk)2)
= lE(C%lE((.1Xk)2IFk_d) = lE(C%.1Ak)'
By linearity, therefore, we have the fundamental identity for stochastic integrals
relative to martingales (also called the Ito isometry):
n
n
k=l
k=l
230
Measure, Integral and Probability
Remark 7.53
The sum inside the expectation sign on the right is a 'Stieltjes sum' for the increasing process, so that it is now at least plausible that this identity allows us
to define martingale integrals in the continuous-time setting, using approximation of processes by simple processes, much as was done throughout this book
for real functions. The Ito isometry is of critical importance in the definition of
stochastic integrals relative to processes such as Brownian motion: in defining
Lebesgue-Stieltjes integrals our integrators were of bounded variation. Typically, the paths of Brownian motion (a process we shall not discuss in detail in
this book - see (e.g.) [3] for its basic properties) are not of bounded variation,
but the Ito isometry shows that their quadratic variation can be handled in
the (much subtler) continuous-time version of the above framework, and this
enables one to define integrals of a wide class of functions, using Brownian
motion (and more general martingales) as the 'integrator'.
We turn finally to the idea of stopping a martingale at a random time.
Definition 7.54
A random variable T : il ~ {a, 1,2, ... , n, ... } u {oo} is a stopping time relative
to the filtration (Fn) if for every n 2 1, the event {T = n} belongs to Fn.
Note that we include the value T(W) = 00, so that we need {T = oo} E
Foo = a(Un21Fn), the 'limit a-field'. Stopping times are also called random
times, to emphasise that the 'time' T is a random variable.
For a stopping time T the event {T ~ n} = U~=o {T = k} is in F n since for
each k ~ n, {T = k} E Fk and the a-fields increase with n. On the other hand,
given that for each n the event {T ~ n} E Fn, then
{T = n} = {T
~
n}"'{T
~
n -I} E Fn.
Thus we could equally well have taken the condition {T
the definition of stopping time.
~
n}
E
Fn for all n as
Example 7.55
A gambler may decide to stop playing after a random number of games, depending on whether the winnings X have reached a predetermined level L (or
his funds are exhausted!). The time T = min{ n : Xn 2 L} is the first time
at which the process X hits the interval [L, (0); more precisely, for W E il,
T(W) = n if Xn(w) 2 L while Xk(W) < L for all k < n. Since {T = n} is thus
7. The Radon-Nikodym theorem
231
determined by the values of X and those of the X k for k < n it is now clear
that T is a stopping time.
Example 7.56
Similarly, we may decide to sell our shares in a stock S if its value falls below
75% of its current (time 0) price. Thus we sell at the random time T = min{n :
Sn < iSo}, which is again a stopping time. This is an example of a 'stop-loss
strategy', and is much in evidence in a bear market.
Quite generally, the first hitting time T A of a Borel set A c ~ by an adapted
process X is defined by setting TA = min{n ~ 0 : Xn E A}. For any n ~ 0
we have {TA ~ n} = Uk<n{X k E A} E Fn. To cater for the possibility that
X never hits A we use the convention min 0 = 00, so that {TA = oo} =
n"'(Un>o{TA ~ n}) represents this event. But its complement is in Foo =
O'(Un2': oFn ), thus so is {TA = oo}. We have proved that TA is a stopping time.
Returning to the gambling theme, we see that stopping is simply a particular
form of gambling strategy, and it should thus come as no surprise that the
martingale property is preserved under stopping (with similar conclusions for
super- and submartingales). For any adapted process X and stopping time
T, we define the stopped process XT by setting X~(w) = Xnl\T(w)(w) at each
wEn. (Recall that for real x,y we write x 1\ y = min{x,y}.)
The stopped process XT is again adapted to the filtration (Fn), since
{XTAn E A} means that either T > nand Xn E A, or T = k for some k ~ n
and X k E A. Now the event {Xn E A} n {T > n} E F n , while for each k the
event {T = k} n {Xk E A} E Fk. For all k ~ n these events therefore all belong
to Fn. Hence so does {XTAn E A}, which proves that XT is adapted.
Theorem 7.57
Let (n, F, (Fn), P) be a filtered probability space, and let X be a martingale with Xo = O. If T is a stopping time, the stopped process XT is again a
martingale.
Proof
We use the preservation of the martingale property under discrete stochastic
integrals ('gambling strategies'). Let Cn = l{T2':n} for each n ~ 1. This defines
a bounded predictable process C = (cn ), since it takes only the values 0,1 and
{cn = O} = {T ~ n - I} E Fn-l, so that also {cn = I} = n",{cn = O} E
F n - 1 • Hence by Theorem 7.52 the process c· X is again a martingale. But by
232
construction (c· X)o
Measure, Integral and Probability
= 0 = Xo = X
o,while for any n ?: 1,
hence XT is a martingale.
o
Since Cn ?: 0 as defined in the proof, it follows that the supermartingale
and submartingale properties are also preserved under stopping. For a martingale we have, in addition, that expectation is preserved, i.e. (in general)
JE(XTAn ) = JE(Xo). Similarly, expectations increase for stopped submartingales,
and decrease for stopped supermartingales.
None of this, however, guarantees that the random variable X T defined by
XT(W) = XT(w)(w) has finite expectation - to obtain a result which relates its
expectation to that of Xo we generally need to satisfy much more stringent
conditions. For bounded stopping times (where there is a uniform upper bound
N with r(w) ::; N for all wE fl), matters are simple: if X is a martingale, X TAn
is integrable for all n, and by the above theorem JE(XTAn ) = JE(Xo). Now apply
this with n = N, so that X TAn = X TAN = X T • Thus we have JE(XT ) = JE(Xo)
whenever r is a bounded stopping time. We shall not delve any further, but
refer the reader instead to texts devoted largely to martingale theory, e.g. [13],
[8]. Bounded stopping times suffice for many practical applications, for example
in the analysis of discrete American options in finance.
7.4.4 Applications to mathematical finance
A major task and challenge for the theory of finance is to price assets and
securities by building models consistent with market practice. This consistency
means that any deviation from the theoretical price should be penalized by the
market. Specifically, if a market player quotes a price different from the price
provided by the model, she should be bound to lose money.
The problem is the unknown future which has somehow to be reflected in the
price since market participants express their beliefs about the future by agreeing
on prices. Mathematically, an ideal situation is where the price process X(t) is a
martingale. Then we would have the obvious pricing formula X(O) = JE(X(T))
and in addition we would have the whole range of formulae for the intermediate
prices by means of conditional expectation based on information gathered.
However, the situation where the prices follow a martingale is incompatible
with the market fact that money can be invested risk-free, which creates a
benchmark for expectations for investors investing in risky securities. So we
modify our task by insisting that the discounted values Y(t) = X(t) exp{ -rt}
233
7. The Radon-Nikodym theorem
form a martingale. The modification is by means of a deterministic constant,
so it does not create a problem for asset valuation.
A particular goal is pricing derivative securities where we are given the
terminal value (a claim) of the form j(S(T)), where j is known and the probability distributions of the values of the underlying asset S(t) are assumed to be
known by taking some mathematical model. The above martingale idea would
solve the pricing problem by constructing a process X(t) in such a way that
X(T) = j(S(T)) (We call X a replication of the claim.)
We can summarise the tasks: build a model of the prices of the underlying
security S(t) such that
1. there is a replication X(t) such that X(T)
=
j(S(T)),
2. the process Y(t) = X(t) exp{ -rt} is a martingale, so Y(O) is the price of
the security described by j,
3. any deviation from the resulting prices leads to a loss.
Steps 1 and 2 are mathematical in nature, but Step 3 is related to real
market activities.
We perform the task for the single-step binomial model.
Step 1. Recall that the prices in this model are S(O), S(l) = S(O)'1] where
= U or '1] = D with some probabilities. Let j(x) = (x - K)+. We can easily
find X(O), so that X(l) = (S(l) - K)+ by building a portfolio of n shares S
and m units of bank account (see Section 6.5.5) after solving the system
'1]
nS(O)U + mR
= (S(O)U - K)+,
nS(O)D + mR
= (S(O)D
- K)+.
Note that X(l) is obtained by means of the data at time t
parameters.
= 0 and
the model
Write R = exp{r}. The martingale condition we need is trivial:
X(O)R = lE(X(l)). The task here is to find the probability measure (X is
defined, we have no influence on R, so this is the only parameter we can adjust).
Recall that we assume D < R < U. We solve
Step 2.
X(O)R
= pX(O)U + (1 - p)X(O)D
which gives
R-D
p= U-D
and we are done. Hence the theoretical price of a call is
C
=
R-lp(S(O)U - K)
= exp{ -r}lE(S(l)
- K)+.
234
Measure, Integral and Probability
Step 3. To see that within our model the price is right suppose that someone is
willing to buy a call for 0' > O. We sell it, immediately investing the proceeds
in the portfolio from Step 1. At exercise our portfolio matches the call payoff
and has earned the difference 0' - O. So we keep buying the call until the seller
realises the mistake and raises the price. Similarly if someone is selling a call at
0' < 0 we generate cash by forming the ( -n, -m) portfolio, buy a call (which,
as a result of replication, settles our portfolio liability at maturity) and have
profit until the call prices quoted hit the theoretical price O.
The above analysis summarises the key features of a general theory. A
straightforward extension of this trivial model to n steps gives the so-called
CRR (Cox-Ross-Rubinstein) model, which for large n is quite adequate for realistic pricing. We evaluated the expectation in the binomial model to establish
the CRR price of a European call option in Proposition 4.56.
For continuous time the task becomes quite difficult. In the Black-Scholes
model 8(t) = 8(0) exp{(r - !a 2 )t + aw(t)}, where w(t) is a stochastic process
(called the Wiener process or Brownian motion) with w(O) = 0 and independent
increments such that w(t) -w(s) is Gaussian with zero mean and variance t-s,
s < t.
Exercise 7.21
Show that exp{ -rt}8(t) is a martingale.
Existence of the replication process X (t) is not easy but can be proved, as
well as the fact that the process Y (t) is a martingale. (Recall that we proved
the martingale property of the discounted prices in the binomial model.) This
results in the same general pricing formula: exp{ -rT}lE(f(8(T)). Using the
density of w(T), this number can be written in an explicit form for particular
derivative securities (see Section 4.7.5 where the formulae for the prices of call
and put options were derived).
Remark 7.58
The reader familiar with finance theory will notice that we are focused on
pricing derivative securities and this results in considering the model where the
discounted prices form a martingale. This model is a mathematical creation
not necessarily consistent with real life, which requires a different probability
space and different parameters within the same framework. The link between
the real world and the martingale one is provided by a special application of
Radon-Nikodym theorem which links the probability measures, but this is a
7. The Radon-Nikodym theorem
235
story we shall not pursue here; we refer the reader to numerous books devoted
to the subject (for example [5]).
7.5 Proofs of propositions
Proof (of Proposition 7.2)
Suppose the (c,8)-condition fails. With A as in the hint, we have JL(A) ~
JL(Ui>n Fi ) ~ l::n ~ = 2}-i for every n ~ 1. Thus JL(A) = O. But
v(Fn) ~ c for every n, hence v(En) ~ c, where En = Ui>n Fi . The sequence (En) decreases, so that, as V(F1) is finite, Theorem 2.19 (i) gives:
v(A) = v(limEn) = limv(En) ~ c. Thus v is not absolutely continuous with
respect to JL. Conversely, if the (c,8)-condition holds, and JL(F) = 0, then
JL(F) < 8 for any 8 > 0, and so, for every given c > 0, v(F) < c. Hence
v(F) = O. So v « JL.
D
Proof (of Proposition 7.6)
We proceed as in Remark 4.18. Begin with 9 as the indicator function IG for
G E 9. Then we have: In 9 dJL = JL( G) = IG hJ.L dcp by construction of hI-'" Next,
let 9 = l:~=1 ai1G; for sets G 1, G 2,· .. , Gn in 9, and reals al, a2, ... , an; then
linearity of the integrals yields
Finally, any 9-measurable non-negative function 9 is approximated from below
by an increasing sequence of 9-simple functions gn = l:~=1 ai IG;; its integral
IngdJL is the limit of the increasing sequence Un gnhJ.LdCP)n. But since 0 ~
hJ.L ~ 1 by construction, (gnhJ.L) increases to ghJ.L pointwise, hence the sequence
Un gnhJ.L dCP)n also increases to In ghJ.L dcp, so the limits are equal. For integrable
9 = g+ - g- apply the above to each of g+, g- separately, and use linearity. D
Proof (of Proposition 7.8)
If Un>l An =
n
and the An are not disjoint, replace them by En, where
E1 = 1, En = An ",,(U~:/ E i ), n> 1. The same can be done for the Bm and
hence we can take both sequences as disjoint. Now n = Um n>l (An n Bm) is
also a disjoint union, and v, JL are both finite on each An n B:n.-Re-order these
sets into a single sequence (Cn)n~l and fix n ~ 1. Restricting both measures to
A
236
Measure, Integral and Probability
the a-field Fn = {F n en : F E F} yields them as finite measures on ([2, F n ),
so that the Radon-Nikodym theorem applies, and provides a non-negative Fnmeasurable function hn such that v(E) = IE h dfL for each E E Fn. But any
set F E F has the form F = Un Fn for Fn E F n , so we can define h by
setting h = hn for every n ~ 1. Now v(F) = "L.:::'=1 IFn hn dfL = IF hdfL. The
uniqueness is clear: if 9 has the same properties as h,then IF(h - g) dfL = 0 for
each F E F, so h - 9 = 0 a.e. by Theorem 4.22.
D
Proof (of Proposition 7.9)
(i) This is trivial, since cp = A + v is a-finite and absolutely continuous with
respect to fL, and we have, for F E F :
{~
iF dfL dfL = cp(F) = (A
(~
~
+ v)(F) = A(F) + v(F) = i)dfL + d)
dfL·
The integrands on the right and left extremes are thus a.s. (fL) equal, so the
result follows.
(ii) Write ~~ = 9 and ~~ = h. These are non-negative measurable functions
and we need to show that, for F E F
First consider this when 9 is replaced by a simple function of the form ¢>
"L.~=1 ci1Ei' Then we obtain:
=
Now let (¢>n) be a sequence of simple functions increasing pointwise to g. Then
by monotone convergence theorem:
since (¢>nh) increases to gh. This proves our claim.
D
Proof (of Proposition 7.11)
Use the hint: Al + A2 is concentrated on Al U A 2, while fL is concentrated on
Bl n B 2. But Al U A2 is disjoint from Bl n B 2, hence the measures Al + A2
and fL are mutually singular. This proves (i). For (ii), choose a set E for which
fL(E) = 0 while A2 is concentrated on E. Let FeE, so that fL(F) = 0 and
237
7. The Radon-Nikodym theorem
hence Al (F) = 0 (since Al « f.L). This shows that Al is concentrated on EC,
hence Al and A2 are mutually singular. Finally, (ii) applied with Al = A2 = v
shows that v.l v, which can only happen when v = O.
0
Proof (of Proposition 7.15)
Fix x E lR. The set A = {F(y) : y < x} is bounded above by F(x), while
B = {F(y) : x < y} is bounded below by F(x). Hence supA = K1 ::; F(x),
and inf B = K2 2: F(x) both exist in IR and for any c > 0, we can find Y1 < x
such that K1 - c < F(Y1) and Y2 > x such that K2 + c > F(Y2)' But since
F is increasing this means that K1 - c < F(y) < K1 throughout the interval
(Y1,X) and K2 < F(y) < K 2 +c throughout (Y,Y2). Thus both one-sided limits
F(x-) = limytx F(y) and F(x+) = limytx F(y) are well defined and by their
definition F(x-) ::; F(x) ::; F(x+).
Now let C = {x E IR : F is discontinuous at x}. For any x E C we have
F(x-) < F(x+). Hence we can find a rational r = r(x) in the open interval
(F(x-), F(x+ )). No two distinct x can have the same r(x), since if Xl < X2
we obtain F(X1+) ::; F(X2-) from the definition of these limits. Thus the
correspondence x f-+ r(x) defines a one-one correspondence between C and a
subset of Q, so C is at most countable. At each discontinuity we have F(x-) <
F(x+), so all discontinuities result from jumps of F.
0
Proof (of Proposition 7.21)
Fix c > 0, and let a finite set of disjoint intervals Jk
E = Uk J k · Then
n
n
{; IF(Yk) - F(xk)1 = {; I
l
Yk
Xk
n
f dml ::; {;
l
Yk
Xk
= (Xk' Yk) be given. Let
r
If Idm = } E If Idm.
But since f E L\ the measure f.L(G) = fG If I dm is absolutely continuous with
respect to Lebesgue measure m and hence, by Proposition 7.2, there exists
8 > 0 such that m(F) < 8 implies f.L(F) < c. But if the total length of the
intervals Jk is less than 8, then m(F) < 8, hence f.L(F) < c. This proves that
the function F is absolutely continuous.
0
Measure, Integral and Probability
238
Proof (of Proposition 7.24)
Use the functions defined in the hint: for any partition (xkh::;n of [a, x] we have
n
F(x) - F(a) = ~::)F(Xk) - F(Xk-l)]
k=l
n
n
= ~::)F(Xk) - F(Xk-d]+ - 2:)F(Xk) - F(Xk-dr
k=l
= p(x) - n(x),
k=l
so that p(x) = n(x)+ [F(b) - F(a)] ~ N F(X)+ [F(b) - F(a)] by definition of sup.
This holds for all partitions, hence PF(x) = supp(x) ~ NF(x) + [F(b) - F(a)].
On the other hand, writing n(x) = p(x)+ [F(a) -F(b)] yields NF(x) ~ PF(x)+
[F(a) - F(b)]. Thus PF(x) - NF(x) = F(b) - F(a). Now for any fixed partition
we have
n
TF(X) 2::
L IF(Xk) - F(Xk-dl = p(x) + n(x) = p(x) + {p(x) -
[F(b) - F(a)]}
k=l
Take the supremum on the right: TF(X) 2:: 2PF (x) + NF(X) - PF(x)
PF(X) + NF(x). But we can also write 2:Z=lIF(Xk) - F(Xk-l)1 = p(x) +
n( x) ~ PF (x) + N F (x) for any partition, so taking the sup on the left provides
TF(X) ~ PF(X) + NF(x). So the two sides are equal.
D
Proof (of Proposition 7.25)
It will suffice to prove this for TF, as the other cases are similar. If the partition
p of [a, b] produces the sum t(P) for the absolute differences, and if pi = Pu{ c}
for some c E (a, b), then t(P)[a, b] ~ t(pl)[a, c] + t(PI)[c, b] and this is bounded
above by TF [a, c] + TF [c, b] for all partitions. Thus it bounds TF [a, b] also. On
the other hand, any partitions of [a, c] and [c, b] together make up a partition of
[a, b], so that TF[a, b] bounds their joint sums. So the two sides must be equal.
In particular, fixing a, TF[a, c] ~ TF[a, b] when c ~ b, hence TF(X) = TF[a, x]
is increasing with x. The same holds for PF and N F. The final statement is
obvious.
D
Proof (of Proposition 7.27)
(i) Given E > 0, find 8 > 0 such that 2:~=1 (Yi - Xi) < 8 implies 2:~=1 IF(Yi) F(Xi)1 < b~a· Given a partition (ti)i::;K of [a,b], we add further partition
points, uniformly spaced and at a distance bAt from each other, to ensure that
7. The Radon-Nikodym theorem
239
the combined partition (Zi)i<N has all its points less than 6 apart. To do this
we simply need to choose M-as the integer part of T = b"6 a + 1. Since the (tj)
form a subset of the partition points (Zi)i=O,l, ... ,N it follows that
K
N
L
IF(ti) - F(ti-dl ::;
i=l
L
IF(Zi) - F(Zi-dl·
i=l
The latter sum can be re-ordered into M groups of terms where each group
begins and ends with two consecutive new partition points: the kth group then
contains (say) mk points altogether, and by their construction, the sum of
their consecutive distances (i.e. the distance between the two new endpoints!)
is less than 6, so for each k ::; M, ~:\ IF( Wi,k) - F( Wi-1,k) I < b<!a' where
the (Wi,k) are the re-ordered points (Zi). Thus the whole sum is bounded by
M(b<!J ::; T(b<!J < E. This shows that F E BV[a, b], since the bound is
independent of the original partition (ti)i<K.
For (ii), note first that, by (i), the function F has bounded variation on
[a, b], so that over any subinterval [Xi, Yi] the total variation function TF [Xi, Yi]
is finite. Again take E,6 as given in the definition of absolutely continuous functions. If (Xi, Yi)i~n are subintervals with ~~=1 IYi - xii < 6 then
~~=llF(Yi) - F(Xi)1 < E. As in the previous proposition this implies that
TF[Xi,Yi] ::; E. Thus both PF[Xi,Yi] and NF[Xi,Yi] are less than E, so that the
functions F1 and F2 are absolutely continuous.
D
Proof (of Proposition 7.31)
°
Obviously v(0) = and 0 c E for any E. So if v is monotone increasing,
v(E) ~ v(0) ~ O. Hence v is a measure. Conversely, if v is a measure, FeE,
v(E) = v(F) + v(E \ F) ~ v(E).
D
Proof (of Proposition 7.40)
= {f > I} has p(E) > 0, iE) is well defined and IE
IE f dp > 1. This contradicts the hypothesis, so p(E) = O.
F = {f < -I} has p(F) = 0. Hence If I ::; 1 p-a.e.
If E
p(~)
iE) dp
=
Similarly,
D
Proof (of Proposition 7.41)
Choose h, A, B as in the hint. Recall that v+
~(l + h) = hIB, so that, for F E F,
v+(F) =
~
2
=
~(Ivl
+ v),
and note that
r (1 + h) dlvl = iFnB
r hdlvl = v(F n B).
iF
240
Measure, Integral and Probability
But then, since B
= A c,
v-(F) = -[v(F) - v+(F)] = -[v(F) - v(F n B)] = -v(F n A).
Finally, if v = Al - A2 where the Ai are measures, then v ~ AI, so that
v+(F) = v(FnB) ~ Al(FnB) ~ AdF) by monotonicity. This proves the final
statement of the proposition.
D
Proof (of Proposition 7.45)
(i) Is immediate, as
In JE(XIQ) dP = In X dP by definition.
Ie
Ie
(ii) If both integrands are g-measurable and
JE(Xlg) dP =
X dP for all
G E g, then the integrands are a.s. equal by Theorem 4.22, and thus X is a
version of JE(XIQ).
(iii) For any G E g, Ie and X are independent random variables, so that
fc
X dP = JE(XIe) = JE(X)JE(le) =
fc
JE(X) dP
Hence by definition JE(X) is a version of JE(XIQ). But JE(X) is constant, so the
identity holds everywhere.
(iv) Use the linearity of integrals:
fc
(aX
+ bY) = a
fc
X dP + b
fc
Y dP
=a
fc
JE(XIQ) dP + b
fc
JE(Ylg) dP
= fc[aJE(XIQ) dP + bJE(YIQ)] dP,
so the result follows.
D
8
Limit theorems
In this chapter we present the classical limit theorems of probability theory.
The reader may wish to omit the more technically demanding proofs at a
first reading, in order to gain an overview of the principal limit theorems for
sequences of random variables. We put the spotlight firmly on probability to
derive substantive applications of the preceding theory.
First, however we discuss some basic modes of convergence of sequences of
functions of real variable. Then we move to the probabilistic setting to which
this chapter is largely devoted.
8.1 Modes of convergence
Let E be a Borel subset of jRn. For a given sequence (in) in LP(E), p :::: 1, we
can express the statement 'fn -+ f as n -+ 00' in a number of distinct ways:
Definition 8.1
(i) fn -+ f uniformly on E: given c: > 0, there exists N = N(c:) such that, for
all n:::: N,
Ifn - floo = sup(lfn(x) - f(x)1) < c:.
xEE
(Note that we need fn E LOO(E) for the sup to be finite in general.)
241
242
Measure, Integral and Probability
(ii) fn ~ f pointwise on E: for each x E E, given c > 0, there exists N =
N(c, x) such that Ifn(x) - f(x)1 < c for all n ~ N.
(iii) fn
~
fn
~
f almost everywhere (a.e.) on E: there is a null set FeE such that
f pointwise on E \ F.
(iv) fn ~ f in LP- norm (in the pth mean): Ifn - flp ~
given c > 0, 3N = N(c) such that
for all n
~
°
as n ~
00,
i.e. for
N.
Handling these different modes of convergence requires some care; at this
point we have not even shown that they are all genuinely different. Clearly,
pointwise (and a.e.) limits are often easier to 'guess', however, we cannot always
be certain that the limit function, if it exists, is again a member of LP(E).
For mean convergence, however, this is ensured by the completeness of LP(E),
and similarly for uniform convergence, which is just convergence in the L oo norm. Note that the conclusions of the dominated and monotone convergence
theorems yield the mean convergence of a.e. convergent sequences (In), but
only by imposing additional conditions on the (In).
Theorem 8.2
With (In) as above, the only valid implications are the following: (i) :::} (ii) :::}
(iii). For finite measures, (i) :::} (iv).
Proof
°
The above implications are obvious. It is also obvious that (iii) ~ (ii).
To see that (ii) ~ (i) take E = [0,1]' fn = l(o,~) which converges to f =
at all points but sup f n = 1 for all n.
For (iii) ~ (iv) take fn = nl(o,~]; fn converges to
pointwise, but
flfPdm=n P1n =nP- 1 >
l.
Jo n
To see that (iv) ~ (iii), let E = [0,1] and put
°
94
=
1[14'4~]
95
=
1[~4'4!!.]
8. Limit theorems
Then
243
11 g~dm =
but for each x E [0, 1], gn (x)
converge at any x in E.
11
gndm -+ 0
1 for infinitely many n, so gn (x) does not
=
o
Example 8.3
We investigate
hn(x) = xn
for convergence in each of these modes on E = [O,lJ.
The sequence converges everywhere to the function h(x) = 0 for x E [0,1),
h(l) = 1, and so it also converges almost everywhere.
It does not converge uniformly since sup[o,1]lh n (x) - h(x)1 = 1 for all n.
It converges in LP for p > 0:
Remark 8.4
There are still other modes of convergence which can be considered for sequences of measurable functions, and the relations between these and the above
are quite complex in general. Here we will not pursue this theme in general, but
specialise instead to probability spaces, where we derive additional relationships
between the different limit processes.
Exercise 8.1
For each of the following decide whether in -+ 0 in LP, uniformly, pointwise, a.e.:
(a) in = l[n,n+*],
(b) in = nl[o,*] - nl[_*,o]'
8.2 Probability
The remainder of this chapter is devoted to a discussion of the basic limit
theorems for random variables in probability theory. The very definition of
244
Measure, Integral and Probability
'probabilities' relies on a belief in such results, i.e. that we can ascribe a meaning
to the 'limiting average' of successes in a sequence of independent identically
distributed trials. Then the purpose of the 'endless repetition' of tossing a coin
is to use the 'limiting frequency' of heads as the definition of the probability of
heads.
Similarly, the pivotal role ascribed in statistics to the Gaussian density
has its origin in the famous central limit theorem (of which there are actually
many versions), which shows this density to describe the limit distribution of a
sequence of distributions under appropriate conditions. Convergence of distributions therefore provides yet a further important limit concept for sequences
of random variables.
In both cases the concept of independence plays a crucial role and first of
all we need to extend this concept to infinite sequences of random variables. In
what follows (Xn )n:2:1 will denote a sequence of random variables defined on a
probability space.
Definition 8.5
We say that random variables (Xn )n:2:1 are independent if for any kEN the
variables Xl' ... ' X k are independent (see Definition 3.18).
An alternative is to demand that any finite collection of Xn be independent.
Of course this condition implies independence since finite collections cover the
initial segments of k variables.
Conversely, take any finite collection of Xn and let k be the greatest index
of this finite collection. Now Xl' ... ' X k are independent and then for each
subset the collection of its elements is independent; this includes, in particular,
the chosen one.
We study the following sequence:
If all Xn have the same distribution (we say that they are identically distributed), then
is the average value of Xl (or any X k , it does not matter)
after n repetitions of the same experiment.
We study the behaviour of Sn as n goes to infinity. The two main questions
we address are:
1. When do the random variables ~ converge to a certain number? Here
there is an immediate question of the appropriate mode of convergence. Positive
answers to such questions are known as laws of large numbers.
2. When do the distributions of the random variables ~ converge to a
measure? Under what conditions is this limit measure Gaussian? The results
s;
8. Limit theorems
245
we obtain in response are known as central limit theorems.
8.2.1 Convergence in probability
Our first additional mode of convergence, convergence in probability, is sometimes termed 'convergence in measure'.
Definition 8.6
A sequence (Xn) converges to X in probability if for each
P(IXn
as n -+
-
E
>0
XI > E) -+ 0
00.
Exercise 8.2
Go back to the proof of Theorem 8.2 (with E = [0,1]) to see which of the
sequences of random variables constructed there converge in probability.
Exercise 8.3
Find an example of a sequence of random variables on [0, 1] that does
not converge to 0 in probability.
We begin by showing that convergence almost surely (i.e. almost everywhere) is stronger than convergence in probability. But first we prove an auxiliary result.
Lemma 8.7
The following conditions are equivalent:
(i) Yn -+ 0 almost surely,
(ii) for each
E
> 0,
U {w: IYn(w)12 E:}) = O.
00
lim P(
N-'too
n=N
246
Measure, Integral and Probability
Proof
Convergence almost surely, expressed succinctly, means that
P({w: 'lie > 0 3N EN: 'lin? N, IYn(w) I < c}) = 1.
Writing this set of full measure another way we have
p(n
un
{w: IYn(w) I < e}) = 1.
e>ONENn?N
The probability of the outer intersection (over all e > 0) is less then the probability of any of its terms, but being already 1, it cannot increase, hence for all
e>O
P(
{w: IYn(w) I < e}) = 1.
Un
NENn?N
We have a union of increasing sets, so
lim P(
N-+oo
n {w: IYn(w) I <
thus
lim (1- P(
N-+oo
e}) = 1,
n?N
n
{w: IYn(w) I < e}))
but we can write 1 = P(Jl), so that
P(Jl) - P(
= 0;
n?N
n {w : IYn(w) I <
e})
=
P(Jl \
n?N
n {w : IYn(w) I <
e})
n?N
= P(
U{w: IYn(w) I ? e})
n?N
by de Morgan's law. Hence (i) implies (ii). Working backwards, these steps also
prove the converse.
0
Theorem 8.8
If Xn -+ X almost surely then Xn -+ X in probability.
Proof
For simplicity of notation consider the difference Yn = Xn - X and the problem
reduces to the discussion of convergence of Yn to zero. We have
00
and by Lemma 8.7 the limit on the left is zero, hence so is that on the right.
0
247
8. Limit theorems
Note that the two sides of the inequality neatly summarize the difference
between convergence a.s. and in probability. For convergence in probability we
consider the probabilities that individual Yn are at least c: away from the limit,
while for almost sure convergence we need to consider the whole tail sequence
(Ynk,~k simultaneously.
The following example shows that the implication in the above theorem
cannot be reversed and also shows that the convergence in £P does not imply
almost sure convergence.
Example 8.9
Consider the following sequence of random variables defined on n = [0, 1] with
Lebesgue measure: Xl = 1[0,1], X 2 = 1[0,1/2], X3 = 1[1/2,1], X 4 = 1[0,1/4],
X5 = 1[1/4,1/2] and so on (like in the proof of Theorem 8.2). The sequence
clearly converges to zero in probability and in LP but for each w E [0,1],
Xn(W) = 1 for infinitely many n, so it fails to converge pointwise.
Convergence in probability has an additional useful feature:
Proposition 8.10
The function defined by d(X, Y)
= lE( l~~i~~I) is a metric and convergence in
d is equivalent to convergence in probability.
Hint If Xn -+ X in probability then decompose the expectation into fA
where A = {w : IXn(w) - X(w)1 < c}.
+ fa\A
We now give a basic estimate of the probability of a non-negative random
variable, taking values in a given set by means of the moments of this random
variable.
Theorem 8.11 (Chebyshev's inequality)
If Y is a non-negative random variable, c: > 0,
P(Y
°<
~ c:) :::; lE~P).
p
<
00,
then
(8.1)
Proof
This is immediate from basic properties of integral: let A = {w : Y(w) ~ c:}
248
Measure, Integral and Probability
and then
i
IE(YP) ;::::
yP dP
(integration over a smaller set)
;:::: P(A)c P ,
since YP(w)
> cP on A, which gives the result after dividing by
cp .
D
Chebyshev's inequality will be used mainly with small c. But let us see what
happens if c is large.
Proposition 8.12
Assume that IE(YP) <
00.
Then
cP P(Y ;:::: c) -t 0
Hint Write
IE(YP)
=
r
J{w:Y(w)~c}
as c -t
yP dP +
00.
r
yP dP
J{w:Y(w)<c}
and estimate the first term as in the proof of Chebyshev's inequality.
Corollary 8.13
Let X be a random variable with finite expectation IE(X)
Let 0 < a < 00; then
P(IX -
ILl;:::: aO')
= IL and variance 0'2.
1
~ 2·
a
Proof
Using Chebyshev's inequality with Y
that
as required.
= IX - ILl
and p
=
2, c
= aO',
we find
D
Remark 8.14
Chebyshev's inequality also shows that convergence in LP implies convergence
in probability. For, let Yn = IXn - XI and assume that IIXn - Xllp -t O. This
implies that P(Yn ;:::: c) -t o. The converse is false in general, as the next
example shows.
8. Limit theorems
249
Example 8.15
Let [l = [0,1] with Lebesgue measure and let Xn = nl[o,~l' The sequence
Xn converges to 0 pointwise, so we take X = 0 and we see that IXn - Xlp =
Jon n P dm = n P- 1 • If p ~ 1, then IXn 1
Xlp
-1+ 0,
however as we have already
seen (Exercise 8.2),
P(IXn -
XI
~
c)
1
= P(Xn = n) = -n --+ 0,
showing that Xn --+ X in probability.
8.2.2 Weak law of large numbers
The simplest 'law of large numbers' provides an L2-convergence result:
Theorem 8.16
If (Xn) are independent, 18:(Xn) = J..L, Var(Xn) ::; K
and hence in probability.
< 00,
then ~ --+ J..L in L2
Proof
First note that 18:(Sn) = 18:(X1 ) + ... + 18:(Xn) = nJ..L by the linearity of expectation. Hence 18:( ~) = J..L and so 18:( ~ - J..L)2 = Var( ~). By the properties of
the variance (Proposition 5.40),
Sn
Var(-)
n
1
1 1 K
(Var(X1 ) + ... + Var(Xn)) ::; 2nK = - --+ 0
n
n
= 2Var(Sn)
= 2n
n
as n --+ 00. This in turn implies convergence in probability, as we saw in Remark 8.14.
D
Exercise 8.4
Using Chebyshev's inequality, find a lower bound for the probability that
the average number of 'heads' in 100 tosses of a coin differs from ~ by
less than 0.1.
Exercise 8.5
Find a lower bound for the probability that the average number shown
on a die in 1000 tosses differs from 3.5 by less than 0.1.
Measure, Integral and Probability
250
We give some classical applications of the weak law of large numbers. The
Weierstrass theorem says that every continuous function can be uniformly approximated by polynomials. The Chebyshev inequality provides an easy proof.
Theorem 8.17 (Bernstein-Weierstrass approximation theorem)
If f
: [0, l]-t R is continuous then the sequence of Bernstein polynomials
converges to
f
uniformly.
Proof
The number fn(x) has a probabilistic meaning. Namely, fn(x) = lE(f(~))
where Sn = Xl + ... + X n,
X _
•-
{I
with probability x
0 with probability 1 - x.
Then writing lEx instead of lE to emphasise the fact that the underlying probability depends on x, we have
sup Ifn(x) - f(x)1
xE[O,I]
~
~
sup
xE[O,I]
IlEx (f(Sn)) n
f(x)1
sup lExlf(Sn) - f(x)l.
n
xE[O,I]
Take any c > 0 and find 8 > 0 such that if Ix - yl < 8 then If(x) - f(y)1 < ~
(this is possible since f is uniformly continuous).
lExlf(Sn) - f(x)1
n
=
r
J{w:I~-xl<6}
+
J
s
If(Sn) - f(x)1 dP
n
{w'I~-xl?~}
~ ~ +2
2
If(Sn) - f(x)1 dP
n
sup If(y)I' P(I Sn - xl 2:: 8).
n
yE[O,I]
The last term converges to zero by the law of large numbers since x = lE( ~ ),
and Var( ~) is finite. This convergence is uniform in x:
P(I -Sn -x I > u5:) < Var(~)_x(1-x)
<
-1n
82
n8 2
- 4n8 2
251
8. Limit theorems
(due to 4x(1-x) :::; 1 and Var(Sn) = nVar(Xi ) = nx(l-x)). So, for sufficiently
large n the right-hand side is less than c:, which completes the proof.
D
In many practical situations it is impossible to compute integrals (in particular areas or volumes) directly. The law of large numbers is the basis of the
so-called Monte Carlo method, which gives an approximate solution by random
selection of points. The next two examples illustrate this method.
Example 8.18
We restrict ourselves to F C [0, 1] x [0, 1] for simplicity. Assume that F is
Lebesgue-measurable and our goal is to find its measure. Let X n , Yn be independent, uniformly distributed in [0,1]. Let Mn = ~ L~=l IF(Xk, Yk). If we
draw the pairs of numbers n times, then this sum gives the number of hits
of the set F, and Mn -+ m(F) in probability. To see this, first observe that
lE(lF(Xk , Yk )) = P((Xk' Yk ) E F) = m(F) by the assumption on the distribution of X k , Yk : independence guarantees that the distribution of the pair is
two-dimensional Lebesgue measure restricted to the square. Then
P(IMn
-
m(F)1 ;:::: c:) :::; m(F) -+ 0.
nc:
A similar example illustrates the use of the Monte Carlo method for computing integrals.
Example 8.19
Let f be an integrable function defined on [0,1]. With Xn independent uniformly distributed on [0,1] we take
and we show that
in probability. First note that the distribution of Xk is Lebesgue measure on
[0, 1], hence
lE(f(Xk)) =
and so lE(In)
J
f(x) dPxk (x) =
11
f(x) dx,
= fo1 f(x) dx. The weak law provides the desired convergence.
P(IIn
-1
1
f(x) dxl ;:::: c:) -+ 0.
252
Measure, Integral and Probability
We return to considering further weak laws of large numbers for sequences
of random variables. The assumption that JE(Xn) and Var(Xn) are finite can
be relaxed. There is, however, a price to pay by imposing additional conditions.
First we introduce some convenient notation. For a given sequence of random variables (Xn) introduce the truncated random variables
Xn(L) = X n l{w:IX n(w)I:'OL}
and set /1-L = JE(Xl(L)). Note that /1-L = JE(Xn(L)) for all n ;::: 1 since the
distributions of all Xn are the same. Also write
for all n ;::: 1.
Theorem 8.20
If Xn are independent identically distributed random variables such that
aP(IX11 > a) -+ 0
then
Sn
- /1-n -+ 0
n
-
00,
(8.2)
in probability.
(8.3)
as
a -+
We shall need the following lemma, which is of interest in itself and will be
useful in what follows.
Lemma 8.21
If Y ;::: 0, YELP, 0
In particular (p
< p < 00, then
= 1),
JE(Y) =
1
00
P(Y
> y) dy.
253
8. Limit theorems
Proof of the lemma
This is a simple application of Fubini's theorem:
1
1 in
in 1
inl
in
00
=
pyp-l P(Y > y) dy
00
=
00
=
Y
=
pyP- 1l{w:Y(w»y}(W) dP(w) dy
pyP- 1l{w:y(w»y}(W) dydP(w)
(W)
(by Fubini)
pyP-1dydP(w)
YP(w) dP(w)
(computing the inner integral, w fixed)
= lE(YP)
D
as required.
Proof of the theorem
Take e > 0 and obviously
P(I Sn _ tin I 2': e)
n
We estimate the first term
~ P(I Sn(n)
n
- tin I 2': e)
+ P(Sn(n) 1= Sn).
lE(1 ~ 2- tinl 2 ) (by Chebyshev)
e
lE(1 L~=l Xk(n) - ntinl 2 )
n 2e 2
Var(L~=l Xk(n))
n 2 e2
Note that the truncated random variables are independent, being functions of
the original ones, hence we may continue the estimation:
P(I Sn(n) _ tin I 2': e)
n
~
< L~=l Var(Xk(n))
-
n 2 e2
Var(X1(n))
ne 2
< lE(Xr(n))
-
ne 2
= -1
2
ne
= -12
ne
1
I
00
0
n
0
(as Var(Xk(n)) are the same)
(by Var(Z) = lE(Z2) - (lEZ)2
2yP(IX1(n)1 > y) dy
2yP(IX1 1 > y) dy.
~ lE(Z2))
(by the lemma for p = 2)
254
Measure, Integral and Probability
The function y I---t 2yP(IX I I > y) converges to 0 as y ---+ 00 by hypothesis,
hence for given Ij > 0 there is Yo such that for y ~ Yo this quantity is less than
~ IjE 2 , and we have
= ~ {YO 2yP(IX I I > y)dy+ ~ln 2yP(IX I I > y)dy
nE io
nE
1
:::; -2Yo max {yP(IXII > y)}
nE
yE[O,oo]
Yo
1
IjE 2
+ -2n-2
nE
:::; Ij,
provided n is sufficiently large. So the first term converges to O. Now we tackle
the second term:
n
: :; L P(Xk(n) =I- X k )
(by subadditivity of P)
k=l
= nP(X I (n) =I- Xd
= nP(IXII > n)
(the same distributions)
---+ 0 by hypothesis.
o
This completes the proof.
Remark 8.22
Note that we cannot generalise the last theorem to the case of uncorrelated
random variables since we made essential use of the independence. Although
the identity
Var
(~Xk(n)) = ~Var(Xk(n))
holds for uncorrelated random variable, we needed the independence of the (Xk )
- which implies that of the (Xk(n))- to conclude that the truncated random
variables are uncorrelated.
Theorem 8.23
If Xn are independent and identically distributed, 1E(IXI I) < 00, then (8.2) is
satisfied, /-Ln ---+ /-L = IE(XI ) and ~ ---+ /-L in probability. (Note that we do not
assume here that Xl has finite variance.)
8. Limit theorems
255
Proof
The finite expectation of IX11 gives condition (8.2):
aP(IX11 > a) = a !nl{w: Xl(w)l>a} dP
1
:::; !nIX1Il{w: Xl(w)l>a} dP
1
= {
J{w:IX 1 (w)l>a}
--+0
IXlldP
by the dominated convergence theorem. Hence ~ - J.ln --+
J.ln = lE(Xll{w:IX1(w)l~n}) --+ lE(X1) as n --+ 00, so the result follows.
as a --+
00
°
but
0
8.2.3 The Borel-Cantelli lemmas
The idea that a point w E Q belongs to 'infinitely many' events of a given
sequence (An) of elements of :F can easily be made precise: for every n 2: 1
we need to be able to find an mn 2: n such that w E Am n • This identifies
a subsequence (m n ) of indices such that for each n 2: 1, w E A mn , i.e. w E
Um>n Am for every n 2: 1. Thus we say that wEAn infinitely often, and write
w E- An i.o., if w E n~=l U:=n Am. We call this set the upper limit of the
sequence (An) and write it as
nU
00
00
lim sup An =
Am.
n-+oo
n=l m=n
Exercise 8.6
Find limsuPn-+oo An for a sequence Al
A4 = [0,
A5 =
etc.
n
[i,!l
= [0,1]'
A2
= [0,
n A3 = [!, 1],
Given e > 0, a sequence of random variables (Xn) and a random variable X,
for each n set An = {IXn - XI > e}. Then wEAn i.o. precisely when for every
e > 0, IXn(w) - X(w)1 > e occurs for all elements of an infinite subsequence
(m n) of indices, which means that (Xn) fails to converge to X. Hence it follows
that
Xn --+ X a.s. (P)
<¢:=}
Ve >
°P(lim n-+oo An) = P(IXn sup
XI > e i.o.)
= 0.
256
Measure, Integral and Probability
Similarly, define
U n An00
lim inf An
n--+oo
=
00
n=lm=n
(We say that this set is the lower limit of the sequence (An).)
Proposition 8.24
(i) We have w E lim infn--+oo An if and only if wEAn except for finitely many
n. (We say that wEAn eventually.)
(ii) P(Xn -+ X)
(iii) If A
= limc--+o P(IXn - XI < E eventually).
= limsuPn--+oo An then AC = liminfn--+oo A~.
(iv) For any sequence (An) of events, P( {w E An eventually}) < P( {w E
An i.o.})
Hint Use Fatou's lemma on the indicator functions of the sets in (iv).
The sets lim infn--+oo An and lim sUPn--+oo An are 'tail events' of the sequence
(An): we can only determine whether a point belongs to them by knowing the
whole sequence. It is frequently true that the probability of a tail event is either
o or 1 - such results are known as 0 - 1 laws. The simplest of these is provided
by combining the two Borel-Cantelli lemmas to which we now turn: together
they show that for a sequence of independent events (An), lim sUPn--+oo An has
either probability 0 or 1, depending on whether the series of their individual probabilities converges or diverges. In the first case, we do not even need
independence, but can prove the result in general.
We have the following simple but fundamental fact.
Theorem 8.25 (Borel-Cantelli lemma)
If
00
then
P(lim sup An) = 0,
n--+oo
i.e. 'w E An for infinitely many n' occurs only with probability zero.
257
8. Limit theorems
Proof
First note that lim SUPn-+oo An
C
U~=k
An, hence
U
00
P(lim sup An) :::: P(
An)
n-+oo
n=k
: : L P(An)
(for all k)
00
(by subadditivity)
n=k
-to
since the tail of a convergent series converges to O.
o
The basic application of the lemma provides a link between almost sure
convergence and convergence in probability.
Theorem 8.26
If Xn -t X in probability then there is a subsequence Xkn converging to X
almost surely.
Proof
We have to find a set of full measure on which a subsequence would converge.
So the set on which the behaviour of the whole sequence is 'bad' should be
of measure zero. For this we employ the Borel-Cantelli lemma whose conclusion is precisely that. So we introduce the sequence An encapsulating the 'bad'
behaviour of X n , which from the point of convergence is expressed by inequalities of the type IXn(w) - X(w)1 > a. Specifically, we take a = 1 and since
P(IXn - XI > 1) -t 0 we find kl such that
P(IXn - XI> 1) < 1
for n ::::: k1 . Next for a =
! we find k2 > kl such that for all n ::::: k2'
P(IXn - XI>
1
1
2) < 4'
We continue that process obtaining an increasing sequence of integers k n with
1
1
P(IXkn -XI> -)
<-.
n
n2
The series L~=l P(An) converges, where An
hence A = lim sup An has probability zero.
= {w : IXkJW) - X(w)1 >
~},
258
Measure, Integral and Probability
We observe that for wE f? \ A, limsupXkn (w) = X(w). For, if wE f? \ A,
then for some k, w E n~=k(f? \ An), so for all n 2: k, IXkn(W) - X(w)1 < ~,
hence we have obtained the desired convergence.
D
The second Borel-Cantelli lemma partially completes the picture. Under
the additional condition of independence it shows when the probability that
infinitely many events occur is one.
Theorem 8.27
Suppose that the events An are independent. We have
00
P(lim sup An) = 1.
::::;.
n-foo
Proof
It is sufficient to show that for all k
00
since then the intersection over k will also have probability 1. Fix k and consider
the partial union up to m > k. The complements of An are also independent,
hence
m
m
m
P(
n
n=k
A~)
= II P(A~) = II (1 n=k
n=k
P(An)).
Since 1- x:::; e- x ,
m
m
m
n=k
n=k
n=k
The last expression converges to 0 as m ---+
n
00
by the hypothesis, hence
m
P(
but
n
m
n=k
A~) ---+ 0
m
U
m
U
A~) = P(f? \
An) = 1 - P(
An).
n=k
n=k
n=k
The sets Bm = U:=k An form an increasing chain with U:=k Bm = U~=k An
and so P(B m ), which as we know converges to 1, converges to P(U~=k An).
P(
Thus this quantity is also equal to 1.
D
8. Limit theorems
259
Exercise 8.7
Let Bn = Xl + X 2 + ... + Xn describe the position after n steps of
a symmetric random walk on ;Ed. Using the asymptotic formula n! '"
(~ ) n \hrrn and the Borel-Cantelli lemmas, show that the probability of
{Bn = 0 i.o.} is 1 when d = 1,2 and 0 for d> 2.
Below we discuss strong laws of large numbers, where convergence in probability is strengthened to almost sure convergence. But already we can observe
some limitations of these improvements. Drawing on the second Borel-Cantelli
lemma, we give a negative result.
Theorem 8.28
Suppose that Xl, X 2 , •.. are independent identically distributed random variables and assume that lE(IX11) = 00 (hence also lE(IXnl) = 00 for all n). Then
(i) P({w: IXn(w)1 ~ n for infinitely many n})
(ii) P(1imn--+oo ~ exists and is finite)
= 1,
= O.
Proof
(i) First,
lE(IXll) =
1
00
00
P(IXll > x) dx
rk+
= ~Jk
l
(by Lemma 8.21)
P(IXll > x)dx
(countable additivity)
00
~ LP(IXll > k)
k=O
because the function x r+ P(IXll > x) reaches its maximum on [k,k + 1] for
x = k since {w: IX1(w)1 > k} :> {w: IX1(w)1 > x} if x ~ k. By the hypothesis
this series is divergent, but P(IXll > k) = P(IXkl > k) as the distributions are
identical, so
L P(IXkl > k) =
00
00.
k=O
The second Borel-Cantelli lemma is applicable, yielding the claim.
260
Measure, Integral and Probability
(ii) Denote by A the set where the limit of ~n exists (and is finite). Some
elementary algebra of fractions gives
Sn
(n
Sn+l
---n
n+ 1
+ l)Sn - nSn+l
n(n + 1)
Sn - nXn+l
n(n + 1)
Sn
X n+1
-----,--------,---n(n+1)
n+1
For any Wo E A the left-hand side converges to zero and also
Hence also Xn~~~wo) --+ O. This means that
Wo
say, so A
c
[l \
tt {w:
IXk(w)1 > k for infinitely many k} = B,
o
B. But P(B) = 1 by (i), hence P(A) = O.
8.2.4 Strong law of large numbers
We shall consider several versions of the strong law of large numbers, first by
imposing additional conditions on the moments of the sequence (Xn ), and then
by gradually relaxing these until we arrive at Theorem 8.32, which provides the
most general positive result.
The first result is due to von Neumann. Note that we do not impose the
condition that the Xn have identical distributions. The price we pay is having
to assume that higher-order moments are finite. However, for many familiar
random variables, Gaussian for example, this is not a serious restriction.
Theorem 8.29
Suppose that the random variables Xn are independent, JE.(Xn)
::; K. Then
j1,
and
JE.(X~)
Sn
-
n
LX
1 n
=-
n
k
--+
j1
a.s.
k=l
Proof
By considering Xn -
j1
we may assume that JE.(Xn) = 0 for all n. This simplifies
261
8. Limit theorems
the following computation:
E(S!Hl
(t,X'),
=1E ( t x t + Lx;xJ+
i#j
k=l
+
LXiXjX~
ifj
L
XiXjXkXl) .
i,j,k,l distinct
The last two terms vanish by independence:
IE(LXiXjX~) = LIE(XiXjX~) = LIE(Xi)IE(Xj)IE(X~) = 0
ifj
i#j
if.j
and similarly for the term with all indices distinct
IE(LXiXjXkX1) = LIE(XiXjXkXl)
= L IE(Xi)IE(Xj )1E(Xk)IE(XI) = o.
The first term is easily estimated by the hypothesis
To the remaining term we first apply the Schwarz inequality
IE(LX; XJ)
ifj
=
LIE(X; XJ)
ifj
~L
ifj
JIE(xt) JIE(Xf)
~ NK,
where N is the number of components of this kind. (We could do better by employing independence, but then we would have to estimate the second moments
by the fourth and it would boil down to the same.)
To find N first note that the pairs of two distinct indices can be chosen in
(~) = n(n2-1) ways. Having fixed i,j, the term xl XJ arises in 6 ways, corresponding to possible arrangements of 2 pairs of 2 indices: (i,i,j,j), (i,j,i,j),
(i,j,j,i), (j,j,i,i), (j,i,j,i), (j,i,i,j). SoN=3n(n-1) and we have
IE(S~) ~ K(n + 3n(n - 1)) = K(n
+ 3n2 -
3n) ~ 3Kn2 •
By Chebyshev's inequality,
Sn
P(I-I
n
~ €)
= P(ISnl
IE(S~)
3K 1
~ n€) ~ - (
)4 ~ 4 · 2 ' ·
n€
€
n
The series L: ~ converges and by Borel-Cantelli the set lim sup An with An =
{w : I~ I ~ €} has measure zero. Its complement is the set of full measure we
262
Measure, Integral and Probability
need on which the sequence ~ converges to O. To see this let w tJ- lim sup An,
which means that w is in finitely many An. So for a certain no, all n 2: no,
w tJ- An, i.e. ~ < E (as observed before), and this is precisely what was needed
D
for the convergence in question.
The next law will only require finite moments of order 2, not necessarily
uniformly bounded.
We precede it by an auxiliary but crucial inequality due to Kolmogorov.
It gives a better estimate than does the Chebyshev inequality. The latter says
that
P(ISnl 2: E)
~ Var~Sn).
E
In the theorem below the left-hand side is larger, hence the result is stronger.
Theorem 8.30
If Xl"'" Xn are independent with 0 expectation and finite variances, then for
any E > 0,
Var(Sn)
P( max ISkl 2: E) ~
2
'
E
l~k~n
where Sn
+ ... + X n .
=
Xl
E
> 0 and describe the first instance that ISkl exceeds E. Namely, we
Proof
We fix an
write
i.pk = {
I
o
if ISll < E, •.. , ISk-ll <
if all ISil < E.
E,
ISkl 2:
E
For any w at most one of the numbers i.pk(W) may be 1, the remaining ones
being 0, hence their sum is either 0 or 1. Clearly
n
Li.pk = 0
k=l
n
Li.pk = 1
k=l
Hence
{:}
max ISkl <
E,
max ISkl 2:
E.
l~k~n
{:}
l~k~n
n
n
P( max ISkl 2: E) = P(" i.pk = 1) = lE(" i.pk)
l<k<n
~
~
- k=l
k=l
263
8. Limit theorems
since the expectation is the integral of an indicator function:
So it remains to show that
the last equality because lE(Sn) = O. We estimate lE(S;) from below:
n
n
k=l
k=l
n
k=l
n
~ lE(I)S~
+ 2Sk(Sn -
Sk)]'Pk)
(non-negative term deleted)
k=l
n
n
k=l
k=l
= lE(L S~'Pk) + 2lE(L Sk(Sn
- Sk)'Pk).
We show that the last expectation is equal to O. Observe that 'Pk is a function of random variables Xl"'" X k , so it is independent of X k + l , ... , Xn and
also Sk is independent of Xk+l,"" Xn for the same reason. We compute one
component of this last sum:
n
lE(Sk(Sn - Sk)'Pk) = lE(Sk'Pk(
L
Xi))
(by the definition of Sn)
i=k+l
n
= lE(Sk'Pk)lE(
L
Xi)
(by independence)
i=k+l
= 0 (since lE(Xi ) = 0 for all i).
In the remaining sum note that for each k :::; n, 'PkS~ ~ 'Pkc: 2 (this is 0 ~ 0 if
'Pk = 0 and SZ ~ c: 2 if 'Pk = 1, both true), hence
n
lE(L S~'Pk) ~
n
lE(c: 2
k=l
which gives the desired inequality.
L 'Pk) =
k=l
n
c: 2 lE(L
'Pk),
k=l
D
264
Theorem
Measure, Integral and Probability
8.31
Suppose that X 1 ,X2, ... are independent with JE(Xn) = 0 and
L
00
n=l
Then
1
2" Var(Xn)
n
-Sn --+ 0
n
< 00.
almost surely.
Proof
We introduce auxiliary random variables
It is sufficient to show that ~ --+ 0 almost surely, and by Lemma 8.7 it is
sufficient to show that for each c > 0,
y,
L P(1 2:12': c) <
00
00.
m=l
First take a single term P(lYml 2': 2m c) and estimate it by Kolmogorov's inequality (Theorem 8.30):
The problem reduces to showing that
L
00
m=l
1
Var(S2=) 4m <
00.
8. Limit theorems
265
We rearrange the components:
1
L Var(S2"') 4
00
m
m=1
1
2'"
L 4 L Var(X
00
=
m
m=1
= Var(XI)
k)
k=1
1
L
00
4m + Var(X2)
m=1
m=l
L
1
4m + ... + Var(X8 )
00
m=3
+ ...
4m
L
m=2
+Var(X5 )
00
1
4m + Var(X4 )
00
+ Var(X3)
1
L
L
00
1
4m
m=2
L
00
1
4m
m=3
since Var(X1), Var(X2) appear in all components of the series in m (1,2 ::;
21), Var(X3), Var(X4 ) appear in all except the first one (21 < 3,4 ::; 22),
Var(X5 ), ... , Var(X8 ) appear in all except the first two (22 < 5,6,7,8 ::; 23),
and so on. We arrive at the series
where
i
This is a geometric series with ratio and the first term 41j where j is the least
integer such that 2j > k. If we replace 2j by k we increase the sum by adding
more terms and the first term is then 21k :
and by the hypothesis
which completes the proof.
D
Finally, we relax the conditions on moments even further; simultaneously
we need to impose the assumption that the random variables are identically
distributed.
266
Measure, Integral and Probability
Theorem 8.32
Suppose that Xl, X 2 , ... are independent, identically distributed, with lE(X I ) =
f.1 < 00. Then
Sn
- -+ f.1
n
almost surely.
Proof
The idea is to use the previous theorem where we needed finite variances. Since
we do not have that here we truncate Xn:
The truncated random variables have finite variances since each is bounded:
IYnl :s: n (the right-hand side is forced to be zero if Xn dare up cross the level
n). The new variables differ from the original ones if IXnl > n. This, however,
cannot happen too often, as the following argument shows. First,
00
n=l
00
=
L P(IXnl > n)
n=l
=L
00
P(IXII
> n)
(the distributions being the same)
n=l
:s: ~ l~l P(IXII
:s:
1
00
P(IXII
> x) dx (as
P(IXII
> x) 2':
P(IXII
> n))
> x) dx
= lE(IXII) (by Lemma 8.21)
< 00.
So by Borel-Cantelli, with probability one only finitely many events Xn =1= Yn
happen; so, in other words, there is a set a' with p(a') = 1 such that for
wE a', Xn(W) = Yn(w) for all except finitely many n. So, on a', if Yl+~+Yn
converges to some limit, the same holds for ~.
To use the previous theorem we have to show the convergence of the series
~ Var(Yn )
~
n2
n=l
'
267
8. Limit theorems
but since Var(Yn) = lE(Y;) - (lEYn)2 :::; lEY; it is sufficient to show the convergence of
To this end, note first:
lE(Y';)
= faoo 2xP(lYnl > x) dx
(by Lemma 8.21)
= fan 2xP(lYnl > x) dx (since P(lYnl > n) = 0)
= fan 2xP(lXn l > x) dx (if IYnl :::; n then Yn = Xn)
= fan 2xP(IX1 > x) dx (identical distributions).
1
Next,
=L
00
n=l
1
n2
roo
io
2xl[O,n)(x)P(IX1 > x) dx
1
0
We examine the function x M L~=l ';2X1[O,n)(x).
If 0 :::; x :::; 1 then none of the terms in the series is killed and
as is well known.
If x> 1, then we have the sum only over n 2: x. Let m = Int(x) (the integer
part of x). We have
In each case the function in question is bounded by 2, so
268
Measure, Integral and Probability
Consider Yn -1E(Yn) and apply the previous theorem to get
1
n
- l::(Yk -1E(Yk )) -+ 0 almost surely.
n
k=l
We have
IE(Yk) = IE(Xkl{w:lxkl~k}) = IE(Xll{w:IXl(w)l~k}) -+
since Xll{w:IXl(w)l~k} -+ Xl, the sequence being dominated by
integrable. So we have by the triangle inequality
1 n
1- l::Yk -
n
k=l
1 n
1 n
J1,1 ::; 1-n l::(Yk -1E(Yk))1 + -n
k=l
l:: IIE(Yk) -
J1,
IXII
which is
J1,1-+ O.
k=l
As observed earlier, this implies almost sure convergence of ~ .
D
8.2.5 Weak convergence
In order to derive central limit theorems we first need to investigate the convergence of the distributions of the sequence (Xn) of random variables.
Definition 8.33
A sequence Pn of Borel probability measures on ]Rn converges weakly to P if
and only their cumulative distribution functions Fn converge to the distribution
function F of P at all points where F is continuous. If Pn = PXn , P = Px
are the distributions of some random variables, then we say that the sequence
(Xn) converges weakly to X.
The name 'weak' is justified, since this convergence is implied by the weakest
we have come across so far, i.e. convergence in probability.
Theorem 8.34
If Xn converge in probability to X, then the distributions of Xn converge
weakly.
Proof
Let F(y) = P(X ::; y). Fix y, a continuity point of F, and
to obtain
F(y) -
E
< Fn(Y) < F(y) + E
E
> O. The goal is
269
8. Limit theorems
for sufficiently large n. By continuity of F we can find 8 > 0 such that
P(X
~
y) -
e
2 < P(X
y - 8),
~
P(X ~ y+8) < F(y)
e
+ 2'
By convergence in probability,
P(IXn -XI> 8) <
Clearly, if Xn ~ y and IXn - XI
P((Xn
~
e
2'
< 8, then X < y + 8, so
y) n (IXn - XI < 8))
~
P(X < y + 8).
We can estimate the left-hand side from below:
P(Xn
~
y) -
e
2 < P((Xn
~
y) n (IXn - XI < 8)).
Putting all these together we get
P(Xn
~
y) < P(X
~
y) + e
and letting e -+ 0 we have achieved half of the goal. The other half is obtained
similarly.
0
However, it turns out that weak convergence in a certain sense implies
convergence almost surely. What we mean by 'in a certain sense' is explained
in the next theorem: it also gives a central role to the Borel sets and Lebesgue
measure in [0,1].
Theorem 8.35 (Skorokhod representation theorem)
If Pn converge weakly to P, then there exist X n , X, random variables defined
on the probability ([0, l],B,m[O,lj), such that PX n = Pn , Px = P and Xn -+ X
a.s.
Proof
Take x;t, X;;, X+, X- corresponding to Fn, F, the distribution functions
of Pn , P, as in Theorem 4.45. We have shown there that Fx+ = Fx- = F,
which implies P(X+ = X-) = 1. Fix an w such that X+(w) = X-(w). Let
y be a continuity point of F such that y > X+(w). Then F(y) > wand, by
the weak convergence, for sufficiently large n we have Fn(Y) > w. Then, by the
construction, x;t(w) ~ y. This inequality holds for all except finitely many n,
so it is preserved if we take the upper limit on the left:
limsupX;t(w) ~ y.
270
Measure, Integral and Probability
Take a sequence Yk of continuity points of F converging to X+(w) from above
(the set of discontinuity points of a monotone function is at most countable).
For Y = Yk consider the above inequality and pass to the limit with k to get
limsupX;;(w) ~ X+(w).
Similarly
liminf X;(w)
~
X-(w),
so
X-(w) ~ liminf X;(w) ~ limsupX;;(w) ~ X+(w).
The extremes are equal a.s., so the convergence holds a.s.
o
The Skorokhod theorem is an important tool in probability. We will only
need it for the following result, which links convergence of distributions to that
of the associated characteristic functions.
Theorem 8.36
If Px n converge weakly to Px then r.p x n --+ r.p X
.
Proof
Take the Skorokhod representation Yn , Y of the measures PXn , Px . Almost
sure convergence of Yn to Y implies that JE(e itYn ) --+ JE(e itY ) by the dominated convergence theorem. But the distributions of X n , X are the same as
the distributions of Yn , Y, so the characteristic functions are the same.
0
Theorem 8.37 (Helly's theorem)
Let Fn be a sequence of distribution functions of some probability measures.
There exists F, the distribution function of a measure (not necessarily probability), and a sequence kn such that Fk n (x) --+ F(x) at the continuity points of
F.
Proof
Arrange the rational numbers in a sequence: Q = {ql, q2, ... }. The sequence
Fn(qd is bounded (the values of a distribution function lie in [0,1]), hence it
has a convergent subsequence,
271
8. Limit theorems
Next consider the sequence Fp (q2), which is again bounded, so for a subsequence k~ of k; we have conve;gence
Of course also
Proceeding in this way we find k~, k~, ... , each term a subsequence of the
previous one, with
Fk~ (qm) -t Ym for m :S 3,
Fk~ (qm) -t Ym for m :S 4,
and so on. The diagonal sequence Fk n
We define FQ on Q by
= Fk?,'; converges at all rational points.
and next we write
F(x) = inf{FQ(q) : q E Q, q > x}.
We show that F is non-decreasing. Since Fn are non-decreasing, the same
is true for FQ (ql < q2 implies Fk n (ql) :S Fk n (q2), which remains true in the
limit). Now let Xl < X2. F(Xl) :S FQ(q) for all q > Xl, hence in particular for
all q > X2, so F(xd :S infq>x2 FQ(q) = F(X2).
We show that F is right-continuous. Let Xn ">t x. By the monotonicity of
F, F(x) :S F(xn), hence F(x) :S limF(xn). Suppose that F(x) < limF(xn).
By the definition of F there is q E Q, X < q, such that FQ(q) < limF(xn). For
some no, x :S xno < q, hence F(xno) :S FQ(q) again by the definition of F, thus
F(xno) < limF(xn), which is a contradiction.
Finally, we show that if F is continuous at x, then Fk n (x) -t F(x). Let
c > 0 be arbitrary and find rationals ql < q2 < X < q3 such that
F(x) - c < F(ql) :S F(x) :S F(q3) < F(x) + c.
Since Fk n (q2) -t FQ(q2)
~
F(ql), then for sufficiently large n,
F(x) - c < Fk n (q2).
But Fk n is non-decreasing, so
Fk n (q2) :S Fk n (x) :S FkJq3).
Finally, Fk n (q3) -t FQ(q3)
~
F(q3), so for sufficiently large n,
Fk n (q3) < F(x)
+ c.
Putting together the above three inequalities we get
F(x) which proves the convergence.
E
< FkJX) < F(x) + E,
D
272
Measure, Integral and Probability
Remark 8.38
The limit distribution function need not correspond to a probability measure.
Example: Fn = l[n,oo), Fn ~ 0, so F = O. This is a distribution function (nondecreasing, right-continuous) and the corresponding measure satisfies peA) = 0
for all A. We then say informally that the mass escapes to infinity. The following
concept is introduced to prevent this happening.
Definition 8.39
We say that a sequence of probabilities Pn on lR. d is tight if for each c > 0 there
is M such that Pn(lR. d \ [-M, M]) < c for all n.
By an interval in lR. n we understand the product of intervals: [- M, M] =
{x = (Xl, ... , Xn) E lR. n : Xi E [-M, M] all i}. It is important that the M
chosen for c is good for all n - the inequality is uniform. It is easy to find
such an M = Mn for each Pn separately. This follows from the fact that
Pn([-M, M]) ~ 1 as M ~ 00.
Theorem 8.40 (Prokhorov's theorem)
If a sequence Pn is tight, then it has a subsequence convergent weakly to some
probability measure P.
Proof
By Helly's theorem a subsequence Fk n converges to some distribution function
F. All we have to do is to show that F corresponds to some probability measure
P, which means we have to show that F(oo) = 1 (i.e. limy---+ oo F(y) = 1). Fix
c> 0 and find a continuity point such that Fn(Y) = Pn((-oo,y]) > 1- c for
all n (find M from the definition of tightness and take a continuity point of F
which is larger than M). Hence lim n---+ oo Fk n (y) ~ 1- c, but this limit is F(y).
This proves that limy---+oo F(y) = 1.
D
We need to extend the notion of the characteristic function.
Definition 8.41
We say that <p is the characteristic function of a Borel measure P on lR. if
<pet) = J eitx dP(x).
273
8. Limit theorems
In the case where P
tions are consistent.
= Px we obviously have r.pp = r.px, so the two defini-
Theorem 8.42
Suppose (Pn ) is tight and let P be the limit of a subsequence of (Pn ) as provided
by Prokhorov's theorem. If r.pn(u) -+ r.p(u) , where r.pn are the characteristic
functions of Pn and r.p is the characteristic function of P, then Pn -+ P weakly.
Proof
Fix a continuity point of F. For every subsequence Fk n there is a subsequence
(subsubsequence) lkn' In for brevity, such that Fin converge to some function H
(Helly's theorem). Denote the corresponding measure by pl. Hence r.pln -+ r.pp',
but on the other hand, r.pln -+ r.p. So r.pp' = r.p and consequently pI = P
(Corollary 6.25). The above is sufficient for the convergence of the sequence
Fn(Y).
0
8.2.6 Central limit theorem
The following lemma will be useful in what follows. It shows how to estimate
the 'weight' of the 'tails' of a probability measure, and will be a useful tool in
proving tightness.
Lemma 8.43
If r.p is the characteristic function of P, then
P(lR \ [-M, M])
~ 7M
1M
1
[1-
~r.p(u)l duo
274
Measure, Integral and Probability
Proof
1M
1
M
[1 -
~ip(u)] du = M
1M ~
1M
1M
1
[1-
[ eixu dP(x)] du
1
=M
[1 - [ cos(xu) dP(x)] du
1
=
~
[1 - cos (xu)] dudP(x)
M
= [
r(1- sin~~))dP(x)
1M.
1
M
(1-
sin~~))dP(x)
sint
inf (1 - )
Itl~l
~ ~P(1R \
since 1 -
sin t
t
1
M
Itl~l
~
(by Fubini)
t
liTl>l
dP(x)
[-M, M])
>
1 - sin 1 -> 1.
7
D
Theorem 8.44 (Levy's theorem)
Let ipn be the characteristic functions of Pn . Suppose that ipn ---+ ip where ip is
a function continuous at o. Then ip is the characteristic function of a measure
P and Pn ---+ P weakly.
Proof
It is sufficient to show that Pn is tight (then Pkn ---+ P weakly for some P and
kn, ipk n ---+ ipp, ip = ipp, and by the previous theorem we are done).
Applying Lemma 8.43 we have
1M
1
Pn(1R \ [-M, M]) S; 7M
[1 -
~ipn(U)] duo
Since ipn ---+ ip, lipnl S; 1, and for fixed M, this upper bound is integrable, so by
the dominated convergence theorem
1M
1
7M
[1 -
~ipn(u)] du ---+ 7M
1M
1
[1-
~ip(u)] duo
275
8. Limit theorems
If M -+
00,
then
7M (iI [1- Wcp(u)] du
Jo
~ 7M ~
M
sup 11 - Wcp(u)l-+ 0
rO,iIJ
by continuity of cp at 0 (recall that cp(O) = 1). Now let E > 0 and find Mo such
that 7 sUPro, J o J 11 - Wcp( u) I < ~, and no such that for n ~ no
{Jo
IJo
[1 - Wcpn(u)] du -
(Jo
Jo
[1 - Wcp(u)] dul <
E
2'
Hence
Pn(JR \ [-Mo, Mo]) < E
for n ~ no. Now for each n = 1,2, ... , no find Mn such that Pn ([-Mn' Mn]) >
1 - E and let M = max{Mo, M l , ... , Mno}' Of course since M ~ Mk,
Pn([-M, M]) ~ Pn ([-Mk' M k]) > 1 - c for each n, which proves the tightD
ness of Pn .
For a sequence X k with f..lk = lE(Xk ), O'~ = Var(Xk ) finite, let Sn = Xl
as usual and consider the normalized random variables
... + Xn
+
Tn = Sn - lE(Sn) .
JVar(Sn)
Clearly lE(Tn) = 0 and Var(Tn) = 1 (by Var(aX) = a 2 Var(X)). Write c;, =
Var(Sn) (if Xn are independent, then as we already know, c;, = L~=l O'~). We
state a condition under which the sequence of distributions of Tn converges to
the standard Gaussian measure G (with the density vk-e-!x2):
(8.4)
In particular, if the distributions of Xn are the same, f..lk = f..l, O'k = 0', then this
condition is satisfied. To see this, note that assuming independence we have
c;, = n0'2 and then
hence
276
Measure, Integral and Probability
as n ---+ 00 since the set {x : Ix - tL I 2: c:a yin} decreases to 0.
We are ready for the main theorem in probability. The proof is quite technical and advanced, and may be omitted at a first reading.
Theorem 8.45 (Lindeberg-Feller theorem)
Let Xn be independent with finite expectations and variances. If condition (8.4)
holds, then PTn ---+ G weakly.
Proof
Assume for simplicity that tLn = 0 (the general case follows by considering
Xn - tLn). It is sufficient to show that the characteristic functions 'PTn converge
to the characteristic function of G, i.e. to show that
We compute
'PTn (u)
= JE( eiuTn ) (by the definition of 'PTn)
= JE(e ic: Lk=l Xk)
n
= JE(II eic: Xk)
k=l
n
=
II JE(e ic: Xk)
(by independence)
k=l
11
n
=
'PXk (c:)
(by the definition of 'PXk)'
What we need to show is that
n
u
k=l
n
1
2
log II 'Px (-) ---+ --u .
k C
2
We shall make use of the following formulae (particular cases of Taylor's
formula for a complex variable):
log(l
+ z) = z + thlzl 2
. 1
2
= 1 + iy + 202Y
elY
.
elY
1
2
1
for some 01 with 1011 ~ 1,
for some O2 with 102 1 ~ 1,
= 1 + iy - 2Y + "603Iyl
3
for some 03 with 1031 ~ 1.
8. Limit theorems
So for fixed
277
E
> 0,
1
=1
'PXk(U) =
Ixl2:e:Cn
Ixl2:e:cn
e iux dPXk(x)
eiux dPXk(x)
~u2x2 + ~e3IuI3IxI3) dPXk (x)
(1 + iux -
= 1 + ~u21
Ixl2:e:cn
+~luI31
Ixl<e:cn
since J xdPXk(x)
notation:
Ixl<e:cn
2
Ixl<e:cn
2
1
+ iux + ~e2u2x2) dPXk (x)
(1
+1
+
2
6
e2X2 dPXk(x) -
~u21
2
Ixl<e:cn
x 2 dPXk(x)
e3 IxI 3 dPXk (x)
= 0 (this is JE(Xk))' For clarity we introduce the following
an k
(3nk
=l
1
Ixl2:e:cn
=
Ixl<e:cn
x 2 dPXk (x),
x 2 dPXk (x).
Observe that (3nk ~ E2C; because on the set over which we take the integral
we have x 2 < E2 C; and we integrate with respect to a probability measure.
Condition (8.4) now takes the form
(8.5)
Since
n
n
I:(ank
+ (3nk) = I:Var(Xk ) = c;'
k=l
we have
k=l
n
1
I : 2(3nk -+ 1 as n -+
k=l cn
(8.6)
00.
The numbers ank, (3nk are positive, so the last convergence is monotone (the
sequence is increasing). The above relations hold for each E > O.
We now analyse some terms in the expression for 'PXk' Since
we have a e~ with le~1 ~ 1 such that
1 Ixl2:e:Cn e2x2 dPXk(X) =
e~
1
Ixl2:e:Cn
x 2 dPXk(x) =
e~ank'
278
Measure, Integral and Probability
Next
I
r
J1xl<cc n
03IxI 3 dPxk (x)l:::;
(we replace one x by
JO~I :::; 1,
I
r
CC n ,
r
r
J1xl<EC n
ccnx 2 dPxk (x)
leaving the remaining two), hence for some
03IxI 3 dPx k(x)l:::;
J1xl<EC n
IxI 3 dPxk (x):::;
J1xl<EC n
O~
r
J1xl<EC n
cCnx 2 dPxk (x) =
O~
with
O~ccnf3nk.
We substitute this to the expression for 'PXk' obtaining
1 2I
1 2
'Px k(U ) = 1 + "2u 02 cx nk - "2u f3nk
Replace u by
+ 6"11 U 13 03 ccnf3nk.
I
to get
.:!!:...
Cn
~
~ "Ink
k=l
1 2
--+ -"2 U
+ 6"11 U 13 03 c .
(8.7)
I
Recall that what we are really after is
n
log
n
n
II 'P xJ~n ) = L log 'P Xk (~)n = L bnk + 0 i'Ynk 2),
k=l
k=l
C
k=l
C
1
1
where we introduced Taylor's formula for the logarithm, so we are not that far
from the target. All we have to do is to show that
as n --+
00.
So let 0 > 0 be arbitrary and u fixed.
n
u
1 2
Ilog II
'PXk(-)+"2ul
Cn
k=l
n
1
:::; I L
"Ink + "2u 2
: :; I
"Ink
k=l
t
k=l
1
n
+ 1011 L I"Ink 12
+ ~u2 -
k=l
~luI3e~cl +
tI
k=l
"Ink 12
+ luI3cle~l·
279
8. Limit theorems
The first term on the right converges to zero by (8.7), so it is less than ~ for
sufficiently large n. It remains to show that
~
~
bnkl
2
+ lui 3 c < '02
k=l
for large n. We choose
c so small that lul 3c < ~. It remains to show that
for large n. We have a formula for
hence we use the following trick:
"'Ink
and we know something about I:~=l "'Ink,
n
n
The first factor has:
max
k=l, ... ,n
bnkl
1
2
~ -u
2
1
Ctnk
2 2
1
3 3
2.ll ax 1-en2-1 + -u
c + -61ul c
2
k-l, ... ,n
n
Ctnk I
Ctnk
max I -2~ ""
~ -2-+ O.
k=l, ... ,n
en
k=l
en
The second factor satisfies
where we used the fact I : (3,,:/
~ 1. Writing
en
clarity we have, taking c ~ 1,
L
n
I:nk=l -}Ctnk
en
=
an, an
-+ 0, for
u2
u2c2
lul3c3 u2
u2
lul 3c
I"'Ink 12 ~ (2 an + -2- + -6-)( 2 an + 2 + -6-) ~ Can + Dc
k=l
for some numbers (depending only on u) C, D. So finally choose c so that
Dc < ~ and then take no so large that Can < ~ for n 2': no.
D
As a special case we immediately deduce a famous result:
280
Measure, Integral and Probability
Corollary 8.46 (de Moivre-Laplace theorem)
Let Xn be identically distributed independent random variables with P(Xn =
1) = P(Xn = -1) = ~. Then
P(a <
Sn
< b) -+
r,;:;
yn
1
rn=
y21f
lb
1
e-"2 x
a
2
dx.
Exercise 8.8
Use the central limit theorem to estimate the probability that the number
of 'heads' in 1000 independent tosses differs from 500 by less than 2%.
Exercise 8. 9
How many tosses of a coin are required to have the probability at least
0.99 that the average number of 'heads' differs from 0.5 by less than 1%?
8.2.7 Applications to mathematical finance
We shall show that the Black-Scholes model is a limit of a sequence of suitably
defined eRR models with option prices converging as well. For fixed T recall
that in the Black-Scholes model the stock price is of the form
S(T)
=
S(O) exp{~(T)},
where ~(T) = (r - <72 )T + aw(T), r is the risk-free rate for continuous compounding. Randomness is only in w(T), which is Gaussian with zero mean and
variance T.
To construct the approximating sequence, fix n to decompose the time period into n steps of length '£
and write
n
2
T
n
Rn = exp{r- },
which is the risk-free growth factor for one step. We construct a sequence
T}n(i), i = 1, ... , n of independent, identically distributed random variables,
with binomial distribution, so that the price in the binomial model after n
steps,
n
Sn(T) = S(O)
II T}n(i),
i=l
converges to S(T). Write
281
8. Limit theorems
where each value is taken with probability!. Our task is to find Un, Dn,
assuming that Un > Dn. The following condition
(8.8)
guarantees that Sn(T) is a martingale (see Section 7.4.4). We look at the logarithmic returns:
We wish to apply the Central Limit Theorem to the sequence In 17n (i) so we
adjust the variance. We want to have
n
Var(ln(II 17n(i))) = Tcr 2 .
i=l
On the left, using independence,
n
n
Var(Lln17n(i))
i=l
= LVar(ln17n(i)) = nVar(ln17n(1)).
i=l
For the binomial distribution with p = !,
so the condition needed is
Since Un > Dn,
so finally
Un = Dn
eXP{2cr/r;,}.
We solve the system (8.8), (8.9) to get
(8.9)
282
Measure, Integral and Probability
Consider the expected values
n
n
lE(ln(II 17n(i)))
i=l
= lE(2:)n17n(i)) = nlE(ln17n(1))
i=l
say.
Exercise 8.10
Show that
For each n we have a sequence of n independent identically distributed
random variables ~n (i) = In 17n (i) forming the so-called triangular array. We
have the following version of central limit theorem. It can be proved in exactly
the same way as Theorem 8.45 (see also [2]).
Theorem 8.47
If ~n(i) is a triangular array, An = L ~n(i), lE(An) ---+ J.l = (r-~a2)T, Var(An) ---+
a 2 T then the sequence (An) converges weakly to a Gaussian random variable
with mean J.l and variance a 2 T.
The conditions stated in the theorem hold true with the values assigned
above to Un and D n , as we have seen. As a result, InSn(T) converges to In S(T)
weakly. The value of put in the binomial model is given by
Pn(O) = exp{ -rT}lE(K - Sn(T))+ = exp{ -rT}lE(g(lnSn(T)))
where g(x)
= (K - e )+ is a bounded, continuous function. Therefore
X
Pn(O) ---+ exp{ -rT}lE(g(lnS(T))),
which, as we know (see Exercise 4.20), gives the Black-Scholes formula. The
convergence of call prices follows immediately from the call-put parity, which
holds universally, in each model.
283
8. Limit theorems
8.3 Proofs of propositions
Proof (of Proposition 8.10)
Clearly d(X, Y) = d(Y, X). If d(X, Y) = 0 then 1E(IX - YI) = 0, hence X = Y
almost everywhere so, X = Y in £1. The triangle inequality follows from the
triangle inequality for the metric p(x, y) = 1~~~~1'
Next, assume that Xn --+ X in probability. Let IE > 0
d(Xn' X)
1
+1
IXn-XI
IX XI dP
IXn-xl<~ 1 + n IXn-XI dP
IXn-xl~~ 1 + IXn - XI
IE
IE
:::; 2 +P(IXn -XI;::: 2)
=
since the integrand in the first term is less than IE (estimating the denominator
by 1) and we estimate the integrand in the second term by 1. For n large enough
the second term is less than ~.
Conversely, let Ee,n = {w : IXn(w) - X(w)1 > IE} and assume 0 < IE < 1.
Additionally, let An = {w : IXn(w) - X(w)1 < I} and write
d(XnX) = {
IXn - XI dP + {
IXn - XI dP.
l An 1+IXn -XI
lA-;.l+IXn-XI
We estimate from below each of the two terms. First,
since l~a
> i if a < 1. Second,
{
IXn - XI
{I
(
lAc 1 + IXn - XI dP;::: lAc 2 dP;::: lAcnE
n
n
n
1
e,n
IE
2 dP ;::: 2P (A n n Ee,n)
since IE < 1. Hence, d(Xn' X) ;::: ~P(Ee,n) --+ 0, so (Xn) converges to X in
probability.
D
284
Measure, Integral and Probability
Proof (of Proposition 8.12)
First
JE(YP) = {
yP dP
+ {
i{w:Y(w)?e}
yP dP
i{w:Y(w)<e}
~ cPP(Y ~ c) + {
YPdP.
i{w:Y(w)<e}
Now if we let c -+ 00, then the second term converges to
and the first term has no choice but to converge to O.
In yP dP = JE(YP)
0
Proof (of Proposition 8.24)
(i) Write A = liminfn--+oo An. If W E A then there is an N such that wEAn
for all n ~ N, which implies wEAn for all except finitely many n. Conversely,
if n E An eventually, then there exists N such that wEAn for all n ~ Nand
wE A.
(ii) Fix c > O. If A;' = {IXn - XI
lim infn--+oo A;' = Ae, say. But
{Xn -+ X}
= n{IXn
< c} then {IXn - XI < c e.v.}
XI < c ev.} = n Ae
-
e>O
and the sets Ae decrease as c
e>O
'\t O. Taking c =
~ successively shows that
00
P(Xn -+ X)
=
P( n liminf A;,/n)
n=l
= lim Ae = lim P(IXn - XI < c ev.}.
e--+O
e--+O
(iii) By de Morgan
00
00
n=lm=n
00
00
n=l
m=n
00
00
n=lm=n
(iv) If A = liminfn--+oo An then IA = liminfn--+oo IAn (since IA(W) = 1 if
and only if 1 An = 1 eventually) and if B = lim sUPn--+oo An, then 1 B =
limsuPn--+oo IAn (since IB(w) = 1 if and only if IAn = 1 i.o.). Fatou's lemma
implies
P(A n ev.)
= { lim inf
in
n--+oo
IAn dP ::; lim inf
n--+ooi(IAn
n dP,
that is
P(B) ::; lim inf P(An) ::; lim sup P(An).
n-+oo
n-too
285
8. Limit theorems
But
lim sup P(An)
n-+oo
= lim sup
1
n-+oo [}
IAn
dP :::;
1
[}
lim sup
n-+oo
IAn
dP
by Fatou in reverse (see the proof of Theorem 4.26), hence
lim sup P(An) =
n-+oo
iiA
[}
dP = P(A) = P(An i.o.).
D
Solutions
Chapter 2
2.1 If we can cover A by open intervals with prescribed total length, then
we can also cover A by closed intervals with the same endpoints (closed
intervals are bigger), and the total length is the same. The same is true for
any other kind of intervals. For the converse, suppose that A is null in the
sense of covering by closed intervals. Let c > 0, take a cover Cn = [an, bn ]
with Ln (bn - an) < ~, let In = (an - c 2}+2, bn + c 2}+2)' For each n, In
contains Cn, so the In cover A; the total length is less than c. In the same
way we refine the cover by any other kind of intervals.
2.2 Write each element of C in ternary form. Suppose C is countable and so
they can be arranged in a sequence. Define a number which is not in this
sequence but has ternary expansion and so is in C, by exchanging 0 and 2
at the nth position.
2.3 For continuity at x E [0,1] take any c > 0 and find F(x) - c < a < F(x) <
b < F(x) + c of the form a = 2':.' b = :;. By the construction of F, these
numbers are values of F taken on some intervals (aI, a2), (b!, b2), with
ternary ends, and al < x < b2 . Take a 8 such that al < x - 8 < x + 8 < b2
to get the continuity condition.
2.4 m*(B) ~ m*(AUB) ~ m*(A) +m*(B) = m*(B) by monotonicity (Proposition 2.5) and subadditivity (Theorem 2.7). Thus m*(A U B) is squeezed
between m*(B) and m*(B), so has little choice.
2.5 Since A c BU(ALlB), m*(A) ~ m*(B)+m*(ALlB) = m*(B) (monotonicity and subadditivity). Reversing the roles of A and B gives the opposite
inequality.
287
288
Measure, Integral and Probability
2.6 Since Au B = Au (B \ A) = Au (B \ (A n B)), using additivity and
Proposition 2.15 we have m(AUB) = m(A) +m(B) -m(AnB). Similarly
m(A U B U C) = m(A) +m(B) + m(C) - m(A n B) - m(An C) - m(B n
C) + m(A n B n C). Note, that the second part makes use of the first on
the sets A n C and B n C.
2.7 It is sufficient to note that
00
1
(a,b) = U(a,b--],
n
n=l
00
1
(a,b) = U[a+-,b).
n=l
n
2.8 If E is Lebesgue-measurable, then existence of 0 and F is given by Theorems 2.17 and 2.29, respectively. Conversely, let € = ~, find On, Fn , and
then m*(nOn \ E) = 0, m*(E \ UFn) = 0, so E is Borel up to null set.
Hence E is in M.
2.9 We can decompose A into U:I (A
disjoint, so
n Hi);
00
00
i=l
i=l
the components are pairwise
using the definition of conditional probability.
2.10 Since AcnB = (D\A)nB = B\ (AnB), p(ACnB) = P(B) -p(AnB)
P(B) - P(A)P(B) = P(B)(l - P(A)) = P(B)P(N).
=
2.11 There are 32 paths altogether. S(5) = 524.88 = 500U 2 D 3 , so there are
= 10 paths. S(5) > 900 in two cases: S(5) = 1244.16 = 500U 5 or
S(5) = 933.12 = 500U 4 D, so we have 6 paths with probability 0.5 5 =
0.03125 each, so the probability in question is 0.1875.
m
2.12 There are 2m paths of length m, Fm can be identified with the power set
of the set of all such paths, thus it has 22'" elements.
2.13 Fm c Fm+l since if the first m + 1 elements of a sequence are identical, so
are the first m elements.
2.14 It suffices to note that p(AmnA k )
=
i, which is the same as P(Am)P(A
k ).
Chapter 3
3.1 If f is monotone (in the sense that Xl ::; X2 implies f (Xl) ::; f (X2)), the
inverse image of interval (a, (0) is either [b, (0) or (b, (0) with b = sup{ X :
f(x) ::; a}, so it is obviously measurable.
289
Solutions
3.2 The level set {x : f(x) = a} is the intersection of f- 1([a, +00)) and
f- 1 ((-00, a]), each measurable by Theorem 3.3.
3.3 If b ~ a, then (r)-l((-oo,b])
=~, and if
b < a, then (r)-l((-oo,b]) =
f- 1 ((-00,b]); in each case a measurable set.
3.4 Let A be a non-measurable set and let f(x) = 1 if x E A, f(x) = -1
otherwise. The set f- 1 ([1,00)) is non-measurable (it is A), so f is nonmeasurable, but P == 1 which is clearly measurable.
3.5 Let g(x) = lim sup fn(x), h(x) = liminf fn(x); they are measurable by
Theorem 3.9. Their difference is also measurable and the set where fn
converges is the level set of this difference: {x : f n converges} = {x :
(h - g)(x) = O}, hence is measurable.
3.6 If sup f = 00 then there is nothing to prove. If sup f = M, then f(x) ::; M
for all x, so M is one of the z in the definition of ess sup (so the infimum
of that set is at most M). Let f be continuous and suppose ess supf < M
finite. Then we take z between these numbers and by the definition of
ess sup, f(x) ::; z a.e. Hence A = {x : f(x) > z} is null. However, by
continuity A contains the set f- 1 ((z, M)) which is open - a contradiction.
If sup f is infinite, then for each z the condition f (x) ::; z a.e. does not
hold. The infimum of the empty set is +00 so, we are done.
3.7 It is sufficient to notice that if 9 is any O"-field containing the inverse images
of Borel sets, then :Fx c g.
3.8 Let f : ~ -+ ~ be given by f(x) = x 2. The complement of the set f(~)
equal to (-00,0) cannot be ofthe form f(A) since each of these is contained
in [0,00).
3.9 The payoff of a down-and-out European call is f((S(O), S(l), ... , S(n)))
with f(xo, Xl, .•. , XN) = (XN - K)+ . lA, where A = {(xo, Xl,· •. , XN) :
min{xO,x1, ... ,XN}
~
L}.
Chapter 4
4.1 (a) fo101nt(x)dx
= Om([O, 1))+lm([1,2))+2m([2,3))+·· ·+9m([9, 10))+
lOm([10, 10]) = 45.
(b) fo2 Int(x 2 ) dx
5- V3- V2.
f:
-71".
(c)
1r
= Om([O, l])+lm([l, V2))+2m([V2, V3))+3m([V3, 2)) =
Int(sinx) dx = Om([O, ~))+lm([~, ~])+Om((~,7I"])-lm((7I",271"]) =
290
Measure, Integral and Probability
4.2 We have i(x) = I:~1 k2 k - 1 lAk(X), where Ak is the union of 2 k - 1 intervals of length 31k each, that are removed from [0, 1] at the kth stage. The
convergence is monotone, so
1
[O,lJ
idm =
2k-1
n
00 (2)k-1
1
nl~~Lk3k = 3 Lk 3
k=l
k=l
Since I:~1 a k = 12"" differentiating term by term with respect to a we
have I:~1 ka k - 1 = (1_1",)2' With a = ~ we get fro,lJ i dm = 3.
4.3 The simple function alA is less than i, so its integral, equal to am(A), is
smaller than the integral of f. Next, i :::; bl A , hence the second inequality.
4.4 Let in = nl(o,~J; limin(x) = 0 for all x but fin dm = 1.
4.5 Let a =I=- -1. We have
sider E = (0,1):
f x'" dx =
"'~1 x"'+1 (indefinite integral). First con-
1
1
1
1
x'" dx = - - x",+ll
o
a+1
0'
which is finite if a> -1. Next E = (1,00):
and for this to be finite we need a
1.tx3
< -1.
l.tx3 : :;
: :;
4.6 The sequence in(x) =
converges to 0 pointwise,
~
~X-2.5 :::; x- 2 .5 which is integrable on [1,00), so the sequence of integrals
converges to O.
4.7 First a = O. Substitute u = nx:
The sequence of integrands converges to ue- u2 for all u :::: 0; it is dominated
by g(u) = ue- u2 , so
lim
1o00 indm= 100 limindm= 100 ue-udu=-.21
2
0
Now a
0
> O. After the same substitution we have
1
00 n2xe-n2x2
a
l+x
2
dx=
1
lR
ue- u2
(u)2 l [na,00)(u)du=
1+
n
say, and in -+ 0, in(u) :::; ue- u2 , so limf in dm = O.
1
lR
in(u)du,
291
Solutions
4.8 The sequence In(x) converges for x ~ 0 to e- x . We find the dominating
function. Let n > 1. For x E (0, 1), x~ ~ x!, (1 + ~)n ~ 1, so In(x) ::;
.Ix
which is integrable over (0,1). For x E [1, 00), x-~ ::; 1, so In(x) ::; (1
~)-n. Next
(1 + -X) n =
n
1+ x
n- 1
+ n(n - 1) (X)
- 2+ ... > x 2n
2!
2n
~
+
1 2
-x ,
4
so In(x) ::; ~ which is integrable over [1,00).
Therefore, by the dominated convergence theorem,
lim
1
00
In dm =
1
00
e- X dx = 1.
tl
4.9 (a) f~1 Inaxnl dx = n a
Ixl n dx = 2n a fOI xn dx (Ixl n is an even function)
= ~~~. If a < 0, then the series L:n~1 ~~~ converges by comparison with
nl~a, we may apply the Beppo-Levi theorem and the power series in question defines an integrable function. If a = 0 the series is L:n>1 xn = l=x'
which is not integrable since f~1 (L:n>1 xn) dx = L::=1 f~1 x;dx = 00. By
comparison the series fails to give an-integrable function if a > O.
fOO xe- nx dx = x(_l)e-nXIOO (b) Write _x_
l-e- = '""'
L.Jn>1 xe- nx , Jo
n
0
e -l = x~
oo
nx
(-~) fo e- dx = ~ (integration by parts) and, as is well known,
X
,"",00
L.Jn=1
1
~
=
X
71"2
6'
4.10 We extend I by putting 1(0) = 1, so that I is continuous, hence Riemannintegrable on any finite interval. Let an = f~:+I)71" I(x) dx. Since I is even,
a_n = an and hence f~oo I (x) dx = 2 L::=o an. The series converges since
an = (-l)nla n l, lanl ::; ;71" (x ~ mr, I f~:+1)71" sinxdxl = 2). However for
I to be in Ll we would need In~.1/1 dm = 2 L::=o bn finite, where bn =
I~:+1)71" I/(x)1 dx. This is impossible due to bn ~ (n';I)71"'
4.11 Denote
f
OO
-00
12
2
e- X dx = Ij then
=
J
L2 e-(x 2+y2)dx dy =
1271" 1
00
re- r2 dr do: = 11",
using polar coordinates and Fubini's theorem (Chapter 6).
0.;
Substitute x =
in Ij .,fii = f::O e- x2 dx
completes the computation.
=
vt f~oo e- ~ dz, which
4.12
fIR
l';x2 dx = arc tanxl:J=: = 11", hence f~oo c(x) dx = 1.
4.13
10
e->'X dx = -te->,xIOO = t, hence c = A.
00
2<7
292
Measure, Integral and Probability
4.14 Let an -+ 0, an ~ 0. Then Px({y}) = limn-+oo Px((Y - an,y]) = Fx(y)limn-+ oo Fx(y - an), which proves the required equivalence. (Recall that
P x is always right-continuous.)
4.15 (a) Fx(Y)
= 1 for y ~ a and zero otherwise.
°
Fx(y) = °
(b) Fx(Y)
= for y < 0, Fx(Y) = 1 for y
~ ~, and Fx(y)
= 2y otherwise.
< 0, Fx(y) = 1 for y
~ ~, and Fx(y)
= 1 - (1 - 2y)2
for y
(c)
otherwise.
= x 3 , g-l(y) = flY, d~g-l(y) = ~y-~, hence fX3(y) =
I[O,l](fIY)h-~ = h-h[o,l](Y)'
4.17 (a) In adP = aP(fl) = a (constant function is a simple function).
4.16 In this case g(x)
(b) Using Exercise 4.15, fx(x)
(c) Again by Exercise 4.15,
1
= 2 x l[o,!](x), so JE(X) = Io"2 2xdx =
~.
1
Ix (x) = 4(1-2x) x l[o,!] (x), JE(X) = Io"24x(l-
2x) dx = ~.
4.18 (a) With fx(x)
b).
= b~a l[a,b], JE(X) = b~a I: xdx = b~a ~(b2 - a2) = ~(a +
(b) Consider the simple triangle distribution with fx(x) = x + 1 for
x E [-1,0], fx(x) = -x + 1 for x E (0,1] and zero elsewhere. Then
immediately I~l xfx(x) dx = 0. A similar computation for the density fy,
b.
whose triangle's base is [a, b], gives JE(X) =
at
(c) A Iooo xe - AX dx
4.19 (a) cpx{t)
(b) cpx(t)
=
(b_1a)it
=
±(integration by parts).
(e ibt
-
eiat ),
= AIooo e(it-A)xdx =
. t
(c) cpx(t) = ell'-
1
A~it'
2 2
-"2<1 t .
4.20 Using call-put parity, the formula for the call and symmetry of the Gaussian distribution we have P = S(0)(N(d 1 ) - 1) - Ke- rT (N(d 2 ) - 1) =
-S(O)N( -dd + Ke- rT N( -d2 )
Chapter 5
5.1 First, f == f as f = f everywhere. Second, if f = 9 a.e., then of course
9 = f a.e. Third, if f = 9 on a full set Fl C E (m(E \ F1 ) = 0) and 9 = h
on a full set F2 C E, then f = h on Fl n F2 which is full as well.
293
Solutions
< n, so the sequence is not Cauchy.
5.2 (a) Illn - Iml11 = 2 if m
I::
~ dx = logm -logn (for simplicity assume that n <
(b) Illn - 1m 111 =
m), let c = 1, take any N, let n = N, take m such that logm -logN > 1
(log m --t 00 as m --t 00) - the sequence is not Cauchy.
I:: tz
dx = -~I~ = ~ - ~ (n < m as before), and for
(c) Illn - Iml11 =
< ~ and for n, m ~ N, clearly ~ - ~ < c any c > 0 take N such that
the sequence is Cauchy.
5.3 119n - 9mll~ =
-k
I:: ~ dx = --bl~ = !(~ - ~3) - the sequence is Cauchy.
5.4 (a) Illn - Imll~ = m - n if n
(b)
IIln -
Imll~ =
(c) Illn - Imll~ =
> m, so the sequence is not Cauchy.
I:: tz dx = ~ -
I::
~ --t 0 - the sequence is Cauchy.
~ dx = (3~3 - 3r!.3) - the sequence is Cauchy.
5.5 III + 911 2 = 4, III - 911 2 = 1, 111112 = 1, 11911 2 = 1, and the parallelogram law
is violated.
+ 911~ = 0, III - 911~ = ~, IIIII~ = ~, 11911~ = ~, which contradicts the
parallelogram law.
5.6 III
5.7 Since sinnxcosmx = ~[sin(n
+ m)x + sin(n - m)x]
5.8 No: take any n, m (suppose n
< m) and compute
and sinnxsinmx =
~[cos(n-m)x+cos(n+m)x], it is easy to compute the indefinite integrals.
They are periodic functions, so the integrals over [-71',71'] are zero (for the
latter we need n !- m).
J.1..n (~)
1
119n-9mll!=
4
dx=-x- 1
11/n
=(m-n)~l,
11m
Tn
so the sequence is not Cauchy.
5.9 Let
n
Jw, JE(X) = 10 X dm =
If we take X(w) = Jw - 2 then
= [0,1] with Lebesgue measure, X(w) =
I; Jx dx =
2, JE(X2) = 101 ~ dx =
JE(X) = 0 and JE(X2) =
00.
1
00.
5.10 Var(aX) = JE((aX)2) - (JE(aX))2 = a2(JE(X2) - (JE(X))2) = a2Var(X).
b~a l[a,bj, JE(X) = at b , VarX = JE(X2) - (a~b)2, JE(X2) =
x 2 dx = b~a !(b3 - a3 ) and simple algebra gives the result.
5.11 Let Ix(x) =
b~a
I:
5.12 (a) JE(X) = a, JE((X - a)2) = 0 since X = a a.s.
(b) By Exercise 4.15, Ix (x) = 2x l[o,!j(x) and by Exercise 4.17, JE(X) = ~;
1
1
so Var(X) = 102 2(x - ~)2 dx = 2 I..!! x 2 dx = 418'
4
294
Measure, Integral and Probability
(c) By Exercise 4.15, fx(x)
lE(X) =
So
(4 - 8x)1[0,!J(x), and by Exercise 4.17,
i.
5.13 Cov(Y, 2Y + 1) = lE( (Y) (2Y + 1)) -lE(Y)lE(2Y + 1) = 2lE(y2) - 2(lE(y))2 =
2Var(Y), Var(2Y + 1) = Var(2Y) = 4Var(Y), hence p = 1.
5.14 X, Yare uncorrelated by Exercise 5.7. Take a > 0 so small that the sets
A = {w : sin27fw > 1 - a}, B = {w : cos27fw > 1 - a} are disjoint. Then
P(A n B) = 0 but P(A)P(B) =1= O.
Chapter 6
6.1 The function
-!,
if 0 < y < x < 1
{
g(x, y) =
~lfr if 0 < x < y < 1
o
otherwise
is not integrable since the integral of g+ is infinite (the same is true for the
integral of g-). However,
=
ior
1
[1y ~x
1
dx _
~
r dX] dy = ior
y io
1
[-1
+~y
~]
dy =-1
y
while similarly
10
1
[:2
loX dy -
11 :2
dy] dx =
1,
which shows that the iterated integrals may not be equal if Fubini's theorem
condition is violated.
6.2
I I[0,3J x [-1,2J x 2 y dm2 =
I~l I: x 2 y dx dy = I~l 9y dy = 2:].
6.3 By symmetry it is sufficient to consider x
is 4~ a va 2 - 2 dx = ab7f.
Io
x
~
0, y
~
0, and hence the area
295
Solutions
6.4 Fix x E [0,2], I; lA(x, y) dy = m(Ax), hence fx(x) = x for x E [0,1]'
fx(x) = 2 - x for x E (1,2J and zero otherwise (triangle distribution). By
symmetry, the same holds for fy.
6.5 P(X + Y > 4) = P(Y > -X + 4) = I IA fx,Y(x,y) dxdy where A =
{(x,y) : y > 4 - x} n [0,2J x [1,4]' so P(X + Y > 4) = I02 I44_x 510(X 2 +
y2)dydx = 510 I;(-4x 2 + x 3 + 16x)dx = 185' Similarly P(Y > X) =
I IA fx,Y(x, y) dxdy where A = {(x, y) : y > x} n [0, 2J x [1,4], so we get
t
6.6
fx+Y(z)~ J.fX'Y(X'Z-X)dX~{ ~-z
z<O
O:=:;z:=:;l
1:=:;z:=:;2
2<z
6.7 By Exercise 6.4, fx,Y(x,y) =I- fx(x)fy(y), so X, Yare not independent.
6.8 fy+(-x)(z) = I~: fy(y)f-x(z - y)dy = I~: ~1[0,2l(y)1[_1,ol(z - y)dy,
so
0
z < -lor 2 < z
=
~(z
+
1)
-I:=:; z :=:; 0
() {
f
y-X z
<
z < 1
-
1
0
~(2-z)
1:=:;z:=:;2
2
fx+y(z) = I~: fx(x)fy(z-x)dx = I~: 1[0,1l(x)~1[0,2l(z-x)dx, hence
z < 0 or 3 < z
O:=:;z:=:;l
1:=:;z:=:;2
2:=:;z:=:;3
P(Y
~
+~
P(X
> X) = P(Y - X > 0) = Iooo fy_x(z)dz = ~ + I12 ~(2 - z)dz =
=~;
+ Y > 1) = It fx+y(z) dz = ~ + I23 ~(3 - z) dz = ~ +
i = ~.
6.9 fx(x) = Io-~x+l1Ady = 1 - ~x, h(y,x) = 1:~~,~) and lE(YIX = 1) =
2
fl
2 J02
1
xdx = 4'
296
Measure, Integral and Probability
6.10 Jy(y) = Jol(x + y) dx = ~ + y, h(xly)
fl
Jo
xy+Y
dx = !t!Y.
.,+y
2'+y
Chapter 7
7.1 If f.l(A)
= 0 then AI(A) = 0 and A2(A) = 0, hence (AI + A2)(A) = O.
7.2 Let Q be a finite partition of n which refines both PI and P 2 • Each A E
PI and each B E P2 can be written as a disjoint union of in the form
A = Ui$;m Ei , B = Uj$;n Fj , where the Ei and Fj are sets in Q. Thus
A n B = Ui,j (Ei n F j ) is a finite disjoint union of sets in Q, since in
each case, Ei n F j equals either Ei or Fj (sets in Q are disjoint). Thus
Q refines the partition R = {A n B : A E PI, B E P 2 }. On the other
hand, any set in PI or P 2 is obviously a finite union of sets in R (e.g.
A = Ann = An (UBE'P2 B) for A E PI) as the partitions are finite.
7.3 We have to assume first that m(A) =f. O. Then B C A clearly implies
that f.l dominates v. (In fact m(B \ A) = 0 is slightly more general.) Then
consider the partition {B, A \ B, n \ A} to see that h = 1 B. To double
check, v(F) = m(F n B) = JFnB IB dm = JFnB IB df.l.
7.4 Clearly f.l( {w }) ~ v( {w }) is equivalent to f.l dominating v. For each w we
_ ~lw})
h ave dl/()
dl-' w -~.
= JEgdm and we wish to have v(E) = JE ~~ df.l = JE ~~fdm
it is natural to aim at taking ~~ (x) = ~i:~. Then a sufficient condition for
this to work is that if A = {x : f(x) = O} then v(A) = 0, i.e. g(x) = 0
a.e. on A. Then we put ~~ (x) = ~i:~ on AC and 0 on A and we have
v(E) = JEnAc gdm = JEnAc ~~f dm = JE ~~ df.l, as required.
7.6 Clearly v« f.l is equivalent to A = {w: f.l({w}) = O} C {w: v({w}) = O}
and then ~~ (w) = :a~B on AC and zero on A.
7.5 Since v(E)
7.7 Since v « f.l we may write h = ~~, so that v(F) = JF h df.l. As f.l(F) = 0
if and only if v(F) = 0, the set {h = O} is both f.l-null and v-null. Thus
h- l = (~~)-l is well-defined a.s., and we can use (ii) in Proposition 7.9
= h- l , as required.
with A = f.l to conclude that 1 = h-Ih implies
*
7.8 80((0,25]) = 0, but 215ml[o,25]((0,25]) = 1; 215ml[0,25] ({O}) = 0 but
80 ({O}) = 1, so neither PI « P2 nor P2 « Pl. Clearly PI « P3 with
~(x) = 2 x l{o}(x) and P2 «P3 with ~(x) = 2 X 1(0,25] (x).
297
Solutions
7.9 Aa
= ml[2,3], As = Do + miu,2), and h = 1[2,3]'
7.10 Suppose F is non-constant at ai with positive jumps Ci, i = 1,2, .... Take
M -=I- ai, with -M -=I- ai and let 1= {i ; ai E [-M, M]}. Then
mF([-M,M]) = F(M) - F(-M) =
I:Ci = LmF({ai}),
iEI
iEI
which is finite since F is bounded on a bounded interval. So any A c
[-M, M] \ UiEI{ai} is mF-null, hence measurable. But {ad are mFmeasurable, hence each subset of [-M, M] is mF-measurable. Finally, any
subset E of IR is a union of the sets of the form E n [- M, M], so E is
mF-measurable as well.
7.11 mF has density f(x) = 2 for x E [0,1] and zero otherwise.
7.12 (a) Ixi = 1 + f~l f(y) dy, where f(y) = -1 for y E [-1,0]' and f(y) = 1
for y E (0,1].
(b) Let 1
n
(L
k=l
> c > 0, take D = c 2, I:~=l (Yk - Xk) < D, with Yk :::; Xk+l; then
1v'Xk -
JYkI)2 :::; (yY: -
yX1Y = Yn - 2Y!YnXI + Xl < Yn -
Xl
n
:::; L(Yk - Xk) < c 2 •
k=l
(c) The Lebesgue function f is a.e. differentiable with f' = 0 a.e. If it were
absolutely continuous, it could be written as f(x) = foX f'(y) dy = 0, a
contradiction.
7.13 (a) If F is monotone increasing on [a, b], I:7=IIF(Xi) - F(Xi-I)1 = F(b)F(a) for any partition a = Xo < Xl < ... < Xk = b. Hence TF[a, b] =
F(b) - F(a).
(b) If FE BV[a,b] we can write F = F I -F2 where both F I , F2 are monotone increasing, hence have only countably many points of discontinuity.
So F is continuous a.e. and thus Lebesgue-measurable.
(c) f(x) = x2 cos;;; for
belong to BV[O, 1].
X
-=I- 0 and f(O)
= 0 is differentiable but does not
(d) If F is Lipschitz, I:7=1!F(Xi) - F(Xi-dl :::; MI:7=llxi - Xi-II
M(b - a) for any partition, so TF[a,b] :::; M(b - a) is finite.
=
7.14 Recall that v+(E) = v(E n B), where B is the positive set in the Hahn
decomposition. As in the hint, if G S;; F, v(G) :::; v+(G n B) :::; v(G n B) +
v((F n B) \ (G n B)) = v(F n B). Since the set (F n B) \ (G n B) S;; B,
298
Measure, Integral and Probability
its v-measure is non-negative. But F n B <:;:; F, so sup{v(G) : G <:;:; F} is
attained and equals v(FnB) = v+(F). A similar argument shows v-(F) =
sup{ -v(G)} = - infecF{v(G)}.
7.15 For all FE F, v+(F) = IBnFfdp, = sUPecFIe fdp,. If f > 0 on a set
C c AnF with p,(C) > 0, then Ie f dp, > 0, so that Ie f dp, + I BnF f dp, >
SUPeCF Ie f dp,. This is a contradiction since C U (B n F) c F. So f :::; 0a.s. (p,) on An F. We can take the set {f = O} into B, since it does not
affect the integrals. Hence {J < O} c A and {f 2: O} c B. But the two
smaller sets partition D, so we have equality in both cases.
Hence f+
= fIB and f- = - flA, and for all F
v+(F)=v(BnF)=
r
-1
iBnF
v-(F) = -v(AnF) =
fdp,=
AnF
F
E
r f+dp"
iF
fdp, =
r f-dp,.
iF
r
7.16 f E Ll(V) if and only if both I f+ dv and I
dv are finite. Then
IE f+ 9 dp, and IE f- 9 dp, are well defined and finite, and their difference
is IE fgdp,. So fg E Ll(p,), as IE(f+ - r)lgl dp, < 00. Conversely, if
fg E U(p,) then both IE f+ Igl dp, and IE f-Igl dp, are finite, hence so is
their difference IE f dv.
7.17 (a) lE(XIQ)(w)
(b) lE(XIQ)(w)
=
=
~ otherwise.
= w if wE [0, ~J, lE(XIQ)(w) =
~ otherwise.
~ if wE [0, ~J, lE(XIQ)(w)
7.18 lE(XnIFn-d = lE(Zl Z2 ... ZnlFn-d = Zl Z2 ... Zn-llE(ZnIFn-l), and
since Zn is independent of F n- 1 , lE(ZnIFn-d = lE(Zn) = 1, hence the
result.
7.19 lE(Xn) = np, =I- p,
is a martingale.
= lE(Xd so Xn is not a martingale. Clearly Yn = Xn -np,
7.20 Apply Jensen's inequality to Xn
= Zl + Z2 + ... + Zn. We obtain
so (Xn) is a submartingale. By (7.5) the compensator (An) satisfies LlAn =
lE((LlX~)IFn-d for each n. But here LlXn = Zn and Z~ == 1. So LlAn = 1
and hence An = n is deterministic.
7.21 For s
< t, since the increments are independent and w(t) - w(s) has the
299
Solutions
same distribution as w(t - s),
lE(exp( -O"w(t) -
~a2t)) = e-o-w (s)_!o-2 tlE (exp( -[O"(w(t) - w(s)])l.:Fs )
= e-o- w (sl-!o-2 tlE (exp( -[O"(w(t) - w(s)])
= e-o- w (s)_!o-2 tlE (exp( -O"w(t - s))).
Now O"w(t - s) '-'"' N(O, 0"2(t - s)) so the expectation equals lE(e-o-y't=SZ)
e-!o-2(t-s) (where Z'-'"' N(O, 1)) and so the result follows.
=
Chapter 8
8.1 (a) in
=
l[n,n+~l converges to 0 in
LP, pointwise, a.e. but not uniformly.
(b) in = nl[o,~l - nl[_~,Ol converges to 0 pointwise and a.e. It converges
neither in LP nor uniformly.
n
[0,1] with Lebesgue measure. The sequences Xn = 1(0,~),
Xn = nl(o,~l converge to 0 in probability since P(IXnl > c) ::; ~ and the
same holds for the sequence gn'
8.2 We have
=
8.3 There are endless possibilities, the simplest being Xn(w) == 1 (but this
sequence converges to 1) or, to make sure that it does not converge to
anything, Xn(W) == n.
8.4 Let Xn = 1 indicate the 'heads' and Xn = 0 the 'tails', then ~OOo) is the
average number of 'heads' in 100 tosses. Clearly lE(Xn) =~, lE(~ooJ) =~,
~Tar(X ) - 1 Har(SlOO) - -1-100.
14 -- _1
so
v'
n - 4' v'
100 - 100 2
400'
p(IS100_~1>01)<
100
and
2 -
.
1
- 0.1 2 400
S100 1
1
P(1 100 - "21 < 0.1) ~ 1 - 0.1 2 400
8.5 Let Xn be the number shown on the die, lE(Xn)
P(I ~~~~ - 3.51 < 0.1)
3
4:
= 3.5, Var(Xn)
~
2.9167.
~ 0.2916.
8.6 The union U:=n Am is equal to [0, 1] for all m and so is lim sUPn~oo An-
300
Measure, Integral and Probability
8.7 Let d = 1. There are (~n) paths that return to 0, so P(S2n = 0) = (;,n) 2;n.
Now
(2n)!
(n!)2
(~)2nJ21f2ri
rv
(~)2n27rn
22n
= vfn7r'
jI.
.In
so P(S2n = 0) rv
with c =
Hence 2:::=1 P(An) diverges and BorelCantelli applies (as (An) are independent), so that P(S2n = 0 Lo.) = 1.
Same for d = 2 since P(An) rv ~. But for d > 2, P(An) rv nL2' the series
converges and by the first Borel-Cantelli lemma P(S2n = 0 Lo.) = O.
8.8 Write S = SlOOO; P(IS - 5001 < 10) = P( IS~OI < 0.63) = N(0.63) N( -0.63) ~ 0.4729, where N is the cumulative distribution function corresponding to the standard normal distribution.
8.9 The condition on n is P(I~ - 0.51
n
< 0.005) = P( Is~nl
< 0.01.jn) :?:
n/4
0.99, N(O.Ol.jn) :?: 0.995 = N(2.575835), hence n :?: 66350.
8.10 Write Xn
1
-2 (In Un
= ea~. Then
+ InDn) =
1
1 (2Rnxn)2
rI.
(1 + X~)
-In(UnDn) = -In (
2)2 = lne n -In -2-- .
Xn
2
2
1 + xn
So it suffices to show that the last term on the right is
1 +X2
__
n
2x n
so that
a;;; + o(~). But
X-I +X
ea~ +ea~
n
n =
= cosh(ay'T/n)
2
2
a 2T
1
= 1+ - +0(-),
2n
n
=
1+~
In(--n)
2x n
~T
= In(l + -
2n
1
~T
1
+ 0(-)) = + 0(-).
n
2n
n
Appendix
Existence of non-measurable and non-Borel sets
In Chapter 2 we defined the a-field B of Borel sets and the larger a-field M
of Lebesgue-measurable sets, and all our subsequent analysis of the Lebesgue
integral and its properties involved these two families of subsets of K The set
inclusions
Be M c
P(l~)
are trivial; however, it is not at all obvious at first sight that they are strict,
i.e. that there are sets in ~ which are not Lebesgue-measurable, and as that
there are Lebesgue-measurable sets which are not Borel sets. In this appendix
we construct examples of such sets. Using the fact that A c ~ is measurable
(resp. Borel-measurable) iff its indicator function lA EM (resp. B), it follows
that we will automatically have examples of non-measurable (resp. measurable
but not Borel) functions.
The construction of a non-measurable set requires some set-theoretic preparation. This takes the form of an axiom which, while not needed for the consistent development of set theory, nevertheless enriches that theory considerably.
Its truth or falsehood cannot be proved from the standard axioms on which
modern set theory is based, but we shall accept its validity as an axiom, without
delving further into foundational matters.
301
302
Measure, Integral and Probal."_ J
The axiom of choice
Suppose that A = {Ac, : a E r} is a non-empty collection, indexed by some
set r, of non-empty disjoint subsets of a fixed set D. There then exists a set
E c D which contains precisely one element from each ofthe sets Ac", i.e. there
is a choice function f : r -t A.
Remark
The Axiom may seem innocuous enough, yet it can be shown to be independent
of the (Zermelo-Fraenkel) axioms of sets theory. If the collection A has only
finitely many members there is no problem in finding a choice function, of
course.
To construct our example of a non-measurable set, first define the following
equivalence relation on [0, 1J: x r-v y if y - x is a rational number (which will be
in [-1, 1]). This relation is easily seen to be reflexive, symmetric and transitive.
Hence it partitions [O,lJ into disjoint equivalence classes (An), where for each
a, any two elements x, y of An differ by a rational, while elements of different
classes will always differ by an irrational. Thus each An is countable, since IQl
is, but there are uncountably many different classes, as [0, 1J is uncountable.
Now use the axiom of choice to construct a new set E c [O,lJ which contains exactly one member an from each of the An. Now enumerate the rationals in [-1, 1J: there are only count ably many, so we can order them as a
sequence (qn). Define a sequence of translates of E by En = E + qn. If E is
Lebesgue-measurable, then so is each En and their measures are the same, by
Proposition 2.15.
But the (En) are disjoint: to see this, suppose that Z E Em n En for some
m =1= m. Then we can write an + qm = Z = a(3 + qn for some an, a(3 E E, and
their difference an - a(3 = qn - qm is rational. Since E contains only one element
from each class, a = f3 and therefore m = n. Thus U~=l En is a disjoint union
containing [0, IJ.
Thus we have [O,lJ C U~=l En c [-1,2J and m(En) = m(E) for all n. By
countable additivity and monotonicity of m this implies:
L m(En) = m(E) + m(E) + ... :::; 3.
00
1 = m([O, 1]) :::;
n=l
This is clearly impossible, since the sum must be either
conclude that E is not measurable.
a or 00. Hence we must
For an example of a measurable set that is not Borel, let C denote the Cantor
set, and define the Cantor function f : [0, 1J -t C as follows: for x E [0, 1J write
x = O.ala2 ... in binary form, i.e. x = L~=l ~,where each an = a or 1 (taking
303
Appendix
non-terminating expansions where the choice exists). The function x H an is
determined by a system of finitely many binary intervals (i.e. the value of an
is fixed by x satisfying finitely many linear inequalities) and so is measurable
- hence so is the function f given by f(x) = 2::=1 ~. Since all the terms of
y = 2::=1 23ann have numerators 0 or 2, it follows that the range Rf of f is a
subset of C. Moreover, the value of y determines the sequence (an) and hence
x, uniquely, so that f is invertible.
Now consider the image in C of the non-measurable set E constructed
above, i.e. let B = f(E). Then B is a subset of the null set C, hence by the
completeness of m it is also measurable and null. On the other hand, E =
f-1(B) is non-measurable. We show that this situation is incompatible with B
being a Borel set.
Given a set B E B and a measurable function g, then g-1(B) must be
measurable. For, by definition of measurable functions, g-1 (1) is measurable
for every interval I, and we have
g-1(U Ai) = Ug-1(A i ),
00
00
i=1
i=1
quite generally for any sets and functions. Hence the collection of sets whose
inverse images under the measurable function 9 are again measurable forms a
a-field containing the intervals, and hence it also contains all Borel sets.
But we have found a measurable function f and a Lebesgue-measurable set
B for which f-1(B) = E is not measurable. Therefore the measurable set B
cannot be a Borel set, i.e. the inclusion B c M is strict.
References
[1]
T.M. Apostol, Mathematical Analysis, Addison-Wesley, Reading, MA,
1974.
[2]
P. Billingsley, Probability and Measure, Wiley, New York, 1995.
[3]
Z. Brzezniak, T. Zastawniak, Basic Stochastic Processes, Springer, London, 1999.
[4]
M. Capinski, T. Zastawniak, Mathematics for Finance, An Introduction
to Financial Engineering, Springer, London, 2003.
[5]
R.J. Elliott, P.E. Kopp, Mathematics of Financial Markets, Springer, New
York, 1999.
[6]
G.R. Grimmett, n.R. Stirzaker, Probability and Random Processes, Clarendon Press, Oxford, 1982.
[7]
J. Hull, Options, Futures, and Other Derivatives, Prentice Hall, Upper
Saddle River, NJ, 2000.
[8]
P.E. Kopp, Martingales and Stochastic Integrals, Cambridge University
Press. Cambridge, 1984.
[9]
P.E. Kopp, Analysis, Edward Arnold, London, 1996.
[10] J. Pitman, Probability, Springer, New York, 1995.
[11] W. Rudin, Real and Complex Analysis, McGraw-Hill, New York, 1966.
[12] G. Smith, Introductory Mathematics: Algebra and Analysis, Springer, London, 1998.
[13] n. Williams, Probability with Martingales, Cambridge University Press,
Cambridge, 1991.
305
Index
a.e., 55
- convergence, 242
a.s., 56
absolutely continuous, 189
- function, 109, 204
- measure, 107
adapted, 223
additive
- countably, 27
additivity
- countable, 29
- finite, 39
- of measure, 35
almost everywhere, 55
almost surely, 56
American option, 72
angle, 138
Banach space, 136
Beppo-Levi
- theorem, 95
Bernstein
- polynomials, 250
Bernstein-Weierstrass
- theorem, 250
binomial
- tree, 50
Black-Scholes
- formula, 119
- model, 118
Borel
- function, 57
- measure, regular, 44
- set, 40
Borel-Cantelli lemma
- first, 256
- second, 258
bounded variation, 207
Brownian motion, 234
BV[a, b], 207
call option, 70
- down-and-out, 72
call-put parity, 117
Cantor
- function, 302
- set, 19
Cauchy
- density, 108
- sequence, 11, 128
central limit theorem, 276, 280
central moment, 146
centred
- random variable, 151
characteristic function, 116, 272
Chebyshev's inequality, 247
complete
- measure space, 43
- space, 128
completion, 43
concentrated, 197
conditional
- expectation, 153, 178, 179, 219
- probability, 47
contingent claim, 71
continuity of measure, 39
continuous
- absolutely, 204
307
308
convergence
- almost everywhere, 242
- in pth mean, 144
- in LP, 242
- in probability, 245
- pointwise, 242
- uniform, 11, 241
- weak, 268
correlation, 138, 151
countable
- additivity, 27, 29
covariance, 151
cover, 20
de Moivre-Laplace theorem, 280
de Morgan's laws, 3
density, 107
- Cauchy, 108
- Gaussian, 107, 174
- joint, 173
- normal, 107, 174
- triangle, 107
derivative
- Radon-Nikodym, 194
derivative security, 71
- European, 71
Dirac measure, 68
direct sum, 139
distance, 126
distribution
- function, 109, 110, 199
- gamma, 109
- geometric, 69
- marginal, 174
- Poisson, 69
- triangle, 107
- uniform, 107
dominated convergence theorem, 92
dominating measure, 190
Doob decomposition, 227
essential
- infimum, 66
- supremum, 66, 140
essentially bounded, 141
event, 47
eventually, 256
exotic option, 72
expectation
- conditional, 153, 178, 179, 219
- of random variable, 114
Fatou's lemma, 82
filtration, 51, 223
Index
- natural, 223
first hitting time, 231
formula
- inversion, 180
Fourier series, 140
Fubini's theorem, 171
function
- Borel, 57
- Cantor, 302
- characteristic, 116, 272
- Dirichlet, 99
- essentially bounded, 141
- integrable, 86
- Lebesgue, 20
- Lebesgue measurable, 57
- simple, 76
- step, 102
fundamental theorem of calculus, 9, 97,
215
futures, 71
gamma distribution, 109
Gaussian density, 107, 174
geometric distribution, 69
Holder inequality, 142
Hahn-Jordan decomposition, 211, 217
Helly's theorem, 270
Hilbert space, 136, 138
Lo., 255
identically distributed, 244
independent
- events, 48, 49
- random variables, 70, 244
- 0'- fields, 49
indicator function, 4, 59
inequality
Chebyshev, 247
- Holder, 142
- Jensen, 220
- Kolmogorov, 262
- Minkowski, 143
- Schwarz, 132, 143
- triangle, 126
infimum, 6
infinitely often, 255
inner product, 135, 136
- space, 136
integrable function, 86
integral
- improper Riemann, 100
- Lebesgue, 77, 87
- of a simple function, 76
309
Index
- Riemann, 7
invariance
- translation, 35
inversion formula, 180
Ito isometry, 229
Jensen inequality, 220
joint density, 173
Kolmogorov inequality, 262
L 1 (E),87
L2(E), 131
LP convergence, 242
LP(E),140
LOO(E), 141
law of large numbers
- strong, 260, 266
- weak, 249
Lebesgue
- decomposition, 197
- function, 20
- integral, 76, 87
- measurable set, 27
- measure, 35
Lebesgue-Stieltjes
- measurable, 202
- measure, 199
lemma
- Borel-Cantelli, 256
- Fatou, 82
- Riemann-Lebesgue, 104
Levy's theorem, 274
liminf,6
limsup,6
Lindeberg-Feller theorem, 276
lower limit, 256
lower sum, 77
marginal distribution, 174
martingale, 223
- transform, 228
mean value theorem, 81
measurable
- function, 57
- Lebesgue-Stieltjes, 202
- set, 27
- space, 189
measure, 29
- absolutely continuous, 107
- Dirac, 68
- F -outer, 201
- Lebesgue, 35
- Lebesgue-Stieltjes, 199
- outer, 20, 45
- probability, 46
- product, 164
- regular, 44
- a-finite, 162
- signed, 209, 210
- space, 29
measures
- mutually singular, 197
metric, 126
Minkowski inequality, 143
model
- binomial, 50
- Black-Scholes, 118
- CRR,234
moment, 146
monotone class, 164
- theorem, 165
monotone convergence theorem, 84
monotonicity
- of integral, 81
- of measure, 21, 35
Monte Carlo method, 251
mutually singular measures, 197
negative part, 63
negative variation, 207
norm, 126
normal density, 107, 174
null set, 16
option
- American, 72
- European, 70
- exotic, 72
- lookback, 72
orthogonal, 137-139
orthonormal
- basis, 140
- set, 140
outer measure, 20, 45
parallelogram law, 136
partition, 190
path, 50
pointwise convergence, 242
Poisson distribution, 69
polarisation identity, 136
portfolio, 183
positive part, 63
positive variation, 207
power set, 2
predictable, 226
probability, 46
310
- conditional, 47
distribution, 68
measure, 46
- space, 46
probability space
- filtered, 223
process
- stochastic, 222
- stopped, 231
product
- measure, 164
- a-field, 160
Prokhorov's theorem, 272
put option, 72
Radon-Nikodym
- derivative, 194
- theorem, 190, 195
random time, 230
random variable, 66
- centred, 151
rectangle, 3
refinement, 7, 190
replication, 233
return, 183
Riemann
- integral, 7
-- improper, 100
Riemann's criterion, 8
Riemann-Lebesgue lemma, 104
Schwarz inequality, 132, 143
section, 162, 169
sequence
- Cauchy, 11, 128
- tight, 272
set
- Borel, 40
- Cantor, 19
- Lebesgue-measurable, 27
- null, 16
a-field, 29
- generated, 40
- - by random variable, 67
- product, 160
a-finite measure, 162
signed measure, 209, 210
simple function, 76
Skorokhod representation theorem, 110,
269
space
- Banach, 136
- complete, 128
- Hilbert, 136, 138
Index
- inner product, 136
- L2(E), 131
- £p(E), 140
- measurable, 189
- measure, 29
- probability, 46
standard normal distribution, 114
step function, 102
stochastic
- integral, discrete, 228
- process, discrete, 222
stopped process, 231
stopping time, 230
strong law of large numbers, 266
subadditivity, 24
submartingale, 224
summable, 218
supermartingale, 224
supremum, 6
symmetric difference, 36
theorem
- Beppo-Levi, 95
- Bernstein-Weierstrass approximation,
250
- central limit, 276, 280
- de Moivre-Laplace, 280
- dominated convergence, 92
Fubini, 171
- fundamental of calculus, 9, 97, 215
Helly,270
- intermediate value, 6
- Levy, 274
Lindeberg-Feller, 276
mean value, 81
- Miller-Modigliani, 118
- monotone class, 165
- monotone convergence, 84
Prokhorov, 272
Radon-Nikodym, 190, 195
- Skorokhod representation, 110, 269
tight sequence, 272
total variation, 207
translation invariance
- of measure, 35
- of outer measure, 26
triangle inequality, 126
triangular array, 282
uncorrelated random variables, 151
uniform convergence, 11, 241
uniform distribution, 107
upper limit, 255
upper sum, 77
Index
variance, 147
variation
- bounded, 207
- function, 207
- negative, 213
- positive, 213
311
- total, 207, 212
weak
- convergence, 268
- law of large numbers, 249
Wiener process, 234
Download