Uploaded by shuklakaiwalya

Mathematical Statistics Basic Ideas and Selected Topics Volume II ( PDFDrive )

advertisement
Volume II
Statistics
The book covers asymptotic efficiency in semiparametric models from the Le Cam and Fisherian
points of view as well as some finite sample size optimality criteria based on Lehmann–
Scheffé theory. It develops the theory of semiparametric maximum likelihood estimation with
applications to areas such as survival analysis. It also discusses methods of inference based
on sieve models and asymptotic testing theory. The remainder of the book is devoted to model
and variable selection, Monte Carlo methods, nonparametric curve estimation, and prediction,
classification, and machine learning topics.
Features
• Develops basic asymptotic tools, including weak convergence for random processes,
empirical process theory, and the functional delta method
• Discusses the classical theory of statistical optimality in a decision-theoretic context
• Presents inference procedures and their properties in a variety of applications and
models, such as Cox’s regression model, models for censored data, and partial linear
models
• Describes properties of Monte Carlo/simulation-based methods, including the bootstrap
and Markov chain Monte Carlo (MCMC)
• Examines the nonparametric estimation of functions of one or more variables
• Covers many topics related to statistical learning, including support vector machines and
classification and regression trees (CART)
Using the tools and methods developed in this textbook, you will be ready for advanced
research in modern statistics. Numerous examples and problems illustrate statistical modeling
and inference concepts. Measure theory is not required for understanding.
Mathematical Statistics
Mathematical Statistics: Basic Ideas and Selected Topics, Volume II presents important
statistical concepts, methods, and tools not covered in the authors’ previous volume. This
second volume focuses on inference in non- and semiparametric models. It not only reexamines
the procedures introduced in the first volume from a more sophisticated point of view but also
addresses new problems originating from the analysis of estimation of functions and other
complex decision procedures and large-scale data analysis.
Texts in Statistical Science
Mathematical
Statistics
Basic Ideas and
Selected Topics
Volume II
Bickel
Doksum
•
•
•
•
•
Access online or download to your smartphone, tablet or PC/Mac
Search the full text of this and other titles you own
Make and share notes and highlights
Copy and paste text and figures for use in your own documents
Customize your view by changing font size and layout
K25643
ISBN: 978-1-4987-2268-1
90000
9 781498 722681
w w w. c rc p r e s s . c o m
WITH VITALSOURCE ®
EBOOK
Peter J. Bickel
Kjell A. Doksum
Mathematical
Statistics
Basic Ideas and
Selected Topics
Volume II
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Francesca Dominici, Harvard School of Public Health, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Statistical Theory: A Concise Introduction
F. Abramovich and Y. Ritov
Practical Multivariate Analysis, Fifth Edition
A. Afifi, S. May, and V.A. Clark
Practical Statistics for Medical Research
D.G. Altman
Interpreting Data: A First Course
in Statistics
A.J.B. Anderson
Introduction to Probability with R
K. Baclawski
Linear Algebra and Matrix Analysis for
Statistics
S. Banerjee and A. Roy
Mathematical Statistics: Basic Ideas and
Selected Topics, Volume I, Second Edition
P. J. Bickel and K. A. Doksum
Mathematical Statistics: Basic Ideas and
Selected Topics, Volume II
P. J. Bickel and K. A. Doksum
Analysis of Categorical Data with R
C. R. Bilder and T. M. Loughin
Statistical Methods for SPC and TQM
D. Bissell
Introduction to Probability
J. K. Blitzstein and J. Hwang
Bayesian Methods for Data Analysis,
Third Edition
B.P. Carlin and T.A. Louis
Second Edition
R. Caulcutt
The Analysis of Time Series: An Introduction,
Sixth Edition
C. Chatfield
Introduction to Multivariate Analysis
C. Chatfield and A.J. Collins
Problem Solving: A Statistician’s Guide,
Second Edition
C. Chatfield
Statistics for Technology: A Course in Applied
Statistics, Third Edition
C. Chatfield
Bayesian Ideas and Data Analysis: An
Introduction for Scientists and Statisticians
R. Christensen, W. Johnson, A. Branscum,
and T.E. Hanson
Modelling Binary Data, Second Edition
D. Collett
Modelling Survival Data in Medical Research,
Third Edition
D. Collett
Introduction to Statistical Methods for
Clinical Trials
T.D. Cook and D.L. DeMets
Applied Statistics: Principles and Examples
D.R. Cox and E.J. Snell
Multivariate Survival Analysis and Competing
Risks
M. Crowder
Statistical Analysis of Reliability Data
M.J. Crowder, A.C. Kimber,
T.J. Sweeting, and R.L. Smith
An Introduction to Generalized
Linear Models, Third Edition
A.J. Dobson and A.G. Barnett
Nonlinear Time Series: Theory, Methods, and
Applications with R Examples
R. Douc, E. Moulines, and D.S. Stoffer
Introduction to Optimization Methods and
Their Applications in Statistics
B.S. Everitt
Extending the Linear Model with R:
Generalized Linear, Mixed Effects and
Nonparametric Regression Models
J.J. Faraway
Linear Models with R, Second Edition
J.J. Faraway
A Course in Large Sample Theory
T.S. Ferguson
Multivariate Statistics: A Practical
Approach
B. Flury and H. Riedwyl
Readings in Decision Analysis
S. French
Markov Chain Monte Carlo:
Stochastic Simulation for Bayesian Inference,
Second Edition
D. Gamerman and H.F. Lopes
Bayesian Data Analysis, Third Edition
A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson,
A. Vehtari, and D.B. Rubin
Multivariate Analysis of Variance and
Repeated Measures: A Practical Approach for
Behavioural Scientists
D.J. Hand and C.C. Taylor
Practical Longitudinal Data Analysis
D.J. Hand and M. Crowder
Logistic Regression Models
J.M. Hilbe
Richly Parameterized Linear Models:
Additive, Time Series, and Spatial Models
Using Random Effects
J.S. Hodges
Statistics for Epidemiology
N.P. Jewell
Stochastic Processes: An Introduction,
Second Edition
P.W. Jones and P. Smith
The Theory of Linear Models
B. Jørgensen
Principles of Uncertainty
J.B. Kadane
Graphics for Statistics and Data Analysis with R
K.J. Keen
Mathematical Statistics
K. Knight
Introduction to Multivariate Analysis:
Linear and Nonlinear Modeling
S. Konishi
Nonparametric Methods in Statistics with SAS
Applications
O. Korosteleva
Modeling and Analysis of Stochastic Systems,
Second Edition
V.G. Kulkarni
Exercises and Solutions in Biostatistical Theory
L.L. Kupper, B.H. Neelon, and S.M. O’Brien
Exercises and Solutions in Statistical Theory
L.L. Kupper, B.H. Neelon, and S.M. O’Brien
Design and Analysis of Experiments with R
J. Lawson
Design and Analysis of Experiments with SAS
J. Lawson
A Course in Categorical Data Analysis
T. Leonard
Statistics for Accountants
S. Letchford
Introduction to the Theory of Statistical
Inference
H. Liero and S. Zwanzig
Statistical Theory, Fourth Edition
B.W. Lindgren
Stationary Stochastic Processes: Theory and
Applications
G. Lindgren
Statistics for Finance
E. Lindström, H. Madsen, and J. N. Nielsen
The BUGS Book: A Practical Introduction to
Bayesian Analysis
D. Lunn, C. Jackson, N. Best, A. Thomas, and
D. Spiegelhalter
Introduction to General and Generalized
Linear Models
H. Madsen and P. Thyregod
Time Series Analysis
H. Madsen
Pólya Urn Models
H. Mahmoud
Randomization, Bootstrap and Monte Carlo
Methods in Biology, Third Edition
B.F.J. Manly
Introduction to Randomized Controlled
Clinical Trials, Second Edition
J.N.S. Matthews
Statistical Methods in Agriculture and
Experimental Biology, Second Edition
R. Mead, R.N. Curnow, and A.M. Hasted
Statistics in Engineering: A Practical Approach
A.V. Metcalfe
Statistical Inference: An Integrated Approach,
Second Edition
H. S. Migon, D. Gamerman, and
F. Louzada
Beyond ANOVA: Basics of Applied Statistics
R.G. Miller, Jr.
A Primer on Linear Models
J.F. Monahan
Decision Analysis: A Bayesian Approach
J.Q. Smith
Elements of Simulation
B.J.T. Morgan
Applied Statistics: Handbook of GENSTAT
Analyses
E.J. Snell and H. Simpson
Applied Stochastic Modelling, Second Edition
B.J.T. Morgan
Probability: Methods and Measurement
A. O’Hagan
Introduction to Statistical Limit Theory
A.M. Polansky
Applied Bayesian Forecasting and Time Series
Analysis
A. Pole, M. West, and J. Harrison
Statistics in Research and Development,
Time Series: Modeling, Computation, and
Inference
R. Prado and M. West
Introduction to Statistical Process Control
P. Qiu
Sampling Methodologies with Applications
P.S.R.S. Rao
A First Course in Linear Model Theory
N. Ravishanker and D.K. Dey
Essential Statistics, Fourth Edition
D.A.G. Rees
Stochastic Modeling and Mathematical
Statistics: A Text for Statisticians and
Quantitative Scientists
F.J. Samaniego
Statistical Methods for Spatial Data Analysis
O. Schabenberger and C.A. Gotway
Bayesian Networks: With Examples in R
M. Scutari and J.-B. Denis
Large Sample Methods in Statistics
P.K. Sen and J. da Motta Singer
Spatio-Temporal Methods in Environmental
Epidemiology
G. Shaddick and J.V. Zidek
Analysis of Failure and Survival Data
P. J. Smith
Applied Nonparametric Statistical Methods,
Fourth Edition
P. Sprent and N.C. Smeeton
Data Driven Statistical Methods
P. Sprent
Generalized Linear Mixed Models:
Modern Concepts, Methods and Applications
W. W. Stroup
Survival Analysis Using S: Analysis of
Time-to-Event Data
M. Tableman and J.S. Kim
Applied Categorical and Count Data Analysis
W. Tang, H. He, and X.M. Tu
Elementary Applications of Probability Theory,
Second Edition
H.C. Tuckwell
Introduction to Statistical Inference and Its
Applications with R
M.W. Trosset
Understanding Advanced Statistical Methods
P.H. Westfall and K.S.S. Henning
Statistical Process Control: Theory and
Practice, Third Edition
G.B. Wetherill and D.W. Brown
Generalized Additive Models:
An Introduction with R
S. Wood
Epidemiology: Study Design and
Data Analysis, Third Edition
M. Woodward
Practical Data Analysis for Designed
Experiments
B.S. Yandell
Texts in Statistical Science
Mathematical
Statistics
Basic Ideas and
Selected Topics
Volume II
Peter J. Bickel
University of California
Berkley, California, USA
Kjell A. Doksum
University of Wisconsin
Madison, Wisconsin, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20150909
International Standard Book Number-13: 978-1-4987-2270-4 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To Erich L. Lehmann
CONTENTS
PREFACE TO THE 2015 EDITION
xv
I
1
1
5
5
6
8
8
10
11
15
20
INTRODUCTION AND EXAMPLES
I.0 Basic Ideas and Conventions
I.1 Tests of Goodness of Fit and the Brownian Bridge
I.2 Testing Goodness of Fit to Parametric Hypotheses
I.3 Regular Parameters. Minimum Distance Estimates
I.4 Permutation Tests
I.5 Estimation of Irregular Parameters
I.6 Stein and Empirical Bayes Estimation
I.7 Model Selection
I.8 Problems and Complements
I.9 Notes
7 TOOLS FOR ASYMPTOTIC ANALYSIS
7.1 Weak Convergence in Function Spaces
7.1.1 Stochastic Processes and Weak Convergence
7.1.2 Maximal Inequalities
7.1.3 Empirical Processes on Function Spaces
7.2 The Delta Method in Infinite Dimensional Space
7.2.1 Influence Functions. The Gâteaux and Fréchet Derivatives
7.2.2 The Quantile Process
7.3 Further Expansions
7.3.1 The von Mises Expansion
7.3.2 The Hoeffding and Analysis of Variance Expansions
7.4 Problems and Complements
7.5 Notes
21
21
21
28
31
38
38
47
51
51
54
62
72
ix
x
Contents
8 DISTRIBUTION-FREE, UNBIASED, AND EQUIVARIANT
PROCEDURES
8.1 Introduction
8.2 Similarity and Completeness
8.2.1 Testing
8.2.2 Testing Optimality Theory
8.2.3 Estimation
8.3 Invariance, Equivariance, and Minimax Procedures
8.3.1 Group Models
8.3.2 Group Models and Decision Theory
8.3.3 Characterizing Invariant Tests
8.3.4 Characterizing Equivariant Estimates
8.3.5 Minimaxity for Tests: Application to Group Models
8.3.6 Minimax Estimation, Admissibility, and Steinian Shrinkage
8.4 Problems and Complements
8.5 Notes
73
73
74
74
85
87
92
92
94
96
102
104
107
112
123
9 INFERENCE IN SEMIPARAMETRIC MODELS
9.1 Estimation in Semiparametric Models
9.1.1 Selected Examples
9.1.2 Regularization. Modified Maximum Likelihood
9.1.3 Other Modified and Approximate Likelihoods
9.1.4 Sieves and Regularization
9.2 Asymptotics. Consistency and Asymptotic Normality
9.2.1 A General Consistency Criterion
9.2.2 Asymptotics for Selected Models
9.3 Efficiency in Semiparametric Models
9.4 Tests and Empirical Process Theory
9.5 Asymptotic Properties of Likelihoods. Contiguity
9.6 Problems and Complements
9.7 Notes
125
125
125
133
142
145
151
152
153
161
175
181
193
209
10 MONTE CARLO METHODS
10.1 The Nature of Monte Carlo Methods
10.2 Three Basic Monte Carlo Metheds
10.2.1 Simple Monte Carlo
10.2.2 Importance Sampling
10.2.3 Rejective Sampling
211
211
214
215
216
217
Contents
10.3 The Bootstrap
10.3.1 Bootstrap Samples and Bias Corrections
10.3.2 Bootstrap Variance and Confidence Bounds
10.3.3 The General i.i.d. Nonparametric Bootstrap
10.3.4 Asymptotic Theory for the Bootstrap
10.3.5 Examples Where Efron’s Bootstrap Fails. The m out of n
Bootstraps
10.4 Markov Chain Monte Carlo
10.4.1 The Basic MCMC Framework
10.4.2 Metropolis Sampling Algorithms
10.4.3 The Gibbs Samplers
10.4.4 Speed of Convergence and Efficiency of MCMC
10.5 Applications of MCMC to Bayesian and Frequentist Inference
10.6 Problems and Complements
10.7 Notes
xi
219
220
224
227
230
235
237
237
238
242
246
250
256
263
11 NONPARAMETRIC INFERENCE FOR FUNCTIONS
OF ONE VARIABLE
11.1 Introduction
11.2 Convolution Kernel Estimates on R
11.2.1 Uniform Local Behavior of Kernel Density Estimates
11.2.2 Global Behavior of Convolution Kernel Estimates
11.2.3 Performance and Bandwidth Choice
11.2.4 Discussion of Convolution Kernel Estimates
11.3 Minimum Contrast Estimates: Reducing Boundary Bias
11.4 Regularization and Nonlinear Density Estimates
11.4.1 Regularization and Roughness Penalties
11.4.2 Sieves. Machine Learning. Log Density Estimation
11.4.3 Nearest Neighbor Density Estimates
11.5 Confidence Regions
11.6 Nonparametric Regression for One Covariate
11.6.1 Estimation Principles
11.6.2 Asymptotic Bias and Variance Calculations
11.7 Problems and Complements
265
265
266
269
271
272
273
274
280
280
281
284
285
287
287
290
297
12 PREDICTION AND MACHINE LEARNING
12.1 Introduction
307
307
xii
Contents
12.2
12.3
12.4
12.5
12.6
12.7
12.8
12.1.1 Statistical Approaches to Modeling and Analyzing Multidimensional data. Sieves
12.1.2 Machine Learning Approaches
12.1.3 Outline
Classification and Prediction
12.2.1 Multivariate Density and Regression Estimation
12.2.2 Bayes Rule and Nonparametric Classification
12.2.3 Sieve Methods
12.2.4 Machine Learning Approaches
Asymptotic Risk Criteria
12.3.1 Optimal Prediction in Parametric Regression Models
12.3.2 Optimal Rates of Convergence for Estimation and Prediction in
Nonparametric Models
12.3.3 The Gaussian White Noise (GWN) Model
12.3.4 Minimax Bounds on IMSE for Subsets of the GWN Model
12.3.5 Sparse Submodels
Oracle Inequalities
12.4.1 Stein’s Unbiased Risk Estimate
12.4.2 Oracle Inequality for Shrinkage Estimators
12.4.3 Oracle Inequality and Adaptive Minimax Rate for Truncated Estimates
12.4.4 An Oracle Inequality for Classification
Performance and Tuning via Cross Validation
12.5.1 Cross Validation for Tuning Parameter Choice
12.5.2 Cross Validation for Measuring Performance
Model Selection and Dimension Reduction
12.6.1 A Bayesian Criterion for Model Selection
12.6.2 Inference after Model Selection
12.6.3 Dimension Reduction via Principal Component Analysis
Topics Briefly Touched and Current Frontiers
Problems and Complements
D APPENDIX D. SUPPLEMENTS TO TEXT
D.1 Probability Results
D.2 Supplement to Section 7.1
D.3 Supplement to Section 7.2
D.4 Supplement to Section 9.2.2
D.5 Supplement to Section 10.4
309
313
315
315
315
320
322
324
333
334
337
347
349
350
352
354
355
357
359
361
362
367
367
368
372
374
377
381
399
399
401
404
405
406
Contents
xiii
D.6 Supplement to Section 11.6
D.7 Supplement to Section 12.2.2
D.8 Problems and Complements
410
413
419
E SOLUTIONS FOR VOLUME II
423
REFERENCES
437
SUBJECT INDEX
455
AUTHOR INDEX
461
PREFACE TO THE 2015 EDITION
Volume II
This textbook represents our view of what a second course in mathematical statistics
for graduate students with a good mathematics background should be. The mathematics
background needed includes linear algebra, matrix theory and advanced calculus, but not
measure theory. Probability at the level of, for instance, Grimmett and Stirzaker’s Probability and Random Processes, is also needed. Appendix D1 in combination with Appendices
A and B in Volume I give the probability that is needed. However, the treatment is abridged
with few proofs.
This Volume II of the second edition presents what we think are some of the most important statistical concepts, methods, and tools developed since the first edition. Topics
included are: asymptotic efficiency in semiparametric models, semiparametric maximum
likelihood estimation, finite sample size optimality including Lehmann-Scheffé theory, survival analysis including Cox regression, prediction, classification, methods of inference
based on sieve models, model and variable selection, Monte Carlo methods such as the
bootstrap and Markov Chain Monte Carlo, nonparametric curve estimation, and machine
learning including support vector machines and classification and regression trees (CART).
The basic asymptotic tools developed or presented, in part in the text and in part in
appendices, are weak convergence for random processes, empirical process theory, and the
functional delta method. With the tools and concepts developed in this second volume
students will be ready for advanced research in modern statistics.
Volume II includes too many topics to be covered in one semester. Chapter 8 can
be omitted without losing much continuity. The following outline of Volume II chapter
contents can be used to select topics to be included in a one semester course. A course
leaning towards basic statistical theory could include most of Chapters 7, 8, 9, and 10 plus
Sections 11.6, 12.4, and 12.6, while a course leaning towards statistical learning could
include most of Chapters 7, 9, 10, and 12 plus Section 11.4. A great number of other
possible combinations can be constructed from the following chapter outline.
Volume II Outline
Chapter I is an introductory chapter that starts by presenting basic ideas about statistical
modeling and inference, then gives a number of examples that illustrate some of the basic
xv
xvi
Preface to the 2015 Edition
ideas. Chapter 7 gives the asymptotic tools to be used in parts of the rest of the book.
These tools include asymptotic empirical process theory, the delta method and derivatives
on function spaces and their use in deriving influence functions, as well as von Mises and
Hoeffding expansions.
Chapter 8 presents some of the classical theory of statistical optimality in a decision
theoretic context. Here, we derive for fixed sample sizes procedures which are optimal
over restricted classes of methods, such as distribution free tests and unbiased, invariant
tests or equivariant estimates. Alternatively, we study the weak but general property of
minimaxity, as well as admissibility. Some results important for Chapter 12 such as Stein’s
identity and estimate are also introduced. The asymptotic counterparts of these notions
are introduced in Chapter 9, enabling us to present generalizations which go well beyond
classical exponential family and invariant models. The methods of Chapter 7 are key in this
development.
In Chapter 9, we first introduce the concepts of regularization, modified maximum likelihood, and sieves. We then develop asymptotic efficiency for semiparametric models,
asymptotic optimality for tests, and Le Cam’s asymptotic theory of experiments. These
concepts are applied to estimation in Cox’s semiparametric proportional hazard regression
model, partially linear models, semiparametric linear models, biased sampling models, and
models for censored data. For testing, the asymptotic optimality of Neyman’s Cα and Rao’s
score tests are established.
Chapter 10 develops Monte Carlo methods, that is, methods based on simulation. Simulation enables us to approximate methods stochastically, which are difficult to compute
analytically. Examples range from distributions of test statistics and confidence bounds
to likelihood methods to estimates of risk. In this chapter we first give methods based on
simple random sampling where data are generated from specified distributions, including
simple posterior distributions in a Bayesian setting. Next, importance sampling and rejective sampling are introduced as methods that are able to generate random variables with a
desired distribution when simple random sampling is not possible. In Chapter 10 we also
present Efron’s bootstrap. Here, important features of the population distribution, that is,
parameters, are expressed as functionals of this unknown distribution and then the distribution is replaced by an estimate such as the empirical distribution of the experimental data.
The functional evaluated at the empirical distribution is approximated by drawing Monte
Carlo samples from the empirical distribution. Chapter 10 goes on to present Markov Chain
Monte Carlo (MCMC) techniques. These are appropriate when direct generation of independent identically distributed variables is not possible. In this approach, a sequence of
random variables are generated according to a homogenous Markov chain in such a way
that variables far out in the chain are approximately distributed as a sample from a target
distribution. The Metropolis algorithm and the Gibbs sampler are developed as special
cases. MCMC is in particular important for Bayesian statistical inference, but also in frequentist contexts.
Chapter 11 examines nonparametric estimation of functions of one variable, including density functions and the nonparametric regression of a response on one covariate.
Estimates considered are based on kernels, series expansions, roughness penalties, nearest
neighbors, and local polynomials. Asymptotic properties of mean squared errors are
xvii
Preface to the 2015 Edition
developed.
Chapter 12 has a number of topics related to what is known as statistical learning. Topics include prediction, classification, nonparametric estimation of multivariate densities and
regression functions, penalty estimation including the least absolute shrinkage and selection operator (Lasso), classification and regression trees (CART), support vector machines,
boosting, Gaussian white nose modeling, oracle inequalities, Steinian shrinkage estimation,
sparsity, Bayesian model selection, regularization, sieves, cross-validation, asymptotic risk
and optimality properties of predictors and classifiers, and more.
We thank Akichika Ozeki, Sangbum Choi, Sören Künzel, Joshua Cape, and many students for pointing out errors and John Kimmel and CRC Press for production support. For
word processing we thank Dee Frana and especially Anne Chong who processed 95% of
Volume II and helped with references, indexing and error detection. Kjell Doksum thanks
the statistics departments of Harvard, Columbia and Stanford Universities for support.
Last and most important we would like to thank our wives, Nancy Kramer Bickel and
Joan H. Fujimura, and our families for support, encouragement, and active participation in
an enterprise that at times seemed endless, appeared gratifyingly ended in 1976 but has,
with the field, taken on a new life.
Volume I
For convenience we repeat the preface to Volume I. In recent years statistics has changed
enormously under the impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phylogenetic and other), and other types of combinatorial objects.
(2) The generation of enormous amounts of data—terabytes (the equivalent of 1012 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once
been unthinkable.
The underlying sources of these changes have been the exponential change in computing speed (Moore’s “law”) and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing).
These techniques often have a strong intrinsic computational component. Tomographic
data are the result of mathematically based processing. Sequencing is done by applying
computational algorithms to raw gel electrophoresis data.
As a consequence the emphasis of statistical theory has shifted away from small sample
optimality results in a number of directions:
(1) Methods for inference based on larger numbers of observations and minimal
assumptions—asymptotic methods in non- and semiparametric models, models with
“infinite” number of parameters.
(2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for
analytical results on asymptotic approximation. Multiparameter models are the rule.
xviii
Preface to the 2015 Edition
(3) The use of methods of inference involving simulation as a key element such as the
bootstrap and Markov Chain Monte Carlo.
(4) The development of techniques not describable in “closed mathematical form” but
rather through elaborate algorithms for which problems of existence of solutions are
important and far from obvious.
(5) The study of the interplay between numerical and statistical considerations. Despite
advances in computing speed, some methods run quickly in real time. Others do not
and some though theoretically attractive cannot be implemented in a human lifetime.
(6) The study of the interplay between the number of observations and the number of
parameters of a model and the beginnings of appropriate asymptotic theories.
There have been other important consequences such as the extensive development of
graphical and other exploratory methods for which theoretical development and connection
with mathematics have been minimal. These will not be dealt with in our work.
In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice.
Volume I Outline
This volume presents the basic classical statistical concepts at the Ph.D. level without requiring measure theory. It gives careful proofs of the major results and indicates how the
theory sheds light on the properties of practical methods. The topics include estimation,
prediction, testing, confidence sets, Bayesian analysis and the more general approach of
decision theory.
We include from the start in Chapter 1 non- and semiparametric models, then go to
parameters and parametric models stressing the role of identifiability. From the beginning
we stress function-valued parameters, such as the density, and function-valued statistics,
such as the empirical distribution function. We also, from the start, include examples that
are important in applications, such as regression experiments. There is extensive material
on Bayesian models and analysis and extended discussion of prediction and k-parameter
exponential families. These objects that are the building blocks of most modern models
require concepts involving moments of random vectors and convexity that are given in
Appendix B.
Chapter 2 deals with estimation and includes a detailed treatment of maximum likelihood estimates (MLEs), including a complete study of MLEs in canonical k-parameter
exponential families. Other novel features of this chapter include a detailed analysis, including proofs of convergence, of a standard but slow algorithm (coordinate descent) for
convex optimization, applied, in particular to computing MLEs in multiparameter exponential families. We also give an introduction to the EM algorithm, one of the main ingredients
of most modern algorithms for inference. Chapters 3 and 4 are on the theory of testing and
confidence regions, including some optimality theory for estimation as well and elementary
robustness considerations.
Preface to the 2015 Edition
xix
Chapter 5 is devoted to basic asymptotic approximations with one dimensional parameter models as examples. It includes proofs of consistency and asymptotic normality and
optimality of maximum likelihood procedures in inference and a section relating Bayesian
and frequentist inference via the Bernstein–von Mises theorem.
Finally, Chapter 6 is devoted to inference in multivariate (multiparameter) models.
Included are asymptotic normality and optimality of maximum likelihood estimates, inference in the general linear model, Wilks theorem on the asymptotic distribution of the
likelihood ratio test, the Wald and Rao statistics and associated confidence regions, and
some parallels to the optimality theory and comparisons of Bayes and frequentist procedures given in the one dimensional parameter case in Chapter 5. Chapter 6 also develops
the asymptotic joint normality of estimates that are solutions to estimating equations and
presents Huber’s Sandwich formula for the asymptotic covariance matrix of such estimates.
Generalized linear models, including binary logistic regression, are introduced as examples.
Robustness from an asymptotic theory point of view appears also. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the
more advanced topics of Volume II.
Volume I includes Appendix A on basic probability and a larger Appendix B, which
includes more advanced topics from probability theory such as the multivariate Gaussian
distribution, weak convergence in Euclidean spaces, and probability inequalities as well as
more advanced topics in matrix theory and analysis. The latter include the principal axis
and spectral theorems for Euclidean space and the elementary theory of convex functions
on Rd as well as an elementary introduction to Hilbert space theory. As in the first edition,
we do not require measure theory but assume from the start that our models are what we
call “regular.” That is, we assume either a discrete probability whose support does not
depend on the parameter set, or the absolutely continuous case with a density. Hilbert space
theory is not needed, but for those who know this topic Appendix B points out interesting
connections to prediction and linear regression analysis.
Appendix B is as self-contained as possible with proofs of most statements, problems,
and references to the literature for proofs of the deepest results such as the spectral theorem.
The reason for these additions are the changes in subject matter necessitated by the current
areas of importance in the field.
For the first volume of the second edition we would like to add thanks to Jianging Fan,
Michael Jordan, Jianhua Huang, Ying Qing Chen, and Carl Spruill and the many students
who were guinea pigs in the basic theory course at Berkeley. We also thank Faye Yeager
for typing, Michael Ostland and Simon Cawley for producing the graphs, Yoram Gat for
proofreading that found not only typos but serious errors, and Prentice Hall for generous
production support.
Peter J. Bickel
bickel@stat.berkeley.edu
Kjell Doksum
doksum@stat.wisc.edu
Chapter I
INTRODUCTION AND EXAMPLES
I.0
Basic Ideas and Conventions
Recall from Volume I that in the field of statistics we represent important data-related problems and questions in terms of questions about distributions and their parameters. Thus our
goal is to use data X ∈ X to estimate or draw conclusions about aspects of the probability
distribution P of X. The probability distribution P is assumed to belong to a class P of distributions called the model. Examining what models are useful for answering data-related
questions is an important part of statistics. In Volume I we considered three cases with a
focus on the first:
(1) P is a member of a parametric class of distributions {Pθ : θ ∈ Θ}, Θ ⊂ Rd , and our
interest is in θ or some vector q(θ).
(2) P is arbitrary except for regularity conditions, such as finite second moments or
continuity of the distribution function, and our interest is in functionals ν(P ) that
may be real valued, vectors, or functions.
(3) Our class of distributions is neither smoothly parametrizable by a Euclidean parameter nor essentially unrestricted.
In Volume I we focussed mostly(1) on parametric cases and on situations where the number
of parameters we were dealing with was small in at least one of two ways:
(i) The complexity of the regular model {Pθ : θ ∈ Θ}, Θ ⊂ Rd , as measured by the
dimension d of the parametrization, was small in relation to the amount of information, as measured by the sample size n of the data. In particular, when examining the
properties of statistical procedures, d does not increase with n.
(ii) The procedures we considered, estimation of low dimensional Euclidean parameters,
testing, and confidence regions, corresponded to simple (finite or low dimensional)
action spaces A, where A is the range of the statistical decision procedure.
In this volume we will focus on inference in non- and semiparametric models. In doing
so, we will not only reexamine the procedures introduced in Volume I from a more sophisticated point of view but also come to grips with new problems originating from our analysis
1
2
Introduction and Examples
Chapter I
of estimation of functions and other complex decision procedures that appear naturally in
these contexts. The mathematics needed for this work is often of a higher level than that
used in Volume I. But, as before, we present what is needed in the appendices with proofs
or references.
Modeling Conventions
The guiding principle of modern statistics was best formulated by George Box (see
Section 1.1.):
“Models, of course, are never true, but fortunately it is only necessary that they be useful”
One implication of this statement is that the parameters we deal with are the parameters
of the distribution in our model class closest to the unknown true distribution. See Sections
2.2.2, 5.4.2, and 6.2.1. For instance, a linear regression model can detect linear trends
that provide useful information even if the true population relationship is not linear. See
Figure 1.4.1. This leads to an interesting dilemma and accompanying research questions:
The more general a class we postulate the more closely we will be able to approximte the
true population distribution. However, using a very general class of models means more
parameters and more variability of statistical methods. Achieving a balance leads to useful
models. One approach is to use a nested sequence of “regular” parametric models (sieves)
that become more general as we add parameters and then select the number of parameters
by minimizing estimated prediction error (cross validation). See Chapter 12.
As in Volume I (see Section 1.1.3), except for Bayesian models, our parametric models
are restricted to be regular parametric models. P = {Pθ : θ ∈ Θ}, Θ ⊂ Rd , where Pθ is
either continuous or discrete, and in the discrete case {x : p(x; θ) > 0} does not involve θ.
But see Section I.5 for a general concept of regular and irregular parameters that includes
semiparametric and nonparametric models.
As in our discussion of Bayesian models in Section 1.2, conditioning of continuous
variables by discrete variables and vice versa generally preserves the interpretation of the
conditional p as being a continuous or discrete case density, respectively. If X = (I, Y )T
where RI is discrete and Y is continuous, then p(i, y) is a density if it satisfies P (I = i, Y ≤
y
y) = −∞ p(i, t)dt. Readers familiar with measure theory will recognize that all results
remain meaningful when p = dP/dµ, where µ is a σ-finite measure dominating all P under
discussion, and conditional densities are interpreted as being with respect to the appropriate
conditional measure. All of the proofs can be converted to this general case, subject only
to minor technicalities. Finally, we will write h = 0 when h = 0 a.s. (almost surely). More
generally, a.s. equality is denoted as equality.
As in Volume I, throughout this volume, for x ∈ Rd we shall use p(x) interchangeably for frequency functions p(x) = P [X = x] and for continuous case density
R functions
p(x). We willP
call p(·) a density function
in
both
cases.
When
we
write
h(x)dP (x)
R
we will mean x h(x)P [X = x] or h(x)p(x)dx. Unless we indicate otherwise, statements which can be interpreted under either interpretation are valid under both, although
proofs
in the text will be given under one formalism
or the other. That is, when we write
R
R
h(x)p(x)dx, for instance, we really mean h(x)dP
(x) as interpreted
above. When
R
R
d = 1, we let F (x) = P (−∞, x] and often write h(x)dF (x) for h(x)dP (x).
Section I.0
3
Basic Ideas and Conventions
Selected Topics
Statistical methods in Volume II include the bootstrap, Markov Chain Monte Carlo
(MCMC), Steinian shrinkage, sieves, cross-validation, censored data analysis, Cox proportional hazard regression, nonparametric curve (kernel) estimation, model selection, classification, prediction, classification and regression trees (CART), penalty estimation such as
the Lasso, and Bayesian procedures. The effectiveness of statistical methods is examined
using classical concepts such as risk, Bayes risk, mean squared error, power, minimaxity,
admissibility, invariance, and equivariance. Statistical methods that are optimal based on
such criteria are obtained in a finite sample context in Chapter 8. However, most of the
book is concerned with asymptotic theory including empirical process theory and efficient
estimation in semiparametric models as well as the development of the asymptotic properties of the statistical methods listed above. We present in Section I.1–I.7 a few more details
of some of the topics in Volume II.
Notation
X ∼ F , X is distributed according to F
statistic, a function of observable data X only
L(X), the distribution, or law, of X
Xn =⇒ X, L(Xn ) → L(X) Xn converges weakly (in law) to X
df, distribution function
J, identity matrix = diag(1, . . . , 1)
F̄ = 1 − F , the survival function
[t], greatest integer less than or equal to t
i.i.d., independent identically distributed
sample, X1 , . . . , Xn i.i.d. as X ∼ F
Pb and Pbn , empirical probability of a sample X1 , . . . , Xn
B(n, θ), binomial distribution with parameters n and θ
Ber(θ), Bernoulli distribution = B(1, θ)
E(λ), exponential distribution with parameter λ (mean 1/λ)
H(D, N, n), hypergeometric distribution with parameters D, N , n
M(n, θ1 , . . . , θq ), multinomial distribution with parameters n, θ1 , . . . , θq
N (µ, σ 2 ), normal (Gaussian) distribution with mean µ and variance σ 2
4
Introduction and Examples
Chapter I
ϕ, N (0, 1) density
Φ, N (0, 1) df
zα , αth quantile of Φ : zα = Φ−1 (α)
N (µ1 , µ2 , σ12 , σ22 , ρ), bivariate normal (Gaussian) distribution
N (µ, Σ), multivariate normal (Gaussian) distribution
P(λ), Poisson distribution with parameter λ
U(a, b), uniform distribution on the interval (a, b)
d.f., degrees of freedom
χ2k , chi-square distribution with k d.f.
≡, defined to be equal to
⊥, orthogonal to, uncorrelated
1(·), indicator function
The OP , ≍P , and oP Notation
The following asymptotic order in probability notation is from Section B.7. Let Un
and Vn be random vectors in Rd and let | · | denote Euclidean distance.
Un = oP (1)
Un = OP (1)
iff
iff
Un = oP (Vn )
iff
Un = OP (Vn )
iff
Un ≍P Vn
iff
Un = ΩP (Vn )
P
Un −→ 0, that is, ∀ǫ > 0, P (|Un | > ǫ) → 0
∀ǫ > 0, ∃M < ∞ such that ∀n P [|Un | ≥ M ] ≤ ǫ
|Un |
= op (1)
|Vn |
|Un |
= OP (1)
|Vn |
Un = OP (Vn ) and Vn = OP (Un )
iff Un ≍P Vn
Note that
OP (1)oP (1) = oP (1), OP (1) + oP (1) = OP (1),
L
(I.1)
and Un −→ U ⇒ Un = OP (1).
Suppose Z1 , . . . , Zn are i.i.d. as Z with E|Z| < ∞. Set µ = E(Z), then Z̄n =
1
µ + op (1) by the weak law of large numbers. If E|Z|2 < ∞, then Z̄n = µ + Op (n− 2 ) by
the central limit theorem.
Section I.1
I.1
Tests of Goodness of Fit and the Brownian Bridge
5
Tests of Goodness of Fit and the Brownian Bridge
Let X1 , ..., Xn be i.i.d. as X with distribution P . For one dimensional observations, the
distribution function (df) F (·) = P [X ≤ ·] is a natural infinite dimensional parameter
to consider. In Example 4.1.5 we showed how one could use the Kolmogorov statistic
T (Fb , F0 ), where T (Fb , F ) ≡ supt∈R |Fb (t) − F (t)| and Fb is the empirical df, to construct
a test of the hypothesis H : F = F0 . The test is designed so that one can expect it
to be consistent against all alternatives, so that our viewpoint is fully nonparametric. In
Example 4.4.6 we showed how to construct a simultaneous confidence band for F (·) using
the pivot T (Fb , F ). In both cases we noted that the critical values needed for the test and
confidence band could be obtained by determining the distribution of T (Fb , U), where U
is the U nif [0, 1] distribution function under F = U nif [0, 1], and stated that these values
could be determined by Monte Carlo simulation.
How does T (Fb, F ) behave qualitatively? We will show in Section 7.1 that, although
infinite dimensional, F (·) is a “regular” parameter. In this case, what “regular” means is
that the stochastic process,
√
(I.2)
En (x) ≡ n (Fb (x) − F (x)), x ∈ R ,
converges in law in a strong sense (called “weak convergence!”) to a Gaussian process
W 0 (F (·)). Here W 0 (u), 0 ≤ u ≤ 1, is a Gaussian process called the “Brownian bridge”
with mean 0 and covariance structure given by,
Cov(W 0 (u1 ), W 0 (u2 )) = u1 (1 − u2 ), u1 ≤ u2 .
By “Gaussian” we mean that the distribution of W 0 (u1 ), . . . , W 0 (uk ) is multivariate normal for all u1 , . . . , uk . Note that
Cov W 0 (F (x1 )), W 0 (F (x2 )) = Cov(En (x1 ), En (x2 )) = F (x1 )(1−F (x2 )), x1 ≤ x2 .
The weak convergence of En (·) to W 0 F (·) , to be established in Section 7.1, will enable
us to derive the Kolmogorov theorem, that when F = U nif [0, 1], T (Fb , F ) converges in
law to L(sup{|W 0 (u)| : 0 ≤ u ≤ 1}), which is known analytically. This approach is based
on heuristics due to Doob (1949) and developed in Donsker (1952). See also Doob (1953).
We will discuss the heuristics in Section 7.1 and apply them to this and other examples in
Section 7.2.
These results will provide approximate size α critical values for the Kolmogorov statistics and other interesting functionals of distribution functions. The critical values yield
confidence regions for distribution functions and related parameters. See Examples 4.4.6,
4.4.7 and Problems 4.4.17–4.4.19, 4.5.14–4.5.16.
✷
I.2
Testing Goodness of Fit to Parametric Hypotheses
In Examples 4.1.6 and 4.4.6 we considered the important problem of testing goodness-offit to a Gaussian distribution H : F (·) = Φ( ·−µ
σ ) for some µ, σ. We deduced that the
6
Introduction and Examples
goodness-of-fit statistic
Chapter I
b
sup |G(x)
− Φ(x)|,
x
b is the empirical distribution function of (Z1 , . . . , Zn ) with Zi = (Xi −X̄)/b
where G
σ , has a
null distribution which does not depend on µ and σ, so that critical values can be calculated
by simulating from N (0, 1). In Section 8.2, we will consider other classes of hypothesis
models which admit reasonable tests whose critical values can be specified without knowledge of which particular hypothesized distribution is true. However, when we consider the
general problem of testing H : X ∼ P ∈ P where P = {Pθ : θ ∈ Θ} is a regular parametric model, we quickly come to situations where the methods of Chapter 8 will not apply.
For instance, suppose that in the Gaussian goodness-of-fit problem above, our observations
X1 , . . . , Xn which are i.i.d. as F (x) = Φ([x − µ]/σ) are truncated at 0, that is, we assume
that we observe Y1 , . . . , Yn i.i.d. distributed as X|X ≥ 0 where X ∼ N (µ, σ 2 ). Then,
P [0, ∞) = 1 and H is
= 1 − Φ µ−t
Φ µσ , t ≥ 0
P [Y ≤ t] ≡ Gµ,σ2 (t) = PP(0≤X≤t)
(X≥0)
σ
= 0,
t < 0.
The only promising approach here is to estimate µ and σ 2 consistently using, for instance,
maximum likelihood or the method of moments (Problem I.2.1) by µ
b and σ
b2 and estimate
the null distribution of
√
Tn ≡ sup n|Fb (t) − Gµb,bσ2 (t)|
t≥0
by simulating samples of size n from N (b
µ, σ
b2 ) truncated at 0, i.e., keep as observations
only the nonnegative ones. But can this method, called the “parametric bootstrap,” be
justified? To answer this question we need to consider asymptotics: It turns out that, under
H, Tn converges in law to a limit. This helps us little in approximating the null distribution
of Tn since an analytic form for its limiting distribution is not available. But, as we show
in Section 9.4, such results are essential in justifying the parametric bootstrap.
The more important “nonparametric bootstrap” and other methods for simulating or
approximately simulating observations from complicated distributions, often dependent on
the data, such as Markov Chain Monte Carlo, are developed in Chapter 10.
I.3
Regular Parameters. Minimum Distance Estimates
We have seen in Chapters 5 and 6 how to establish asymptotic normality and approximate
linearity of estimates that are solutions to estimation equations (M estimates) and then used
these results to establish efficiency of the MLE under suitable conditions.
There are many types of estimates which cannot be characterized as solutions of estimating equations. Examples we have discussed are the linear combinations of order statistics, such as the trimmed mean introduced in Section 3.5,
n−[nα]
X̄α = (n − 2[nα])−1
X
i=[nα]+1
X(i)
Section I.3
7
Regular Parameters. Minimum Distance Estimates
where X(1) ≤ · · · ≤ X(n) are the ordered Xi .
Here is another important class of such estimates. Suppose the data X ∈ X have
probability distribution P . Let {Pθ : θ ∈ Θ, Θ ⊂ Rq } be a regular parametric model.
There may be a unique θ such that Pθ = P , or if not, we choose the θ that makes Pθ
closest to P in some metric d on the space of probability distributions on X . For instance,
if X = R, examples of such metrics are
d∞ (P, Q) = sup |P (−∞, t] − Q(−∞, t]|
(I.3)
t
and
d22 (P, Q) =
Z
∞
−∞
(P (−∞, t] − Q(−∞, t])2 ψ(t)dt
where ψ(·) is a nonnegative weight function of our choice with
(I.4)
R∞
−∞
ψ(t)dt < ∞.
A minimum distance estimate θ(Pb ) is obtained by plugging the empirical distribution
distribution Pb into the parameter
θ(P ) = arg min{d(P, Pθ ) : θ ∈ Θ}
where we assume that d(Q, Pθ ) is well defined for Q in M, a general class of distributions
containing all distributions with finite support. Thus M contains the probability distribution P generating X and the empirical probability Pb. See Problems 7.2.10 and 7.2.18
for examples and properties of minimum distance estimates θ(Pb ). These problems show
√
n consistency of θ(Pb ). They also show that θ(Pb) may not have a linear approximation
in the sense of Section 7.2.1, and they may not be asymptotically normally distributed.
Note that theRminimum contrast estimates of Section 2.1 are of this form but, in this case,
d(Q, Pθ ) = ρ(x, θ)dQ(x), which is not a metric, but is linear in Q, whereas metrics are
not.
Can minimum distance estimates θ(Pb ) be linearized in the sense of (6.2.3), and are
they asymptotically Gaussian as we have shown M estimates to be in Section 6.2.1 and
6.2.2? When this is true asymptotic inference is simple as we have seen in Section 6.3. We
have effectively studied this question for X finite in Theorem 5.4.1. To do the general case,
we need to extend the notion of Taylor expansion to function spaces, and apply so called
maximal inequalities discussed in Section 7.1. In fact, we shall go further and examine
function valued estimates such as the quantile function and study conditions under which
these can be linearized and shown to be asymptotically Gaussian in the sense of weak convergence which will be rigorously defined in Section 7.1. Moreover, we want to conclude
that asymptotic Gaussianity holds uniformly in a suitable sense. This is an important issue
which we did not focus on in Volume I. As the Hodges Example 5.4.2 shows, it is possible
to have estimates whose asymptotic behavior is not a good guide to the finite n case because
of lack of uniformity of convergence when we vary the underlying distribution.
In Section 9.3 we will be concerned with regular parameters θ(P ), ones whose plug
1
in estimates θ(Pb) converge to θ(P ) at rate n− 2 uniformly over a suitable subset M0 of a
nonparametric family M of probability distributions.
8
Introduction and Examples
Chapter I
Definition I.1. θ(Pb) converges to θ(P ) at rate δn over M0 ⊂ M iff for all ε > 0 there
exists c < ∞ such that
sup{P [ |θ(Pb) − θ(P )| ≥ cδn ] : P ∈ M0 } ≤ ε .
There is another important set of questions having to do with the extension of the notion
of efficient estimation from parametric to non- and semiparametric models. For instance,
consider the parameter corresponding to a minimum contrast estimate
Z
ν(P ) = argmin
ρ(x, θ)dP (x) : θ ∈ Θ .
Suppose that, as usual, we assume ν is defined for all P ∈ M, a nonparametric model.
Is there any estimate of ν(P ) which behaves regularly and yet is able to achieve smaller
asymptotic variance than the minimum contrast estimate ν(Pb ) at some P in M? Note
that this is not the same question as asking whether ν(Pb ) is improvable by another such
estimator η(Pb) such that η(Pθ ) = ν(Pθ ) = θ on a parametric model {Pθ : θ ∈ Θ}.
Intuitively, ν(Pb ) is to first order the only regular estimate of ν(P ) on all of M and should
not be and indeed, is not improvable. Regularity is essential here to exclude the Hodges
Example phenomena. We develop this theme further in the context of semiparametric as
well as nonparametric models in Section 9.3.
I.4
Permutation Tests
In Section 4.9.3 we considered the Gaussian two-sample problem. We want to compare
two samples X1 , . . . , Xn1 (control) and Y1 , . . . , Yn2 (treatment) from distributions F and
G and, in particular, want to test the hypothesis H : F = G of no treatment effect. We
studied the classical case in which F and G were N (µ, σ 2 ), N (µ + ∆, σ 2 ), respectively.
Then H becomes ∆ = 0 and we arrived at the classical two-sample
t-test. In Example 5.3.7
R
we showed that, even if F = G was not Gaussian, if x2 dF (x) < ∞, the asymptotic
level of the test is preserved as n1 , n2 → ∞. This can be thought of as a result for testing
the hypothesis that the semiparametric model {(F,
R G) : F = G}
R holds within the full
nonparametric model {(F, G) : F, G arbitrary, x2 dF (x) < ∞, y 2 dG(y) < ∞}. But
is it possible to construct tests with reasonable properties which have level 0 < α < 1 for
fixed n1 , n2 and all F, G? The answer is yes. We shall study such permutation tests and
their simple special subclass, the rank tests, in Sections 8.2 and 8.3 in the context of classes
of composite semiparametric and parametric hypotheses H : P ∈ P0 ⊂ P which allow the
construction of tests whose null distribution does not depend on where we are in P0 . We
have, in fact, already seen examples of such parametric hypotheses in the Gaussian oneand two-sample problems with unknown variance.
I.5
Estimation of Irregular Parameters
We now show that the phenomena we encounter with irregular parameters are not simply
a function of infinite dimension but rather manifest themselves as soon as the parameter
Section I.5
9
Estimation of Irregular Parameters
space complexity p, as measured naively by parameter space dimension, is comparable
to the information in the data as measured naively by the sample size n. Although the
distribution function is a reasonable object to study for X = R, a much more visually
informative parameter, even for this dimension and certainly for X = Rd with d > 1, is
the density function, assuming that we postulate that only P which have continuous case
densities p (in the usual sense) are considered. If P ∈ P = {Pθ : θ ∈ Θ} is a regular
parametric model and Pθ has density p(·, θ), we can estimate the density by plugging in,
b where θb is the MLE. If P is essentially nonparametric, there is no natural
say, pb = p(·, θ)
extension of the function valued parameter ν(P ) ≡ p(·) to all of the nonparametric class
M since, if Pb denotes the empirical probability (2.1.15), ν(Pb ), the density of Pb , has no
meaning. This leads to a strategy of “regularization” by which we first approximate ν(P )
on P by νn (P ), which extends smoothly to νen on M, i.e., for which νen (Pb) makes sense
and yet νn (P ) − ν(P ) → 0 in some uniform sense as n → ∞. “Regularization” refers
precisely to the change of estimation from the “irregular” ν to the “regular” νn . In essence,
there is now a natural decomposition of the estimation error,
νen (Pb ) − ν(P ) = (e
νn (Pb) − νn (P )) + (νn (P ) − ν(P )).
(I.5)
The first term in parenthesis can be interpreted as the source of random variability, and
the second as that of deterministic “bias.” We will loosely refer to this as the “bias-variance
decomposition.” For instance, consider the usual histogram estimate of a one dimensional
density p(·),
pbh (t) = Pb [Ij (t)]/h
where Ij = (jh, (j + 1)h], −∞ < j < ∞, and Ij (t) is the unique Ij which contains t.
Now pbh (t) is the plug-in estimate for the parameter,
ph (t) ≡ P [Ij (t)]/h.
Of course, ph 6= p for h > 0 but if h = hn ↓ 0, then ph (t) → p(t) for all t and phn (·) is a
sequence of approximating parameters. Now,
E(b
ph (t) − ph (t)) = 0,
Var pbh (t) = ph (t)(1 − hph (t))/hn.
Thus, the “variance” part of the decomposition tends to 0 only if h → 0 slower than n−1 .
On the other hand, the rate of convergence of the “bias” part to 0
BIAS(h) ≡
1
h
Z
(j+1)h
jh
(p(s) − p(t))ds
is fastest when h → 0 fastest. In fact, typically, at best h−1 BIAS(h) → c 6= 0 (Problem
I.5.1). So we see a tension present here in choosing the rate at which h → 0, which is a
new phenomenon whose consequences we shall investigate in Chapter 11 and 12.
Irregular parameters play a critical role in nonparametric regression, classification, and
prediction which we shall also study in Chapters 11 and 12, as well as even in some aspects
10
Introduction and Examples
Chapter I
of regular parameter estimation in semiparametric models. As mentioned in Section I.3, the
distinction between regular and irregular is loose, roughly corresponding to the distinction
1
between parameters which can at least asymptotically be estimated unbiasedly at rate n− 2
1
b and for which the usual asymptotic Gaussianb ≍ n− 2 for some θ,
in the sense that MSE(θ)
based methods of inference apply straightforwardly; and those which cannot be treated in
this way.
I.6
Stein and Empirical Bayes Estimation
For conceptual reasons, we consider the analysis of variance p-sample Gaussian model
(Example 6.1.3), specified by X = {Xij : 1 ≤ i ≤ n, 1 ≤ j ≤ p} where the Xij are
independent, N (µj , σ02 ), with µp = (µ1 , . . . , µp )T unknown. We write P(n, p) for this
class of distributions for X. The MLE of µp is
X̄p ≡ (X·1 , . . . , X·p )T
Pn
−1
2
where X·j ≡ n
i=1 Xij . Evidently, X̄p ∼ N (µp , (σ0 /n)J) where Jp×p is the iden2
tity. Let our loss function be l(µp , d) = |µp − d| /p, where |t| is the Euclidean norm of
the vector t. Then, the MSE is,
p
σ2
1X
E(X·j − µj )2 = 0 .
R(µp , X̄p ) =
p j=1
n
(I.6)
We can show X̄p is minimax (Problem I.6.1) for each n and p and asymptotically efficient
as n → ∞ for p fixed. But, even if p is only ≥ 3, a remarkable phenomenon discovered by
Stein (1956(b)) occurs: X̄p is not admissible, whatever be n. That is, there exist minimax
estimates δ ∗ (X̄p ) such that R(µp , δ ∗ (X̄p )) < σ02 /n for all µp . A famous simple and
intuitively reasonable estimate is Stein’s positive part estimate,
p−2
∗
δ (X̄p ) = 1 −
X̄p
(I.7)
|X̄p |2 +
where if y is a scalar, y+ ≡ max(y, √
0). This estimate shrinks X̄p towards 0 and if the
distance of X̄p from 0 is smaller than p − 2 declares the estimate to be 0.
This result can be made more plausible by considering why X̄p becomes a poor estimate
as p → ∞ for fixed n. Suppose n is fixed. Because X̄p is sufficient and normally distributed
with independent components we can without loss of generality set n = 1. In this case we
write X1 , . . . , Xp for the data. Put a prior distribution Π on Rp according to which the µi
are i.i.d. with density π0 on R. If π0 is known, the posterior mean of (µ1 , . . . , µp ) given
(X1 , . . . , Xp ), which minimizes the Bayes risk, is
δ 0 (X1 , . . . , Xp ) = (δ0 (X1 ), . . . , δ0 (Xp ))
where
x−µ
µ
φ
π0 (µ)dµ
σ0
f ′ (x)
−∞
= x + σ02 0
δ0 (x) = R ∞ x−µ
f0 (x)
π0 (µ)dµ
−∞ φ
σ0
R∞
(I.8)
Section I.7
and
11
Model Selection
1
f0 (x) =
σ0
Z
∞
−∞
φ
x−µ
σ0
π0 (µ)dµ ,
the marginal density of X1 (Problem I.6.2). The Bayes risk of this estimate is just
r(π0 , δ0 ) ≡ σ02 [1 − σ02 I(f0 )] < σ02
(I.9)
R
where I(f0 ) = {[f0′ (x)]2 /f0 (x)} dx, the Fisher information for location of f0 (Problem
I.6.4).
Suppose now that π0 is unknown. We shall show, in Chapter 12, following Robbins
(1956), that we can construct estimates fb0′ /fb0 using X1 , . . . , Xp which, when plugged in
to (I.8), yield a purely data dependent estimate δb0 , such that if r(π0 , δ) denotes Bayes risk
(see Section 3.3), then r(π0 , δb0 ) → r(π0 , δ0 ) as p → ∞ for all π0 . This strictly improves
the performance of X̄p for n = 1, for all π0 (but not uniformly). Here δb0 is an example on
an empirical Bayes estimate.
What if both p and n tend to ∞? There is no change in our conclusion that δ0 is optimal
if π0 is fixed and known.
√ An alternative and more informative analysis leads us to consider
priors π0n such that nµ/σ0 , the signal to noise ratio, stabilizes. The extent to which we
can or want to try to estimate π0n now depends on the family of priors. We shall consider
this as well as related questions in the context of the so called Gaussian white noise model
in Chapter 12.
The next section looks at this situation from a different point of view.
I.7
Model Selection
How complex should our model be? Usually this question can be reduced to asking how
many parameters should be included in the model. We continue to consider the model
P(n, p) of Section I.6. We identify the model with its parameter space Rp for (µ1 , . . . , µp ).
Consider nested submodels Pt (n, p), 0 ≤ t ≤ p, specified by ωt = {µp : µt+1 = · · · =
µp = 0}. Here t is unknown and is to be selected on the basis of the data X. Consider the
problem of estimating µp as a vector with quadratic loss l(µp , d). If we knew t and, in fact,
that µp ∈ ωt , t < p we could use δt (X) = (X·1 , . . . , X·t , 0, . . . , 0)T and obtain
R(µp , δt ) =
pσ 2
tσ02
< 0 = R(µp , δp ), µp ∈ ωt
n
n
where R is the risk function
R(µ, δ) = E|δ(X) − µ|2
and | · | denotes Euclidean distance.
This seemingly artificial appearing situation is, in fact, a canonical model for a general
linear model replicated n times:
Yki =
p
X
j=1
zkj βj + eki , 1 ≤ k ≤ p, 1 ≤ i ≤ n
12
Introduction and Examples
Chapter I
where the eki are i.i.d. N (0, σ02 ) and the zkj are distinct covariate (predictor) values. Then,
b denotes the MLE,
if β
p
2
b ≡ (βb1 , . . . , βbp )T ∼ Np (β , σ0 [ZT Zp ]−1 )
β
p
p
n p
where β p = (β1 , . . . , βp )T , Zp = kzkj kp×p , and our model P(n, p) is the special case
ZTp Zp = Jp×p . Our submodels {βp : βt+1 = · · · = βp = 0} are natural if we think the
p predictors or covariates Z1 , . . . , Zp whose influence on the distribution of Y is governed
by the βj can be ordered in terms of importance and we expect (p − t) of them to be
independent of the response Y . In the context of the model P(n, p), the questions we
address briefly now and more extensively in Chapter 12 are
(1) Suppose p = ∞ (the possible number of predictors we can measure is “very large”) and
we believe that µ ∈ ωt for some t < ∞. That is, µ ∈ {µ : µj = 0, j > t}.
Can we, without knowledge of t, estimate t by b
t so that, as n → ∞,
E|δbt (X) − µ|2 =
tσ02
(1 + o(1))
n
(I.10)
P∞
where |a|2 = j=1 a2j for a = (a1 , a2 , . . .)T , that is, can we asymptotically do as well not
knowing t as knowing it? The optimal procedure for the case where t is known is called
the oracle solution.
P∞
(2) Suppose that all µj can be nonzero but j=1 µ2j < ∞. Then, we can write the risk as
Vn (t, µ) ≡ E|δ t (X) − µ|2 =
∞
X
tσ02
+
µ2j
n
j=t+1
(I.11)
There is clearly a best t(µ, n), one such that
t(µ, n) = arg min Vn (t, µ).
t
Note that t(µ, n) = ∞, that is, estimating each µj by X·j is always a bad idea! Putting
t = 0 will always do better. Can we select b
t(n) such that
E|δ bt(n) (X) − µ|2
Vn (t(µ, n), µ)
−→ 1
(I.12)
as n → ∞?
For question (1) we want (I.10) and (I.12) to hold uniformly over “moderately large”
sets of µ. It is natural to consider b
t of the form: b
t is the largest k such that for a suitable
decreasing sequence {cn } of positive numbers, |X·j | ≤ cn for all j such that k+1 ≤ j ≤ n.
Suppose p ≤ n. Finding the cn such that this b
t solves (I.10) then turns out to lead to a
special case of the solution to Schwarz’s Bayes criterion (SBC) (1978) which also is called
Section I.7
13
Model Selection
the “Bayes information criterion” (BIC). Let Pµ denote computation under µ; then take cn
such that
P0 [max{|X·j | : 1 ≤ j ≤ n} ≥ cn ] → 0
(I.13)
Pµ [|X·j | ≤ cn ] → 0 .
(I.14)
and
Since, for j ≥ t + 1, the X·j are i.i.d. N (0, σ02 /n) it is easy to see (Problem I.7.1) that
p
cn = σ0 (2 log n)/n
will achieve (I.10) without knowledge of t. To see this compute for µ ∈ ωt ,
Pµ [b
t 6= t] = Pµ [b
t < t] + Pµ [b
t > t]
(I.15)
≤ Pµ [|Xt | ≤ cn ] + Pµ [max{|X·j | : t + 1 ≤ j ≤ n} > cn ]
(Problem I.7.2). So, (I.13), (I.14), and (I.15) establish our answer to question (1).
The solution to question (2) is subtler and somewhat different and was first proposed in
a time series contextPby Akaike (1969) and in the regression context by Mallows (1973).
n
We will show that j=t+1 X·j2 can be used to obtain an unbiased estimate of Vn (t, µ).
Evidently


n
n
X
X
σ2
µ2j .
Eµ 
X·j2  = (n − t) 0 +
n
j=t+1
j=t+1
Therefore,

2
σ
0
X·j2 + 2t
− σ02  = Vn (t, µ),
Eµ 
n
j=t+1

n
X
and it seems plausible since σ02 doesn’t depend on t that minimizing the contrast
ρn (X, k) ≡
n
X
j=k+1
X·j2 + 2k
σ02
,
n
1≤k≤n
will yield bb
t which achieves (I.12). This can be shown under suitable conditions (Shibata
(1981)). However, it is interesting to note that this solution, which is called Mallows’ Cp ,
is quite different from the SBC solution. Because
ρn (X, j) − ρn (X, j − 1) = −X·j2 + 2
σ02
n
then ρn (X, j) ≤ ρn (Xn , j − 1) iff (nX·j2 /σ02 ) ≥ 2. That is, we prefer j parameters to j − 1
√
√
√
√
iff n|X·j | ≥ 2. Thus, Mallows’ Cp essentially uses a threshold of 2 on n|X·j |/σ0
14
Introduction and Examples
Chapter I
√
rather than the corresponding threshold 2 log n → ∞ for SBC. These methods and others
will be discussed and contrasted with oracle solutions in Chapter 12.
✷
Remark I.7.1. Mallows’ Cp is usually discussed in the context of squared prediction error rather than squared estimation error. These two criteria are equivalent under general
conditions; see Problem I.7.5.
Remark I.7.2. More generally we can consider risks other than those based on squared
error. We shall do so in Chapter 12.
Remark I.7.3. This introductory chapter has barely touched on some of the main topics of
Chapter 12. These topics include what is referred to as “Statistical Learning,” See Section
12.1 for an introduction to this topic.
Summary We have in this chapter introduced, through examples, some of the main issues,
topics, and procedures to be considered in this volume. One issue is the level of accuracy
possible in the estimation of a parameter. In Section I.3, we illustrated regular
parameters,
√
which are parameters that are relatively easy to estimate in the sense that n times the estimation error is well behaved as n → ∞. We also discussed, in Section I.1, Doob’s heuristic, which suggests how to approximate the distribution of a functional of a sample-based
stochastic process by the distribution of the functional of the limiting stochastic process.
Other methods for obtaining approximations, Monte Carlo Methods and the bootstrap, are
illustrated in Section I.2. We also gave, in Section I.3, important parameters that are regular
but nevertheless do not fall under the framework of Vol. I and discussed efficient estimation
of parameters for a given model. In Section I.4 we discussed situations where we can construct tests, called permutations tests, whose significance level α remain the same for every
member of a general class of distributions P0 . An important subclass of the permutation
tests is the rank tests. In Section I.5 we illustrated irregular parameters, and the notion
of plugging into approximations of irregular parameters such as densities. In this context
we introduced the bias variance tradeoff. In Section I.6 we introduced Stein estimation,
which, in our illustration with p Gaussian population means, consists of providing vector
estimates with smaller average mean squared error than the vector of p sample means. We
then connected this approach to the empirical Bayes approach where we solve the Bayes
estimation problem with p Gaussian means assuming the prior is known and then use the
data to estimate the unknowns in the Bayes procedure. Finally, in Section I.7 we considered
model selection where the question is how many parameters should be included in a model
used to analyze the results of an experiment. A complex model with many parameters may
represent the true distribution of the data better than a simpler model with fewer parameters.
However, we illustrated that trying to estimate too many parameters for the given sample
size may result in worse inference than fitting a simpler model providing a less adequate
approximation to the truth. This is another reflection of the bias-variance tradeoff discussed
in Section I.5. We used a simple p-sample framework to outline the derivations of two procedures that give the most accurate result in the sense of minimizing average mean squared
error for the problem of estimating a vector of means. The first method, which applies to
the situation where all but a finite number of parameters are zero, turns out to be a special
case of the method generated by Schwarz’s Bayes criterion (also called BIC), while the
Section I.8
15
Problems and Complements
second method, which allows for an infinite sequence of positive means, yields a special
case of Mallows’ Cp .
I.8
PROBLEMS AND COMPLEMENTS
Problems for Section I.1
P
1. Let X1 , . . . , Xn be i.i.d. as X ∼ F , let Fb (x) = n−1 ni=1 1[Xi ≤ x] be the empirical
√ b
distribution function, and let En (x) = n[F (x) − F (x)], x ∈ R.
L
(a) Show that for fixed x0 , En (x0 ) → N (0, F (x0 )[1 − F (x0 )])
(b) Find Cov(En (x1 ), En (x2 )) for x1 ≤ x2 .
(c) Find the limiting law of (En (x1 ), En (x2 )) for x1 ≤ x2 using the bivariate central
limit theorem.
2. Let X1 , . . . , Xm be i.i.d. as X ∼ F , Y1 , . . . , Yn i.i.d. as Y ∼ G, and let the X’s and
b be the empirical df ’s, define
Y ’s be independent (see Example 1.1.3). Let Fb (·) and G(·)
N = m + n, and
r
mn b
EN (t) =
{G(t) − Fb (t) − [G(t) − F (t)]}
N
Assume that
m
N
→ λ as N → ∞ with 0 < λ < 1.
(a) Show that for fixed t0 , as N → ∞,
p
L √
EN (t0 ) → λZ1 (t0 ) + (1 − λ)Z2 (t0 )
where Z1 (t0 ) and Z2 (t0 ) are independent with Z1 (t0 ) ∼ N (0, G(t0 )[1 − G(t0 )])
and Z2 (t0 ) ∼ N (0, F (t0 )[1 − F (t0 )]).
(b) Describe the weak limit of EN (t).
Problems for Section I.2
1. Let X1 , . . . , Xn be i.i.d. as X ∼ F (x) = Φ([x−µ]/σ). Suppose we observe Y1 , . . . , Ym
i.i.d. as Y ∼ (X|X ≥ 0). Let µ = E(X), σ 2 = Var(X), and θ = (µ, σ 2 )T .
(a) Express E(Y ) and Var(Y ) as functions of θ. By replacing E(Y ) and Var(Y ) by Y
and σ
bY2 in these two equations and solving them for θ numerically, we have method
of moment estimates of µ and σ 2 .
(b) Use Proposition 5.2.1 to outline an argument for the consistency of the estimates of
µ and σ 2 in part (a).
16
Introduction and Examples
Chapter I
(c) Use Theorem 5.2.3 to outline an argument for the consistency of the MLE of θ.
Problems for Section I.5
1. Suppose p is positive and Lipschitz continuous over Ij , that is, for some γ > 0 and all
x, z ∈ Ij , |p(x) − p(z)| ≤ γ|x − z|. Show that
(a) BIAS pbh (t) = O(h), all t ∈ Ij .
(b) Var pbh (t) = O((nh)−1 ), all t ∈ Ij .
(c) If p(x) is not constant on Ij , then for some constant c > 0,
c < inf h−1 |BIASb
ph (t)| ≤ sup h−1 |BIASb
ph (t)| ≤ c−1 .
h>0
(d) Let t be as in (c) above. Assume that p′ exists and that 0 < |p′ (x)| ≤ M < ∞ for x
in a neighbourhood of t. Show that
2
infh MSE pbh (t) = Var pbh (t) + [BIAS pbh (t)]2 ≍ (n− 3 ) ,
where An ≍ Bn iff An = O(Bn ), Bn = O(An ).
Hint (a) and (b). By the mean value theorem, for some x0 ∈ Ij ,
Z
p(x)d(x) = hp(x0 ) .
P [Ij (t)] =
Ij
Now show that
|BIAS pbh (t)| ≤ γ|x0 − t| ≤ γh, Var pbh (t) ≤
p(x0 )
.
nh
Hint (d). Show that MSE pbn (t) = b(nh)−1 + ch2 plus smaller order terms for some
constants b and c. Now minimize A(h) ≡ b(nh)−1 + ch2 with respect to h.
Problems for Section I.6
1. Show that X̄P is the minimax estimate for the model P(n, p) and the risk (I.6).
Hint. See Example 3.3.3.
2. Derive formula (I.8).
Hint. The first equality is from (3.2.6).
3. Show that the Bayes risk for estimating q(θ) with quadratic loss when θ has prior π is
Z
2
q(θ) − δ ∗ (x) π(θ)p(x, θ)dθdx,
EVar(q(θ|X) =
where δ ∗ (x) is the Bayes estimate of q(θ).
Section I.8
17
Problems and Complements
4. Derive formula (I.9).
Hint. Use (I.8), (5.4.32), and Problem I.6.3.
Problems for Section I.7
1. Show that if X1 , . . . , Xn are i.i.d. N (0, 1), then
p
P [max(X1 , . . . , Xn ) ≤ 2 log n] → 1 .
Hint. Use the following refinement of (7.1.9) (Feller (1968), p. 175)
(x−1 − x−3 )φ(x) ≤ (1 − Φ(x)) ≤ x−1 φ(x)
for all x > 0.
2. Verify (I.15).
Hint. If Z1 , . . . , Zn are i.i.d. N (0, 1), P [max{|Zi | : 1 ≤ i ≤ n} ≥ z] ≤ n P [|Z1 | ≥ z].
Use 1 − Φ(z) ≤ φ(z)/z for all z > 0.
3. Selecting the model to minimize MSE. The t-test revisited. Consider the simple linear
regression model (e.g. Example 2.2.2)
Yi = β0 + β1 zi + εi , i = 1, . . . , n
(I.8.1)
2
where ε1 , . . . , εn are i.i.d.
P as ε ∼ N (0, σ ) and the zi ’s are not all equal. Without loss of
generality we assume zi = 0 and β0 = E(Y ). We are interested in estimating
µi ≡ µ(zi ) ≡ E(Yi ) = β0 + β1 zi , i = 1, . . . , n.
In this problem we assume that (I.8.1) is the true model. Even so, it may be that the MLE
µ
b10 = βb0 = Y based on the submodel with β1 = 0 has smaller risk than the MLE
b = (b
µ1 , . . . , µ
bn )T we define
µ
b1i ≡ Y + βb1 zi based on the full model. For any estimate µ
T
the risk for estimating µ = (µ1 , . . . , µn ) as
n
b =
R(µ, µ)
1
1X
b − µ|2 =
E|b
µ i − µ i |2
E|µ
n
n i=1
where the expectation is computed for the full model (I.8.1).
b j = (b
(a) Show that for µ
µj1 , . . . , µ
bjn )T ,
(1 − j)β12
(1 + j)σ 2
+
n
n
Hint. See Example 6.1.2, pages 381–382.
bj) =
R(µ, µ
Pn
2
i=1 zi
, j = 0, 1
b j ), j = 0, 1, are
(b) Show that unbiased estimates of R(µ, µ
b
b 0 ) ≡ n−1 [RSS0 − RSS1 ] = n−1 |b
b 0 |2 = βb12
R(µ,
µ
µ1 − µ
(I.8.2)
Pn
b
b 1 ) = 2s2 /n
R(µ,
µ
Pn
bji ]2 , j = 0, 1, and s2 = RSS1 /(n − 2).
where RSSj = i=1 [Yi − µ
Hint. See Section 6.1.3.
2
i=1 zi
n
18
Introduction and Examples
Chapter I
b
b1) <
(c) Show that the model selection rule that decides to keep β1 in the model iff R(µ,
µ
b
b 0 ) is equivalent to a likelihood ratio test of H : β1 = 0 vs K : β1 6= 0. Show
R(µ, µ
that the rule and the test can be written as
“Keep β1 in the model iff t2 > 2”
Pn
√
where t = βb1 /(s/ nb
σz ) with σ
bz2 = n−1 i=1 zi2 is the t-statistic for regression.
Hint. See Example 6.1.2, pages 381–382.
(d) Show that√the limit as n → ∞ of the significance level of the test in (c) equals
P (|Z| > 2) = 0.16 where Z ∼ N (0, 1). That is, we decide to keep β1 if the
p-value is less than 0.16. Show this result without the normality assumption. Instead
assume that ε1 , . . . , εn are i.i.d. as ε with E(ε) = 0, E(ε2 ) = σ 2 , E(ε4 ) < ∞.
Also assume maxi {zi2 /Σzi2 } → 0.
Hint. Use Lindeberg-Feller and Slutsky’s theorems. See Example 5.3.3.
√
b 1 ) < R(µ, µ
b 0 ) iff τ 2 > 1 where τ = β1 /(σ/ nb
(e) Using (a), show that R(µ, µ
σz ) is
the noncentrality parameter of the distribution of the t-statistic.
Hint. See page 260, Section 4.9.2.
(f) Show that τb2 ≡ ct2 − 1 with c = (n − 4)/(n − 2) is an unbiased estimate of τ 2 .
Using τb2 , deciding in favor of keeping β1 is now equivalent to t2 > 2(n − 2)/(n − 4)
for n ≥ 5.
p
L
2
, and Z and V are
Hint. t =(Z + τ )/ V /(n − 2) where Z ∼ N (0, 1), V ∼ Xn−2
independent. Use Problem B.2.4.
b
b
b 1 ) < R(µ,
b 0 ) is also equivalent to r2 > 2/n, where r2 is the sam(g) Show that R(µ,
µ
µ
ple correlation coefficient and we assume n ≥ 3. Also show t2 ≥ 2(n − 2)/(n − 4)
iff r2 ≥ 2/(n − 2).
P
Hint. t2 = nβb12 σ
bz2 /RSS1 and r2 = βb12 σ
bz2 /b
σY2 where σ
bY2 = n−1 ni=1 (Yi − Y )2 .
Now use the ANOVA decomposition nb
σY2 = RSS1 + βb2 nb
σz2 (see Section 6.1.3).
4. Selecting the model to minimize MSE. The F-test revisited. Consider the linear model
(see 6.1.26)
(I.8.3)
Y = Z1 β1 + Z2 β2 + ε
where Z 1 is n×q and Z 2 is n×(p−q), β 1 is q×1, β 2 is (p−q)×1, and ε = (ε1 , . . . , εn )T
with ε1 , . . . , εn i.i.d. as ε ∼ N (0, σ 2 ). The entries of Z 1 and Z 2 are constants. Assume
that Z ≡ (Z 1 , Z 2 ) has rank p. We are interested in estimating the effect of the covariates
on the response mean E(Y ). Thus our parameters are
(1)
(2)
µi ≡ µ(z i ) ≡ E(Yi ) = z i β 1 + z i β 2 , i = 1, . . . , n
(j)
where z i is the ith row of Z j , j = 1, 2. It may be that the coefficients in β 2 are so small
that in the bias-variance tradeoff we get more efficient estimates of the µi ’s if we use the
Section I.8
19
Problems and Complements
b 0 = H1 Y
model with β 2 = 0. Thus we want to compare the risks of the two estimates µ
b = HY where H1 and H are the hat matrices (e.g., H1 = Z 1 (Z T1 Z 1 )−1 Z T1 ) for
and µ
the models with β2 = 0 and general β = (β T1 , β T2 )T , respectively. The risk for estimating
µ = (µ1 , . . . , µn )T is
b ) = n−1 E|b
R(µ, µ
µ − µ|2 = n−1
n
X
i=1
E[b
µ i − µ i ]2
where the expectation is computed for the full model.
(a) Set µ0 = H1 µ. Show that
b0) =
Rq ≡ Rq (µ, µ
qσ 2
|µ − µ0 |2
+
, 1≤q≤p.
n
n
b0 = µ
b , and Rp (µ, µ
b ) = pσ 2 /n.
Note that when q = p, then µ
Hint. See (6.1.15).
P
(b) (i) Let s2 = [n − (p + 1)]−1 [Yi − Ybi ]2 where Ybi is the predicted value of Yi in
the full model (I.8.3). Show that an unbiased estimate of Rp − Rq for model
(I.8.3) is
2
b−µ
b 0 |2
bp − R
bq = 2(p − q)s − |µ
R
.
n
n
bp < R
bq is equivalent to
(ii) Show that deciding to keep Z2 β2 in the model when R
deciding β 2 = 0 when F > 2(p − q) where “F > 2(p − q)” is a likelihood
ratio test of H : β2 = 0 vs K : β 2 6= 0.
Hint. See Example 6.1.2.
(c)
(i) Use (a) to show that Rp < Rq is equivalent to θ2 > p − q, where θ2 =
σ −2 |µ−µ0 |2 is the noncentrality parameter of the distribution of the F -statistic
of Example 6.1.2.
(ii) Show that θb2 = (n − p)−1 (p − q)(n − p − 2)F − (p − q) is an unbiased estimate
of θ2 . Thus, using θb2 > (p − q), we select the full model iff
F >
(p − q + 1)(n − p)
, n>p+2.
(p − q)(n − p − 2)
P
2
−1 −1
Hint. By Problem 8.3.13, F = [(p − q)−1 (Z1 + θ)2 + p−q
V ]
i=1 Zi (n − p)
2
where Zi ∼ N (0, 1), V ∼ Xn−p , and Z1 , . . . , Zp−q , V are independent.
5. Prediction and estimation are equivalent for squared error. Assume that E(Yi ) = µi
and Var(Yi ) = σi2 where µi depends on a vector z i of covariates while σi2 does not, i =
b be any estimate of µ
1, . . . , n. Let Y = (Y1 , . . . , Yn )T , µ = (µ1 , . . . , µn )T , and let µ
based on Y with 0 < Varb
µi < ∞ for i = 1, . . . , n. Let Y10 , . . . , Yn0 be variables to be
predicted using Y1 , . . . , Yn . We assume that Y 0 = (Y10 , . . . , Yn0 )T is independent of Y ,
20
Introduction and Examples
Chapter I
b to predict
and that Yi0 has the same mean and variance as Yi , i = 1, . . . , n. We will use µ
Y 0 and define the mean squared prediction error as
MSPE = n−1 E|b
µ − Y 0 |2 .
Show that selecting the model (covariates, as in Problem 4 preceding) that minimizes MSPE
is equivalent to selecting the model (covariates) that minimizes the mean squared error MSE
= n−1 E|b
µ − µ|2 .
Hint. µ
bi − Yi0 = [µi − Yi0 ] + [b
µi − µi ]. Complete the square keeping the square brackets
intact. Because Yi0 is independent of µ
bi , we have
E[b
µi − Yi0 ]2 = σi2 + E[b
µ i − µ i ]2 .
Pn
Note that the equivalence fails if i=1 σi2 depends on the covariates.
I.9
Notes
Notes for Section I.1.
(1) Important exceptions are Sections 1.4 and 6.6.
(I.8.4)
Chapter 7
TOOLS FOR ASYMPTOTIC
ANALYSIS
In this chapter we will present the basic tools needed for most of our subsequent chapters.
These tools include empirical process theory and maximal inequalities, the delta method
in infinite dimensional space, influence functions, functional derivatives, and Taylor type
expansions in function space including the von Mises and Hoeffding expansions. These
tools are essential for developing asymptotic inference for semiparametric models. Some
of the proofs and references will be deferred to Appendix D.
7.1
Weak Convergence in Function Spaces
7.1.1
Stochastic Processes and Weak Convergence
As we discussed in Section I.1, when our inference involves estimates of functions such as
the distribution F and functionals ν(F ) such as ν(F ) = supx |F (x)− F0 (x)|, we are faced
with problems involving the convergence of stochastic processes such as
√
{ n[Fb (x) − F0 (x)] : x ∈ R} .
We now turn to this subject. Our treatment largely parallels that of van der Vaart and
Wellner (1996). Our story has to do with sequences of collections of random variables
{Z(t) : t ∈ T }, also written as {Z(·)} or {Z(t; w) : t ∈ T, w ∈ Ω}, defined on a
suitable probability space (Ω, A, P ), which have a type of approximability property called
separability by Doob (1953). Specifically, {Z(t) : t ∈ T } is separable if there exists a
countable set C ≡ {tj ∈ T : j ≥ 1} such that for any open interval (a, b) ⊂ R and any
S ⊂ T,
ω ∈ Ω0 : Z(t; w) ∈ (a, b), all t ∈ S = ω ∈ Ω0 : Z(t; w) ∈ (a, b), all t ∈ S ∩ C
(7.1.1)
where Ω0 = Ω − N with N a set in A with P (N ) = 0. See Doob (1953) or van der Vaart
and Wellner (1996) for more details.
21
22
Tools for Asymptotic Analysis
Chapter 7
The reason we need to introduce separability is that if T is uncountable, the distribution
of functionals such as sup{Z(t) : t ∈ T } is not determined by the finite dimensional
distributions. For instance, suppose Z(t) = c > 0 for all 0 ≤ t ≤ 1. Let τ be uniform on
T = [0, 1] independent of Z(·) and let
Z ∗ (t) = Z(t) ,
∗
Z (τ ) = 2c .
t 6= τ
Then Z ∗ (t) = c with probability 1, but sup{Z ∗ (t) : t ∈ T }.
Doob essentially shows that for certain collections of finite dimensional distributions
Lt1 ,...,tk , tj ∈ T , 1 ≤ j ≤ k, there is, on a suitable probability space, a separable stochastic
process {Z(t) : t ∈ T } with the same finite dimensional distributions — see Doob (1953)
for an extensive discussion.
What (7.1.1) means in addition to guaranteeing that we can rigorously talk about the
probability of the left hand side of (7.1.1), is that if, for all {ti1 , . . . , tim } ⊂ S ∩ C, all
m ≥ 1,
P [Z(tij ) ∈ (a, b), 1 ≤ j ≤ m] ≥ 1 − ǫ,
then
P [Z(t) ∈ (a, b) for all t ∈ S] ≥ 1 − ǫ.
We shall call a collection {Z(t) : t ∈ T } satisfying (7.1.1) a stochastic process.
An important example of a sequence of processes which we have already encountered in
Section I.1 is the classical empirical process,
√ En (t) ≡ n Fb(t) − F (t) , t ∈ R,
where Fb is the empirical df of X1 , . . . , Xn , i.i.d. F . Note, however, that this does not mean
that we restrict T to be discrete or Euclidean(1).
Definition 7.1.1. Finite Dimensional Convergence. A sequence of stochastic processes
F IDI
Zn (·) converge FIDI to the stochastic process Z(·) Zn (·) −→ Z(·) iff
L(Zn (t1 ), . . . , Zn (tk )) → L(Z(t1 ), . . . , Z(tk ))
(7.1.2)
as n → ∞ for all {t1 , . . . , tk } ∈ T , all k < ∞.
In Definition 7.1.1 Z(·) is not uniquely defined by (7.1.2) but we assume that a particular Z(·) is given.
F IDI
As we noted in Section I.1, En (·) −→ W 0 (F (·)), where {W 0 (u) : 0 ≤ u ≤ 1} is the
Brownian bridge defined in Section I.1.
As the notation suggests, it is possible and important to think of any Z(·) as an abstract
valued random quantity, at the very least Z(·) ∈ F (T ), the set of all real valued functions
on T . More precisely, if points of the set Ω, on which Z(·) are defined, are denoted by ω,
then, for each ω ∈ Ω, Z(·; ω) is a function on T . If T is Euclidean, it is customary to speak
of a realization of Z(·) as a sample function. It is clear that En (·) has bounded sample
Section 7.1
23
Weak Convergence in Function Spaces
functions. It is far less obvious but true (Problem D.2.3) that W 0 (·) has sample functions
which are not only bounded but continuous.
P [ω : W 0 (·; ω) is continuous ] = 1 .
(7.1.3)
Much more is known about W 0 (·) including the distribution of random variables such
R1
R1
as sup{|W 0 (u)| : 0 ≤ u ≤ 1}, 0 [W 0 ]2 (u)du, and 0 W 0 (u) a(u) du (Problem 7.1.11).
A discussion of some of its properties and the closely related Brownian motion or Wiener
process W (·) defined in Section 7.1.2 is given in Billingsley (1968) and Shorack and Wellner (1986).
Here is an example which is important for asymptotic inference.
Example 7.1.1. The log likelihood process. Let X1 , . . . , Xn be i.i.d. P , where P ∈ P =
{Pθ : θ ∈ R}, a regular 1 dimensional model satisfying conditions A0–A4 and A6 of
Section 5.4.2 applied to ψ = ∂l/∂θ. Define as usual,
ln (θ) =
n
X
l(Xi , θ),
i=1
where l ≡ log p and p = p(·, θ) is the density of X under Pθ . Define for all t ∈ R,
t
Zn (t) = ln (θ0 + √ ) − ln (θ0 )
n
with Pθ0 being the true underlying probability. Then, for all t ∈ R, by Problem 5.4.5, as
n → ∞,
Zn (t) = tZn0 −
t2
I(θ0 ) + oPθ0 (1)
2
(7.1.4)
1 P
where I(θ0 ) is Fisher information and Zn0 = n− 2 ni=1 l′ (Xi , θ0 ) with l′ = ∂l/∂θ. By the
central limit theorem and Slutsky’s theorem, for all t1 , . . . , tk ,
L
(Zn (t1 ), . . . , Zn (tk )) →(Z(t1 ), . . . Z(tk ))
where
Z(t) ≡ tZ 0 −
t2
I(θ0 )
2
(7.1.5)
F IDI
and Z 0 ∼ N (0, I(θ0 )). Therefore, Zn (·) −→ Z(·) for the index set T = R.
The result may readily be generalized to the case Θ ⊂ Rp , Θ open with the model
satisfying A0–A4 and A6 of Section 6.2.1. We take T = Θ and define, for all t ∈ Θ, n
sufficiently large,
1
Zn (t) ≡ ln (θ 0 + tn− 2 ) − ln (θ 0 ).
24
Tools for Asymptotic Analysis
Chapter 7
It follows (Problem 7.1.2) that
1
Zn (t) = tT Z 0n − tT I(θ0 )t + oPθ0 (1)
2
(7.1.6)
where I(θ0 ) is the Fisher information matrix, and
Z 0n
≡n
− 12
n
X
l̇(Xi , θ0 )
i=1
F IDI
with l̇ the gradient vector (∂l/∂θ1 , . . . , ∂l/∂θp)T . Generalizing (7.1.5), Zn (·) −→ Z(·)
where
1
Z(t) = tT Z 0 − tT I(θ0 )t
2
and Z 0 ∼ Np (0, I(θ0 )).
✷
We return to the general case where the index set T may not be Euclidean but for
instance be a function space. Let |h|∞ ≡ sup{|h(t)| : t ∈ T }, where | · | is the Euclidean
norm in Rd .
Definition 7.1.2. For the index set T , l∞ (T ) denotes the class of sample functions Z(·) on
T consisting of the set of all bounded real valued functions h on T endowed with the sup
norm |h|∞ .
Note that En (·) ∈ l∞ (R) and W 0 (·) ∈ l∞ ([0, 1]). Doob’s (1949) heuristics which
started the fields of weak convergence and empirical processes were of the following type:
F IDI
We know from the multivariate central limit theorem that En (·) −→ W 0 (F (·)). This does
not imply that the law of functions of En (·) such as supt |En (t)| whose value depends on
more than a fixed finite number of t’s is necessarily close to that of the corresponding function of W 0 (·), but it seems plausible that further analysis will yield L(supt |En (t)|) −→
L(supt |W 0 (t)|). This heuristic was shown to be correct in this and related cases by
Donsker (1952), but the theory has taken on its most satisfactory form only during the
last two decades.
F IDI
What are needed are conditions on processes Zn (·) −→ Z(·) and on functions
q : F (T ) −→ R such that L (q(Zn (·)) −→ L(q(Z(·))). This strong and useful notion
is called weak convergence.
Definition 7.1.3. Suppose that Z(·) and Zn (·), 1 ≤ i ≤ n, are in l∞ (T ). Zn (·) converges
weakly to Z(·), written L(Zn (·)) −→ L(Z(·)), or more commonly, Zn =⇒ Z, iff
F IDI
(i) Zn −→ Z.
(ii) For every continuous function q(·) : l∞ (T ) −→ R such that q(Zn (·)) is a random
variable,
L(q(Zn (·))) −→ L(q(Z(·)) .
Section 7.1
Weak Convergence in Function Spaces
25
(iii) For each m, there exists a partition of T into a finite number m of sets and Z (m) (·) ∈
P
l∞ (T ), each constant on the members of the partition, such that |Z (m) − Z|∞ −→ 0,
as m → ∞.
Continuity of q here means that if hn , h ∈ l∞ (T ) and |hn − h|∞ → 0, then q(hn ) →
q(h). Note that q(h) = |h|∞ is certainly continuous, in fact, uniformly continuous, that is,
for all ǫ > 0, there exists δ(ǫ) > 0 such that if |h1 − h2 |∞ ≤ δ(ǫ), then |q(h1 )− q(h2 )| ≤ ǫ.
Even more, this q is Lipschitz continuous, that is,
|q(h1 ) − q(h2 )| ≤ M |h1 − h2 |∞ ,
for all h1 , h2 for some constant M > 0.
Property (iii), finite approximation, is called tightness and any Z(·) with this property
is called tight. It implies separability.
We relate our definition of weak convergence to classical definitions such as the one in
Billingsley (1968) or more modern ones in terms of outer measures such as the one in van
der Vaart and Wellner (1996) by example
as follows: Let C(T ) be all real valued continuous
functions on T . If P Z(·) ∈ C(T ) = 1 and T = [0, 1], then (iii) is satisfied by taking
tmj = j/m, 0 ≤ j ≤ m, and Z (m) the piecewise constant interpolation of Z(·) based on
Z(tmj ), 0 ≤ j ≤ m . More generally, (iii) essentially says that with high probability,
Z(·) is arbitrarily close to a compact subset of l∞ (T ) such as an equicontinuous uniformly
bounded subset of l∞ (T ). See Theorem 7.1.1 below. See also van der Vaart and Wellner
(1996) for further discussion.
We can now state the basic theorem giving sufficient (and essentially necessary) checkable conditions for weak convergence.
Theorem 7.1.1. Suppose {Zn (·)}, Z(·) ∈ l∞ (T ), and
F IDI
(1) Zn (·) −→ Z(·).
m
Tmj = T and δm , εm → 0 as
(2) There exist subsets Tm1 , . . . , Tmkm of T with ∪kj=1
m → ∞ such that, for all m,
lim sup P [max{sup[|Zn (s) − Zn (t)| : s, t ∈ Tmj ] : 1 ≤ j ≤ km } > εm ] ≤ δm .
n→∞
Then, Zn =⇒ Z.
A proof of a weaker form of this theorem is given in Appendix D.2. The full proof
may be found in van der Vaart and Wellner (1996), for instance. The basic idea is that
Pk
for processes Zn of the form Zn (t) = j=1 Zn (tj )1(t ∈ Tmj ) where tj ∈ Tmj , FIDI
and weak convergence coincide. Condition (2) links uniform approximation of {Zn (·)} by
processes {Znm (·)} in a way which makes the weak limit of the Znm , call them Z (m) ,
approximate Z in the sense of condition (iii) of Definition 7.1.3. Here is a development of
Example 7.1.1.
Example 7.1.2.(Example 7.1.1 continued) Weak convergence of likelihood processes. Note
that, for θ ∈ R, conditions A0–A4 and A6 of Section 5.4.2 applied to ψ = ∂l/∂θ imply
26
Tools for Asymptotic Analysis
Chapter 7
that, under Pθ0 ,
sup{|Zn (t) − tZn0 +
t2
I(θ0 )| : |t| ≤ M } = oPθ0 (1).
2
We claim that weak convergence of Zn (·) to Z(·) holds on all intervals T = [−M, M ]. To
see this note first that Zn (·) is separable on all such T since θ → ln (θ) is continuous. FIDI
convergence was shown
in Example 7.1.1. Next, to show condition (2) of Theorem 7.1.1,
j
let Tmj = M j−1
,
m m , −m + 1 ≤ j ≤ m. Then,
sup{|(s − t)Zn0 −
t2
s2
−
2
2
I(θ0 )| : s, t ∈ Tmj } ≤
|Zn0 | M
+
I(θ0 ) M .
m
m
It follows that,
P max sup[|Zn (s) − Zn (t)| : s, t ∈ Tmj ], : −m + 1 ≤ j ≤ m > ε
m
X
s2
t2
ε
P max{sup[|(s − t)Zn0 − ( − )I(θo )| : s, t ∈ Tmj ]} >
≤
2
2
2
j=−m+1
s2
ε
+ P sup{|Zn (s) − sZn0 + I(θ0 )| : |s| ≤ M } >
2
2
εm + o(1) .
≤ 2m P |Zn0 | + M I(θ0 ) >
2M
(7.1.7)
Choose ε = εm → 0 so slowly that
1
εm m .
≤
P |Z 0 | + M I(θ0 ) >
2M
2m2
Since Z 0 is Gaussian, using Problem D.2.10 or (7.1.9) of the next subsection, εm =
(log m)/m will do. Now take δm = 1/m and weak convergence follows from Theorem
7.1.1.
This example also illustrates condition (iii) of Definition 7.1.3 since it is evident that
we can approximate Z(t) on the partition member Tmj by
Z (m) (t) = M
2
j
1
j
Z0 − M
I(θ0 ).
m
m
2
✷
Theorem 7.1.1 has an easy corollary (Problem 7.1.3). Let
pm (ε) = lim sup max{P [sup{|Zn (s)−Zn (t)| : s, t ∈ Tmj } ≥ ε] : 1 ≤ j ≤ km } . (7.1.8)
n
FIDI
Corollary 7.1.1. If km pm (ε) → 0 for all ε > 0 and Zn −→ Z, then Zn ⇒ Z.
Section 7.1
27
Weak Convergence in Function Spaces
Here is a continuous mapping theorem for processes:
Proposition 7.1.1. If Zn =⇒ Z ∈ l∞ (T ) and g is a continuous map from l∞ (T ) to l∞ (S)
for some set S, then
g(Zn ) =⇒ g(Z) ∈ l∞ (S) .
Proof. We
show that for any function q that maps l∞ (S) to R continuously, q g(Zn ) →
q g(Zn ) on l∞ (S). Because the continuity of q and g implies the continuity of the compostion q · g, (ii) of Definition 7.1.3 follows from the weak convergence of Zn . We leave
(i) and (iii) of Definition 7.1.3 for Problem 7.1.6.
✷
The previous
concepts extend readily from processes Zn (t) ∈ l∞ (T ) to processes
Zn (t), Un (s) ∈ l∞ (T ) × l∞ (S). For such processes a useful result is Slutsky’s theorem for processes.
Corollary 7.1.2. For sets T , S, and V , suppose g : l∞ (T ) × l∞(S) → l∞ (V ) continuously
and,
(i) Zn =⇒ Z
P
(ii) Un −→ u
P
where Zn ∈ l∞ (T ), Un , u ∈ l∞ (S), u is a nonrandom function, and −→ convergence is
in l∞ (S). Then,
g(Zn , Un ) =⇒ g(Z, u) .
Proof. It is equivalent to show that (Zn , Un ) =⇒ (Z, u). We establish the result for Zn , Un
obeying the conditions of Theorem 7.1.1. Let {Tmj } be the sets for {Zn }, {Smj } for {Un }.
Take {Qml } = {Smj × Tmk : all j, k}. Evidently {Qml } satisfy (2) of Theorem 7.1.1 for
(Zn , Un ). On the other hand,
F IDI
Zn (t1 ), . . . , Zn (ta ), Un (s1 ), . . . , Un (sb ) =⇒ Z(t1 ), . . . , Z(ta ), u(s1 ), . . . , u(sb )
by the “classical” Slutsky theorem (Theorem B.7.2). Thus, by Theorem 7.1.1, (Zn , Un ) =⇒
(Z, u) and the result follows.
Example 7.1.3. The standardized empirical process. Let X1 , . . . , Xn be i.i.d. F on R with
F continuous. Assume Donsker’s theorem 7.1.4,
En (·) =⇒ W 0 (·) .
For t such that Fb(t) 1 − Fb (t) > 0, define
√
Fb (t) − F (t)
q
Zn (t) = n
,
Fb (t) 1 − Fb (t)
28
Tools for Asymptotic Analysis
Chapter 7
and set Zn (t) = 0, otherwise. Let I ≡ F −1 (ε), F −1 (1 − ε) for ε > 0. Then
W 0 F (·)
Zn (·) =⇒ q
F (·) 1 − F (·)
on I. This follows by representing Xi as Xi = F −1 (Ui ), i = 1, . . . , n, where Ui are i.i.d.
uniform (0, 1) so that
√
n Fb (t) − F (t) = En F (t) ,
and using Corollary 7.1.2 — see Problem 7.1.9.
Remark 7.1.1. Weak convergence in this general sense also has many of the important
properties of FIDI convergence, mainly Theorems B7.1, B7.2, B7.4, B7.5 and generalizations of these to functions g : l∞ (T ) → l∞ (S). We shall use these results as needed in this
section and subsequently.
Summary. We introduced the concept of a stochastic process as a collection {Z(t) : t ∈
T } of random variables that are separable in the sense that the probability that Z(t) will
stay in the interval (a, b) for all t in a set S ⊂ T can be computed by restricting t to
the set S ∩ C for some countable set C. We consider stochastic processes that take values in the class l∞ (T ) of all bounded real valued functions h on T with the sup norm
|h|∞ . A sequence {Zn (t) : t ∈ T } of stochastic processes converges weakly to a stochastic process Z(t), t ∈ T , if each
finite collection Zn (t1 ), . .. , Zn (tk ) converges in law to
Z(t1 ), . . . , Z(tk ), if q Zn (·) converges in law to q Z(·) for all continous real valued
P
functions q on l∞ (T ), and if Z(·) is tight in the sense that |Z (m) − Z|∞ −→ 0 as m → ∞
for some {Z (m) (·)} ∈ l∞ (T ), where each Z (m) (·) takes on only a finite set of values.
We give conditions on fluctuations of Zn (·) over small sets that imply weak convergence
in Theorem 7.1.1 and Corollary 7.1.1, and verify that they hold for the likelihood process
√
which is defined to be the increments of the log likelihood over intervals (θ 0 , θ0 + t/ n]
for a smooth parametric model for i.i.d. X1 , . . . , Xn .
7.1.2
Maximal Inequalities
Corollary 7.1.1 suggests that the essential tools we need are bounds on
P [sup{|Zn (s) − Zn (t)| : s, t ∈ S} ≥ λ] for subsets S of T of the form Tmj described in
Theorem 7.1.1. There is a standard way of bounding tail probabilities for random variables
(see (A.15.4)):
P [|W | ≥ λ] ≤
Eh(W )
for h ≥ 0, h non-decreasing.
h(λ)
By studying h(t) = exp(γt) for appropriate γ > 0, one can derive Bernstein’s and Hoeffding’s inequalities (B.9.5), (B.9.6). If W is N (0, 1), an important bound we shall use
(Problem D.2.10) is, for λ ≥ 1,
2
φ(λ)
λ
.
(7.1.9)
P [|W | ≥ λ] ≤ 2
≤ 2 exp −
λ
2
Section 7.1
29
Weak Convergence in Function Spaces
An important refinement of Bernstein’s inequality (B.9.5) — see Hoeffding (1963) or
Shorack and Wellner (1986), is the following Hoeffding’s inequality. Let Y1 , . . . , Yn be
independent, |Yi | ≤ M for some constant M > 0, E(Yi ) = 0 for i = 1, . . . , n, and
σn2 = Var(Y1 + . . . + Yn ). Then for each λ > 0,
(
−1 )
|Y1 + . . . + Yn |
Mλ
1 2
P
.
(7.1.10)
> λ ≤ 2 exp − λ 1 +
σn
2
3σn
The M λ/3σn term makes this inequality weaker than (7.1.9) for the normal case.
If, however, we have a collection of random variables, {Z(t) : t ∈ T }, tail bounds
P (|Z(t)| > λ) on the individual Z(t) are not translatable into useful maximal inequalities
without additional conditions. To take a trivial example, suppose T is the rational numbers
and Z(t) are independent N (0, 1). Then, no matter how small δ > 0 is, for each λ > 0
P [sup{|Z(t)| : |t| ≤ δ} ≥ λ] ≡ 1.
Yet, as we shall see in this section and Appendix D.2, it is quite possible to have inequalities
of the form (7.1.9) and (7.1.10) improved to maximal inequalities of the same form for
stochastic processes. Essential ingredients are appropriate tail bounds on |Z(s) − Z(t)|
and the structure of T . To develop intuition we begin with T = [0, 1]. The following
criterion, due to Kolmogorov, has been elaborated and extended in Billingsley ((1968),
p.89).
Proposition 7.1.2 Suppose that the separable stochastic process Z(·) on [0, 1] satisfies
P [|Z(t) − Z(s)| ≥ ε] ≤ M (ε)δ 1+γ
(7.1.11)
for all s, t, |s − t| ≤ δ, all δ > 0, and some γ > 0. Then, if the length of S ⊂ [0, 1] is ≤ δ0 ,
P [sup{|Z(t) − Z(s)| : s, t ∈ S} ≥ ε] ≤ KM (ε)δ01+γ
(7.1.12)
for a universal constant K = K(γ).
Note that by (A.15.4), (7.1.11) follows if, for some c > 0, r > 0 and γ > 0,
E|Z(t) − Z(s)|r ≤ c|t − s|1+γ .
(7.1.13)
A remarkable feature of the proposition is that (7.1.11) is a condition which can be checked
using bivariate distributions.
In the following application, we will need
Definition 7.1.4. The Wiener process or Brownian motion on [0,1] is a separable Gaussian
process W (t), 0 ≤ t ≤ 1, with EW (t) = 0, Cov(W (s), W (t)) = s ∧ t.
Example 7.1.4. The partial sum process. The following process arises in statistics
primarily in the context of sequential analysis. Let X1 , . . . , Xn be i.i.d. as X, E(X) =
Pk
0, V ar(X) = σ 2 < ∞, E|X|2+γ < ∞, γ > 0. Define Sk = j=1 Xj ,
1
Wn (t) = n− 2
S[nt]
, 0≤t≤1
σ
(7.1.14)
30
Tools for Asymptotic Analysis
Chapter 7
where [x] is the largest integer ≤ x.
Proposition 7.1.3. If W (·) is the Wiener process on [0,1], then
Wn ⇒ W .
Proof: Note that
Wn (t) =
[nt]
n
21
S[nt]
1
σ[nt] 2
, n−1 ≤ t ≤ 1.
Let t1 < . . . < tk . Then (Problem 7.1.16), by the k variate central limit theorem,
(Wn (t1 ), Wn (t2 ) − Wn (t1 ), . . . , Wn (tk ) − Wn (tk−1 )) ⇒ (U1 , . . . , Uk ), (7.1.15)
where U1 , . . . , Uk are independent U1 ∼ N (0, t1 ), Uj ∼ N (0, tj − tj−1 ), 2 ≤ j ≤ k. It
is easy to see that
(U1 , U1 + U2 , . . . , U1 + . . . + Uk ) ∼ (W (t1 ), . . . , W (tk )) .
Now we show that convergence is not just FIDI but weak by using Proposition 7.1.1 and
Corollary 7.1.1. By Markov’s inequality
P [|Wn (t) − Wn (s)| ≥ λ] ≤ λ−(2+γ) E|Wn (t) − Wn (s)|2+γ
= λ−(2+γ) E
[nt]
X
j=[ns]+1
Xj
√
σ n
2+γ
.
(7.1.16)
Without loss of generality, take σ = 1. By an extension of the proof of (5.3.3) to all γ > 0
based on the Marcinkiewicz-Zygmund’s inequality (1939),
E
[nt]
X
j=[ns]+1
2+γ
Xj
γ
≤ ([nt] − [ns])1+ 2 E|X|2+γ ,
and thus for M = E|X|2+γ ,
γ
E|Wn (t) − Wn (s)|2+γ ≤ M |s − t|1+ 2 .
(7.1.17)
We conclude from Proposition 7.1.2 that, for all fixed t0 ,
γ
P [sup{|Wn (t0 ) − Wn (s)| : t0 ≤ s ≤ t0 + δ} ≥ λ] ≤ λ−(2+γ) M δ (1+ 2 )
and
P [sup{|Wn (t) − Wn (s)| : t0 ≤ s, t ≤ t0 + δ} ≥ λ]
λ
≤ 2P [sup{|Wn (s) − Wn (t0 )| : t0 ≤ s ≤ t0 + δ} ≥ ]
2
−(2+γ) γ+3
1+ γ2
.
≤ λ
2
Mδ
(7.1.18)
Section 7.1
Weak Convergence in Function Spaces
31
j j+1
Take Tmj = m
, m , 0 ≤ j ≤ m − 1. Then, δ = m−1 , and by (7.1.18), pm (ε) of
γ
Corollary 7.1.1 is bounded above by cm−(1+ 2 ) . Thus because km = m, Corollary 7.1.1
establishes weak convergence.
✷
Remark 7.1.2.
1) The structure of the processes Wn (·) appears only in a weak way, through the FIDI
convergence and the inequality (7.1.12). Thus, results such as Proposition 7.1.2 can be
proved for dependent Xj as well.
2) The condition E|X|2+γ < ∞ in Example 7.1.3 is superfluous. Only second moments are needed—see Billingsley(1968).
3) The process W (·) has continuous sample functions (Problem D.2.3).
4) The property (7.1.15) reveals the fundamental property of the Wiener process. It has
stationary independent Gaussian increments.
✷
Summary. We gave a result which gives bounds on the probability that the maximum of
the fluctuations of a stochastic process over a set S exceeds ε. We introduced the Wiener
process and the partial sum process and showed how the result can be used to establish the
weak convergence of the partial sum process to a Wiener process.
7.1.3
Empirical Processes on Function Spaces
We could at this stage prove weak convergence results for the classical empirical process
of Section I.1. We choose instead to introduce and develop the general theory of weak
convergence of empirical processes which we use in Chapters 9–12.
We specialize in the rest of this section to a subset T of the set of functions h from X
to R, in the context of X1 , . . . , Xn ∈ X i.i.d as X ∼ P , and EP h2 (X) < ∞. We define
the empirical process on T by
n
1 X
(h(Xi ) − EP h(Xi ))
En (h) = √
n i=1
(7.1.19)
for h ∈ T . The empirical process on the line that we defined earlier is a special case with
T = {h : h(·) = hx (·), x ∈ R} where hx (·) = 1(−∞, x](·). Note that we have made
a notational change by now writing En (hx ) for what was denoted by En (x) in Section I.1
and Section 7.1.1. Weak convergence of En for this T will be what we need for Donsker’s
theorem and, as we noted, maximal inequalities for En on suitable subsets of T are what
we’ll need for the proof.
Another En on a suitable T appeared in Examples 7.1.1 and 7.1.2 with
T = {log p(·, θ) : θ ∈ Θ} .
If we define, generally,
||En ||T ≡ sup{|En (h)| : h ∈ T }
we will find that remainder terms in the proofs of asymptotic results in this book will be
naturally bounded by ||En ||T for suitable T .
32
Tools for Asymptotic Analysis
Chapter 7
To obtain useful maximal inequalities for En (·), we need essentially to consider T
which are not too big. If T = {all bounded functions} for instance, we do not get useful bounds (Problem 7.1.7). There are many ways of restricting T . We limit ourselves here
to bounding the bracketing number. Note in passing that by specializing to En , we have
completely specified the joint behavior of En (h1 ), . . . , En (hk ) and hence FIDI limits of the
En (·). The bracketing numbers take advantage of this structure.
Definition 7.1.5. A bracket, (f , f¯), is a set of functions h on X such that h ∈ (f , f¯), that
is f (x) ≤ h(x) ≤ f¯(x) for all x. We assume both f and f¯ are in L2 (P ), but f and f¯ don’t
have to belong to the bracket.
Definition 7.1.6. The δ bracketing number N[ ] (δ, T, L2(P )) for a subset T of functions on
X is the smallest number of brackets (f i , f¯i ) with f i , f¯i ∈ T such that
(a) for every h ∈ T , there exist f i , f¯i such that h ∈ (f i , f¯i ),
(b) EP (f¯i − f i )2 (X) ≤ δ 2
Definition 7.1.7. The envelope function for sets T in Definition 7.1.6 is s(T ) = sup{|h(X )|
: h ∈ T }.
We can get upper bounds on N[ ] by finding the number of brackets for some collection
of brackets satisfying (b). Here is an example.
Example 7.1.5. The classical empirical process. Suppose X = R, F (x) = P (−∞, x]
is continuous and strictly increasing, and T = {1(−∞, x] : x ∈ R}. For a given δ ∈
(0, 1/2), set k = [1/δ 2 ] where [t] is the greatest integer ≤ t and let x(δ 2 ), x(2δ 2 ), . . . ,
x(kδ 2 ) be the unique δ 2 , 2δ 2 , . . . , kδ 2 quantiles of P . We set x(0) = −∞ and x((k +
1)δ 2 ) = ∞. Next, for i = 0, . . . , k, define the brackets
(f i , f¯i ) = [1(−∞, x(iδ 2 )], 1(−∞, x((i + 1)δ 2 )]] .
Then, for each h ∈ T , there is i ∈ {0, . . . , k} such that h ∈ (f i , f¯i ), and
EP (f¯i − f i )2 (X) = P (x(iδ 2 ) < X ≤ x((i + 1)δ 2 )) = δ 2 .
For this collection of brackets, the number of brackets is k + 1 = [1/δ 2 ] + 1. Thus,
N[ ] (δ, T, L2 (P )) ≤ 1 + [1/δ 2 ]. This upper bound is, in fact, sharp — but only upper
bounds are needed. For this X and T , the envelope function is s(T ) ≡ 1.
✷
We continue by stating and proving a maximal inequality not for the empirical process
but for its Gaussian limit, the “Brownian bridge.” We then state the appropriate analogue
for the empirical process and discuss the new difficulties introduced but only give proofs of
the theorem in Appendix D.2.
Let T be a set of functions as above.
Definition 7.1.8. The Brownian bridge on T (with respect to P) is the Gaussian process
WP0 (f ), f ∈ T with
EWP0 (f ) = 0,
Cov(WP0 (f ), WP0 (g)) = CovP (f (X), g(X)) .
Section 7.1
33
Weak Convergence in Function Spaces
Note that by the central limit theorem, for hj ∈ T , En (h1 ), . . . , En (hk ) will converge
in law to WP0 (h1 ), . . . , WP0 (hk ), which incidentally establishes that WP0 is definable as a
Gaussian process as well as that En (·) converges FIDI to WP0 (·).
The critical connection between WP0 (f ) and bracketing is that if (f , f¯) is a bracket with
bracketing number δ and if f, g ∈ (f , f¯), then
E(WP0 (g) − WP0 (f ))2 ≤ EP (g − f )2 (X) ≤ E(f¯ − f )2 (X) ≤ δ 2 .
We state an analogue of Theorem 2.14.16 of van der Vaart and Wellner (1996), which
is also a crude version of a result of Talagrand (1994).
Theorem 7.1.2. Suppose that for all x ∈ X , f ∈ T
(i) For some fixed enveloped function s ∈ T , |(f (x)| ≤ s(x).
(ii) For some positive constants c and d, N[ ] (δ, T, L2 (P )) ≤ cδ −d for all 0 < δ < 1.
(iii) For some positive constant γ, EP f 2 (X) ≤ γ 2 .
Then,
2d(1+ǫ)
λ
λ2
P [sup{|WP0 (f )| : f ∈ T } ≥ λ] ≤ C 1 +
exp − 2 ,
γ
2γ
(7.1.20)
for all ǫ > 0 and C a constant.
Note that (ii) implies (i) (Problem D.2.9).
The proof we shall sketch in Appendix D.2 is essentially that of van der Vaart and
Wellner (1996) following Pollard (1984). A much better result due to Talagrand (1994)
shows that 2d(1+ǫ) can be replaced by d−1 with C replaced by Cγ −2d — see Proposition
A.2.7 in van der Vaart and Wellner. This is the best possible (Problem D.2.7).
In the future, we shall refer to the condition (ii) above as the “polynomial bracketing”
condition.
In the discussion following the proof of Theorem 7.1.2 in Appendix D.2 it is argued that
the result follows from a “chaining” argument which involves showing that we can write
WP0 (f ) = WP0 (gm0 ) +
∞
X
m=m0
[WP0 (gm+1 ) − WP0 (gm )]
(7.1.21)
where the {gm } are representative members of the bracket sets (f mj , f mj ). It is also
argued that the order of magnitude of P {sup |WP0 (f )| : f ∈ T } ≥ λ is of the same order
exp{−λ2 /2γ 2 } as that of a single WP0 (f ). This result enables us to show that WP0 (f ) is
continuous in | · |∞ on T — see below.
Remarkably, a result almost as good is available for En , independent of n. This is a
slightly cruder form of Theorem 2.14.16 of van der Vaart and Wellner (1996).
34
Tools for Asymptotic Analysis
Chapter 7
Theorem 7.1.3. Suppose T satisfies the conditions of Theorem 7.1.2 and the envelope
function s(·) in conditon (i) of Theorem 7.1.2 is uniformly bounded by M < ∞. Then,
4d
1
λ
1
P [||En ||T ≥ M λ] ≤ Cγ −2d 1 +
exp{− [λ2 (γ 2 + (3 + λ)n− 2 )−1 ]} . (7.1.22)
γ
2
The argument for this theorem, which we discuss somewhat further in Appendix D.2,
replaces the use of the Gaussian tail inequality (7.1.9) by Hoeffding’s inequality (7.1.10)
and the change in the bound reflects this. The chaining method described in Appendix D.2
relies on bounds for the probabilities of deviation from 0 of increments with small variance. For En , this behaviour is intrinsically worse than for WP0 . For instance, the classical
√
empirical process sample functions, n(Fb (·) − F (·)), have jumps while WP0 (F (·)) has
continuous sample functions. This leads to the need for arguing that although individual
gm+1 − gm may be unusually large on their scale, having more than one such element of a
chain deviate is sufficiently unlikely. Furthermore, the requirement that s(·) is bounded has
to be imposed since, for an arbitrary F (·), a bound of the type given in (7.1.22) couldn’t
hold even if T were a single function.
We now apply Theorem 7.1.2 to a first example.
Corollary 7.1.3. Continuity of WP0 (f ). Suppose T satisfies the conditions of Theorem
1
7.1.2. Let ||f ||2 ≡ EP2 f 2 (X). Then, WP0 (·) is continuous in probability with respect to
|| · ||2 , in the sense that, for f0 ∈ T ,
P [sup{|WP0 (f ) − WP0 (f0 )| : ||f − f0 ||2 ≤ δ, f ∈ T } ≥ ǫ] −→ 0
(7.1.23)
for each ǫ > 0 as δ → 0.
0
0
Proof. Set W0 (f ) = W
W0 (f ) is a Gaussian process
with mean
P (f ) − WP (f
0 ). Then
zero and covariance E W0 (f )W0 (g) = E (f − f0 )(X)(g − f0 )(X) . That is, W0 (f )
has the same distribution as WP0 (f − f0 ). The result follows from the bound of Theorem
7.1.2.
✷
Note that (7.1.23) is weaker than the more natural definition of continuity with probability one: On a suitable probability space (Ω, A, P ), we can define WP0 (·, w) such that
P [ω : sup{|WP0 (f, ω) − WP0 (f0 , ω)| : ||f − f0 ||2 ≤ δ, f ∈ T } → 0 as δ → 0] = 1.
(7.1.24)
But this can also be derived. See Problem D.2.3.
As an application of Corollary 7.1.3, take T = {1[0, u] : u ∈ [0, 1]} and P the uniform
distribution on [0, 1]. Identify 1[0, u] with u. Then W 0 (u) is the usual Brownian bridge of
Section I.1. We deduce from Corollary 7.1.3 that, for u0 ∈ [0, 1],
P [sup{|W 0 (u) − W 0 (u0 )| : |u − u0 | ≤ δ, u ∈ [0, 1]} ≥ ǫ] → 0
as δ → 0 for all ǫ > 0. To see this, note that if v > u,
E{1(X ∈ [0, v]) − 1(X ∈ [0, u])}2 = E{1(X ∈ (u, v])}2 = v − u,
Section 7.1
35
Weak Convergence in Function Spaces
and use Example 7.1.4 and Theorem 7.1.2. The statement (7.1.24) also holds in this example.
We continue with the identification of T
with {1[0, u] : u ∈ [0, 1]} and write En (u)
and WP0 (u) for En (1[0, u]) and WP0 1[0, u] as before.
Theorem 7.1.4 (Donsker (1952)). If T = {1[0, u] : u ∈ [0, 1]} and P = U[0, 1], then for
En and W 0 as above
En =⇒ W 0 .
Proof. The structure of this argument parallels that of Proposition 7.1.2. We need only
check the conditions of Corollary 7.1.1. FIDI convergence, as we noted previously, follows
from the multivariate central limit theorem. Let Tmj = (jδm , (j + 1)δm ], 0 ≤ j <
[1/δm ] − 1 with δ = δm to be chosen.
Now En (v) − En (jδm ) = En (1(jδm , v]).
The set of all {1(jδm , v] :
jδm < v ≤ (j + 1)δm } is easily seen to obey the conditions of Theorem 7.1.2 with
d = 1, γ 2 = δm (1 − δm ) by arguing as in Example 7.1.4. Thus using (7.1.22),
λ
P sup{|En (v) − En (jδm )| : jδm < v ≤ (j + 1)δm } >
2
2
1
1 λ
−2 4
−1
− 21 −1
.
≤ Cδm (1 + λδm ) exp −
(δm + (3 + λ)n )
2 2
Put λ = 2/m, δm = 1/m3 (say), and the condition of Corollary 7.1.1 holds. The theorem
follows.
✷
More generally, we state (see Problem 7.1.17)
Theorem 7.1.5. Suppose T can be partitioned into sets {Tmj }, j = 1, . . . , km , such
that each Tmj satisfies the conditions of Theorem 7.1.3, polynomial {c, d} bracketing, and
uniformly bounded envelope s(·) ≤ M . Suppose c, d, and M are independent of m, and
suppose
2
sup{EP (f − g)2 (X) : f, g ∈ Tmj } ≤ γm
2
for all m, where γm
= o (1/ log km ) . Then, if X1 , . . . , Xn are i.i.d. P and f ∈ T ,
En (f ) =⇒ WP0 (f ) .
Note, for future reference, that the only properties specific to En that are used are the
maximal inequalities. Thus weak convergence results can be expected to hold for much
wider classes of processes, for instance, empirical processes for weakly dependent stationary sequences — see Doukhan (1995) for instance.
We continue with
Example 7.1.6. The Kolmogorov statistic. We immediately deduce from Donsker’s theorem that if F0 = U(0, 1), under H : F = F0 ,
∞
X
√
2 2
(−1)j+1 e−2j λ (7.1.25)
P [ n sup |Fb(u)−F0 (u)| > λ] → P [sup |W 0 (u)| > λ] =
u
u
j=1
36
Tools for Asymptotic Analysis
Chapter 7
where the second equality is from Shorack and Wellner (1986). From Proposition 4.1.1, we
know that the left hand side of (7.1.25) in fact has the same distribution under F0 for any
F0 continuous. What if F0 is not continuous? It follows from Theorem 7.1.5 that, under
H : F = F0 , for T = {1(−∞, u] : u ∈ R}, with 1(−∞, u] identified with u,
En (·) =⇒ WF00 (·) ≡ W 0 F0 (·) .
Although the distribution of supx |WF00 (x)| depends on F0 if F0 is not continuous, it is
√
still true that the distribution of n supx |Fb (x) − F0 (x)| converges weakly to that of
supx |WF00 (x)|. We will use this remark and the bootstrap in Chapter 10 to set confidence
bounds for F based on Kolmogorov statistic which are narrower than ones using the tabled
values based on the right hand side of (7.1.25). See Example 10.3.4.
✷
Example 7.1.7. The Cramér-von Mises statistic of Problem 4.1.11 is
Z ∞
Cn ≡ n
(Fb(x) − F0 (x) )2 dF0 (x).
−∞
If F0 is continuous, the distribution of this statistic under H does not depend on F0 . Take
R1
F0 = U(0, 1). Since the map f → 0 f 2 (u)du is continuous from l∞ [0, 1] to R+ , we
immediately deduce from Donsker’s theorem that
Z 1
LF0 (Cn ) → L( [W 0 ]2 (u) du).
0
Although the density function of this limit does not have a nice analytic form (see Example
7.3.1), its characteristic function does.
Z 1
− 12
E exp{it
[W 0 ]2 (u) du)} = Π∞
j=1 (1 − 2λj it)
0
with λj = (jπ)−2 from Shorack and Wellner (1986), p.215.
✷
Classes of functions T for which En (·) =⇒ WP0 (·) are called Donsker classes. They
include indicators of quadrants and hyperplanes in Rd and much else. Validation of weak
convergence involves, as we have seen, being able to find partitions of the sets of functions
T such that the oscillations of En over each member of the partition tend to be small. This
requires using maximal inequalities in the way we have. An important, but in general
overly crude application of the maximal inequalities we have, establishes generalizations
of the law of large numbers. We obtain from Theorem 7.1.3
Proposition 7.1.4. If T obeys the conditions of Theorem 7.1.3,
n
sup{
1X
P
|h(Xi ) − EP h(X)| : h ∈ T } −→ 0.
n i=1
Proof. The result follows because Theorem 7.1.3 tells us that
sup{|En (h)| : h ∈ T } = OP (1)
(7.1.26)
Section 7.1
37
Weak Convergence in Function Spaces
1
which implies (7.1.26) on multiplying both sides by n− 2 .
✷
In fact, more can be shown: The convergence in (7.1.26) is almost sure (Problem
7.1.10). A famous special case is the Glivenko-Cantelli theorem: With probability 1
sup |Fb(x) − F (x)| → 0
(7.1.27)
x
We will apply the weak convergence results further in Sections 7.2 and 9.2. The maximal inequalities will be directly useful both in Section 7.2 and in Chapters 9 and 11.
We close the discussion in this section noting that we do not enter further into the
other beautiful ideas underlying maximal inequalities, such as symmetrization, viewing
the sample obtained as a sample without replacement from a larger sample and VapnikChervonenkis theory, among others. Our focus in this section and the following sections
has been and will be the application of these ideas to develop the asymptotic tools needed
for statistical theory.
Summary. We introduced separable stochastic processes {Z(t) : t ∈ T } on a general
space T which typically is a space of functions and defined the weak convergence of a sequence of stochastic processes Zn (·) to a tight stochastic process Z(·). Weak convergence
allows us to conclude that the distribution of a l∞ continuous function q of Zn (·) converges to the distribution of q(Z(·)). We gave sufficient conditions for weak convergence
in terms of probability bounds on oscillations of Zn (·). This led to the consideration of
maximal inequalities that provide such bounds. We applied this framework and results to
the likelihood and partial sum processes and established weak convergence of the partial
sum process to the Wiener process. We devoted Section 7.1.3 to the empirical process
1
En (h) = n 2
Z
[h(x) − Eh(X)] dPb (x)
for h in some space T of functions, where Pb is the empirical probability of X1 , . . . , Xn
i.i.d. as X ∼ P . En (·) converges weakly to a Brownian bridge WP0 (·) on T , which is
defined to be a zero mean Gaussian process with
Cov WP0 (g), WP0 (h) = CovP g(X), h(X) .
For such processes, we introduced the bracketing entropy (number) and established maximal inequalities for WP0 (·) and En (h) under envelope bounds on h(x) and polynomial
bounds on the bracketing number (the polynomial bracketing condition). We established
Donsker’s theorem which states that for uniform P the classical empirical process converges weakly to the Brownian bridge W 0 and we generalized this result to give conditions
0
under which {En (h)
: h ∈ T } converges weakly to {WP (h)
: h ∈ T }. A class of functions T such that En (h) : h ∈ T converges weakly to WP0 (h) : h ∈ T is called
a Donsker class. Finally, we used the results to derive the asymptotic distributions of the
Kolmogorov and Cramér-von Mises statistics, and to give a proof of the Glivenko-Cantelli
theorem.
38
Tools for Asymptotic Analysis
7.2
Chapter 7
The Delta Method in Infinite Dimensional Space
We have applied the stochastic delta method successfully to M estimates in Sections 5.4
and 6.2 by Taylor expanding the estimating equations defining the parameter ν(P ). There
are many examples, such as the minimum distance estimates and trimmed mean of Section
I.3, where it is unclear what a linear approximation to νbn = ν(Pb ) means. The first and
perhaps the most valuable part of this section is the development of a heuristic method
which yields the “correct” linear approximation if one exists. In this section, as in most of
the text, we are limited to the i.i.d. case.
7.2.1
Influence Functions. The Gâteaux and Fréchet
Derivatives
Let Pb denote the empirical distribution of X1 , . . . , Xn i.i.d. as P . What is a linear approximation to a statistic νbn = ν(Pb ) which is a consistent estimate of a parameter ν(P )? In
analogy with the one and d dimensional approximations (5.3.23), (5.4.22), and (6.2.3), we
define it to be an expression of the form
n
ν(P ) +
1X
ψ(Xi , P ) = ν(P ) +
n i=1
Z
ψ(x, P )dPb(x)
(7.2.1)
where EP ψ(X, P ) = 0 and EP ψ 2 (X, P ) < ∞. Since
n
1
1X
ψ(Xi , P ) = OP (n− 2 )
n i=1
we shall say that (7.2.1) is a valid linear approximation to νbn if
Z
1
νbn − ν(P ) − ψ(x, P )dPb (x) = oP (n− 2 ).
(7.2.2)
√
Thus νbn − ν(P ) is asymptotically an average of i.i.d. variables and n νbn − ν(P ) will
be asymptotically normal. For instance, if νbn = ν(Pb) = h(X̄) for some differentiable
function h, then ψ(x, P ) = h′ (µ)(x − µ) and
√ √ n h(X̄) − µ = n h′ (µ)(X̄ − u) + oP (1) .
See the proof of (A.14.17).
The function ψ(·, P ) is unique if it exists (Problem 7.2.19) and we shall call it the
influence function of ν(Pb).
The terminology and the basic idea of the approximation come from Section 5.4.1. Let
ν:M→R
be a parameter where
Section 7.2
The Delta Method in Infinite Dimensional Space
39
(i) M ⊃ P, with P the model of primary interest.
(ii) M ⊃ all distributions with finite support ⊂ X , i.e., M includes the empirical distribution.
(iii) M is convex. If P and Q belong to M, then
(1 − ε)P + ε Q ∈ M for all 0 ≤ ε ≤ 1 .
Define, in accord with Huber (1981) the Gâteaux derivative of ν at P in the direction
of Q by
ψ0 (Q, P ) =
∂
ν( (1 − ε)P + εQ)|ε=0
∂ε
(7.2.3)
where the one-sided derivative is assumed to exist. If the influence function as defined
above exists, then under general conditions (e.g., Huber (1981), Bickel, Klassen, Ritov and
Wellner (1993,1998)), it can be computed as a function of x ∈ X as,
ψ(x, P ) ≡ ψ0 (δx , P )
(7.2.4)
where δx is point mass at x. Under appropriate conditions (see Example 7.2.1 and Problem
7.2.3)
Z
ψ0 (Q, P ) = ψ(x, P ) dQ(x)
(7.2.5)
for all Q ∈ M. In particular, since ψ0 (P, P ) = 0, we have a fundamental property of the
influence function,
EP ψ(X, P ) = 0 .
(7.2.6)
Why should we expect (7.2.2) and (7.2.5) to hold? Here is an example where they are valid.
If X = {x1 , . . . , xk }, then we are in the multinomial case we studied in Section 5.4.1.
Example 7.2.1. The multinomial case. We can identify P with (p1 , . . . , pk )T where pj =
P {X = xj }, j = 1, . . . , k, and we can write the parameter ν(P ) as h(p1 , . . . , pk ) for some
h : Rk → R. In this case we know from Theorem 5.3.2 and Section 5.4.1 that the influence
function ψ(x, P ) is obtained from the total differential. We take M = P = {q : 0 < qi <
P
P
1, ki=1 qi = 1}. Because of the restriction ki=1 qi = 1, the total differential evaluated at
x = xj under the conditions of Section 5.4.1 is (see (B.8.14))
1
ψ(xj , P ) = lim (h((1 − ε)p1 , . . . , (1 − ε)pj−1 , (1 − ε)pj + ε,
ε↓0 ε
(1 − ε)pj+1 , . . . , (1 − ε)pk ) − h(p1 , . . . , pk ))
=
k
k
X
X
∂h
∂h
∂h
(p) −
(p)pt =
(p) 1(xt = xj ) − pt (7.2.7)
∂pj
∂p
∂p
t
j
t=1
t=1
40
Tools for Asymptotic Analysis
Chapter 7
and if Q ↔ (q1 , . . . , qk ) ,
1
ψ0 (Q, P ) = lim (h((1 − ε)p1 + εq1 , . . . , (1 − ε)pk + εqk ) − h(p1 , . . . , pk ))
ε↓0 ε
k
k
X
X
∂h
ψ(xj , P )qj
(7.2.8)
(p)(qj − pj ) =
=
∂pj
j=1
j=1
which is just (7.2.5). We see that
ψ(X, P ) =
k
X
∂h
(p)(1(X = xj ) − pj )
∂p
j
j=1
is the function appearing in (5.4.10) and that (5.4.10) exactly gives (7.2.2) with νbn = ν(Pb )
where Pb is the empirical probability, that is,
n
1
Nk
1X
N1
ψ(Xi , P ) + oP (n− 2 )
) = ν(P ) +
ν(Pb ) = h( , . . . ,
n
n
n i=1
(7.2.9)
Equation (7.2.9) just says that (7.2.1) is a valid linear approximation to ν(Pb) in this case.
✷
∂h
(·) in a neighborhood of p is enough to leIn the multinomial case, continuity of ∂p
j
gitimize the existence of a total differential
for
h at p, which is what (7.2.7) corresponds
R
to. Also, all expressions such as Ψ(x, P )dQ(x) naturally exist and are finite. There
are different generalizations of the total differential to infinite dimensional spaces. At this
point, we simply operate formally, derive the influence function, and show, by example, the
validity of (7.2.5), and, in some cases, (7.2.2) for a number of examples. We also relate the
influence function to the sensitivity curve of Section 3.5 (see Remark 3.5.1 and Problem
7.2.2).
Here is an extension of the multinomial example.
Example 7.2.2. Moments and functions of moments. Suppose M is as in (i), (ii), and (iii),
Z
ν(P ) = h
g(x)dP
where g(·) = (g1 , . . . , gk )T is a vector of functions from X to R such that
Z
|g|2 (x)dP (x) < ∞ for all P ∈ M
R
and h : Rk → R has a total differential at µg (P ) ≡ g(x)dP (x). Then, by definition, for
P, Q ∈ M,
Z
Z
ν((1 − ε)P + εQ) = h
g(x)dP (x) + ε
g(x)d(Q − P )(x)
Section 7.2
The Delta Method in Infinite Dimensional Space
41
and, by definition,
∂h
(µ (P ) + ε(g(x) − µg (P )))|ε=0
∂ε g
k
X
∂h
=
(µ (P ))(gj (x) − µgj (P ))
∂µgj g
j=1
ψ0 (δx , P ) =
= ∇T h(µg (P ))(g(x) − µg (P )) .
(7.2.10)
It can be shown (Problem 7.2.3) that (7.2.5) holds. Moreover, if we set ψ(x, P ) = ψ0 (δx ,
P ), then by (5.3.23),
n
ν(Pb ) = ν(P ) +
1
1X
ψ(Xi , P ) + oP (n− 2 )
n i=1
(7.2.11)
and (7.2.2) holds. We have already seen in Chapters 5 and 6 how to apply (7.2.10) to get
expressions for the asymptotic behaviour of moments and cumulants.
✷
b =
Recall
that
a
minimum
contrast
(MC)
estimate
is
defined
in
Section
5.4.2
as
θ
R
arg min ρ(x, θ)dFb (x) for some contrast function ρ, while a M estimate is defined as
R
the solution to ψ(x, θ) dFb (x) = 0 for some funtion ψ. An important special case of M
estimates is obtained by choosing ψ as the derivative of a contrast function.
Example 7.2.3. Quantiles. The median is a MC estimate which can be shown to be consistent and asymptotically normal (Problem 5.4.1), yet does not satisfy the conditions of
Theorems 5.4.2. Specifically, suppose X ∈ R has distribution function F with density
f and uniquely defined median ν(F ) = F −1 (.5), such that f (ν(F )) > 0. Then, from
Problem 5.4.1, if Fb −1 (.5) is a sample median,
√
L( n(Fb −1 (.5) − ν(F ))) → N (0, σ 2 (F ))
where, σ 2 (F ) = 1/4f 2(ν(F )).
The mean and median are informative of the center of a population. Sometimes quantiles are also of interest. For instance in addition to median income we may want to examine
the 10th or the 99th percentile of income. We next formally work out the influence function
of the α quantile να (F ), 0 < α < 1, defined by
F (να (F )) = α .
(7.2.12)
We assume that equation (7.2.12) has a unique solution for F ↔ P ∈ M. In general,
να (F ) corresponds (Problem 7.2.7) to the MC estimate, with
ρ(x, θ)
=
=
(1 − α)|x − θ|, x < θ
α|x − θ|,
x≥θ.
If F is continuous
R and increasing, να (F ) is the parameter (corresponding to an M estimate)
which solves φα (x, θ) dF (x) = 0 with
φα (x, θ)
= −(1 − α),
=
α,
x<θ
x≥θ.
(7.2.13)
42
Tools for Asymptotic Analysis
Chapter 7
The median indeed corresponds to α = .5.
If F = Fb , as we know for the median, να (Fb ) is not uniquely defined in general. For
convenience, we use the sample quantile (see (2.1.17))
x
bα =
1 b −1
F (α) + FbU−1 (α) = X([nα]+1) , if nα ∈
/ {1, . . . , n}
2
1
X([nα]) + X([nα]+1) , otherwise
=
2
(7.2.14)
where FbU−1 (α) = sup{x : Fb(x) ≤ α}, X(n+1) = X(n) , and X(1) ≤ . . . ≤ X(n) are the
ordered X1 , . . . , Xn .
We now proceed formally from (7.2.12) assuming να (F ) is uniquely defined on M,
and that we are evaluating the influence function at F ↔ P such that f (να (F )) is defined
and > 0. Then, we can differentiate (7.2.12) in the form
((1 − ε)F + εG)(να ((1 − ε)F + εG)) = α
with respect to ε, evaluated at ε = 0, to get
(G − F )(να (F )) + f (να (F ))
∂
να (F + ε(G − F ))|ε=0 = 0
∂ε
which, by (7.2.4), yields, by substituting G(y) = 1(y ≥ x) corresponding to δx , the
influence function,
− 1(να (F ) ≥ x) − α
ψα (x, F ) =
.
(7.2.15)
f (να (F ))
Expressions (7.2.15) and (7.2.2) lead to the expectation that we have, for xα = F −1 (α),
n
x
bα = xα +
1
1 X −(1(Xi ≤ F −1 (α)) − α)
+ oP (n− 2 )
−1
n i=1
f (F (α))
(7.2.16)
and hence, by the central limit theorem,
√
n(b
xα − xα ) =⇒ N (0, σα2 (F ))
where σα2 (F ) = α(1 − α)/f 2 (F −1 (α)). That (7.2.16) is, in fact, correct is shown in
Problem 7.2.8.
✷
The Fréchet derivative
To obtain a general theorem yielding (7.2.2) we need stronger notions of differentiation
than that due to Gâteaux. These necessarily rely on some definition of (metric) topology
on M, a convex subset of probabilities on X containing P, and all discrete distributions.
We can regard the Gâteaux derivative as a partial derivative in the direction (1 − ε)P +
εQ. We next turn to a stronger differentiability that corresponds to the existence of a total
differential.
Definition 7.2.1. d is a suitable metric on M, if it satisfies the usual a)–d) below, and e).
Section 7.2
The Delta Method in Infinite Dimensional Space
43
a) d ≥ 0
b) d(P, Q) = d(Q, P )
c) d(P, Q) = 0 iff P = Q
d) d(P, Q) ≤ d(P, R) + d(R, Q) all P, R, Q ∈ M
e) d((1 − ε)P + εQ, P ) ≤ εd(P, Q) all P, Q ∈ M
Examples of suitable metrics are
d(P, Q) = sup{|
Z
f (dP − dQ)| : f ∈ C}
R
R
where C is a Donsker class satisfying: f dP = f dQ for all f ∈ C =⇒ P = Q. For
instance, if X = R and C = {1z (·) : z ∈ R} where 1z (t) = 1(t ≤ z), then d is the
Kolmogorov (sup) metric.
The strongest (often too strong) definition of differentiability of a function ν : M → R
is F (for Fréchet) differentiability.
In the following we write “Rem” for the remainder of
R
the approximation ψ(x, P )dQ(x) to ν(Q) − ν(P ).
Definition 7.2.2. ν is F differentiable at P iff there exists ψ(·, P ) such that
R
(a) ψ(x, P )dP (x) = 0.
(b) For all ε > 0, there exists δ(ε) > 0 such that if d(P, Q) ≤ δ(ε), then
Z
Rem(P, Q) ≡ |ν(Q) − ν(P ) − ψ(x, P )dQ(x)| ≤ εd(P, Q).
R
R
Note that ψ(x, P )dQ(x) = ψ(x, P )d(Q − P ). Equivalently, if Qm is any sequence
such that d(Qm , P ) → 0, and if (a) holds, then (b) is equivalent to
Rem(P, Qm ) = o(d(P, Qm )).
Write Pbn for the empirical probability Pb to clearly indicate the dependence on sample size.
Theorem 7.2.1. If ν is F differentiable at P with derivative ψ with respect to a suitable
metric d and
(i) d(Pbn , P ) = OP (n− 2 )
1
(ii) ψ(·, P ) ∈ L2 (P )
then
n
ν(Pbn ) = ν(P ) +
1
1X
ψ(Xi , P ) + oP (n− 2 ).
n i=1
That is, νbn = ν(Pbn ) has influence function ψ.
44
Tools for Asymptotic Analysis
Chapter 7
Proof. By hypothesis, for δ(ε) as in Definition 7.2.2,
P [d(Pbn , P ) ≤ δ(ε)] → 1 for every ε > 0.
Therefore, by F differentiability,
Z
b
P [|ν(Pn ) − ν(P ) − ψ(x, P )d(Pbn − P )(x)| ≤ ε′ d(Pbn , P )] → 1 for every ε′ > 0.
By (i), for all ε > 0, there exists M such that
1
Take ε′ =
P [d(Pbn , P ) ≤ M n− 2 ] ≥ 1 − ε.
ε
M,
then, for every ε > 0,
Z
1
b
P [|ν(Pn ) − ν(P ) − ψ(x, P )d(Pbn − P )(x)| ≤ εn− 2 ] → 1.
✷
Example 7.2.4. Cramér-von Mises (C-vM) Statistic. We want to see how the C-vM statistic
behaves when H : P = P0 is false and X = R. The C-vM statistic of Example 7.1.6 is
ν(Fb), where
Z
ν(F ) = (F (y) − F0 (y))2 dF0 (y) .
By a standard calculation (Problem 7.2.9), the influence function is given by
Z ∞
ψ(x, P ) = 2
(1(y ≤ x) − F (y))(F (y) − F0 (y))dF0 (y) .
(7.2.17)
−∞
It can be shown (Problem 7.2.9) that, if we take d as the Kolmogorov metric
d(P, Q) = sup |P (−∞, y] − Q(−∞, y]|
y
then,
Rem(P, Q) = o(d(P, Q)).
if P 6= P0 , the Cramér-von Mises statistic
is asymptotically normal with mean
RThus,
R
∞
2
−1 ∞
2
(F
−
F
)
(y)dF
(y)
and
variance
n
ψ
(x, P )dF (x) where ψ is given by
0
0
−∞
−∞
(7.2.17). This is quite different from its behaviour under the hypothesis H : F = F0
where its asymptotic behaviour is non-Gaussian as we will see in Example 7.3.1.
✷
There is a weaker but more broadly applicable differentiability notion due to Hadamard.
We refer the reader to van der Vaart (1998), Chapter 20, for a clear discussion of this notion
and its utility.
We shall give some more applications of Theorem 7.2.1 in the problems. Unfortunately, it is hard to apply, even when weakened to the Hadamard form. The problem is
Section 7.2
The Delta Method in Infinite Dimensional Space
45
that many parameters are defined implicitly and what
R is most serious — the argument P
appears as dP , as in the median which minimizes |x − θ|dP (x). Convergence in a suit√
1
able metric having the n consistency property, d(Pb, P ) = OP (n− 2 ), is not sufficient
to have continuity of such ν(P ) in general, much less differentiability. It is often the case
that, after formal calculation of the influence function, proving that it provides a suitable
approximation is more easily done by ad hoc methods. We illustrate this in the following
Theorem 7.2.2.
In this result, we also illustrate that the influence function gives us all relevant first order
asymptotic information about ν(Pb ) if we have a vector parameter
ν(P ) = (ν1 (P ), . . . , νd (P ))T .
If the expansion (7.2.2) is valid for νj (P ), j = 1, . . . , d, and ψj (x, P ) is the influence
function corresponding to νj (P ), then
n
√
1 X
ψ(Xi , P ) + oP (1)
n(ν(Pb) − ν(P )) = √
n i=1
(7.2.18)
where ψ(x, P ) ≡ (ψ1 (x, P ), . . . , ψd (x, P ))T and EP ψ(X, P ) = 0, EP |ψ|2 (X, P ) < ∞.
Here ψ is the natural extension of the concept of an influence function from one dimensional parameters to d dimensional parameters. By the multivariate central limit theorem,
we conclude that
√
n(ν(Pb) − ν(P )) =⇒ Nd (0, EP ψψ T (x)) .
(7.2.19)
We now present a method of Huber(1967) for establishing influence function approximations for M estimates which covers situations like the median or its analogues in more
than 1 dimension. It is an extension of Theorem 6.2.2.
Theorem 7.2.2. Suppose φ : X × Rp → Rp satisfies
bn is such that
A0’: θ
Z
bn )dPbn (x) = oP (n− 12 ) .
φ(x, θ
(7.2.20)
R
A1: θ(P ) uniquely satisfies φ x, θ(P ) dP (x) = 0 for all P ⊂ Q.
2
A2: EP φ X, θ(P )
< ∞ for all P ∈ Q.
bn is consistent for θ(P ) for P ∈ Q.
A5’: θ
A4’: The set Tε = {φ(·, θ) − φ(·, θ(P )) : |θ − θ(P )| ≤ ε} satisfies the conditions of
Theorem 7.1.3 with envelope function s(·, ε) such that EP s2 (X, ε) = o(1) as ε → 0, where
statements about the vector φ refer to each of its components φj .
R
A3’: The map θ → φ(x, θ)dP (x) has a total differential H(P ) at θ(P ) which is nonsingular,
∂
H(P ) =
EP φi (X, θ(P ))
∂θj
1≤i,j≤p
46
Tools for Asymptotic Analysis
Chapter 7
where [·] denotes a matrix. Then,
n
X
1
b = θ(P ) + 1
θ
[−H]−1 (P )φ(Xi , θ(P )) + oP (n− 2 ) .
n i=1
(7.2.21)
Note (Problem 7.2.11) that formal calculation of the influence function for θ(P ) as a
solution of (6.2.2) yields H −1 φ.
Proof: By A4’ and Theorem 7.1.3,
sup{|En (φ(·, θ) − φ(·, θ(P )))| : |θ − θ(P )| ≤ εn } = oP (1)
(7.2.22)
if εn → 0. By A5’, it follows that
bn ) − φ(·, θ(P ))) = op (1) .
En (φ(·, θ
(7.2.23)
Translating (7.2.23) by using (7.1.19) and using A0’ and A1, we obtain that
1
1
b n ) = n− 2
−n 2 EP φ(X, θ
1
But, by A3’ and A1, since n− 2
n
X
φ(Xi , θ(P )) + oP (1) .
(7.2.24)
i=1
Pn
i=1
φ(Xi , θ(P )) = OP (1) by the central limit theorem,
bn ) = H(P )(θ
bn − θ(P )) + oP (|θ
bn − θ(P )|)
EP φ(X, θ
since Ep φ(X, θ(P )) = 0. Combining (7.2.24) and (7.2.25) we obtain the result.
(7.2.25)
✷
Here is an example.
Example 7.2.5. Multivariate “medians.” Define for P on Rd .
Z
θ(P ) ≡ arg min |x − θ|r dP (x)
where |x| is the Euclidean
norm and 1 ≤ r ≤ 2. It is not hard to show (Problem 7.2.12)
R
that θ(P ) is defined if |x|r dP (x) < ∞ and is unique for r > 1. It is defined when r = 1,
which corresponds to the median for d = 1, but then θ(P ) may not be unique. If P has a
density p > 0, then θ(P ) is unique even if r = 1 (Problem 7.2.12). Suppose P is such that
(i) P has a positive density p.
R
(ii) |x|2(r−1) p(x)dx < ∞.
Then,
where
θ(Pbn ) = θ(P ) +
n
1
1X
ψ(Xi , P ) + oP (n− 2 )
n i=1
ψ(x, P ) = M (P )(x − θ(P ))|x − θ(P )|r−2
Section 7.2
The Delta Method in Infinite Dimensional Space
47
and, if X ≡ (X1 , . . . , Xd )T , δij = 1[i = j],
(Xi − θi (P ))(Xj − θj (P ))
−1
r−2
M (P ) = EP
. (7.2.26)
(r − 2) + δij |X − θ(P )|
|X − θ(P )|4−r
The matrix M (P ) is well defined for r > 1 and corresponds to [−H] in (7.2.21). For r = 1,
expression (7.2.26) becomes ∞ − ∞, but passage to the limit as r decreases to 1 yields
ψ(x, P ) = sgn(x − θ(P ))/2p(θ(P )) as in Problem 5.4.1(e). If r > 1, EP |X|2r−2 < ∞
makes (7.2.26) well defined — see Problem 7.2.13. Checking the conditions of Theorem
7.2.2 otherwise requires some arguments — see Huber (1967) and Problems 7.2.11 and
7.2.12.
✷
7.2.2
The Quantile Process
We now give another approach to justifying the quantile representation (7.2.13) and deducing weak convergence for an infinite dimensional parameter, the quantile process defined
by
√
(7.2.27)
Qn (t) ≡ n(Fb−1 (t) − F −1 (t)), 0 < t < 1
for suitable F . The approach is due to Shorack (1982); see also Shorack and Wellner
(1986). Let W 0 (u) denote the Brownian bridge on [0, 1]. Note that −W0 (u) is also a
Brownian bridge. We state the result as
Theorem 7.2.3. Suppose
P on R has distribution F with continuous positive density f =
F ′ . Then, F F −1 (t) = t for 0 < t < 1 and
(i) For all 0 < ε ≤ 21 , Qn (t) can be approximated by −En F −1 (t) /f F −1 (t) in the
sense that
sup{|Fb −1 (t)−F −1 (t)+
1
(Fb(F −1 (t)) − t)
| : ε ≤ t ≤ 1−ε} = oP (n− 2 ). (7.2.28)
−1
f (F (t))
(ii) If T = [ε, 1 − ε], and Qn (t) defined in (7.2.27) is viewed as a stochastic process on
T , then
Qn =⇒ W 0 (·)/f (F −1 (·))
(7.2.29)
where, by definition, W 0 (·)/f (F −1 (·)) is a Gaussian process with mean 0 on (0, 1),
W 0 (s)
s(1 − t)
W 0 (t)
Cov
=
,
, s ≤ t,
f (F −1 (s)) f (F −1 (t))
f (F −1 (s))f (F −1 (t))
which has continuous sample functions since W 0 (·) does.
(iii) If f (F −1 (t)) ≥ δ, 0 ≤ t ≤ 1, then ε can be taken as 0 in (i) and (ii).
48
Tools for Asymptotic Analysis
Chapter 7
(iv) If P = U(0, 1) and En (·) is the classical empirical process, we have
kQn + En k∞ = oP (1)
(7.2.30)
and Qn (·) converges weakly to the Brownian bridge −W 0 (·).
Proof. Write
(Fb−1 (t) − F −1 (t))
Fb −1 (t) − F −1 (t) =
.(F (Fb −1 (t)) − t)
F (Fb−1 (t)) − t
(Fb −1 (t) − F −1 (t)) b−1
F (F (t)) − Fb (Fb −1 (t)) + ∆n1 (t)
=
F (Fb−1 (t)) − F (F −1 (t))
≡ I(t) × II(t)
(7.2.31)
where ∆n1 (t) = Fb Fb−1 (t) − t satisfies |∆n1 (t)| ≤ 1/n by the definition of Fb −1 (t) since
F is continuous.
We prove (iv) first. If F (x) = x, 0 ≤ x ≤ 1,
kFb −1 − F −1 k∞ = oP (1)
(7.2.32)
by the Glivenko-Cantelli Theorem (Problem 7.2.16). Now, in the U(0, 1) case,
I(t) ≡ 1.
On the other hand,
1
1
II(t) = −(En (Fb−1 (t)) + O(n− 2 ))n− 2
= −(En (t) + ∆n2 (t) + O(n
− 12
))n
(7.2.33)
− 21
where
|∆n2 (t)| ≤ k∆n2 k∞ ≤ sup{|En (u) − En (v)| : |u − v| ≤ kFb −1 − F −1 k∞ } .
Then
k∆n2 k∞ > ε ⊂ sup{|En (u) − En (v)| : |u − v| ≤ δ} > ε + kFb −1 − F −1 k > δ
by intersecting the event on the left hand side with kFb −1 − F −1 k∞ ≤ δ and its complement. Thus,
lim sup P [k∆n2 k∞ > ε] ≤ lim sup P sup |En (s) − En (t)| : |s − t| ≤ δ > ε
n
n
+ lim sup P kFb −1 − F −1 k∞ > δ
(7.2.34)
n
for all δ, ε > 0. But the first term in (7.2.34) is just
P sup{kW 0 (s) − W 0 (t)k : |s − t| ≤ δ} > ε
Section 7.2
49
The Delta Method in Infinite Dimensional Space
and the second is 0 for all δ > 0 by the Glivenko-Cantelli theorem (Problem 7.2.16). Then
(7.2.30) for P = U(0, 1) follows from (7.2.33) and (7.2.34).
b be their empirical df. Then, if
For the (i) case, let U1 , . . . , Un be i.i.d. U(0, 1) and G
−1
Xi ≡ F (Ui ), 1 ≤ i ≤ n, X1 , . . . , Xn are i.i.d. with df F . Taking Xi represented in this
way as our sample from F , we can write
and
b−1 (t)) − F −1 (G−1 (t))
Fb−1 (t) − F −1 (t) = F −1 (G
(7.2.35)
b −t.
Fb(F −1 (t)) − t = G(t)
(7.2.36)
P −1
b −1 (t). Then tn −→
G (t) = t uniformly on [ε, 1 − ε] by (7.2.32). Thus,
Set tn = G
uniformly on [ε, 1 − ε],
F −1 (tn ) − F −1 (t) P ∂ −1
1
.
−→ F (t) =
−1
tn − t
∂t
f F (t)
More precisely, by (7.2.30),
where
b−1 (t)) − F −1 (G−1 (t))
1
F −1 (G
=
(1 + ∆n (t))
−1 (t))
−1
−1
b
f
(F
G (t) − G (t)
P
sup{|∆n (t)| : ε ≤ t ≤ 1 − ε} −→ 0
because f (F −1 (t)) is continuous and bounded away from 0 on [ε, 1 − ε] if ε > 0, by assumption. Thus (i) follows. The case (iii) also follows. Finally, (ii) follows from Donsker’s
theorem (Theorem 7.1.4) and Slutsky’s theorem generalized to weak convergence in function spaces (Corollary 7.1.2).
✷
Here is a first application of this result.
Example 7.2.6. Linear combinations of order statistics. A class of parameters which
includes the trimmed means (3.5.3) are
Z 1
ν(P ) =
F −1 (t)dΛ(t)
(7.2.37)
0
where Λ(t) is the df of a probability distribution on [0, 1]. If
Λ(t) = 0,
= 1,
t<ε
t>1−ε
or equivalently Λ[ε, 1 − ε] = 1, the parameter ν(P ) is defined for all F . A special case is
Λα with density
λα (t) = (1 − 2α)−1 , α ≤ t ≤ 1 − α
= 0
otherwise
(7.2.38)
50
Tools for Asymptotic Analysis
Chapter 7
for 0 < α < 12 . Then ν(Pb ) is just the α trimmed mean. From Theorem 7.2.3 (iii), since
h→
Z
1−ε
h(u)dΛ(u)
ε
is continuous in k · k∞ we have that
Z 1
Z 1
√
n(Fb−1 (t) − F −1 (t))dΛ(t) =⇒
[f (F −1 (t))]−1 W 0 (t)dΛ(t).
(7.2.39)
0
0
The limit is N (0, σ 2 (Λ, f )) (Problem 7.2.16) with
Z 1Z 1
2
σ (Λ, f ) =
[f (F −1 (s))f (F −1 (t))]−1 (s ∧ t − st)dΛ(s)dΛ(t).
0
(7.2.40)
0
We can get a simpler expression for σ 2 (Λ, f ) and more insight by using part (ii). We
obtain, if λ(s) = 0, s ≤ ε or s ≥ 1 − ε, ε > 0,
Z
1
0
(Fb−1 (s) − F −1 (s))λ(s)ds = −
So the influence function of
R1
ψ(x, P ) = −
= −
= −
0
Z
0
Z
0
1
1
(Fb (F −1 (s)) − s)
λ(s)ds + oP (n− 2 ) .
−1
f (F (s))
Fb−1 (s)λ(s)ds is
1
0
Z
Z
1
(1(x ≤ F −1 (s)) − s)
λ(s)ds
f (F −1 (s))
(1(x ≤ F −1 (s)) − s)λ(s)dF −1 (s)
(1(x ≤ y) − F (y))λ(F (y))dy
Z ∞
Z ∞
= −
λ(F (y))dy − EP (
λ(F (y))dy .
x
(7.2.41)
X
Specializing to the trimmed mean, with λ = λα given by (7.2.38), we get, as influence
function,
F −1 (α)
− µα (P ), x ≤ F −1 (α)
1 − 2α
x
=
− µα (P ), F −1 (α) ≤ x ≤ F −1 (1 − α)
1 − 2α
F −1 (1 − α)
− µα (P ), x ≥ F −1 (1 − α)
=
1 − 2α
ψα (x, P ) =
(7.2.42)
where
µα (P ) =
α(F −1 (α) + F −1 (1 − α))
+ (1 − 2α)−1
1 − 2α
Z
F −1 (1−α)
F −1 (α)
x dF (x) .
Section 7.3
51
Further Expansions
From (7.2.41) the asymptotic variance of
√
nνα (Pb ) is
σ 2 (Λ, f ) = (1 − 2α)−2 {(F −1 (α) − µα (P ))2 + (F −1 (1 − α) − µα (P ))2
Z F −1 (1−α)
(x − µα (P ))2 dF (x)} .
(7.2.43)
+
F −1 (α)
Note that if f is symmetric, f (a + x) = f (a − x) for some a, then
−1
να (P ) = (1 − 2α)
Z
F −1 (1−α)
x dF (x) = a.
F −1 (α)
Note the agreement between this formula and the sensitivity curve of Figure 3.5.2.
✷
Remark 7.2.1. Finally, note that (7.2.39) makes sense even if λ > 0 on all of (0,1) and
is correct for the mean itself, λ ≡ 1. In fact (7.2.39) and (7.2.41) are valid much more
generally; see Problems 7.2.21–7.2.24 and results in Bickel (1967), Chernoff, Gastwirth
and Johns (1967), and Stigler (1969).
✷
Summary. We introduced a generalization of the stochastic delta method from functions
of i.i.d. vectors of dimension d to functions ν(Pb) of the empirical distribution. We did
this by introducing Gâteaux and Fréchet derivatives of functions defined on ∞ dimensional linear spaces. These derivatives correspond, respectively, to partial differentiation
in a fixed direction and the existence of a total differential. They lead to the important
notion of P
the influence function which tells us what is the “right” approximation of the
n
form n−1 i=1 ψ(Xi , P ) to ν(Pb) we should aim for. We derive influence functions for
functions of multinomial proportions, functions of sample moments, sample quantiles, the
Cramér-von Mises statistic, multivariate medians, and linear combination of order statistics.
The Fréchet derivative is used to obtain the validity of influence function approximation for
statistics which are not simply functions of vector means or characterizable as M estimates.
Empirical process theory is used to justify more general results for M estimates than those
of Chapter 5, and in Section 7.2.2 yields weak convergence of the quantile process.
7.3
7.3.1
Further Expansions
The von Mises Expansion
The influence function gives us the analogue of the first derivative of a parameter. The von
Mises expansion is the formal analogue of a Taylor series and tells us, for instance, what
we can expect if the “first derivative” vanishes. Write, given ν : M → R,
ν(P + ε(Q − P )) = ν(P ) +
m
X
ν (k) (P − Q) k
ε + o(εm+1 )
k!
k=1
∂k
ν(P + ε(Q − P ))|ε=0
where ν (k) (Q − P ) =
∂εk
(7.3.1)
52
Tools for Asymptotic Analysis
Chapter 7
the usual Taylor expansion. Suppose, as can easily be shown to hold under mild conditions
in case X is finite (Problem 7.3.1), that
∂k
ν(P + ε(Q − P ))|ε=0 =
∂εk
Z
...
Z
ψ(x1 , . . . , xk , P ) Πkj=1 dQ(xj )
(7.3.2)
where
(i) ψ is symmetric in x1 , . . . , xk .
R
(ii) ψ(x, X2 , . . . , Xk , P ) dP (x) = 0 with P probability 1 or, equivalently,
EP (ψ(X1 , . . . , Xk , P ) | X2 , . . . , Xk ) = 0 .
(7.3.3)
From (ii) it follows that
Z
...
Z
ψ(x1 , . . . , xk , P )Πkj=1 dQ(xj ) =
Z
...
Z
ψ(x1 , . . . , xk , P )Πkj=1 d(Q−P )(xj ) .
(7.3.4)
Further, ψ may be computed formally as follows. For Q1 , . . . , Qk ∈ M, define
k
ψ0 (Q1 , . . . , Qk , P ) ≡
where εj ≥ 0,
Pk
j=1 εj
X
∂k
εj (Qj − P ))|ε1 =...=εk =0
ν(P +
∂ε1 . . . ∂εk
j=1
(7.3.5)
< 1. Then, extending (7.2.4),
ψ(x1 , . . . , xk , P ) = ψ0 (δx1 , . . . δxk , P ) .
(7.3.6)
It is easy to see that (7.3.5) and (7.3.2) imply (7.3.6) and (7.3.3) (Problem 7.3.2). Formally,
(7.3.2) and (7.3.4) suggest that
1
ν(Pb ) = ν(P )+n− 2
Z
ψ(x, P )dEn (x)+
n−1
2
Z
ψ(x, y, P )dEn (x)dEn (y)+. . . . (7.3.7)
1
b
2
Still formally, if ψ(·, P ) 6= 0 (7.3.7)
R suggests that n (ν(P ) − ν(P )) is of order 1 and its
behaviour is the same as that of ψ(x, P ) dEn (x). This is just the linear approximation
(7.2.2).
Interpretation of von Mises expansion terms. Stochastic integrals
As we have already noted, by the multivariate central limit theorem, for T = L2 (P ),
R
FIDI
En −→ WP0R. Now, En (f ) = f (x)dEn (x) by definition. Can we similarly interpret
WP0 (f ) as f (x)dWP0 (x) where WP0 (x) is the (transformed) Brownian bridge? This cannot be done in the Riemann-Stieltjes or Lebesgue-Stieltjes way since the sample functions
Section 7.3
53
Further Expansions
of W 0 are not of bounded variation (Breiman (1968), pp. 261–263). However, if f is
piecewise constant,
J
X
aj 1(cj < x ≤ cj+1 ) ,
f (x) =
j=1
R
PJ
then, if we naturally define f (x)dWP0 (x) ≡ j=1 aj (WP0 (cj+1 )− WP0 (cj )), it is easy to
R
see that, indeed, f (x)dWP0 (x) has a N (0, VarP f (X)) distribution and can be identified
with WP0 (f ).
More generally, if f1 , . . . , fk are k such functions the joint distribution of
Z
fj (x)dWP0 (x), 1 ≤ j ≤ k
coincides with that of {WP0 (fj ) : 1 ≤ j ≤ k}. It may be shown, since any
R f ∈ L2 (P )
can be approximated in the L2 (P ) sense by such functions, we can define f dWP0 for all
f ∈ L2 (PR) with the correct FIDI’s for any finite set of such variables. It turns out that, for
k ≥ 2, if b2 (x)dP (x) < ∞, it is, unfortunately, possible to define
Z
Z
. . . b(x1 , . . . , xk ) dWP0 (x1 ) . . . dWP0 (xk )
in two distinct ways as the Ito or Stratonovich stochastic integral. However, if b(x) =
0 whenever x = (x1 , . . . , xk ) has at least two coordinates equal, then the two integrals
coincide and our heuristics apply.
Our heuristics suggest that if ψ(·, P ) = 0, then the expansion (7.3.7) is of order 2 and
Z Z
0
0
b
2n(ν(P ) − ν(P )) =⇒ L
ψ(x1 , x2 , P )dWP (x1 )dWP (x2 ) .
(7.3.8)
If P [X1 = X2 ] = 0, the right hand side
What makes (7.3.8) valuable,
R R is uniquely defined.
0
0
if it holds,
is
that
the
distribution
of
ψ(x
,
x
,
P
)dW
(x
1
2
1 )dWP (x2 ) is always of the
P
P∞
2
form k=1 λk Zk R, where the λk are the eigenvalues of the integral operator which sends
h ∈ L2 (P ) into ψ(x1 , x2 , P )h(x1 )dP (x1 ) and the Zj are i.i.d. N (0, 1). Here is an
example.
Example 7.3.1. (Example 7.2.4. continued). Null distribution of the Cramér-von Mises
statistic. Consider ν(F ) given in that example. By (7.2.17), ψ(x, P0 ) = 0 if F0 ↔ P0 .
However, we can compute, if F0 is continuous and Gj is the df of the probability Qj ,
Z
∂ν
(F0 +ε1 (G1 −F0 )+ε2 (G2 −F0 ))|ε1 =ε2 =0 = 2 (G1 −F0 )(G2 −F0 )(z) dF0 (z) .
∂ε1 ∂ε2
Thus,
ψ(x, y, P0 ) = 2
Z
(1(z ≥ x) − F0 (z))(1(z ≥ y) − F0 (z))dF0 (z)
= 2F0 (x ∧ y) − (F02 (y) + F02 (x)) +
2
.
3
54
Tools for Asymptotic Analysis
Chapter 7
Without loss of generality (Problem 4.1.11), we can take F0 (t) = t, 0 ≤ t ≤ 1, and
obtain the required eigenvalues as λj = 1/j 2 n2 . It may be shown (Shorack and WellR1
R1R1
ner (1986), Ch. 5) that, indeed, 0 [W 0 ]2 (u)du = 0 0 (2(x ∧ y) − (x2 + y 2 ) +
2
0
0
✷
3 )dW (x)dW (y).
Unfortunately, higher order terms in the von Mises expansion do not have closed form
limiting distributions. However, the limit theorems we can prove do suggest how to propm
erly estimate the limit distribution of n 2 (ν(Pb ) − ν(P )) if ψ(x1 , . . . , xj , P ) = 0 for all
j < m via the m out of n bootstrap discussed in Chapter 10.
7.3.2
The Hoeffding and Analysis of Variance Expansions
Here is another expansion of a symmetric statistic. Its terms are harder to compute but have
important orthogonality properties which ease asymptotic analysis. Let Pb be the empirical
df of the sample X1 , . . . , Xn and think of ν(Pb) as a function q(X1 , . . . , Xn ) where q is
symmetric, and represent q as
Z
q(x1 , . . . , xn ) = q(y)dHx1 (y1 ) . . . dHxn (yn )
where Hx (y) = 1(y ≥ x) the distribution function of δx , point mass at x. On the other
hand, consider
Z
EP q(X1 , . . . , Xn ) = q(y) dP (y1 ) . . . dP (yn ) .
Note the expansion to n terms for the product Πni=1 yi around y = x
Πni=1 yi = Πni=1 xi +
n X
X
{Πxi : i ∈ J¯m }{Π(yi − xi ) : i ∈ Jm }
m=1 Jm
where Jm ranges over all distinct subsets of m integers out of {1, . . . , n} and J¯m is the
complement of Jm . We get, after some algebra,
q(X1 , . . . , Xn ) = EP q(X1 , . . . , Xn ) +
(7.3.9)
Z
Z
n
XX
{ . . . q(y1 , . . . , yn )}Π{dP (yi ) : i ∈ J¯m }Π{d(HXi − P )(yi ) : i ∈ Jm }.
m=1 Jm
(0)
(n)
(1)
and let Mq = EP q(X1 , . . . , Xn ). Write
X
=
Mq (Jm ) ,
Call the sums in the series Mq , . . . , Mq
Mq(m)
Jm
where Mq (Jm ) is the integral in (7.3.9) for a fixed Jm . Note that Mq (Jm ) is a function of
{Xi : i ∈ Jm } only. Thus,
Mq ({i}) = EP {q(X1 , . . . , Xn )|Xi } − Mq(0) .
(7.3.10)
Section 7.3
55
Further Expansions
So we may write, abusing notation, Mq ({i}) as Mq (Xi ), and similarly, Mq ({i, j}) as
Mq (Xi , Xj ) given by,
Mq ({i, j}) = EP {q(X1 , . . . , Xn )|Xi , Xj } − Mq ({i}) − Mq ({j}) + Mq(0) . (7.3.11)
If S is a subset of {1. . . . , n} with |S| = m + 1 where |S| denotes cardinality, then, more
generally, expanding in (7.3.9) we get the recursion
Mq (S) = EP {q(X1 , . . . , Xn )|{Xi : i ∈ S}}
X
−
{Mq (T ) : T ⊂ S, |T | = m}
X
+
{Mq (T ) : T ⊂ S, |T | = m − 1}
X
−
{Mq (T ) : T ⊂ S, |T | = m − 2}
+ · · · (−1)m+1 Mq(0) .
(7.3.12)
This is the Moebius expansion of Mq (S) — see Stanley (1986), p.116.
Suppose we drop the assumption of symmetry on q and simply take Xi independent
with Xi ∼ Pi , i = 1, . . . , n. Then, our development remains valid once we replace P by
Pi appropriately in (7.3.9) and, in particular,
q(X1 , . . . , Xn ) =
n
X
j=0
Mq(j) =
n X
X
j=0
{Mq (S) : |S| = j}
(7.3.13)
(0)
with Mq (S) defined by Mq = E q(X1 , . . . , Xn ), and the recursion (7.3.12) (provided
that E|q(X1 , . . . , Xn )| < ∞).
The expression simplifies considerably if q is symmetric and X1 , . . . , Xn are i.i.d. In
that case, Mq (T ) = uq (Xi : i ∈ T ) with |T | = m, and the integral in (7.3.9) is
Z
Z
m
wq (x1 , . . . , xm ) ≡ . . . Eq(y1 , . . . , ym , Xm+1 , . . . , Xn )πi=1
d(Hxi − P )(yi )
=
m
X
r=0
(−1)n−r
X
{vq (xi1 , . . . , xir ) : {i1 , . . . , ir } ⊂ {1, . . . , m}}
(7.3.14)
where
vq (x1 , . . . , xr ) = Eq(x1 , . . . , xr , Xr+1 , . . . Xn ).
The importance of the expansion (7.3.9) comes from the following interpretation (The(j)
orem 7.3.1) of the Mq which does not make use of either the identical distribution of the
Xi or the symmetry of q. Suppose X1 , . . . , Xn are independent. Let Λ0 be the space of
all
variables of the form
P
P constants, let Λ1 be the linear subspace of L2 (P ) of all random
u
(X
),
Λ
that
of
all
random
variables
of
the
form,
i
2
1≤i<j≤n uij (Xi , Xj ), and
i=1 i
⊥
so on. Let Vj = Λj ∩ Λj−1 , the linear subspace of all members of Λj orthogonal to all
members of Λj−1 , and let π(q|Vj ) denote the L2 (P ) projection of q on Vj in L2 (P ). We
write U ⊥ V iff E(U V ) = 0. See Section B.10.3.2. We state our interpretation as
56
Tools for Asymptotic Analysis
Chapter 7
Theorem 7.3.1. With the definitions and assumptions of the previous paragraph, if Eq 2 (X1 ,
. . . , Xn ) < ∞, then
Mq(j) = π(q(X1 , . . . , Xn )|Vj )
(7.3.15)
and the representation,
q(X1 , . . . , Xn ) =
n
X
Mq(j) ,
(7.3.16)
j=0
is a decomposition of q(X1 , . . . , Xn ) into mutually orthogonal functions of X. In fact,
Mq (S) ⊥ Mq (T )
(7.3.17)
(j)
unless S = T and so each of the Mq have an orthogonal representation as
X
Mq(j) =
{Mq (S) : |S| = j}
(7.3.18)
Proof. The key observation is that, if Eh2 (Xi : i ∈ S) < ∞, and S ∩ T 6= S, then
Z
E
(7.3.19)
h(yi : i ∈ S)Π{d(HXi − Pi )(yi ) : i ∈ S}|Xk : k ∈ T = 0.
The reason is that the conditional expectation form passes through the integral and product,
since the variables are independent, and
E[{Π{d(HXi − Pi )(yi ) : i ∈ S}|Xk : k ∈ T }
(7.3.20)
= Π{E[d(HXi − Pi )(yi )|Xk : k ∈ T ] : i ∈ S} = 0
since one of the terms in the product must vanish given S ∩T 6= S. Statements like (7.3.20)
are evidently rigorous if the Pi are discrete, so that
d(HXi − Pi )(yi ) = 1(Xi = yi ) − P [Xi = yi ].
The general case can be obtained by induction on the cardinality of S (Problem 7.3.7).
Claim (7.3.17) follows since S 6= T implies that either S ∩ T 6= S or S ∩ T 6= T or both
and
EMq (S)Mq (T ) = EMq (S)E{Mq (T )|Xi : i ∈ S}
= EMq (T )E{Mq (S)|Xi : i ∈ T } .
Then, (7.3.19) implies that one or the other conditional expectation is 0. Finally, we claim
that
Mq(j) ∈ Vj ,
0≤j ≤m.
(7.3.21)
Section 7.3
57
Further Expansions
It is enough to argue that, if |S| = j, Mq (S) ∈ Λj , and that
X
Mq (S) ⊥
{hT {Xi : i ∈ T } : |T | = j − 1} .
That Mq (S) ∈ Λj follows by definition. On the other hand,
EMq (S)hT {Xi : i ∈ T }
= E[E{Mq (S)|Xi : i ∈ T }]hT (Xi : i ∈ T ) = 0
since |S| = j, |T | = j − 1 implies that S 6= T . Claim (7.3.21) follows. Write
X
π(q|Vj ) = Mq(j) + π(
Mq(i) |Vj )
i6=j
=
Mq(j)
+
X
i6=j
(i)
The second term is 0 since by construction Mq
(i)
(i)
Mq ∈ Λj−1 while, if i > j, Mq
theorem follows.
π (Mq(i) |Vj ).
⊥ Vj for i < j since Vj ⊂ Λ⊥
j−1 and
⊥ Vj since Vj ⊂ Λj ⊂ Λi−1 . Claim (7.3.15) and the
✷
Theorem 7.3.1 has a number of interesting consequences: Let
τ 2 (S) = Var Mq (S) .
Then,
Var q(X1 , . . . , Xn ) =
n X
X
m=1
{τ 2 (S) : |S| = m} =
n
X
VarMq(m) .
(7.3.22)
m=1
If we specialize to q symmetric, X1 , . . . , Xn i.i.d., (7.3.22) becomes
Var q(X1 , . . . , Xn ) =
n X
n
r=1
r
Var vq (X1 , . . . , Xr ) .
(7.3.23)
A number of important inequalities and other results follow from (7.3.22) and (7.3.23); see
Serfling (1980) and van Zwet (1984) as well as the classic paper of Hoeffding (1948).
(1)
The first term Mq of the Hoeffding expansion of a statistic of the form ν(Pb) is, by
(7.3.10), a sum of i.i.d. variables, and is not surprisingly closely related to the first term
of the von Mises expansion — the influence function approximation to ν. Unfortunately,
E{ν(Pb )|X1 = x} is not, in general, easy to compute and unlike ψ(x, P ) depends on n.
However, showing that it is a good approximation can be easier. Here is a simple result.
Theorem 7.3.2. Suppose X1 , . . . ,Xn are i.i.d., qn (X1 , . . . , Xn ) is a sequence of symmetric statistics with 0 < Var qn (X) < ∞, and
un (x) ≡ E(qn (X)|X1 = x) .
58
Tools for Asymptotic Analysis
Chapter 7
Write qn for qn (X). Suppose
Var(qn ) = n Var un (X1 ) + o(1) .
Then,
qn = Eqn +
n
X
un (Xi ) + oP (1).
(7.3.24)
i=1
n 1hP
io
n
If further L n− 2
→ N (0, σ 2 ), then
i=1 un (Xi ) − E un (Xi )
1
L n− 2 qn − E(qn ) → N (0, σ 2 ) .
(0)
Proof. Define Mq
(1)
= Eqn (X), and Mq
Mq(1) =
n
X
i=1
(7.3.25)
via
(un (Xi ) − Eun (Xi )) .
The result follows because
n
n
X
2
X
un (Xi ))
un (Xi ) = Var (qn ) − Var (
E qn − E(qn ) −
i=1
i=1
since
(qn − E(qn )) −
n
X
i=1
un (Xi ) ⊥
n
X
un (Xi )
i=1
by the definition of un and Theorem 7.3.1. So (7.3.24) follows and then so does (7.3.25)
by Slutsky’s theorem.
✷
We illustrate with,
Example 7.3.2. U statistics. Our model is X1 , . . . , Xn i.i.d. as X ∼ P ∈ P. Given a
function φ : X → R which is symmetric, a U statistic of order m is given by
Un ≡
1 X
{φ(Xi1 , . . . Xim ) : 1 ≤ i1 < . . . < im ≤ n} .
n
(7.3.26)
m
If P = {P : EP Un < ∞}, then it may be shown that Un is the UMVU estimate of
θn (P ) ≡ EP φ(X1 , . . . , Xm ) (Problem 7.3.7). A U statistic of order 1 is just a mean of
i.i.d. variables such as the sample mean.
An example of a U statistic of order 2, for X = R, is the one-sample Wilcoxon rank
statistic for testing symmetry of P ,
W =
1 X
1(Xi + Xj > 0) .
n
2
i<j
Section 7.3
59
Further Expansions
To apply Theorem 7.3.2, set qn =
P
i<j
1(xi + xj > 0), and note
E 1(X1 + X2 > 0|X1 = x) = P (x + X2 > 0) = 1 − F (−x) .
It follows that the projection of qn is
qbn ≡
n
X
un (Xi ) =
n
X
i=1
i=1
1 − F (−Xi ) .
Consider the hypothesis H : F is symmetric about zero. Then
VarH (qn ) = 4(n − 2)Var 1 − F (−X1 )
and the assumptions of Theorem 7.3.2 are satisfied. Because 1 − Fp(−X1 ) = F (X1 ) ∼
1
U(0, 1) under H, Var(qn ) = (n − 2)/3, and n− 2 (qn − 21 ) → N (0, 1/3).
For X = R2 , X = (Z, V ), Kendall’s τ statistic for testing independence of Z and V ,
τ=
1 X
1((Zi − Zj )(Vi − Vj ) > 0),
n
2
i<j
is also a U statistic of order 2. We will study these two statistics further in Problems 7.3.11
and 7.3.16.
Next we analyze U statistics generally using the Hoeffding and von Mises approaches.
Note first that for a U statistic of order m with Eφ2 (X1 , . . . , Xm ) < ∞, the Hoeffding
expansion is truncated, since Un ∈ Λm ,
Un =
m
X
(j)
MTn .
j=0
Therefore, from (7.3.22) and (7.3.23), for
vq (x1 , . . . , xj ) ≡ E Un (x1 , . . . , xj , Xj+1 , . . . , Xn ) ,
Var (Un ) =
m X
n
j=1
j
Var vq (X1 , . . . , Xj ) .
By symmetry,
vq (x) = E
=
h 1 X
i
{φ(Xi1 , . . . , Xim ) : 1 ≤ i1 ≤ . . . ≤ in X1 = x
n
m
m−1
m−1
φ(x, X2 , . . . , Xm )
n
m
=
m
Eφ(x, X2 , . . . , Xm ) .
n
(7.3.27)
60
Tools for Asymptotic Analysis
More generally, for 1 ≤ r ≤ m,
vq (x1 , . . . , xr ) =
so that
n−r
m−r
n
m
Var (Un ) =
Chapter 7
Eφ(x1 , . . . , xr , Xr+1 , . . . , Xm )
m
X
r=1
m
r
where
n−r
m−r
n
m
ξr ,
ξr ≡ VarE φ(X1 , . . . , Xm ) X1 , . . . , Xr .
(7.3.28)
(7.3.29)
For the one-sample Wilcoxon statistic we have E φ(X1 , X2 )|X1 = x = 1 − F (−x) and
E φ(X1 , X2 )|X1 , X2 ) = 1(X1 + X2 > 0). It follows that ξ1 = Var 1 − F (−X1 ) ,
ξ2 = Var 1(X1 + X2 > 0) , and
Var(W ) = 2 2(n − 2)ξ1 + ξ2 (1 − ξ2 ) /n(n − 1) .
✷
Note that closely related to Un is the von Mises statistic,
Z
Z
Vn ≡
. . . φ(x1 , . . . , xm ) dPb (x1 ) . . . dPb (xm )
=
So,
Vn =
n
m
nm
Un +
1 X
{φ(Xi1 , . . . , Xim ) : 1 ≤ i1 , . . . , im ≤ n} .
nm
Z
...
S
Z
φ(x1 , . . . , xm ) dPb (x1 ) . . . dPb (xm )
(7.3.30)
where S = {(x1 , . . . , xm ) : xi = xj for some i 6= j}. The influence function calculation
readily yields for Vn ,
Z
Z
ψ(x, P ) = m( . . . φ(x, x2 , . . . , xm ) dP (x2 ) . . . dP (xm ) − EP φ(X1 , . . . , Xm )) .
However, application of Theorem 7.3.2 for Vn is difficult since F appears as dF . On the
other hand, consider Un .
Proposition 7.3.1. For Un as (7.3.26), if 0 < Var vq (X1 ) < ∞, then
where
1
n 2 Un − EP φ(X1 , . . . , Xm ) =⇒ N 0, EP ψ 2 (X1 , P )
ψ(X, P ) = mEP φ(X1 , . . . , Xm )|X1 .
Section 7.3
61
Further Expansions
1
Proof. By (7.3.27) and (7.3.23), if qn = n 2 Un ,
1
vq (X1 ) = n− 2 ψ(X1 , P )
and from (7.3.28),
1
Var(n 2 Un ) = mξ1 + OP (n−1 ) ,
since
n −1
r
(7.3.31)
= O(n−r ). Now apply Theorem 7.3.2 to qn above. Then,
n
X
1
un (Xi ) = n− 2
i=1
n
X
ψ(Xi , P )
i=1
so that, since ξ1 = Var ψ(X1 , P ), the conditions of the theorem are satisfied and Proposition 7.3.1 follows.
✷
Remark 7.3.1. If ψ(·, P ) ≡ 0 the limit of 2n[Un − E(Un )] is of the form
Z Z
ψ(x, y, P )dW 0 F (x) dW 0 F (y) .
See (7.3.8). This is also true for Vn . That is, the second term in (7.3.30) is negligible if
Eφ2 (X(i1 ) , X(i2 ) , . . . , X(ir ) ) < ∞
for all i1 + . . . + ir = m and X(ij ) = (Xj , Xj , . . . , Xj ) with the dimension of X(ij ) being
ij . We leave this to Problem 7.3.9.
The von Mises and Hoeffding expansions can be compared quite precisely for U statistics if we assume that φ(x1 , . . . , xm ) = 0 whenever xi = xj for some i 6= j. In that
case,
−m n
Un
Vn = n
m
and further, for ψ(x1 , . . . , xj , P ) and wq as defined in (7.3.14) and (7.3.6),
w0 (x1 , . . . , xj ) = ψ(x1 , . . . , xj , P )n−j
(7.3.32)
for j = 1, . . . , m. We leave this result to Problem 7.3.9. We again refer to Serfling (1980)
for an extensive treatment of U statistics.
✷
The analysis of variance expansion
Suppose Z1 , . . . , Zk are independent predictor variables, Zj taking on values in Zj ≡
{zj1 , . . . , zjnj } with equal probabilities n−1
j , for 1 ≤ j ≤ k. Then if Y is a “response”
variable we are interested in the regression surface,
µ(z1 , . . . , zk ) ≡ E(Y |Z1 = z1 , . . . Zk = zk ) .
62
Tools for Asymptotic Analysis
Chapter 7
Here µ(Z1 , . . . , Zk ) is a general function of (Z1 , . . . , Zk ) and thus can be expanded in the
Hoeffding sense as
µ(Z1 , . . . , Zk ) = Eµ(Z1 , . . . , Zk ) +
k
X
j=1
+ . . . + Uµ(k) .
{E(µ(Z1 , . . . , Zk )|Zj ) − Eµ(Z1 , . . . , Zk )}
Here the first term is the mean response which is
X
1
Eµ(Z1 , . . . , Zk ) ≡ µ(·, · · ·, ·) =
{µ(z1 , . . . , zk ) : zj ∈ Zj , 1 ≤ j ≤ k} .
n1 . . . nk
The next term is
E(µ(Z1 , . . . , Zk )|Z1 = z) − Eµ(Z1 , . . . , Zk ) = [µ(z, ·, · · ·, ·) − µ(·, · · ·, ·)]
where we use dots in the usual ANOVA sense for averaging. But in the case where the
Z’s are treatments, this is just the effect of treatment 1 at level z. Continuing, U (2) ({1, 2})
(z1 , z2 ) is just the first order interaction between Z1 and Z2 at levels z1 and z2 . So in this
context, the Hoeffding expansion is the classical ANOVA expansion mentioned in terms of
the corresponding variance decomposition in Example 6.1.3 and discussed extensively in
books such as Box and Hunter (1978).
Summary. In Section 7.3.1 we consider a generalization of Taylor expansion due to von
Mises. The limiting behaviour of statistics whose first order Gâteaux derivative vanishes
is related to the Brownian bridge. Finally in Section 7.3.2 we introduce another type of
expansion due to Hoeffding and closely related to the analysis of variance. This is an
expansion into orthogonal terms and as such enables us to develop powerful bounds used
in establishing asymptotic approximations. The theory is applied to U statistics.
7.4
PROBLEMS AND COMPLEMENTS
Problems for Section 7.1
1. Establish (7.1.4) using Problem 5.4.5.
2. Establish (7.1.6) using the multivariate Taylor expansion and the multivariate CLT.
3. Prove Corollary 7.1.1 from Theorem 7.1.1.
Hint. Let Amj be the event where
sup |Zn (s) − Zn (t)| : s, t ∈ Tmj | > ε .
Then P (Uj Amj ) ≤ Σj P (Amj ).
4. Suppose T is a metric space and Zn =⇒ Z on l∞ (T ) and P [Z is continuous] = 1. Let
P
tn −→ t on T , tn possibly random but t a constant. Show that then
Zn (tn ) =⇒ Z(t) .
Section 7.4
63
Problems and Complements
Hint. The map M : l∞ (T ) × T → R given by M (f t) = f (t) is continuous at all (f0 , t0 )
such that f0 is continuous at t0 . Apply Corollary 7.1.2.
5. Assuming the existence of the Wiener process W (·) on [0, 1], show that if U (t) ≡
W (t) − tW (1), then U (·) ∼ W 0 (·) where W 0 (·) is the Brownian bridge.
6. Establish FIDI convergence and tightness in Proposition 7.1.1.
Hint. [nt]/n − [ns]/n = (t − s) + O(1/n).
7. Show that, if X = R, T = {All f with |f |∞ ≤ 1} and P is continuous, then
1
sup{|En (f )| : f ∈ T } = n 2 .
8. Show that in Definition 7.1.3, (ii) implies (i).
9. Prove Theorem 7.1.5.
10. Show that the convergence in (7.1.27) is almost sure.
Hint. Use the Borel-Cantelli Lemma; see Appendix D.1.
11. Let W 0 (u), 0 ≤ u ≤ 1, be the Brownian bridge.
R1
(a) Show that 0 W 0 (u)a(u)du has a N (0, σ 2 (a)) distribution where
2
σ (a) = 2
Z
1
Z
1
0
0
s≤t
(b) If A(v) ≡
Rv
0
a(s)ds,
R1
0
s(1 − t)a(s)a(t)dsdt .
a(s)ds = 0, show that
σ 2 (a) =
Z
1
A2 (s)ds .
0
0
Hint. From Problem D.2.3, W is continuous.
12. For the Section I.2 example show that the process t →
P
√ b
n(F (t) − Gµb,bσ2 (t)) con-
P
verges weakly to a Gaussian process Z(· ; µ, σ 2 ) if µ
b → µ, σ
b2 → σ 2 , whenever the data are
generated from the conditional distribution of Zn given Zn > 0 where Zn ∼ N (µn , σn2 )
and µn → µ, σn2 → σ 2 .
13. Suppose U, V, W1 , W2 , . . . are i.i.d. uniform [0, 1] and set
Z(t) = U + V t,
Zn (t) = Z(t) + n−1
rn
X
i=1
1[Wi ≤ t], t ∈ [0, 1]
where the integer rn satisfies (rn /n) → 0 as n → ∞. Show that
F IDI
(a) Zn → Z.
F IDI
Hint. Compute |Zn − Z| and show that Zn − Z → 0. Apply Slutsky’s theorem.
64
Tools for Asymptotic Analysis
Chapter 7
(b) Zn ⇒ Z.
Hint.
Zn (s) − Zn (t) = (s − t)V + n−1
Introduce
Tmj
rn
X
i=1
{1[Wi ≤ s] − 1[Wi ≤ t]} .
j−1 j
=
, j = 1, . . . , km
,
km km
−1
and show that |Zn (s) − Zn (t)| is bounded on Tmj by km
+ rn /n.
14. Let U, V, W1 , W2 , . . . be i.i.d. uniform [0, 1], let Φ be the N (0, 1)df , and set
rn
X
t − Wi
t−U
Φ
, Zn (t) = Z(t) + n−1
.
Z(t) = Φ
V +1
V +1
i=1
Show that if rn /n → 0 as n → ∞, then
F IDI
(a) Zn (·) −→ Z(·), (b) Zn (·) ⇒ Z(·).
15. If X1 , . . . , Xn are i.i.d. with continuous df F and empirical df Fb, then, in l∞ (R),
√
n Fb (·) − F (·) =⇒ W 0 F (·).
Hint. Show that for any df F (·) if U1 , . . . , Un are i.i.d. U(0, 1) then F −1 (U1 ), . . . , F −1 (Un )
are i.i.d. F , where F −1 (t) = inf{x : F (x) ≥ t}.
16. Establish (7.1.15).
17. Establish Theorem 7.1.5 by extending the proof of Theorem 7.1.4.
18. (a) Show there exists a constant α > 0 (α ≈ 6.308), such that for every real random
variable X and every h > 0,
Z h
α
−1
P (|X| ≥ h ) ≤
{1 − φX (t)} dt ,
2h −h
where φX (t) = E{exp(itX)} denotes the characteristic function for X.
Remark: In this problem, we do not have the assumption E(|X|) < ∞.
Hint. You can use (without proving) the inequality, sin(x)
≤ sin(1), which holds for
x
any |x| ≥ 1.
(b) Let X1 , X2 , . . . be a sequence of real random variables. Assume that φXn converges
pointwise to φ, as n → ∞, where φ is continuous (but not necessarily a characteristic
function). Show that Xn = OP (1).
19. Show the following
(a) For δ > 0,
Z ∞
2
2
δe−δ /2
e−δ /2
−x2 /2
≤
e
dx
≤
.
1 + δ2
δ
δ
Hint. You may find integration by parts helpful for the first inequality.
Section 7.4
65
Problems and Complements
(b) For X1 , . . . , Xn independent and identically distributed N (0, 1) random variables,
p
(i) P max |Xi | > 2 log(n) → 0, as n → ∞,
1≤i≤n
p
p
(ii) log(n)P max |Xi | > 2 log(n) →
1≤i≤n
(iii) nc/2−1
√1 ,
π
as n → ∞,
p
p
log(n)P max |Xi | > c log(n) →
1≤i≤n
√
√2 ,
cπ
as n → ∞.
20. Let {Zn } be a sequence of independent and identically distributed N (0, 1) random
variables. Prove that there is a random variable X such that P (X < ∞) = 1 and
p
|Zn | ≤ X log(n), for all n ≥ 2 .
21. Let ε1 , . . . , εn be independent and identically distributed random variables with P (ε1 =
1) = P (ε1 = −1) = 1/2.
Pn
(a) Let a1 , . . . , an be real numbers and let kak2 = i=1 a2i . Prove that for any x > 0,
!
n
X
x2
εi ai > x ≤ 2 exp −
P
.
2k2ak2
i=1
Hint: Show that for any real λ, E exp(λε1 ) ≤ exp(λ2 /2).
(b) Let Z1 , . . . , Zn be independent and identically distributed random variables, independent of ε1 , . . . , εn such that 0 < EZ12 = τ < ∞. Prove that for any x > 0,
!
n
x2 X
√
lim sup P
εi Zi > nx given Z1 , . . . , Zn ≤ 2 exp −
2τ
n→∞
i=1
with probability one.
Problems for Section 7.2
1. Show that (7.2.5) holds if the map, Q → ψ(Q, P ), is continuous for weak convergence,
i.e. Qm =⇒ Q implies that ψ(Qm , P ) → ψ(Q, P ) .
Hint. Use the result of Section 5.4.1.
2. The sensitivity curve ν(Pb) is defined by SCn (x) = n(ν(Pbn−1,x ) − ν(Pbn−1 )) where
Pbn−1 is the empirical distribution of X1 , . . . , Xn−1 , and
1
n−1 b
Pn−1 + δx .
Pbn−1,x =
n
n
Suppose that the Gâteaux derivative of ν(·) is well defined in every direction and (7.2.2)
holds at all P ∈ M.
66
Tools for Asymptotic Analysis
Chapter 7
(a) Show that if νb = ν(Pb) has influence function ψ, then
Z n1
SCn (x : νb) = n
ψ(x, (1 − t)Pbn−1 + tδx ) + ψ Pbn−1 , (1 − t)Pbn−1 + tδx dt .
0
L
(b) Deduce that if Pm → P implies that ψ(x, Pm ) → ψ(x, P ), then, as n → ∞,
P
SCn (x, νb) → ψ(x, P ) .
3. Show that ψ(x, P ) as given by (7.2.10) satisfies (7.2.5).
4. Let (X1 , Y1 ), . . . , (Xn , Yn ) be i.i.d. as (X, Y ) ∼ P , where 0 < E(X 4 ), E(Y 4 ) <
∞. Let ν1 (P ) = Cov(X, Y ) and let ν2 (P ) = ν12 (P )/Var(X)Var(Y ) be the squared
correlation coefficient. Compute the influence function of (a)ν1 (P ), (b)ν2 (P ).
Hint. Use (7.2.10) and Example 5.3.6.
5. In the Hardy-Weinberg model where three categories have frequencies
p1 = θ2 , p2 = 2θ(1 − θ), p3 = (1 − θ)2 , 0 < θ < 1
√
√
(Examples 2.1.4, 2.2.6) we can write the parameter θ as ν1 (P ) = p1 , ν2 (P ) = 1 − p3 ,
1
or ν3 (P ) = p1 + 2 p2 . Use Example 7.2.1 to compute the influence functions ψ1 , ψ2 , and
ψ3 of ν1 (Pb ), ν2 (Pb ), and ν3 (Pb). Which ψj has the smaller value of the asymptotic variance
EP ψj2 (X, P )?
6.(a) Show that if ψα is given by (7.2.13), then the definition of ν(Fb) in Problem 5.4(a) is
equivalent to (7.2.14).
Hint. Eψα (X, θ) = α[1 − Fb− (θ)] − (1 − α)Fb− (θ).
R
(b) Show that for ρ and x
bα as defined in Example 7.2.3, x
bα = arg minθ ρ(x, θ)dFb(x).
7. Show that να (F ) minimizes
R
ρ(x, θ)dF (x) for ρ and να (F ) as in Example 7.2.3.
8. Establish (7.2.16) under appropriate assumptions.
Hint. Use the assumptions and method of Problem 5.4.1 to show that
√
( n(b
xα − F −1 (α)),
1
n− 2
n
X
ψα (Xi , F ))
i=1
have an asymptotically joint Gaussian distribution concentrating in the plane on the diagonal {(u, v) ∈ R2 : u = v}, which establishes the claim.
9. (a) Establish (7.2.17).
(b) Give the details needed to establish the asymptotic normality of the Cramér-von
Mises statistics if the true F 6= F0 using Theorem 7.2.1.
10. Consistency for minimum distance estimates. Let d be a suitable metric and {Pθ :
θ ∈ Θ compact ⊂ Rp } be a regular parametric model such that θ is identifiable and
b = arg min d(Pb, P ).
θ → d(P, Pθ ) is continuous for fixed P . Let θ
θ
Section 7.4
Problems and Complements
67
b is consistent.
(a) Show that θ
1
Hint. If θ 0 is true, d(Pb , Pθ0 ) = OP (n− 2 ). But
d(Pθ 0 , P b ) ≤ d(Pθ 0 , Pb ) + d(P b , Pb) ≤ 2d(Pθ 0 , Pb)
θ
θ
and 1-1 continuous maps on compacts have continuous inverses.
√
(b) Show that if the conditions of Theorem 7.2.1 are satisfied, then θb is n consistent in
√ b
the sense that n(θ − θ) = OP (1).
11. In Theorem 7.2.2, show that if θ(Q) is defined as the unique solution of
Z
φ(x, θ(Q))dQ(x) = 0
for all Q in a convex neighbourhood of P , and A3’ holds, then the influence function of
θ(P ) is indeed
ψ(x, P ) = [H −1 (P )]φ(x, θ(P )) .
R
R
12. (a) In Example 7.2.5 show that if |x|r dP (x) < ∞, then |x − θ|r dP (x) takes its
infimum over θ for r ≥ 1 and the minimizing
θ(P ) is unique if r > 1.
R
r
(b)
Show
that
θ(P
)
=
arg
min
(|x
−
θ|
− |x|r )dP (x) which is defined iff
R
r−1
|x − θ| dP (x)
R < ∞.
R
Hint. If |θ| → ∞, |x − θ|r dP (x) → ∞ and θ → |x − θ|r dP (x) is strictly convex for
r > 1.
(c) Show that θ(P ) is uniquely defined even if r = 1 if P has a positive density.
Hint. (i) Prove the result for d = 1 where θ(P ) is just the median.
(ii) If the minimizer is not unique, then by convexity, there exists θ0 , ∆ 6= 0 such that
all θ0 + ε∆ are minimizers 0 ≤ ε ≤ 1. Rotate and scale so that ∆ = (1, 0, . . . , 0).
13. (a) Show that Ψ(·, P ) ∈ L2 (P ) and M (P ) given by (7.2.26) is well defined if r > 1
and EP |X|2r−2 < ∞ or r = 1 and P has a positive density.
Hint. For r = 1, use spherical polar coordinates.
(b) Check the conditions of Theorem 7.2.2 for the multivariate medians.
14. Show that A4’ in Theorem 7.2.2 is implied by:
A4”: sup{ |ψ(x, θ) − ψ(x, θ 0 )| : |θ − θ0 | ≤ γ]} ≤ V (x, θ 0 )γ α , α > 0
for all θ0 , γ > 0 where EP V 2 (X, θ) < ∞ for all |θ − θ(P )| ≤ ε.
Hint. Consider a box {θ : |θ − θ(P )|∞ ≤ ε} and break it up into a grid of (ε/δ)p boxes
of side δ. If θ(1) , . . . , θ(K) , where K = 2p are the vertices of such a box B, then for θ in
this box, ψ(x, θ) ≥ f where f is defined as min{ψ1 (x, θ) − ψ(x, θ(P )) : θ in B} and
similarly for f .
15. Apply Problem 7.2.14 to Example 7.2.4 to justify A4 for ψ(x, θ) ≡ (x − θ) |x − θ|r−1
for 1 ≤ r ≤ 2, p ≥ 2, by showing that
|ψ(x, θ) − ψ(x, θ0 ) | ≤ |θ − θ0 | |x − θ|r−2 .
68
Tools for Asymptotic Analysis
Chapter 7
16. (a) Establish (7.2.32).
Hint. F −1 F (t) = t, t ∈ R.
(b) Show that (7.2.40) follows from (7.2.39).
17. Derive (7.2.43) from (7.2.41).
18. Let Fθ = F0 (· − θ) be a location parameter family. Let θb be the minimum distance
estimate
Z ∞
θb = arg min
(Fb(x) − Fθ (x))2 dF0 (x) .
θ
−∞
b
(a) Compute formally the influence function of θ.
R∞
(b) Show that the expansion (7.2.1) is valid if −∞ |f ′ (x)|dx < ∞.
Hint. Without loss of generality, take θ0 = 0. Note that θ(F ) solves
Z
(F0 (x − θ) − F (x))f0 (x − θ)dF0 (x) = 0 .
R∞ 2
(c) Set d(F, Fθ ) = −∞ F (x) − F0 (x − θ) dF0 (x). Show that θb is consistent using
Problem 7.2.10 and
∂2d
(F0 , F0 ) 6= 0 .
∂θ2
∂d
(F0 , F0 ) = 0 ,
∂θ
(d) Let
θb = arg min sup |Fb (x) − Fθ (x)| .
θ
x
√
Show that n(θb − θ) is not asymptotically normal and hence does not have an influence function.
√ 1
Hint. Consider θ with |θ| = O(n− 2 ) and Zn (x; θ) = n Fb(x) − Fθ (x) =
√ b
√
n F (x) − F0 (x) + f0 (x) nθ + Op (1). Apply Donsker’s theorem.
P
1
19 (a) Show that in (7.2.1), n−1 ni=1 ψ(Xi , P ) = OP (n− 2 ).
(b) Show that ψ is unique if the approximation is valid.
That is, show that if
Pn
1
ν(Pb) = ν(P ) + n1 i=1 ψ1 (Xi , F ) + op (n− 2 )
Pn
1
= ν(P ) + n1 i=1 ψ2 (Xi , F ) + op (n− 2 )
with EP ψj (X) = 0, EP ψj2 (X) < ∞, j = 1, 2, then P [ψ1 (X) = ψ2 (X)] = 1.
1 Pn
Hint. Use the CLT and n− 2 i=1 ∆(Xi ) = oP (1) for ∆ ∈ L2 (P ) iff ∆ = 0.
20. Let X(1) < . . . < X(n) be the order statistics of X1 , . . . , Xn i.i.d. as X ∼ F where F
is continuous. Show that if E|X|ε < ∞ for ε > 0, then for all r > 0 there exist Cr > 0
and n(r, k) such that for n ≥ n(r, k)
E|X(k) |r ≤ Cr
if α ≤
k
≤ 1 − α,
n
0<α<
1
.
2
Section 7.4
69
Problems and Complements
21. Show that if f = F ′ > 0 and continuous, then, if
n Cov(X(k) , X(l) ) →
k
n
→ s,
s(1 − t)
l
n
f (F −1 (s))f (F −1 (t))
→ t,
;
s≤t
uniformly for α ≤ nk , nl ≤ 1 − α.
Hint. You may use the fact that if Un =⇒ U and E|Un |r+1 ≤ C, all n, then EUnr → EU r
uniformly. If Un (·) =⇒ U (·), supt,n E|Un (t)|r+1 ≤ C, then supt |EUnr (t) − EU r (t)| →
0. Use Problem 7.1.1.4 and the quantile process convergence.
22. If (U, V ) are such that E(U 2 + V 2 ) < ∞ and E(V | U = u) is nondecreasing in u,
then Cov(U, V ) ≥ 0.
Hint. Cov(U, V ) = Cov(U, E(V |U )) = 12 E(U − U ′ )(E(V |U )− E(V ′ |U ′ )) where (U, V )
and (U ′ , V ′ ) are i.i.d.. (See below A.11.14.)
23. Show that under the conditions of Problem 21, we always have
Cov(X(k) , X(l) ) ≥ 0 .
24. Suppose that λ = Λ is as in Example 7.2.5, save that we no longer require λ(t) = 0 for
t < ε and t > 1 − ε but only that λ(t) ≤ C < ∞ and EX 2 < ∞. Show that the conclusion
(7.2.39) still holds. (Bickel (1967)).
Hint. Argue that
R 1−ε √
n(Fb−1 (t) − F −1 (t))λ(t)dt =⇒ N (0, σε2 (Λ)).
(i) ε
(ii)
R1
1−ε
√ b −1
n(F (s) − F −1 (s))λ(s)ds)2 → 0 as ε → 0 if λ ≡ 1
and Problem 7.2.20 ensures that we can pass to the general case.
25. Consider the nonparametric regression model Y = µ(X) + e, where µ(·) is unknown,
E(e) = 0, Var(e) = σ 2 , and X and e are independent. The nonparametric correlation is
defined as η 2 = CORR2 (µ(X), Y ). Assume η 2 exists.
E µ2 (X) −µ2Y
(a) Show that η 2 =
.
Var(Y )
R
(b) Let ν(P ) = E[µ2 (X)] = µ2 (x)f (x)dx where X ∼ F , f = F ′ is assumed to
exist. Informally, show that the influence function is µ2 (x) + 2µ(x)e by computing
the Gâteaux derivative and evaluating it at Q = δ(x,y) .
Note: For the model (1 − ε)P + εQ, both f (x) and µ(x) depend on ε.
(c) Find the influence function of η 2 .
(d) Evaluate your answer to (c) when µ(X) = α + bX.
(e) Compare your answer to (d) with the influence function of the Pearson squared correlation coefficient ρ2 when µ(X) = α + βX.
70
Tools for Asymptotic Analysis
Chapter 7
Hint. See Problem 9.2.4.
Problems for Section 7.3
1. Verify (7.3.2) when X is finite.
2. Show that (7.3.2) and (7.3.5) imply (7.3.3) and (7.3.6) and that ψ is symmetric in
x1 , . . . , xn .
3. (a) Verify that if h in Theorem 5.4.1 has continuous derivatives order of k, then
θ(P ) ≡ h(p) can be represented as in (7.3.1).
(b) Identify the ψ(·, . . . , P ) functions in terms of the derivatives of h and expectations
of these.
(c) Verify (7.3.2) in this case.
4. A U statistics of order 2 is called degenerate iff
U=
1 X
u(Xi , Xj )
n
2
i<j
with E{u(X1 , X2 )|X1 } = 0 and hence Eu(X1 , X2 ) = 0. Suppose u(x, x) = 0, Eu2 (X1 ,
X2 ) < ∞. Show that
R
(a) nU = u(x, y)dEn (x)dEn (y).
(b) If L(nU ) → L0 for some probability law L0 , what form would you expect L0 to
have?
(c) Is the result you conjecture a consequence of weak convergence of En (·)?
5. Establishing weak convergence for degenerate U statistics. Suppose φ(x, y) in (7.3.2) is
symmetric. Then there exist real eigenvalues λj , j ≥ 1, and eigenfunctions ψj (x) of the
kernel K(x, y) = u(x, y), such that
Z
ψj (y)K(x, y)dF (y) = λj ψj (x), K(x, y) =
∞
X
λj ψj (x)ψj (y)
j=1
and
Z
ψi (x)ψj (x)dF (x) = δij ≡ 1[i = j].
(a) Assuming the preceding, show that
Z
u(x, y)dEn (x)dEn (y) =
=
∞
X
j=1
∞
X
λj
Z
ψj (x)ψj (y)dEn (x)dEn (y)
Z
λj ( ψj (x)dEn (x))2 .
j=1
|
{z
}
Zjn
Section 7.4
71
Problems and Complements
(b) Show that Cov(Zin Zjn ) = δij and hence,
(Z1n , . . . , Zkn ) =⇒ (Z1 , . . . , Zk ), Zj i.i.d. N (0, 1) .
P∞
P∞
2
j=1 λj Zj .
j=1 |λj | < ∞, then nU =⇒
R Pk
Pk
2
Hint.
j=1 λj ψj (x)ψj (y)dEn (x)dEn (y) =⇒
j=1 λj Zj and
(c) If

E

= E
Z
(U (x, y) −
∞
X
j=k+1
k
X
j=1
2
2
λj ψj (x)ψj (y)dEn (x)dEn (y)

2
Zjn
λj  = Var 
∞
X
j=k+1


2
Zjn
λj  + 
∞
X
j=k+1
2
λj  .
6. Let U ∼ Np (0, 1) and let Mp×p be symmetric. Show that
Pp
2
(a) UT M U ∼
j=1 λj Zj where the Zj are i.i.d. N (0, 1) and λ1 , . . . , λk are the
eigenvalues of M . That is,
M = P ΛP T
(7.4.1)
where Λ = diag(λ1 , . . . , λp ), and P is orthonormal. (You may assume (7.4.1), a
generalization of the principal axis theorem. (B.10.1.1))
(b) Show that (7.4.1) is equivalent to the existence of unique v1 , . . . , vp orthonormal
such that
M vj = λj vj , vjT M = λj vjT ,
1 ≤ j ≤ p.
7. Show that Un defined in (7.3.26) is the UMVU estimate of EP φ(X1 , . . . , Xm ).
8. Establish (7.3.20) rigorously.
9. Establish the conclusion of Remark 7.3.1 for Vn .
10. Establish (7.3.32).
11.
statistic W is asymptotically normal with mean
R (a) Show that the one-sample Wilcoxon
(1 − F (−x))dF (x) and variance σ 2 (F )/n, where
Z
σ 2 (F ) = (1 − F (−x))2 dF (x) − µ2 (F )
and F (x) = P [X ≤ x].
(b) Show that if F (x) is continuous and F (x) = 1 − F (−x) for all x, then µ(F ) = 1/2
and σ 2 (F ) = 1/12.
f ≡ n−2 P 1(Xi + Xj > 0).
12. Suppose W
i,j
72
Tools for Asymptotic Analysis
Chapter 7
f is a von Mises statistic and that under the assumption of Problem 11
(a) Show that W
f
(b), W = W + Op (n−1/2 ), where W is the one-sample Wilcoxon statistic.
(b) Show that the identity of (a) fails if F is not continuous.
13. Develop a multivariate central limit theorem for U statistics. We state it for U statistics
of order 2 but the generalization
is obvious. Let ϕ = (ϕ1 , . . . , ϕk ) be a function from
R
X × X → Rk such that ϕ2j (x, y)dF (x)dF (y) < ∞, for all j, and ϕj are symmetric,
1 ≤ j ≤ k.
P
(a) Show that U n = n2
i<j ϕ(Xi , Xj ) is asymptotically normal with expectation
Eϕ (X1 , X2 ) and variance 2Var(E[ϕ(X1 , X2 )|X1 ]).
(b) Extend this result to U statistics of order m1 , m2 , . . . , mk .
14. Show that even if P [Xi = Xj ] > 0 for i 6= j nevertheless the von Mises statistic
X
ϕ(Xi , Xj )
n−2
i,j
is asymptotically normal if Eϕ2 (X1 , X2 ) < ∞ and Eϕ2 (X1 , X1 ) < ∞.
Hint. Use Problem 13.
f in Problem 12 generally.
15. Apply Problem 14 to derive the asymptotic distribution of W
16. Derive the results parallel to Problems 11, 12, and 15 for Kendall’s τ .
17. Establish (7.3.31).
7.5
Notes
Notes for Section 7.1
(1) It may be shown that if T is Euclidean, given the FIDIS of {X(t) : t ∈ T }, it is always
possible to choose X(t) separable and X(·, ·) defined by X(u, v) = X(v) − X(u) separable — see Doob (1953), pp. 46–47. Unfortunately the empirical processes in general, can
be defined on sets of functions T , which may not be identified with a subset of Euclidean,
or even a separable metric space. The only recourse is then to go to so called “outer probabilities” — see van der Vaart and Wellner (1996) and Pollard (1990). However, in all cases
of interest that we treat, separability of X(·) and X(·, ·) (and measurability) are more or
less evident, so we shall ignore this technicality.
Chapter 8
DISTRIBUTION-FREE, UNBIASED,
AND EQUIVARIANT PROCEDURES
8.1
Introduction
In this chapter we leave asymptotic theory and elaborate on some general inference principles initiated in Chapter 1, Sections 1.2 and 1.3 and Chapter 3, Sections 3.3 and 3.4. The
results will guide the construction of procedures with optimal asymptotic properties for
models much more general than those of this chapter. Recent work relating to the topics in
this chapter deals with optimal inference after data based model selection. See Fithian, Sun
and Taylor (2014).
One motivating question is, when can we construct a test of composite hypotheses
whose power function, probability of Type I error, is constant on the hypothesis and, if so,
how can we do it? In examining answers to this question we will establish results important
for answering other decision theoretic questions having to do with estimation as well.
The first answer, elaborated on in Section 8.2, is based on the concept of a complete
sufficient statistic when the null hypothesis is our model. Such a statistic, which, if it exists,
must be minimal sufficient, can be conditioned on, and the resulting tests with conditional
Type I error probability equal to α have the desired property. An important example discussed is permutation tests. It turns out that the notion of completeness is also what’s
needed to construct uniformly minimum variance unbiased (UMVU) estimates in situations where the information bound for unbiased estimates is not achievable. This is a much
more powerful method than the one discussed in Section 3.4.
The second answer is based on the notion that some models exhibit symmetries which,
for loss functions having corresponding symmetries, lead us to consider decision procedures which are also symmetric. That is, there is a group acting on the model which is
isomorphic to a group acting on the action space A which in turn is isomorphic to a group
acting on decision procedures. For instance in testing H : θ = 0 vs K : θ 6= 0 when we
observe X1 , . . . , Xn i.i.d. with density f (x − θ) and f is symmetric about 0 it seems reasonable to restrict consideration to tests δ such that δ(X1 , . . . , Xn ) = δ(−X1 , . . . , −Xn ).
The argument, much elaborated in Section 8.3, is that if H holds for X1 , . . . , Xn it does
so for −X1 , . . . , −Xn as well and if it doesn’t then −X1 , . . . , −Xn are symmetrically
distributed about −θ 6= 0. So, it seems reasonable that X1 , . . . , Xn and −X1 , . . . , −Xn
73
74
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
should lead to the same action. Note, however, that, implicitly, we are postulating a problem of a malevolent nature which might flip the signs on the Xi we see to confuse us. This
suggests via Section 3.3 a connection to the minimax principle which we shall explore.
How do symmetries relate to our initial question? Symmetry of the procedure leads
to critical regions with constant probability of Type I error if θ = 0. If in the problem
we just discussed we specify that f (x) = (1/σ)f0 (x/σ), with f0 known and σ unknown,
then symmetry arguments lead us to require δ(aX1 , . . . , aXn ) = δ(X1 , . . . , Xn ), a 6= 0.
We pursue this argument to show that such tests, which are called invariant with respect to
multiplication, have constant probability of Type I error in σ if θ = 0.
These notions can be extended to requiring properties for estimators δ of θ, such as
δ(X1 + c, . . . , Xn + c) = δ(X1 , . . . , Xn ) + c.
(8.1.1)
Examples of estimators satisfying (8.1.1) are the mean and the median. See Problem 3.5.6.
Such estimates are called equivariant with respect to addition. Such notions turn out to be
closely linked to minimaxity so that we can find minimax procedures by restricting to such
procedures to begin with.
Although these ideas are initially applicable only in restricted classes of models, their
applicability where the X i are i.i.d. N (µ, Σ0 ) with Σ0 known can be used to establish
asymptotic versions of these results for all smooth, regular i.i.d. parametric models.
8.2
Similarity and Completeness
We have seen that in the case of the t test under the N (µ, σ 2 ) model, the t-statistic has
the same t-distribution for all values of σ 2 . In this section, we shall study a key property
that a number of models have which permits such a reduction. This reduction is based
on a set of ideas introduced and developed by Lehmann and Scheffé (1950, 1955) that
have another interesting application we shall discuss. Using these notions, it is possible to
essentially characterize situations in which we can expect a UMVU estimate to exist and
give a constructive method to obtain these procedures. As we shall see, this approach is
far more powerful than that of Section 3.4. We develop the concept of completeness for
a model we denote by P0 . In the testing context, P0 is the class of probabilities specified
by the hypothesis. In the estimation context, it is typically the class P of all probabilities
under consideration.
8.2.1
Testing
We consider the general testing problem in the Neyman–Pearson framework: we observe
X ∈ X , X ∼ P ∈ P, with hypothesis H : P ∈P0 , P0 ⊂ P and alternative K : P ∈
/ P0 .
Recall that a test φ has level α if EP φ(X) ≤ α for all P ∈ P0 and φ has size α if
sup{EP φ(X) : P ∈ P0 } = α.
Definition 8.2.1. A (randomized) level α test φ is similar(1) level α if EP φ(X) = α for
all P ∈ P0 .
Section 8.2
Similarity and Completeness
75
Evidently, a similar test has size α as well as level α. A statistic T (X) is distributionfree if P (T (X) ≤ t) is the same for all P ∈ P0 . For such T(X), tests of the form ϕ(X) =
1(T (X) ∈ A) are similar. Similar tests other than φ(X) ≡ α may not exist, and may exist
only for some α (Problem 8.2.19). There is one very important general situation in which
they may readily be constructed: when a complete, sufficient statistic exists.
Similarity is a property of a test and a model P0 . Completeness is a property of a
statistic T (X) and the model obtained when X ∼ P, P ∈ P0 . We consider
T (X), and
the model Q0 = {P T −1 ; P ∈ P0 }, the set of distributions P T (X) ≤ t of T (X) as P
ranges over P0 .
Definition 8.2.2. The model Q0 = {P T −1 ; P ∈ P0 } and the statistic T are complete for
Q0 and P0 , respectively, if, for any function v(T ), the system of equations,
EQ v(T ) = 0, for all Q ∈ Q0 ,
or, equivalently,
implies that
(8.2.1)
EP v T (X) = 0, for all P ∈ P0 ,
Q[v(T ) = 0] = P [v T (X) = 0] = 1
for all Q ∈ Q0 , respectively, P ∈ P0 .
That is, the system (8.2.1) has only the trivial solution v = 0. The main utility of
completeness in testing comes from the following theorem.
Theorem 8.2.1. Let T (X) be a complete, sufficient statistic for P0 . Let φ be a randomized
test of H. Then, φ is similar level α iff
(8.2.2)
EP φ(X)|T (X) = α
with P probability 1 for all P ∈ P0 .
Property (8.2.2) is sometimes called Neyman structure.
Proof. Evidently, by the iterated expectation theorem, B.1.20, (8.2.2) implies similarity.
On the other hand, suppose that, for P ∈ P0 ,
EP φ(X) = EP EP φ(X)|T (X) = α
(8.2.3)
where, by sufficiency, EP φ(X)|T (X) is a function of T (X) not depending on P so
that the subscript P on EP φ(X)|T (X) can be dropped. That is, E φ(X)|T (X) is a
statistic. Rewrite (8.2.3) as
(8.2.4)
EP E φ(X)|T (X) − α = 0
for all P ∈ P0 . Completeness leads to (8.2.2).
The following simple but useful result follows from (8.2.3).
✷
76
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
Proposition 8.2.1. If T is sufficient for P0 and if EP φ(X)|T ≤ α for P ∈ P0 , then φ is
level α.
Completeness and sufficiency are properties of a statistic that push in opposite directions. For instance, the trivial sufficient statistic T (X) = X is not complete unless P0
contains all point masses. On the other hand, if T (X) is a constant, then T (X) is complete
but sufficient only if P0 = {P0 }. To see if the concept of completeness is useful, we need
nontrivial examples of complete, sufficient statistics.
Example 8.2.1. Bernoulli trials. Suppose X1P
, . . . , Xn are i.i.d. as X where Pθ [X = 1] =
θ = 1 − Pθ [X = 0], 0 < θ < 1. Then, T ≡ ni=1 Xi is sufficient for θ. We claim that T
is complete. To see this, write (8.2.1) as
n
X
n k
v(k)
θ (1 − θ)n−k = 0, 0 < θ < 1 .
k
k=0
or, equivalently, with ρ = θ/(1 − θ),
n
X
k=0
n k
v(k)
ρ = 0,
k
(8.2.5)
for all ρ > 0. But (8.2.5) evidently requires v(k) = 0, 0 ≤ k ≤ n, since any polynomial of
degree n has at most n roots.
✷
Not surprisingly, Example 8.2.1 is a special case of a general result for exponential
families.
Theorem 8.2.2. Suppose p(x; θ), θ ∈ Θ ⊂ Rk , is an exponential family of rank k with
T
sufficient statistic T(X) ≡ T1 (X), . . . , Tk (X) ,
p(x; θ) = exp
k
nX
j=1
o
ηj (θ)Tj (x) − A η(θ) h(x) ,
and η(Θ) has a nonempty interior. Then, T(X) is complete as well as sufficient.
Proof. The system of equations (8.2.1) implies that
Z
Rk
h(t) exp
k
nX
j=1
o
ηj tj − A(η) v(t) dt = 0
for all η belonging to an open set, or, equivalently,
Z
Rk
h(t) exp
k
nX
j=1
o
ηj tj v(t) dt = 0
(8.2.6)
for all η in an open set. Now (8.2.6) implies that h(t)v(t) = 0 by a classical theorem on
multivariate Laplace transforms (Widder (1941)).
✷
Section 8.2
77
Similarity and Completeness
We gave the proof of this result for the continuous case but it applies equally well in the
discrete case or for any exponential family dominated by some measure µ.
Here are some immediate consequences of Theorem 8.2.2.
(i) If X1 , . . . , Xn are i.i.d. N (µ, σ 2 ) with µ, σ 2 both unknown, then (X̄, σ
b2 )T is com2
plete and sufficient. If σ is known, X̄ is complete, sufficient. If µ is known,
P
(Xi − µ)2 is complete, sufficient.
Pn
(ii) If X1 , . . . , Xn are i.i.d. Poisson (λ), λ > 0, i=1 Xi is complete, sufficient.
Completeness is not limited to exponential families.
Example 8.2.2. Completeness in the U(0, θ) family, θ > 0. If X1 , . . . , Xn are i.i.d.
U(0, θ), then Mn ≡ maxi {Xi } is sufficient. Note that P (Mn ≤ t) = P (all Xi ≤ t) =
(t/θ)n for all 0 ≤ t < θ so that (8.2.1) becomes
1
θ
Rθ
Z
0
θ
v(t)n
t n−1
dt = nθ−n
θ
Z
θ
v(t)tn−1 dt = 0
0
or, equivalently, 0 v(t)tn−1 dt = 0 for all θ > 0. Differentiating with respect to θ we
obtain v(θ)θn−1 = 0 =⇒ v(θ) = 0 and completeness is proved.
✷
Example 8.2.3. Completeness of the order statistics. Here is an example which will
prove important for both testing and estimation. Let X1 , . . . , Xn be i.i.d. f ∈ F where
F is the set of all continuous case densities. Then, X(1) < . . . < X(n) , the order
statistics, are sufficient (Problem 1.5.8). We next argue that they are complete. Now,
v(x1 , . . . , xn ) defined for x1 < x2 < · · · < xn can be extended to all of Rn by symmetry
v(xi1 , . . . , xin ) ≡ h(x1 , . . . , xn ) for all permutations (i1 , . . . , in ). So (8.2.1) becomes
Z
Z
. . . v(x1 , . . . , xn )f (x1 ) . . . f (xn )dx1 , . . . , dxn = 0
(8.2.7)
for all f ∈ F . Let g1 , . . . , gn be arbitrary densities in F so that
Pn
αj ≥ 0, for all j, j=1 αj = 1. Then (8.2.7) implies
G(α1 , . . . , αn , g1 , . . . , gn ) ≡
Z
...
Z
v(x1 , . . . , xn )
i=1
for all α, (g1 , . . . , gn )T as above. Write αj = wj /
Simplifying further,
G(w1 , . . . , wn , g1 , . . . , gn ) =
Z
...
n
n X
Y
Z
Pn
k=1
j=1
Pn
j=1
αj gj ∈ F if
αj gj (xi ) dxi = 0
wk where wk ≥ 0, 1 ≤ k ≤ n.
v(x1 , . . . , xn )
n
n X
Y
i=1
j=1
wj gj (xi ) dxi = 0 .
(8.2.8)
78
Distribution-Free, Unbiased, and Equivariant Procedures
The coefficient of
Qn
j=1
Chapter 8
wj is
∂nG
(w1 , . . . , wn , g1 , . . . , gn )
∂w1 . . . ∂wn
w=0
=
Z
...
Z
v(x1 , . . . , xn )
n
Y
gi (xi ) dxi .
i=1
(8.2.9)
Put gj (x) = 1(xεAj )/|Aj |, the uniform density on Aj , where |A| is the length of A and
A1 , . . . , An are disjoint arbitrary intervals. Then (8.2.9) becomes
Z
Z
v(x1 , . . . , xn ) dx1 , . . . , dxn = 0
(8.2.10)
,...,
A1
An
for all A1 , . . . , An and (8.2.10) implies that v = 0.
✷
The order statistics are also complete for the case where Xi ∼ F with F continuous.
See Bell, Blackwell and Breiman (1960).
Let us now apply completeness to obtain distribution-free tests in a number of important
examples.
Testing hypotheses in k parameter exponential families
Suppose X ∼ {Pθ : θ ∈ Θ ⊂ Rk }, a canonical k parameter exponential family, with
k ≥ 2 and Θ open, with density function,
p(x; θ) = exp
k
nX
j=1
o
θj Tj (x) − A(θ) h(x) .
We want to test H : θ1 = θ10 where θ10 is a specified value, that is, θ ∈ Θ0 = {θ : θ1 =
θ10 , θ ∈ Θ}. Thus, H is Pθ ∈ P0 where P0 is the k − 1 parameter exponential family with
densities
q(x; θ2 , . . . , θk ) = exp
k
nX
j=2
o
θj Tj (x) − B(θ2 , . . . , θk ) h0 (x)
where h0 (x) = exp[θ10 T1 (x)]h(x). By Theorem 8.2.2, (T2 (X), . . . , Tk (X))T is comT
plete and sufficient for P0 . Thus, if φ(T), where T = T1 (X), . . . , Tk (X) , is any test
such that Eθ φ(T) = α, θ ∈ Θ0 , we must have
Eθ φ(T) (T2 , . . . , Tk ) = α .
(8.2.11)
Computing the conditional expectation in (8.2.11) is straightforward since the conditional
distribution of T given Tj = tj , j ≥ 2 is determined by that of T1 given Tj = tj , j ≥ 2.
By the k dimensional version of Theorem 1.6.1, (T1 , . . . , Tk ) has density of the form
p(t1 , . . . , tk ; θ) = exp{θT t − A1 (θ)}h1 (t) .
Section 8.2
79
Similarity and Completeness
Here (Problem 8.2.20), the conditional distribution of T1 given
T[2,k] ≡ (T2 , . . . , Tk )T has density of the form
p(t) = exp{θ10 t − B0 (θ10 , t2 , . . . , tk )}h0 (t2 , . . . , tk )
so that (8.2.11) is just
Z
φ(t, t2 , . . . , tk )p(t)dt = α .
(8.2.12)
(8.2.13)
The usual analogous formulae apply for the discrete case.
Conversely, suppose we start with any test statistic S(T) and
φ(T) = 1 if S(T) > s
= 0 if S(T) < s .
In general, we can not choose s so that Eθ φ(T) ≤ α for all θ ∈ Θ0 . However, consider
the randomized test,
φ∗ (T) = 1,
= γ,
= 0,
S(T) > s(T2 , . . . , Tk )
S(T) = s(T2 , . . . , Tk )
S(T) < s(T2 , . . . , Tk ) ,
where s(t2 , . . . , tk ) and γ(t2 , . . . , tk ) are determined by
Eθ {φ∗ (T)|T2 , . . . , Tk } = α
under H. This is a useful test since the conditional distribution of S(T) given T2 = t2 , . . .,
Tk = tk does not depend on θ2 , . . . , θk and it has level α by Theorem 8.2.1. By Proposition
8.2.1, if we are satisfied with Eθ φ(T ) ≤ α for all θ ∈ Θ0 , then we can dispense with
randomization in φ∗ (T ). In the continuous case φ∗ (T ) is similar without randomization.
Note that φ∗ (T) may not be the same as the test φ(T) we started with, but asymptotically
they are usually equivalent to first order.
Remark 8.2.1 We can start with any statistic S(T) to construct a similar test φ∗ (T ). What
makes the test we have just obtained appropriate? The test φ∗ (T ) is “aimed” at one-sided
alternatives. We shall see later in this section that the tests we have introduced do have
optimality properties.
✷
Example 8.2.4. Another look at testing location and scale for the Gaussian distribution.
Suppose
X1 , . . . , XnPare i.i.d. N (µ, σ 2 ). This is an exponential family with T1 (X) =
Pn
n
2
2
2
i=1 Xi , T2 (X) =
i=1 Xi , θ1 = µ/σ , θ2 = −1/2σ . We seek a test with guaranteed
level α of H : µ = 0 vs K : µ > 0. This is the appropriate test if the Xi are case control
differences. If σ 2 = σ02 is known, the UMP test is
φ(T1 ) = 1 iff
n
X
i=1
Xi ≥
√
nσ0 Φ−1 (1 − α) .
80
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
If we are wrong about σ0 , this test can have arbitrarily large level. We can apply the
principles of this section to conclude that the test
with T1 =
P
φ(T1 , T2 ) = 1 iff T1 ≥ c(T2 , α)
Xi and
Pθ [T1 ≥ c(t2 , α)|T2 = t2 ] = α
(8.2.14)
for all t2 has Neyman structure. In our case, the distribution of T1 |T2 = t2 can be determined from
P
X
X
( Xi )2 2
(T1 , T2 ) =
Xi ,
(Xi − X̄) +
n
P
P
where, by Theorem B.3.3,
Xi , (Xi − X̄)2 are independent N (0, σ 2 /n), σ 2 χ2n−1 ,
2 2
respectively. Here V ∼ σ X means that (V /σ 2 ) ∼ X 2 . Consider the t-statistic
t=
√
√1 T1
nX̄
n
=
1 .
1
s
{ n−1 [T2 − T12 /n]} 2
For fixed T2 = t2 , t is an increasing function of T1 . Thus for each s(t2 ) and T2 = t2 ,
T1 ≥ s(t2 ) is equivalent to t ≥ d(t2 ) for some d(t2 ), and it is enough to find d(t2 ) such
that
P T1 ≥ s(t2 )|T2 = t2 = PH t ≥ d(T2 )|T2 = t2 = α .
(8.2.15)
Here the distribution of t does not depend on σ; thus, if bα is such that
PH (t ≥ bα ) = α
then by the Neyman structure result (8.2.2), for all σ > 0,
PH t ≥ bα |T2 = t2 = α
(8.2.16)
and we can determine s(t2 ) from (8.2.15) and (8.2.16). Note that (8.2.16) holding for all
α ∈ (0, 1) is equivalent to independence of t and T2 under H. This example illustrates
the computation of the conditional critical value s(t2 ), and it shows connections to earlier
normal model procedures. In practice, we would carry out the test as in Vol. I using the
t-distribtuion of t.
We can extend the argument leading to (8.2.13) for testing the hypothesis
H : θ1 = θ10 to H : θS = θ 0S where θ S ≡ {θi : i ∈ S}, S ⊂ (1, . . . , k). Here is a simple
application.
Example 8.2.5. Gaussian goodness-of-fit. Suppose we wish to test H : F ∈ {N (µ, σ 2 ) :
µ ∈ R, σ 2 > 0} versus a particular alternative with specified density f which is not
Section 8.2
81
Similarity and Completeness
Gaussian. If we knew µ = µ0 and σ 2 = σ02 , the Neyman-Pearson Lemma implies that the
most powerful statistic would be
S(X) =
n
X
log f (Xi ) +
i=1
1 X 2 µ0 X
nµ20
.
X
−
X
+
i
i
2σ02
σ02
2σ02
∗
Using our φ (T ) construction the similar size α test based on S does not depend on µ0 and
σ02 . It is just, if n ≥ 3,
φ∗ (X) = 1
n
X
iff
i=1
= 0
where
P
n
X
i=1
otherwise
n
X
2 Xi − X̄
log f (Xi ) ≥ c X̄,
i=1
n
2
X
Xi − X̄
log f (Xi ) ≥ c X̄,
Pn
i=1
2
X
Xi ,
X
Xi2 = α.
is uniformly distributed on the surface
Since, given X̄ and i=1 (Xi −X̄) , (X1 , . . . , Xn )TP
of the n sphere with center (X̄, . . . , X̄) and radius ni=1 (Xi − X̄)2 , computing involves
evaluating the area of the intersection of the surface of the sphere with
)
(
n
X
T
log f (xi ) ≥ c
(x1 , . . . , xn ) :
i=1
which is, unfortunately, not trivial in general.
Pn
Note that unlike the test “Reject iff i=1 log f (Xi ) is large,” φ∗ (X) has probability of
Type I error α whatever be µ and σ 2 .
✷
Here is an important and typical example where we can, in general, not obtain similarity
without randomization but can guarantee level α.
Example 8.2.6. Independence in contingency tables. Suppose (Xi , Yi ), 1 ≤ i ≤ n are,
as in Section 6.4, i.i.d. with Xi taking values x1 , . . . , xr and Yi values y1 , . . . , ys , with
arbitrary probabilities, θab = P [X = xa , Y = yb ], 0 < θab < 1 for all a, b. We
can immediately
reduce by sufficiency to N = {Nab : 1 ≤ a ≤ r, 1 ≤ b ≤ s} with
P
Nab ≡
1(Xi = xa , Yi = yb ) which has a multinomial M(n, {θab }) distribution. The
hypothesis of independence of X and Y is H : θab = θa+ θ+b , for all a, b where + indicates
summation over the index. The Wald test of Section 6.3.2 for this hypothesis is based on
the Pearson χ2 statistic,
2
r X
s
X
n Nab − Na+nN+b
2
χ =
,
Na+ N+b
a=1
b=1
see (6.4.9). Note that we can write the density function of N under the hypothesis as
nX
o
p(N : η) = exp
Nab (ηa1 + ηb2 )
a,b
82
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
where ηa1 = log θa+ , ηb2 = logP
θ+b , an exponential family in canonical
Pr form with canonis
cal sufficient statistics, Na+ ≡ b=1 Nab , 1 ≤ a ≤ r, and N+b = a=1 Nab , 1 ≤ b ≤ s.
Let N r+ = (N1+ , . . . , Nr+ ) and N +s = (N+1 , . . . , N+s ). Consider the test which rejects H iff
χ2 ≥ c(N r+ , N +s ),
and let c be chosen so that
PH χ2 ≥ c(na+ , n+b , N r+ , N +s )|Na+ = na+ , N+b = n+b , N r+ , N +s ≤ α
and thus the test has level α under H. Note that (see Problem 8.2.23), given Na+ =
na+ , N+b = n+b , for all a, b,
2
χ =n
(
X
a.b
2
Nab
−1
na+ n+b
)
.
(8.2.17)
To implement the test, we need to compute, under H, the conditional distribution of
{Nab } given N r+ = nr+ , N +s = n+s , which is multiple hypergeometric. In fact, computing the conditional null distribution of this statistic is computationally difficult if r and
s are large. Exact results for low values of r and s are given in the package STATEXACT.
More generally, Markov chain Monte Carlo methods such as that of Diaconis and Stumfels
(1998) can be used.
A particularly interesting example is Fisher’s exact test, where r = s = 2. We note that
the conditional χ2 test given N1+ = n1+ , N+1 = n+1 can be based on the distribution
of N11 given N1+ , N+1 which, under H, is hypergeometric (n, n1+ , n+1 ). The one-sided
version of the test, rejecting H for N11 large, has optimality properties. See Lehmann and
Romano (2005).
✷
Example 8.2.7. Two-sample permutation tests. Consider the nonparametric two-sample
problem. We observe X1 , . . . , Xm i.i.d. F and Y1 , . . . , Yn i.i.d. G with the X’s independent of the Y ’s, and want to test H : F = G against the alternative that G tends to give
higher values than F , which can be formally expressed by K : Ḡ(x) ≥ F̄ (x) for all x,
which we refer to as G is stochastically larger than F — see Problems 1.1.4 and 8.3.11(d).
If F and G are N (µ, σ 2 ) under H and N (µ, σ 2 ), N (µ+∆, σ 2 ) under K, we have the Gaussian two-sample problem discussed in Section 4.9.5. The test for that parametric model is
to reject iff
q
mn
m+n (Ȳ − X̄)
≥ tm+n−2 (1 − α)
T ≡
s
Pm
Pn
where s2 = (m + n − 2)−1
b)2 + i=1 (Yi − µ
b)2 and µ
b = λX̄ + (1 − λ)Ȳ ,
i=1 (Xi − µ
λ ≡ m/(m + n). This test is not similar for general H : F = G. What is the similar
version for this nonparametric hypothesis when F and G are continuous? Let Z1 , . . . , ZN
be X1 , . . . , Xm , Y1 , . . . Yn , N = m + n, and let Z(1) , . . . , Z(N ) be the ordered values.
Since (Z(1) , . . . , Z(N ) ) is a complete sufficient statistic under H by Example 8.2.3, the
Section 8.2
83
Similarity and Completeness
similar test is
1
if T ≥ c(Z(1) , . . . , Z(N ) )
= γ(Z(1) , . . . , Z(N ) ) if T = c(Z(1) , . . . , Z(N ) )
0
otherwise
where c and γ are chosen so that
E ψ(X, Y) | Z(1) , . . . , Z(N ) = α .
ψ(X, Y)
We simplify by noting that
m
X
i=1
Then
T =
r
Xi +
n
X
Yi =

−1
Z(i) ,
i=1
i=1
n
mn 

N +1
N
X
Pn
m
X
Xi2 +
i=1
P
N
n
X
Yi2 =
N
X
2
.
Z(i)
i=1
i=1
Pn
Yi −
i=1 Z(i) −
i=1 Yi m
P
2
PN
1
2 − ( Z(i) )
Z
i=1 (i)
m+n−2
N
i=1
−1


,

Pn
PN
a monotone function of i=1 Yi given Z(1) , . . . , Z(N ) and hence given i=1 Z(i) and
PN
2
i=1 Z(i) . Our test is therefore equivalent to
Pn
1
if Pi=1 Yi > c(Z(1) , . . . , Z(N ) )
ψ(X, Y) = γ(Z(1) , . . . , Z(N ) ) if ni=1 Yi = c(Z(1) , . . . , Z(N ) )
0
otherwise .
How do we find c? Note that, given (Z(1) , . . . , Z(N ) ), the values of Y1 , . . . , Yn are
known, but not their order. Each ordering is equally likely under H; thus, the conditional
distribution of Y1 , . . . , Yn given (Z(1) , . . . , Z(N ) ) is
P [Yj = Z(ij ) , 1 ≤ j ≤ n] =
−1
N
.
n
The test is thus equivalent to
Pn
(i) Form s(1) < s(2) < . . . < s(N ) , where the s(k) are the ordered s = i=1 Z(ij ) as
n
(i1 , . . . , in ) ranges over distinct permutations.
Pn
(ii) “Reject H if j=1 Yj is among the top N
+ 1 of the s(k) and reject with
n α
N Pn
probability γ(Z(1) , . . . , Z(N ) ) if i=1 Yi is the n α th sk .”
This test is distribution-free in the sense that it does not require the knowledge of the distribution F = G under H. It is the so called size α permutation t-test introduced by R.A.
Fisher. In practice, randomization is not used and one settles for a level α test with size
≤ α.
The main downside with permutation tests is that the calculation and ordering of the sj
become prohibitive for m, n large. However, there is a simple alternative, the Monte Carlo
84
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
∗
∗
permutation test: Resample Z1b
, . . . , Znb
, 1 ≤ b ≤ B, from {Z(1) , . . . , Z(N ) } with equal
probability
∗
P [Z11
= z(i1 ) , . . . ,
∗
Zn1
= z(in ) |Z(i)
−1
N
= z(i) , 1 ≤ i ≤ N ] =
.
n
∗
∗
That is, draw n times without replacement from {ZP
(1) , . . . , Z(N ) } to get (Z11 , . . . , Zn1 ),
n
∗
then repeat B times. Order the resulting B sums i=1 Zib , 1 ≤ b ≤ B, and reject if
P
n
✷
i=1 Yi is among the top [Bα] + 1. This test has level α.
Basu’s theorem
We noted that in Example 8.2.4, we were able
Pnto eliminate the need for computing a
conditional distribution since the t-statistic and i=1 Xi2 are independent under H. The
existence of such statistics is closely linked to ancillarity.
Definition 8.2.3. A statistic S(X) is said to be ancillary for P = {Pθ : θ ∈ Θ} if Lθ S(X)
is the same for all θ ∈ Θ.
Theorem 8.2.3. (Basu) Suppose T (X) is sufficient and complete for P. If S(X) is ancillary for P, then T (X) and S(X) are statistically independent for all P ∈ P. If the support
of P is the same for all P ∈ P, then a converse holds: If T (X) and S(X) are independent
for all P , then S(X) is ancillary.
Proof. For any set A in the range of S write
P [S(X) ∈ A] = Eθ P [S(X) ∈ A|T ] .
(8.2.18)
Note that we need not put a θ subscript on the left hand side since S is ancillary, nor on the
inside term of the right hand side since T is sufficient. Now (8.2.18) becomes
Eθ P [S(X) ∈ A|T ] − P [S(X) ∈ A] = 0
for all θ. By completeness P [S(X) ∈ A|T ] = P [S(X) ∈ A] and the independence of S
and T follows. We leave the converse for Problem 8.2.21.
✷
P 2
√
Example 8.2.4 (Continued). Here T2 = Xi is complete, sufficient, and t = nX̄/s is
ancillary under H : µ = 0. Thus T2 and t are independent under H.
Example 8.2.7 (Continued). Rank tests. Let Ri be the rank of Yi among Z(1) , . . . , Z(N ) ,
that is, Yi = Z(Ri ) . Then R = (R1 , . . . , Rn ) is, under H, necessarily uniform on all
combinations {r} of size n from {1, . . . , N }. That is, P (R = r) = 1/ N
n . So R is
ancillary and tests based on it are distribution-free. An example
Pn of such a test is the twosample
Wilcoxon
test
which
is
based
on
the
statistic
W
=
i=1 Ri , or equivalently, U =
Pn Pm
1(X
<
Y
).
Most
software
provide
p-values
for the Wilcoxon test when
i
j
j=1
i=1
1
min{m, n} ≤ 10. For min{m, n} ≥ 10, (U − 2 mn)/σW is close to N (0, 1), where
2
σW
= mn(m + n + 1)/12.
✷
Section 8.2
8.2.2
Similarity and Completeness
85
Testing Optimality Theory
We begin with
Definition 8.2.4. A test φ for H : θ ∈ Θ0 vs. K : θ ∈ Θ1 is unbiased level α iff
Eθ φ(X) ≤ α, θ ∈ Θ0 , and Eθ φ(X) ≥ α, θ ∈ Θ1 .
Thus an unbiased test does at least as well as the trivial test φ ≡ α at all θ. As with
unbiasedness of estimates, being biased in testing is not necessarily bad. For instance, in
testing H : µ = 0√vs K : µ 6= 0 on the basis of n observations
from N (µ, 1), the one-sided
√
tests, “Reject iff nX̄ ≥ z(1 − α)” and “Reject iff nX̄ ≤ z(α)” are both biased yet
reasonable. Nevertheless, unbiasedness often leads to reasonable procedures. Recall that
UMP stands for “uniformly most powerful.”
Definition 8.2.5. The test φ∗ is a UMP unbiased level α test if it is unbiased and Eθ φ∗ (X) ≥
Eθ φ(X) for all θ ∈ Θ1 if φ is also unbiased level α.
Let ω ≡ ∂Θ0 ∩∂Θ1 denote the intersection of the boundaries of Θ0 and Θ1 . We assume
that ω is not empty. Note that if φ is unbiased level α and θ → Eθ φ(x) is continuous, then
φ is similar on ω, that is, Eθ φ(x) = α for θ ∈ ω.
Lemma 8.2.1. Lehmann. If θ → Eθ φ(X) is continuous for each test φ, and if φ∗ is level
α and UMP among the class of test φ satisfying Eθ φ(X) = α for θ ∈ ω, then φ∗ is UMP
unbiased α.
Proof. The class of tests that satisfies Eθ φ(X) = α for θ ∈ ω contains the class of tests
that are unbiased level α.
✷
Remark 8.2.2. If P = {Pθ : θ ∈ Θ}, Θ ⊂ Rd is an exponential family of distributions,
then θ → Eθ φ(X) is continuous, for every test function φ.
The connection between unbiasedness and completeness comes via
Theorem 8.2.4. Suppose θ → Eθ φ(X) is continuous for all φ and T (X) is sufficient and
complete for P0 = {Pθ : θ ∈ ω}. Suppose θ1 ∈ Θ1 . Then the UMP unbiased level α test
of H : θ ∈ Θ0 vs K : θ = θ1 is

 > c T (X), θ1 1
P
X|T
(X)
θ
φ∗ (X) = γ(T ) if 1
= c T (X), θ1 pw X|T (X)  < c T (X), θ
0
1
with c, γ determined by Eθ [φ∗ (X)|T (X)] = α, θ ∈ ω.
Proof. Optimality among all unbiased level α tests follows from:
(i) Every unbiased level α test is similar on ω = ∂0 Θ0 ∩ ∂Θ1 .
(ii) Every test similar on ω has Neyman structure on ω.
(iii) The conditional power, Eθ1 φ(X)|T (X) = t , is maximized for each t by φ∗ . Here
the conditional probability defining the Neyman-Pearson likelihood ratio is
p x; θ1 1 T (x) = t
pθ1 X = x, T (X) = t
.
=
pθ1 (x|T (X) = t) =
Pθ1 T (X) = t
Pθ1 T (X) = t
86
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
By the iterated expectation theorem, (B.1.20), φ∗ is also unconditionally most powerful.
✷
A simple application of this result is
Theorem 8.2.5. Suppose X is distributed according to P ∈ P, a canonical exponential
family with log p(x, θ) = θ T T(x) − A(θ) + b(x). Write T = (T1 , TT2 )T and θ =
(θ1 , θT2 )T . Then, the test
=
1 if T1 (x) > c T2 (x)
(8.2.19)
φ∗ (x) = γ T2 (x) if T1 (x) = c T2 (x)
=
0
otherwise
is UMP unbiased level
α for H : θ1 ≤ θ10 vs K : θ1 > θ10 , where c and γ are determined
by Eθ φ∗ (X)|T (X) = α, θ ∈ ω. Similarly, 1 − φ∗ is UMP unbiased for H : θ1 ≥ θ10 vs
K : θ1 < θ10 .
Proof. The family of conditional distributions of X given T2 (X) = t is a one parameter
exponential family of the form
q(x; t) = exp{θ1 T1 (x) − A(θ1 , t)}h(x, t)dx .
Thus, we can apply Theorem 8.2.4 to conclude that the test (8.2.19) is most powerful among
all unbiased level α tests. The test φ∗ is UMP for K : θ1 > θ10 because it does not depend
on θ1 .
Remark 8.2.3. It is possible to extend this theory to show that procedures such as the
two-sided t-test are U M P unbiased. See Lehmann and Romano (2005).
Example 8.2.6. (Continued). 2×2 contingency tables. Suppose that in Example 8.2.6,
we have r = s = 2 and we are interested in testing H vs K : θab > θa+ θ+b for all a, b.
Write
N
−N
N
−N
N11
N10
p = θ11
θ011+ 11 θ10
θ00+0 10
N11 N10
θ11
θ10
N
N
=
θ011+ θ10+0
θ01
θ00
N1+
θ10
N
n−N
= γ N11
θ011+ θ10 +1
θ00
n
= γ N11 λN1+ η N+1 θ01
θ11 . θ10
P [Y = 1|X = 1]P [Y = 0|X = 0]
where γ =
=
θ01
θ00
P [Y = 1|X = 0]P [Y = 0|X = 1]
(8.2.20)
and λ and η are appropriately defined from (8.2.20). Note that γ > 0 ⇐⇒ θab > θa+ θ+b
for all a, b, and thus γ = 0 corresponds to independence. Therefore, by Theorem 8.2.4, the
test which rejects H if N11 > c(N1+ , N+1 ) randomizes on the boundary and accepts otherwise, is UMP unbiased. By Remark 8.2.3, there is also a two-sided UMP unbiased test φ∗
of the form, Reject if N11 > c2 (N1+ , N+1 ) or N11 < c1 (N1+ , N+1 ) with randomization
γ1 , γ2 at both c1 and c2 , determined by
Section 8.2
87
Similarity and Completeness
(i) EH φ∗ (N)|N1+ , N+1 = α and
(ii) EH N11 φ∗ (N)|N1+ , N+1 = αEH N11 |N1+ , N+1 = α N1+nN+1 .
The condition (ii) comes from the requirement that the conditional power function of the
test has derivative 0 at γ = 0 which is necessary for unbiasedness. Unfortunately, this test
doesn’t coincide with the conditional χ2 test discussed earlier nor with the naive two-tailed
test: Reject iff N11 < c∗1 (N1+ , N+1 ) or N11 > c∗2 (N1+ , N+1 ) with randomization on the
boundary, all chosen so that, if N +
1 denotes (N1+ , N+1 ),
+
+
+
+
∗
PH [N11 > c∗1 (N +
1 )|N 1 ] + γ1 (N 1 )P [N11 = c1 (N 1 )|N 1 ] =
α
2
with a similar identity for c2 , γ2 .
Finally, consider
✷
Example 8.2.7. (Continued). Permutation t-test. Consider an alternative with F =
N (µ, σ 2 ), and G = N (µ + ∆, σ 2 ), ∆ > 0. Then the test of Theorem 8.2.4 can be written
as
Reject H if
p(X, Y; θ 1 )
1 T (X, Y) = T > c(Z(1) , . . . , Z(N ) )
p(X, Y; θ 0 )
(8.2.21)
where θ 1 = (µ, µ + ∆, σ 2 ) and θ0 = (µ, µ, σ 2 ). This holds, since by sufficiency of
T (X, Y) = (Z(1) , . . . , Z(N ) )T on w = Θ0 = {θ0 : µ ∈ R, σ 2 > 0} and the general
principles of conditioning,
.
pθ0 [X, Y|T(X, Y) = t] = p(X, Y; θ0 )1[T(X, Y) = t] Pθ0 T(X, Y) = t
for any θ 0 ∈ Θ0 corresponding
Pn to H. But by Section 4.9.3 the ratio
Pn in (8.2.21) is a monotone increasing function of i=1 Yi given Z(1) , . . . , Z(N ) . Since i=1 Yi and Z(1) , . . . , Z(N )
do not involve any parameters, we conclude that the permutation t test is UMP unbiased
against the usual {N (µ, σ 2 ), N (µ + ∆, σ 2 ) : ∆ > 0} alternatives.
Remark 8.2.4. Recently the Lehmann-Scheffé theory of this section has been used to
derive most powerful unbaised selective tests for inference in exponential family models
after data based model selection. See Fithian, Sun and Taylor (2014).
8.2.3
Estimation
We will now show how completeness provides the most powerful tool that we have for the
construction of uniformly minimum variance unbiased (UMVU) estimates. Recall from
Section 3.4.1 that given a model P ≡ {Pθ : θ ∈ Θ}, an estimate δ of a parameter
q(θ) ∈ R is unbiased iff
Eθ δ(X) = q(θ) for all θ .
A UMVU estimate δ ∗ is an unbiased estimate such that, for all θ and all unbiased δ,
Varθ δ ∗ (X) ≤ Varθ δ(X) .
88
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
Given any unbiased estimate δ(X) and sufficient statistic T (X) we now show how to
construct an unbiased Rao-Blackwell (RB) estimate depending on T only which is at least
as good.
Theorem 8.2.6. (Rao-Blackwell) Suppose δ is an estimate of q(θ) and Eθ δ 2 (X) < ∞ for
all θ. Let T (X) be a sufficient statistic and denote the Rao-Blackwell estimate of q(θ) by
(8.2.22)
δRB T (X) = E{δ(X)|T (X)} .
Then, for all θ ∈ Θ,
a) Biasθ (δRB ) = Biasθ (δ).
b) MSEθ δRB (T ) ≤ MSEθ δ(X) .
(8.2.23)
Equality holds in (8.2.23) iff δ(X) is a function of T (X) (with probability 1).
Note. As in the case of conditional tests, sufficiency enables us to conclude that δRB does
not depend on θ so that it is a bona fide estimate.
Proof. Part a) is a consequence of the iterated expectation theorem,
Eθ δRB T (X) = Eθ E δ(X) T (X) = Eθ δ(X) .
(8.2.24)
Further,
2
MSEθ δ(X) = Eθ E δRB (X) − q(θ) T (X)
2
= Eθ E δ(X) T (X) − q(θ)
2
= Eθ E δ(X) − q(θ) T (X)
2
2
≤ Eθ E δ(X) − q(θ) T (X) = Eθ δ(X) − q(θ)
by applying the inequality (EU )2 ≤ EU 2 conditionally on T (X). By A.11.9, equality can
hold iff δ(X) is constant given T (X). The theorem follows.
✷
Note. The theorem is a special case of Problem 3.4.2 which shows that δRB improves on
δ(X) for convex rather than just quadratic loss functions.
The following Lehmann-Scheffé theorem (1950, 1955) completes the story of complete
sufficient statistics and UMVU estimates.
Corollary 8.2.2. (Lehmann–Scheffé) Suppose T (X) is complete as well as sufficient for
P and δ(X) is unbiased for q(θ). Then δRB (T ) is the unique UMVU estimate of q(θ).
e unbiased, is better than
Proof. By part a) of the theorem, δRB (T ) is unbiased. Suppose δ,
2
2
e
δRB (T ), i.e. Eθ δ(X) − q(θ) ≤ Eθ δRB (X) − q(θ) with strict inequality for some θ.
Then δeRB (T ), the Rao-Blackwell version of δe has
2
2
Eθ δeRB (T ) − q(θ) ≤ Eθ δRB (T ) − q(θ) .
(8.2.25)
Section 8.2
Similarity and Completeness
89
On the other hand,
Eθ δRB T (X) − Eθ δeRB T (X) = Eθ δRB (T ) − δeRB (T ) = 0
for all θ. The result follows because completeness of T implies that
δRB (T ) − δeRB (T ) = 0 .
✷
Constructing UMVU estimates when a complete, sufficient T is available.
Method 1. Given q(θ), find δ such that Eθ δ T (X) = q(θ) for all θ. Then δ T (X) is
UMVU.
Method 2. Given
q(θ), find δ(X) such that Eθ δ(X) = q(θ) for all θ. Then δ ∗ ≡
E δ(X) T (X) is UMVU.
Example 8.2.8. Exponential families. T (X) is UMVU for Eθ T (X) ≡ Ȧ(θ). Recall two
special cases:
a) N/n is UMVU for p if N ∼ M(n, p). All aT N are UMVU estimates of aT p.
b) X̄ is UMVU for µ if X1 , . . . , Xn are i.i.d. Np (µ, Σ).
This example is also covered by Theorem 3.4.3.
✷
More examples will be given in the problems. However, consider the closely related
Example 8.2.9. Estimating the variance covariance matrix Σ when X1 , . . . , Xn are i.i.d.
Np (µ, Σ) with both µ, Σ unknown
Pn for n ≥ 2. By Theorem 8.2.2, a complete sufficient
statistic here is T(X) = n−1 i=1 Xi XT where X = (Xij ), 1 ≤ i ≤ n, 1 ≤ j ≤ d.
P
b ≡ n−1 n (Xi − X̄)(Xi − X̄))T has
We try Method 2. The MLE Σ
i=1
( n
)
X
1
b = E
E(Σ)
(Xi − µ)(Xi − µ)T − (X̄ − µ)(X − µ)T
n i=1
n−1
1
=
= Σ 1−
.
(8.2.26)
n
n
b is an unbiased estimate of Σ and since it is a function of T only, it
Therefore n/(n − 1) Σ
b is not a linear function of T so that the theory of Section 3.4 doesn’t
is UMVU. Note that Σ
apply.
✷
Remark 8.2.5. We could not establish in Section 3.4 that the unbiased estimate s2 of σ 2
in the Gaussian linear model is UMVU because Var(s2 ) does not achieve the information
lower bound. Now we have shown that s2 is UMVU.
Here is another example.
90
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
Example 8.2.10. Estimation of p for geometric observations. For illustration, we apply
both methods in this example. Suppose X1 , . . . , Xn are i.i.d. geometric (p)
P [X1 = k] = q k p, 0 < p < 1, k = 0, 1, . . . , .
Then, T ≡
Pn
i=1
Xi is complete and sufficient and
n+k−1 k n
P [T = k] =
q p , k = 0, 1, . . . , .
n−1
We want an unbiased estimate of p.
Method 1. We are supposed to find an estimate δ such that
∞
X
n+k−1 k n
p=
δ(k)
q p ,
n−1
k=0
or, equivalently,
1=
∞
X
k=0
n + k − 1 k n−1
δ(k)
q p
.
n−1
Then, for n ≥ 2, by the unicity of power series and the fact that the negative binomial
distribution with parameters n, k, and p assigns probability 1 to the nonnegative integers,
n+k−1
n+k−2
δ(k)
=
.
n−1
n−2
Thus the solution is
δ(k) =
n+k−2
n−2
n+k−1
n−1
=1−
k
n−1
=
.
n+k−1
n+k−1
Then, δ ∗ (T ) = (n − 1)/(n + T − 1), n ≥ 2, is UMVU.
Note that the MLE of pb solves the likelihood equation,
Epb(X̄) = X̄ ,
or, equivalently,
1
− 1 = X̄
pb
yielding pb = n/(n + T ), which is close to but not the same as the UMVUE.
Method 2. We begin with a simple unbiased estimate, δ(X1 ) ≡ 1(X1 = 0). Then,
E δ(X1 )|T = k = P [X1 = j|T = k]
#
"
n
i
hX
X
Xi = k
Xi = k − j /P
= P X1 = j,
=
i=2
n+k−j −2 . n+k−1
.
n−2
n−1
Section 8.2
Similarity and Completeness
91
By Corollary 8.2.2, δ ∗ (k) = P [X1 = 0|T = k] is the UMVU estimate which we already
derived using Method 1.
✷
Here is a nonparametric example.
Example 8.2.11. Estimation of F . Suppose X is univariate, F is the df of X which is
assumed completely unknown. We assume P is the family of all distributions with continuous case densities. Here X(1) < . . . < X(n) are complete, sufficient by Example 8.2.3.
Since
n
n
1X
1X
Fb (x) ≡
1(Xi ≤ x) =
1(X(i) ≤ x)
n i=1
n i=1
and Fb is unbiased, we conclude that Fb (x) is an UMVU estimate of F (x) for all x. More
generally, every function q(Fb) of Fb(·) is a UMVU estimate of its own expectation.
✷
Remark 8.2.6. The basis of our treatment and many more examples may be found in
Lehmann and Casella (1998) and the problems.
Summary. We introduce the concept of completeness for a class of probability distributions
P and show how it can be used in the construction of tests and estimates with desirable
properties. In essence, P is complete if it admits a sufficient statistic T such that the only
function v(T ) that is an unbiased estimate of zero for all P ∈ P is the function v(T ) = 0.
For this case T is also called complete. We call a test φ(x) similar or distribution-free
with respect to a hypothesized class of probabilities P0 if EP φ(X) = α for all P ∈ P0
and show that if T (X) is a complete, sufficient statistic for P0 , then a randomized test φ
is similar level α iff EP φ(X)|T(X) = α with P probability one for all P ∈ P0 . A
test φ satisfying EP φ(X)|T (X) = α, P ∈ P0 , for a complete, sufficient statistic T is
said to have Neyman structure. Important examples of complete, sufficient statistics are
the natural sufficient statistic in k-dimensional exponential family models and the vector of
order statistics or the empirical distribution. It is shown that when a complete, sufficient
statistic exists, it can be used to construct similar tests by conditioning on a statistic which is
complete and sufficient for the null hypothesis class of probabilities P0 . Important special
cases are testing problems in exponential family models, Gaussian models, contingency
tables, and the two-sample models where conditioning leads to permutation tests and
rank
tests. We define S(X) to be an ancillary statistic with respect to P if LP S(X) does
not depend on P for P ∈ P, and show Basu’s theorem stating that if S(X) is ancillary
for P and T (X) is complete, sufficient for P, then S(X) and T (X) are independent when
X ∼ P ∈ P. For testing H : θ ∈ Θ0 vs k : θ ∈ Θ1 , we define a test φ to be unbiased level
α if Eθ φ(X) ≤ α for all θ ∈ Θ0 and Eθ φ(X) ≥ α for all θ ∈ Θ1 ; that is, φ is no more
likely to accept H when K is true than when H is true. We define φ to be UMP unbiased
level α if it is UMP among all unbiased level α tests and show how such tests can be
found for some models when a simple statistic which is complete, sufficient with respect to
{Pθ : θ ∈ ∂Θ0 ∩ ∂Θ1 } is available. We give the Rao-Blackwell theorem which establishes
that if S is any estimate of q(θ), then the conditional expected value E(S|T ) of S given a
complete, sufficient statistic T has mean squared error at least as small as that of S. The
Lehmann-Scheffé theorem establishes that if S is unbiased, E(S|T ) is UMVU. We give
92
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
two methods for finding UMVU estimates and show, for example, that in the multivariate
T
P
P
normal model X̄ = n−1 Xi and (n − 1)−1
Xi − X̄ Xi − X̄ are UMVU for µ
and Σ.
8.3
Invariance, Equivariance, and Minimax Procedures
The topics we present in this section are a small subset of the extensive discussion given in
Lehmann’s classics, Theory of Point Estimation (TPE) and Testing Statistical Hypotheses
(TSH), now Lehmann and Casella (1986) and Lehmann and Romano (2005).
8.3.1
Group Models
In Section 8.2 we have seen how to reduce composite hypotheses to simple ones by conditioning on a sufficient statistic. Basu’s theorem further enabled us to identify situations
where ancillary statistics could be used to turn conditional tests into unconditional ones. In
this section we will begin by identifying an important structural property which is typical
of such situations and which has further important implications.
We recall the definition of a group(1) G of transformations g of a set L. Recall that all
g ∈ G map L onto itself in a 1 − −1 fashion. Group multiplication is composition, and G
is closed under composition and inversion. That is,
i) If g1 , g2 ∈ G so does g1 ◦ g2 given by g1 ◦ g2 (x) ≡ g1 (g2 (x)).
ii) If g ∈ G so does g −1 defined by g ◦ g −1 = g −1 ◦ g = j where j(x) = x, all x ∈ L.
Here is an important example of such a group.
The affine group on Rd
Here L = Rd and G can be parametrized by θ ≡ (A, b) where Ad×d ranges over all
nonsingular matrices and b over Rd . Then, g(A,b) (x) ≡ Ax + b. It is easy to see that G is
a group and that the group G is isomorphic to the group G on A × Rd where A is the set of
all nonsingular matrices via the correspondence
g(A1 ,b1 ) ◦ g(A2 ,b2 ) = g(A1 A2 ,A1 b2 +b1 )
−1
g(A,b)
= g(A−1 ,−A−1 b) .
As we know G is not commutative. Most of the groups we consider will be subgroups,
some commutative, of the affine group.
Groups on the sample space X induce models. The simplest situation is when we start
with a single P0 and define
P = {P0 g −1 : g ∈ G}
(8.3.1)
or, equivalently, if X ∼ P0 , g(X) ∼ P0 g −1 ∈ P. Note that G parametrizes P. If the
parametrization is identifiable, P0 g0−1 = P0 g1−1 ⇒ g0 = g1 , then G induces a group G
Section 8.3
Invariance, Equivariance, and Minimax Procedures
93
on P defined by g(P ) ≡ P g −1 ∈ P. More generally, if θ → Pθ , θ ∈Θ is an identifiable
parametrization of P, then G induces a group G on Θ via
Pgθ = Pθ g −1 .
(8.3.2)
It is easy to establish (Problem 8.3.1) that (8.3.2) defines g, a transformation of Θ uniquely,
and that G = {g ↔ g ∈ G} is a group isomorphic to G. That is, the map g → g has the
properties g1 ◦ g2 = g 1 ◦ g 2 , g −1 = g −1 . We shall call models such as (8.3.1) transformation models induced by P0 , G. Here is a basic example.
Example 8.3.1. The multivariate Gaussian model. Take P0 = Nd (0, J), where J is the
d × d identity matrix. That is, if ε ≡ (ε1 , . . . , εd ) ∼ P0 , the ε1 , . . . , εd are i.i.d. N (0, 1).
The transformation model induced by P0 and the affine group is just P = {Nd (µ, Σ); µ ∈
Rd , Σ nonsingular} (Problem 8.3.2). Note that the parametrization PA,b corresponding to
g(A,b) (ε) = Aε + b is not identifiable—see Example 9.1.5.
✷
More generally, we define a group model P by the property that if P ∈ P, X ∼ P , and
g ∈ G, then Pg−1 ∈ P. Here is a simple example.
Example 8.3.2. The shift model. Fix a density f0 on Rd and let G0 be the group of all
translations on Rd , gc (x) = x + c, and P0 the corresponding transformation model. G0
is clearly a subgroup of the affine group and if f0 is Gaussian P0 ⊂ P of Example 8.3.1.
Note that the parametrization is identifiable. The model can be written structurally,
X=c+ε
(8.3.3)
where ε ∼ f0 . For Example 8.3.1, X = Aε + b, where A is nonsingular and ε ∼ N (0, J).
Now consider the set of all probability distributions corresponding to (8.3.3) when we
take f0 to be arbitrary. This remains a group model under the shift group but is not a
transformation model since we cannot transform one density shape, say Gaussian, into
another, say Cauchy, by simply shifting. Equivalently the parametrization (c, f ) → f (·−c)
is not identifiable.
✷
The major group we consider in addition to the affine group, and some of its subgroups, is
the group of increasing transformations:
Example 8.3.3. The group G of all continuous strictly increasing functions from R onto
R. Let F0 be the cdf of P0 where F0 ∈ G. Then the model P generated by P0 and G is
the set of all P on R with continuous strictly increasing df. If we consider the subgroup
of g which have positive derivatives and P0 has a density which is positive we generate the
submodel of all P having positive densities (Problem 8.3.3).
✷
Remarks 8.3.1.
(a) Any given point x0 generates an orbit Ox ≡ {y : y = gx for some g ∈ G}. Belonging
to the same orbit is an equivalence relation on X (Problem 8.3.4).
(b) The groups we have defined so far allow only for samples of size 1. However, if a
group G is defined on X and we have a sample (X1 , . . . , Xn ) taking values in X n then we
94
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
naturally define
g (n) (x1 , . . . , xn )T = (g(x1 ), . . . , g(xn ))
(8.3.4)
the tensor group with the natural inherited group operations and properties.
(c) For n = 1 all the groups we have considered are transitive, that is, there is only one
orbit, X itself. That is, given x0 , x1 ∈ X there exists g ∈ G such that gx0 = x1 . For
n > 1, following the definition in (8.3.4) this is no longer the case. For instance, for the
shift group with d = 1, orbits are indexed by Rn−1 . Given (x01 , . . . , x0n−1 )T the orbit is
the hyperplane, (x01 , . . . , x0n−1 , 0)T + c(1, . . . , 1)T (Problem 8.3.5). We re-encounter such
quantities subsequently.
8.3.2
Group Models and Decision Theory
The basic connection between group models and inference is made via a group G ∗ on the
action space A and the loss function relation,
l(P g −1 , g ∗ a) = l(P, a)
(8.3.5)
for all P ∈ P, g ∈ G, g ∗ ∈ G ∗ , a ∈ A. In other words, the cost of taking action a when you
observe X ∼ P is the same as that of taking action g ∗ a when you observe gX ∼ P g −1 .
Consistency with (8.3.5) for a decision rule δ is defined by
δ(gx) = g ∗ δ(x)
(8.3.6)
for all x ∈ X . Such rules are called equivariant. In the case of tests they are called
invariant for reasons that will become apparent. In other words again, equivariance means
that if action a is “right” when you see x then g ∗ a is “right” if you see gx. Requiring
decision procedures to have this property has the same nature as requiring unbiasedness.
This time rules in the class obey certain symmetries. What G ∗ appear?
Here are examples showing what happens in estimation and testing.
Example 8.3.4. Testing in the Gaussian two-sample problem. Suppose that as in Section
4.9.3, we observe a sample X1 , . . . , Xn1 of i.i.d. N (µ, σ 2 ) observations and an independent sample Y1 , . . . , Yn2 of N (µ + ∆, σ 2 ) observations. We wish to test H : ∆ = 0, and
postulate the usual 0 − 1 loss function. The model P can be viewed as a group model with
respect to the shift group applied to (X1 , . . . , Xn1 , . . . , Y1 , . . . , Yn2 ). If
Xi → Xi + c, 1 ≤ i ≤ n1 , Yj → Yj + c, 1 ≤ j ≤ n2
under gc , and we parametrize P by θ ≡ (µ, ∆, σ 2 ) in the standard way, then g c θ = (µ +
c, ∆ + c, σ 2 ). Note that Θ0 ≡ {(µ, ∆, σ 2 ) : ∆ = 0} is an orbit of G as is Θ1 =
{(µ, ∆, σ 2 ) : ∆ 6= 0}.
✷
Following Lehmann and Romano (2005), we shall in general call a testing problem
H : θ ∈ Θ0 vs K : θ ∈ Θ1 invariant if P is a group model and Θ0 and Θ1 are disjoint
Section 8.3
95
Invariance, Equivariance, and Minimax Procedures
unions of G orbits. Consider now the usual 0 − 1 loss. If Θ0 is an orbit, if θ0 ∈Θ0 ,
l(gθ, 0) = 1 − l(gθ, 1) for all g and similarly for Θ1 . It is then natural to take G ∗ =
{Identity}, the trivial group, and see that (8.3.6) defines an invariant test
δ(gx) = δ(x)
(8.3.7)
for all g, x. We take (8.3.7) as our general definition of an invariant test.
Example 8.3.4 (continued). In this example invariance means
δ(x1 + c, . . . , xn1 + c, y1 + c, . . . , yn2 + c) = δ(x1 , . . . , xn1 , y1 , . . . , yn2 ) (8.3.8)
for all x, y, c. For n1 = n2 = 1, this model is invariant under the location-scale group,
g(a,b) (x1 , y1 ) = (ax1 + b, ay1 + b) a 6= 0, b ∈ R .
Now
g (a,b) (µ, ∆, σ 2 ) = (aµ + b, aµ + b + a∆, a2 σ 2 )
and Θ0 and Θ1 are again orbits. This invariance is preserved for general n1 , n2 if g is
extended to Rn1 × Rn2 as in (8.3.4). Now test invariance becomes more stringent,
δ(ax1 + b, . . . , axn1 + b, ay1 + b, . . . , ayn2 + b) = δ(x, y)
for all a 6= 0, b
(8.3.9)
✷
Example 8.3.2. The shift model (continued). Suppose we observe X1 , . . . , Xn i.i.d. P
from the shift model with fixed f0 . Parametrize the shift model by θ ∈ R, P = {fθ ≡
f0 (· − θ) : θ ∈ R}. Suppose we want to estimate θ with loss l(θ, a) ≡ λ(|θ − a|),
λ : R+ → R+ . Then,
l(gc θ, a) = λ(|θ + c − a|) = λ(|θ − (a − c)|) .
So, for equivariances, we must define gc∗ (a) = a + c and G ∗ is the shift group on R. Thus
a translation equivariant estimate δ satisfies
δ(x1 + c, . . . , xn + c) = δ(x1 , . . . , xn ) + c, all x1 , . . . , xn , c .
(8.3.10)
✷
Example 8.3.5. The Gaussian one-sample problem. Suppose we observe X1 , . . . , Xn
i.i.d. N (µ, σ 2 ). As in Example 8.3.4 this model is a group model for G = {gab : xi →
axi + b, 1 ≤ i ≤ n, a 6= 0, b ∈ R}. If θ = (µ, σ 2 ), then
g (a,b) (µ, σ 2 ) = (aµ + b, a2 b2 ) ,
a group operating on R×R+ . Suppose we want to estimate µ with the equivariant KullbackLiebler loss (Vol. I, page 169), which in this case is equivalent to scaled quadratic loss,
l(θ, d) = (µ − d)2 /σ 2 . Then, as in the previous example,
′
l(g(a,b) θ, d′ ) =
(µ − d a−b )2
(aµ + b − d′ )2
=
.
a2 σ 2
σ2
96
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
∗
Thus we must define g(a,b)
(d) = ad + b and G ∗ is the affine group on R. Correspondingly,
an estimate δ of µ is said to be location-scale equivariant iff
δ(ax1 + b, . . . , axn + b) = aδ(x1 , . . . , xn ) + b, all a 6= 0, b ∈ R .
8.3.3
(8.3.11)
Characterizing Invariant Tests
Invariant functions
Rather than restrict ourselves to invariant test functions we more generally define an
invariant function ψ : X → τ , where τ is arbitrary, as one such that
ψ(gx) = ψ(x), all g ∈ G, x ∈ X .
(8.3.12)
Invariant functions can equivalently be characterized as functions which are constant
on orbits of G. This leads naturally to the definition of a maximal invariant (function)
M : X → L, where L is general, as a function which is invariant, but further, M (x1 ) 6=
M (x2 ) if x1 , x2 are not in the same orbit. Thus M labels orbits uniquely. Clearly every
invariant ψ is a function of M ; there exists λ : L → τ such that
ψ(x) = λ(M (x)) .
(8.3.13)
From its definition we see that M is far from unique. In fact, for those familiar with measure
theory the appropriate object to consider is the invariant σ field
I = {A measurable ⊂ X : gA = A for all g ∈ G}.
M simply induces I and (8.3.13) can be interpreted as:“Every invariant function is measurable with respect to I.”
Suppose that testing H : θ ∈ Θ0 vs K : θ ∈Θ1 is invariant under G as discussed in the
previous section. We define ϕ∗ (x) (possibly randomized) as the (uniformly) most powerful
(UMP) invariant level α, test of H vs K1 if ϕ∗ is invariant, has level α and is at least as
powerful as any other level α invariant test (defined by ϕ invariant). Our discussion makes
it plain that a UMP invariant (UMPI) test is simply the UMP test based on observing M (X)
only.
Here are three examples of maximal invariants.
Example 8.3.6. The shift group. We consider the group applied to [Rd ]n ,
gc : (x1 , . . . , xn ) → (x1 + c, . . . , xn + c).
If, for all c, ψ(x1 + c, . . . , xn + c) = ψ(x1 , . . . , xn ), let c1 = −x1 . Then,
ψ(x1 , . . . , xn ) = ψ(0, x2 − x1 , . . . , xn − x1 ) .
(8.3.14)
But the right hand side of (8.3.14) is clearly invariant with respect to shift. Evidently
M (x1 , . . . , xn ) ≡ (x2 −x1 , . . . , xn −x1 ) is a maximal invariant since (x02 −x01 , . . . , x0n −
Section 8.3
Invariance, Equivariance, and Minimax Procedures
97
x01 ) = (x12 − x11 , . . . , x1n − x11 ) implies that (x01 , . . . , x0n ) = (x11 + c, . . . , x1n + c) where
c = x01 − x11 , which is just the condition required by maximality. Note that we could just
as well have taken M (x1 , . . . , xn ) = (x1 − x, . . . , xn − x) where x is the mean of the
xi . Note also that (x1 , M (x1 , . . . , xn )) is an alternative representation of (x1 , . . . , xn ).
Thus, the value of M (x1 , . . . , xn ) tells us which orbit we’re on and x1 parametrizes points
in the orbit, or equivalently G. More generally, if G is parametrized smoothly by θ → gθ
where θ ∈ Θ, an l dimensional manifold, and the dimension of X is p, we expect M (x) to
range over a p − l dimensional manifold. If l ≥ p, then G is transitive.
✷
Example 8.3.7. The orthogonal group. Let X = Rd and G be the group of orthogonal d×d
matrices; the set of all Ad×d such that AAT = AT A = J, the identity. As is well known
orthogonal transformations preserve Euclidean distances between points, |Ax − Ay|2 =
|A(x − y)|2 = (x − y)T AT A(x − y) = |x − y|2 . This shows that M (x) ≡ |x|2 is
invariant. On the other hand |x|2 = |y|2 implies that there exists A orthogonal such that
Ax = y (Problem 8.3.6). Thus, |x|2 is a maximal invariant.
✷
Example 8.3.3 (continued). The group of monotone transformations. Suppose g is a
continuous strictly monotone increasing function from R onto R and x(1) < . . . < x(n) .
Evidently, g(x(1) ) < . . . <
Png(x(n) ). Thus if X = {(x1 , . . . , xn ) : x1 6= x2 . . . 6= xn } the
ranks Rj (x1 , . . . , xn ) = i=1 1(xi ≤ xj ), 1 ≤ j ≤ n, are invariant. On the other hand,
suppose (R1 (x), . . . , Rn (x)) = (R1 (y), . . . , Rn (y)), x, y ∈ X . Then, there exists g ∈ G
such that g(x(i) ) = y(i) , 1 ≤ i ≤ n, where x(1) , . . . , x(n) , y(1) , . . . , y(n) are the ordered
x’s and y’s. Hence (R1 , . . . , Rn ) is a maximal invariant (Problem 8.3.7). Note in this case
a natural representation of (x1 , . . . , xn ) is (R1 (x), . . . , Rn (x), x(1) , . . . , x(n) ).
✷
We now construct UMPI tests.
Example 8.3.8. Testing one shift family against another. Let X ∈ Rd be distributed according to Pj,c , j = 0, 1, c ∈ Rd where Pj,c has continuous case density fj (· − c), j = 0, 1,
and f0 , f1 are of different types, that is, f1 does not belong to the location family generated
by f0 . We want to test H : P ∈ P0,c for some c vs K : P ∈ P1,c for some c. The problem
is clearly invariant under the shift group G. By Example 8.3.6, we need only consider tests
based on M (X) = (X2 − X1 , . . . , Xd − X1 ). Note that the distribution of M (X) does not
depend on c and M (X) has density on Rd−1 given by
Z ∞
gj (m) =
fj (u, m1 + u, . . . md−1 + u)du .
(8.3.15)
−∞
Thus a UMPI level α test exists and is of the form
Reject H if g1 (M (X)) > cg0 (M (X)), Accept H if g1 (M (X)) < cg0 (M (X))
with c determined by g0 and α.
✷
This example suggests the result:
Proposition 8.3.1. The distribution of any maximal invariant M (X) of a group G leaving a model P = {Pθ : θ ∈Θ} invariant depends on θ only through M (θ), i.i.d. as (X, Y )
98
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
the maximal invariant of the group G induced on Θ by G.
Proof: See Problem 8.3.8 or Lehmann and Romano (2005).
Example 8.3.9. The Gaussian linear model with known variance. Suppose that as in
Section 6.1, Y n×1 = µn×1 + ε where ε ∼ N (0, σ02 J) and µ ∈ V , a linear subspace
of Rn of dimension d. We want to test µ = 0 vs µ 6= 0. This testing problem is
left invariant by the group G of orthogonal matrices operating on Rn , since Aµ = 0 iff
µ = 0 and Aε ∼ N (0, σ02 AJAT ) = N (0, J) if A is orthogonal. By Example 8.3.6
Pd
M ≡ |X|2 = j=1 Xi2 is maximal invariant. Under H, M has a σ02 χ2d distribution. More
generally, M ∼ σ02 χ2d (|µ|2 σ02 ) where χ2d (θ2 ) is a noncentral χ2 distribution with noncentrality parameter θ2 . It may be shown (Problem 8.3.9) that the χ2d (θ2 ) family is Monotone
Likelihood Ratio (MLR) in θ2 and M . Therefore the test,
Reject H iff
M
≥ χd (1 − α)
σ02
where χd (1−α) is the (1−α) quantile of χ2d , is UMP invariant by Theorem 4.3.1. Extension
of this result to the case σ 2 unknown and linear hypotheses are given in Lehmann and
Romano (2005).
✷
Example 8.3.10. The nonparametric two-sample problem. Let X1 , . . . , Xn1 be i.i.d. F ,
Xn1 +1 , . . . , Xn be i.i.d G where F, G are continuous but otherwise unknown. All the
X’s are independent. The testing problem, H : F = G vs K : F 6= G, is invariant under the transformation group of Example 8.3.3, applied to X = (X1 , . . . , Xn )T .
Thus R ≡ (R1 , . . . , Rn )T is a maximal invariant. Under H, we have already seen that
P [(R1 , . . . , Rn ) = (i1 , . . . , in )] = 1/n! for all permutations {i1 , . . . , in } of {1, . . . , n}.
Under the alternative, however, although there is a formula for the distribution of R, there
is no UMP invariant test (Problem 8.3.10). However, as was first noted by Lehmann (1953)
there are semiparametric group submodels P ≡ {Pθ,F : G = q(F, θ), θ ∈ R} for which
(R1 , . . . , Rn ) has a distribution depending on θ only (Problem 8.3.11). For instance, Savage (1956) considers
G = 1 − (1 − F )θ ,
(8.3.16)
θ ≥ 1, a model which includes F (t) = 1 − e−λt , G(t) = 1 − e−λθt , the exponential
lifetime two-sample model. The Lehmann-Savage(2) model is a special case of the Cox
proportional hazard model, Example 9.1.4, with λ being the hazard rate of F and λθ that
of G. See Problems 1.1.12 and 1.1.13. By arguing as in Example 8.3.8, for the model
G = 1 − (1 − F )θ , there is a UMP invariant (wrt F ) test for H : θ = 1 vs K : θ = θ1 > 1,
which in this case can be computed (Problem 8.3.10(c)). Unfortunately, it depends on θ1
and we need a new concept: A test ψ of H : θ = θ0 vs K : θ > θ0 is locally MP if there
exists ε > 0 such that ψ is MP for all θ ∈ (θ0 , θ0 + ε). A formula for computing such tests
is given in Problems 8.3.10-8.3.12. For model (8.3.16) there is a locally most powerful (see
Problems 8.3.12 and 8.3.13) UMP invariant test called the Savage exponential scores test.
Section 8.3
Invariance, Equivariance, and Minimax Procedures
99
Let aE (j) = E(X(j) ) where X(1) < . . . < X(n) are the exponential, E(1), order statistics
(see Problem 8.3.13). Then the Savage exponential scores test is given by
Reject H iff
n
X
aE (Ri ) < c
i=n1 +1
with c determined under F = G by the level α and the uniform distribution on permutations. Hoeffding (1951) considered the optimal rank test for the Gaussian two-sample
problem. Note that all rank tests are permutation tests and similar and they have the attractive feature that when H : F = G holds their distributions do not depend on F = G. They
are distribution-free.
Example 8.3.11. Invariant regression tests. Copula models. As in Section 6.1 consider the
framework where the ith measurement (response) Yi among n independent observations
has a continuous distribution Fi and we want to investigate whether Fi depends on a vector
zi = (zi1 , . . . , zid )T of available constants (predictor values), 1 ≤ i ≤ n, d < n, where
zi1 , . . . , zin are not collinear. We consider the testing problem H : Fi = F , 1 ≤ i ≤ n,
vs K : Fi 6= Fk , some i, k, which is invariant under the group of continuous increasing
transformations of Example 8.3.3 applied to Y = (Y1 , . . . , Yn )T . Thus the rank vector
R = (R1 , . . . , Rn )T of Y is maximal invariant. Consider semiparametric group submodels
of the form
Fi (·) = Ci F (·) , 1 ≤ i ≤ n ,
where Ci , which is called a copula, is a continuous df on (0, 1) which is known except for
a parameter θi that depends on zi ; and F is an unknown continuous baseline df.
The term “copula,” introduced by Sklar (1959), is a measure of the dependence between
variables X1 , . . . , Xp . It is defined as
C(u1 , . . . up ) = P (G1 (X1 ) ≤ u1 , . . . , Gp (Xp ) ≤ up ), p ≥ 2 ,
where the marginal df’s G1 , . . . , Gp are assumed to be continuous. Here we extend this
definition to the p = 1 case: the regression copula for Yi ∼ Fi w.r.t. the hypothesis
H : Fi = F , 1 ≤ i ≤ n, for continuous F is
Ci (u) = P F (Yi ) ≤ u , = Fi F −1 (u), 1 ≤ i ≤ n .
The copula Ci (u) is a measure of the dependence of the distribution Fi on zi . The term
“baseline distribution” df refers to the hypothesis distribution F that does not depend on
zi .
Set Vi = F (Yi ), then Ri = Rank(Vi ) because the ranks are invariant under increasing
transformations. The df of Vi is
Ci F F −1 (v) = Ci (v), 0 ≤ v ≤ 1 ,
and the distribution of R only involves Ci (·). Useful choices are the Gaussian copula
Ci (v) = Φ Φ−1 (v) − θi and θi = β T zi , in which case Φ−1 (Vi ) = Φ−1 F (Yi ) ∼
100
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
N (θi , 1) and we may assume Fi = N (θi , 1) when we deal with rank tests. For this copula,
when d = 1 and θi = α + βzi , the locally UMP invariant (w.r.t. α, F ) test of H0 : β = 0
vs H1 : β > 0 is based on the normal scores statistic
1
T Φ = n− 2
n
X
i=1
aΦ (Ri )(zi − z̄i ) ,
where aΦ (k) = E(z (k) ) with Z (1) < . . . < Z (n) the N (0, 1) order statistics (Problem
8.3.14(d)). The normal scoresaΦ (k), which are also called the Fisher-Yates scores, can be
closely approximated by Φ−1 (k − 3/8)/(n + 1/4) .
The locally UMP invariant test for the logistic copula Ci (u) = L(L−1 (u) − θi ) with
−1
L(t) = 1 + exp(−t)
and θi = α + βzi is based on the uniform scores statistic
1
T U ≡ n− 2
n
X
Ri
(zi − z̄) .
n+1
i=1
Critical values for the tests based on TΦ and TU can√be obtained from the distribution of
the ranks or their asymptotic distributions: TΦ /s and 12TU /s are asymptotically N (0, 1)
under H with
n
X
(zi − z̄)2 /(n − 1)
s2 =
i=1
provided (zi ), 1 ≤ i ≤ n, satisfy the Lindeberg condition
max (z − z̄)2
Pn i i
→0.
2
i=1 (zi − z̄)
See Section 9.5 and Hajek and Sidak (1967), Section V.1.5.
✷
T
The parameter
β
in
the
model
θ
=
β
z
in
the
previous
example
is
only
identifiable
if
i
i
d ≤ n and (z1j , . . . , znj )T , 1 ≤ j ≤ d are not collinear; see Problem 1.1.9. However,
modern technology has made it possible to generate data with d very large, so called “big
data.” In particular, d > n. We turn to this case:
Example 8.3.11 (continued). Invariant high dimensional data analysis. One approach to
Pd
the d > n case is to replace β T zi in the preceding copula model with aT zi = j=1 aj zij ,
where the aj are selected to maximize the spread of {aT zi , : 1 ≤ i ≤ n} subject to
aT a = 1. This makes sense if we regard aT zi as predictors because regression analysis
is most effective when the predictors are spread out. Thus we choose a to maximize the
sample variance of {aT zi : 1 ≤ i ≤ n}. That is, we use (see Section B.10.12 and Problem
8.3.24) the first sample eigenvector
b : aT a = 1}
b
a = arg max{aT Σa
b is the sample covariance matrix of (zij )n×d . Let
where Σ
ti = b
azi , 1 ≤ i ≤ n .
Section 8.3
Invariance, Equivariance, and Minimax Procedures
101
Here t1 , . . . , tn , which are called the first principal component sample values, can be computed using available principal component analysis (PCA) software, e.g. “eigensoft.” Now
we use the semiparametric copula regression submodel Fi (·) = Ci (F (·)) with Ci having
parameter ηi = α + βti . This in an example of a sparse model. Several principal components (Problem 8.3.24) can also be used. Typically, ten or fewer are used. The locally UMP
invariant PCA tests of H0 : β = 0 vs H1 : β > 0 for the Gaussian and logistic regression
copulas are the normal scores and uniform scores tests, respectively, with zi replaced by ti .
b and the eigenvector b
Remark 8.3.2. Σ
a are known to be inconsistent estimates of their
population counterparts when d > n, unless the model for the data is restricted to a submodel. One approach is to assume that the predictor is a random vector Z ∈ Rd whose
covariance matrix Σ is restricted; see Bickel and Levina (2008 a, b). A similar approach is
to restrict the ordered population eigenvalues λ1 ≥ λ2 ≥ . . . (as defined in Section B.10.12
and Problem 8.3.24) of Σ. Such approaches are referred to as sparse PCA, e.g. see Patterson et al. (2006), Paul (2007), El Karoui (2008), Amini and Wainwright (2009), Jung and
Marron (2009), Johnstone and Lu (2009), Witten, Tibshirani and Hastie (2009), Shen, Shen
and Marron (2011), Birnbaum, Johnstone, Nadler and Paul (2013), and Hastie, Tibshirani,
and Wainwright (2015). These references focus on the estimation of the covariance matrix
of Z and its eigenvalues and eigenvectors. For papers that focus on the association between
individual predictors Zk and a response Y , see Price et al. (2006), Lin and Zheng (2011),
and Yang, Doksum and Tsui (2014).
✷
Example 8.3.11 (continued). Invariant tests in transformation models. Consider the transformation regression model with
T
h(Yi ) = θi + εi , 1 ≤ i ≤ n ,
where θi = β zi , εi has a known continuous distribution G, and h is an increasing continuous unknown function. In this case h−1 (Yi ) ∼ θi + εi . Thus if G = Φ and we use
rank tests, we are in the realm of the Gaussian copula model and can use the normal scores
rank tests described earlier in this example. The proportional hazard and proportional odds
models are tranformation regression models. See Problem 8.3.14.
Example 8.3.12. Invariant tests of independence. Bivariate copulas. Consider the problem of testing the independence of two random variables X and Y with continuous df
F (x, y). The testing problem is invariant under increasing continuous transformations
h(X), g(Y ), and for a sample (X
invari 1 , Y1 ), . . . , (X
Pn
Pnn , Yn ) i.i.d. as (X, Y ) the maximal
ant is (R1 , S1 ), . . . , (Rn , Sn ) where Ri = k=1 1(Xk ≤ Xi ) and Si = k=1 1(Yk ≤
Yi ) are the ranks. Sklar (1959) introduced models of the form
Fθ (x, y) = Cθ F1 (x), F2 (y)
for some continuous df (copula) Cθ on [0, 1] × [0, 1] which is known except for the parameter θ, where F1 and F2 are the marginal df’s of X and Y .
A useful choice is the bivariate Gaussian copula
−1
2
Cρ (u, v) = Φρ Φ−1
1 (u), Φ2 (v) , Φ1 = N (µ1 , σ1 ),
Φ2 = N (µ2 , σ22 ), Φρ = N (µ1 , µ2 , σ12 , σ22 , ρ) .
102
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
The locally UMP invariantP
test of H0 : ρ = 0 vs H1 : ρ > 0 is based on the bivarin
ate normal scores statistic i=1 aΦ (Ri )aΦ (Si ) where aΦ (k), 1 ≤ k ≤ n, are the normal scores (see Problem 8.3.16). The null distribuion is obtained by rearranging the pairs
(R1 , S1 ), . . . , (Rn , Sn ) so that R1 < R2 < . . . < Rn and noting that the resulting Y -ranks
S∗ has P (S∗ = s) = 1/n! for each permutation s of (1, . . . , n). Klaassen and Wellner
(1997) show that in the Gaussian copula model the normal scores correlation coefficient
Pn
a (R )a (S )
PnΦ i2 Φ i
ρbΦ = i=1
i=1 aΦ (i)
is an asymptotically semiparametrically efficient estimate of
ρΦ = Corr Φ−1 (F1 (X)), Φ−1 (F2 (Y1 ))
for each ρ ∈ (−1, 1) uniformly in F1 , F2 over the class of all regular estimates. It can be
shown that for the bivariate normal copula model, |ρΦ | is the maximum monotone correlation coefficient (Problem 8.3.16) in the sense that
|ρΦ | = sup corr a(X), b(Y )
a,b
for a and b monotone. Approximate critical value for H : ρ = 0 can be obtained from
1
n− 2 ρbΦ ∼ N (0, 1).
For more on rank tests see the problems, Hajek and Sidak (1967), Lehmann (2006), and
Lehmann and Romano (2005).
8.3.4
Characterizing Equivariant Estimates
Characterizing equivariant estimates is relatively simple in the case of the shift group. Suppose as in Example 8.3.2 we observe X 1 , . . . , X n ∈ Rd i.i.d. f (· − θ) for f and θ
unknown. We want to estimate θ and our loss function is of the form λ(|θ − a|), with λ
increasing. As we argued for d = 1 an appropriately equivariant estimate obeys
δ(x1 + c, . . . , xn + c) = c + δ(x1 , . . . , xn ) .
(8.3.17)
Examples of equivariant estimates are δ ∗ = X, or (med X1i , . . . , med Xdi )T , or even
i
i
X 1 . Note that if δ, δ ∗ are equivariant then δ − δ ∗ is invariant by (8.3.17). Thus, if M (x) is
a maximal invariant under G, any equivariant estimate can be written for some ψ as
δ(x1 , . . . , xn ) = δ ∗ (x1 , . . . , xn ) + ψ(M (x1 , . . . , xn )).
(8.3.18)
Taking δ ∗ (x1 , . . . , xn ) = x1 gives us a particularly simple expression in (8.3.18). From
this formula it is easy to deduce that there is a uniformly best equivariant estimate and
derive its form, at least for λ(t) = t2 and similar loss functions.
Section 8.3
103
Invariance, Equivariance, and Minimax Procedures
Theorem 8.3.1. If there exists an equivariant estimate for the shift group with finite risk,
then there is a unique (Uniform Minimum Risk Equivariant) UMRE estimate, given by
δ ∗ (x1 , . . . , xn ) = x1 − ψ ∗ (M (x))
(8.3.19)
where M is a maximal invariant and ψ ∗ (M ) = arg minE0 (λ(X 1 + c)|M ).
c
Proof. Write, keeping f = f0 fixed,
R(θ, δ) = R(0, δ) = E0 λ(X 1 + ψ(M (X 1 , . . . , X n ))) .
(8.3.20)
The reason for the first identity is that X 1 − θ and ψ(M (X 1 , . . . , X n )) have a distribution
not depending on θ. But
E0 λ(X 1 + ψ(M )) = E0 {E0 (λ(X 1 + ψ(M ))|M )} .
Since ψ is arbitrary the result follows.
✷
2
If we specialize to λ(t) = t we readily obtain, by Theorem 1.4.1 (Problem 8.3.17),
δ ∗ (X 1 , . . . , X n ) = X 1 − E0 (X 1 |M ) .
If X 1 has density f0 we can obtain a very suggestive form,
R
Qn
∗
Rd θQ i=1 f0 (X i − θ)dθ
R
δ (X 1 , . . . , X n ) =
.
n
i=1 f0 (X i − θ)dθ
Rd
(8.3.21)
(8.3.22)
We derive (8.3.22). Let M (X 1 , . . . , X n ) = (X 2 − X 1 , . . . , X n − X 1 ). Then, the
conditional density of X 1 |M under θ = 0 is
Qn−1
f0 (u) i=1 f0 (mi + u)
f (u|M ) = R
.
Qn−1
i=1 f0 (mi + v)dv
Rd f0 (v)
So,
R
Qn−1
uf0 (u) i=1 f0 (mi + u)du
E0 (X 1 |M ) = R
Qn−1
i=1 f0 (mi + v)dv
Rd f0 (v)
Rd
and
X 1 − E0 (X 1 |M ) =
R
Rd
Qn−1
(X 1 − u)f0 (u) i=1 f0 (mi + u)du
.
R
Qn−1
f (v) i=1 f0 (mi + v)dv
Rd 0
We obtain (8.3.22) by changing variables, from v to θ = X 1 − v.
The UMRE estimate (8.3.22) known as Pitman’s estimate is the (improper) Bayes estimate of θ when θ has prior “density,” π(θ) ≡ constant, the “uniform distribution” on Rd .
In particular,
Example 8.3.13. Estimating the mean of a multivariate Gaussian distribution. Let X ∼
N (µ, σ02 JP ) where JP is the identity matrix. By sufficiency, because X̄ is also Gaussian,
we can consider n = 1 and suppose X = θ + e where e ∼ Np (0, Jp ). Then, we obtain readily that X is the UMRE estimate of θ, just as we have seen it is UMVU.We will
consider this and other estimates of µ in Section 8.3.6.
✷
104
8.3.5
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
Minimaxity for Tests: Application to Group Models
Recall Theorems 3.3.2 and 3.3.3 where we saw that minimax procedures δ ∗ were Bayes or
approximately Bayes with respect to prior distributions which concentrated on the set of
parameters that produce the maxima of the risk of δ ∗ . We have already noted that invariant
and equivariant procedures have large sets of constant risk, the orbits of G in Θ. Thus, best
equivariant procedures are natural candidates for minimaxity.
Testing:
We begin with a general study of minimaxity for testing. Here, the asymmetry of the two
types of error leads to an asymmetric formulation of minimaxity not covered in Chapter 3.
Definition 8.3.1. ϕ∗ is minimax level α for testing H : θ ∈ Θ0 vs K : θ ∈ Θ1 iff ϕ∗
is level α and
inf{Eθ ϕ∗ (x) : θ ∈ Θ1 } ≥ inf{Eθ ϕ(x) : θ ∈ Θ1 }
for all other level α tests ϕ.
Thus the minimax test maximizes the minimum power, where for 0-1 loss, power = 1 risk. Such ψ ∗ is sometimes called a maxmin test.
Suppose we are given a prior distribution π for θ on Θ0 ∪ Θ1 ≡ Θ such that
i) Π(Θ0 ) = λ = 1 − Π(Θ1 ), 0 < λ < 1
ii) θ given θ ∈ Θj has distribution Πj , j = 0, 1
and a loss function lu given for 0 < u < 1 by
lu (θ, 1) = u1(θ ∈ Θ0 ), lu (θ, 0) = 1(θ ∈ Θ1 ) .
From a simple extension of Example 3.2.2 (Problem 8.3.18) we see that Bayes tests are of
the form, for 0 < λ < 1,
R
p(x, θ)Π1 (dθ)
ϕc (x) = 1 if RΘ1
>c
Θ0 p(x, θ)Π0 (dθ)
= 0 if the inequality is reversed
(8.3.23)
where c = λu/(1 − λ). The argument in Example 3.2.2 makes it clear that (8.3.23) doesn’t
depend of the finiteness of Θ. We can apply Theorem 3.3.2 to obtain
Theorem 8.3.2. Suppose ϕc given by (8.3.23) is size α, and that
Π0 {θ : Eθ ϕc (X) = supEθ′ ϕc (X)} = 1
Θ0
Π1 {θ : Eθ (1 − ϕc (X)) = supEθ′ (1 − ϕc (X))} = 1 .
Θ1
Then, ϕc is minimax level α.
(8.3.24)
Section 8.3
Invariance, Equivariance, and Minimax Procedures
105
Proof. Let β = supEθ′ (1 − ϕc (X)) and
Θ1
u=
α
λ
c
,
= .
β 1−λ
u
(8.3.25)
Note that ϕc is Bayes for Π, lu and that for R(θ, ϕc ) = Eθ lu θ, ϕc (X) ,
supR(θ, ϕc ) = supR(θ, ϕc ) = supR(θ, ϕc ) .
Θ0
Θ1
(8.3.26)
Θ
Now, ϕc Bayes, (8.3.26), and Theorem 3.3.2 yield that ϕc is minimax for lu and that the
Bayes risk equals the maximum risk. Then, given a competing ϕ of level α, by hypothesis,
Z
Z
λα + (1 − λ)uβ = λ Eθ ϕc (X)Π0 (dθ) + (1 − λ)u Eθ (1 − ϕc (X))Π1 (dθ)
Z
Z
≤ λ Eθ ϕ(X)Π0 (dθ) + (1 − λ)u Eθ (1 − ϕ(X))Π1 (dθ)
≤ λα + (1 − λ)u supEθ (1 − ϕ(X))
Θ1
and the result β ≤ sup Eθ (1 − ϕ(X)) follows.
Θ1
Corollary 8.3.1. If Θ1 = {θ1 }, ϕc is given by (8.3.23) and the first condition of (8.3.24)
holds; then ϕc is UMP level α for H : θ ∈ Θ0 vs K : θ = θ1 .
It may be shown that the conditions of Theorem 8.3.2 and Corollary 8.3.1 are also necessary for minimaxity and UMP properties of test rules under regularity conditions on the
model P and compactness of Θ0 , Θ1 (Blackwell and Girshick (1954, 1979)).
Example 8.3.14. Minimaxity in testing in the Gaussian linear model with σ 2 known.
H : µ = 0 vs K : |µ| = r0 . Consider the test of Example 8.3.9, which we have shown
to be UMP invariant under the orthogonal group. Note that the power function of this test
depends on |µ|/σ0 only and hence (8.3.24) holds automatically. We need to exhibit Π0 and
Π1 . Without loss of generality take σ0 = 1. Evidently Π0 is point mass at 0. For Π1 it is
natural to choose the uniform distribution on the sphere surface {µ : |µ| = r0 }. Then
Z
2
2
p(X, µ)
1
1
L(X) ≡
(8.3.27)
dΠ1 (µ) = e 2 |X| E(e− 2 |X−r0 U | |X)
Θ1 p(X, 0)
where X, U are independent, X ∼ N (0, J), and U is uniform on the surface of the unit
sphere. Now (8.3.27) becomes
L(X) = e−
2
r0
2
E(er0 (X,U ) |X) = e−
2
r0
2
E(er0 (|X|(X/|X|,U ) |X) .
Since the distribution of U is invariant under rotations we can replace X/|X| by, say,
(1, 0, . . . , 0)T without changing the result. Thus L(X) is a monotone increasing function
106
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
of |X|. We conclude that the UMP invariant test is of the form (8.3.23) and hence we can
apply Theorem 8.3.2 to conclude minimaxity.
✷
Example 8.3.15. A UMP test for a composite multivariate Gaussian hypothesis. Let
X 1 , . . . , X n be i.i.d. Np ((ϑ, η T )T , Σ0 ). The scalar ϑ and β ≡ (η2 , . . . , ηp )T are unknown; Σ0 is nonsingular but known. We wish to test H : ϑ = ϑ0 vs K : ϑ = ϑ1 > ϑ0 .
Note that both hypothesis and alternative are composite and multivariate so that the UMP
test theory of Chapter 4 doesn’t apply. First reduce by sufficiency to the case n = 1 by considering X. Consider the case Σ0 = J. Define Π0 , Π1 as corresponding to point masses at
(ϑ0 , η 0 )T and (ϑ1 , η 0 )T . Then, if X ≡ (X1 , . . . , Xp )T ,
R
p(X, θ)dΠ1 (θ)dΠ1 (θ)
L(X) ≡ Θ1 R
Θ0 p(X, θ)dΠ0 (θ)
1
= exp − {(X − θ1 )T (X − θ1 ) − (X − θ0 )T (X − θ0 )}
2
where θ j = (ϑj , η2 , . . . , ηp )T , j = 0, 1. Then,
1
log L(X) = (ϑ1 − ϑ0 )X1 − (|θ 1 |2 − |θ0 |2 )
2
(8.3.28)
and thus the family of Bayes tests ϕc for the loss function lu (θ, ϕ) is equivalent in this case
to the family
{Reject H iff X1 > d} .
But under H, X1 ∼ N (ϑ0 , 1). The conditions of Corollary 8.3.1 are clearly satisfied since
the tests have power depending on ϑ only. It follows (Problem 8.3.16(a)) that the size α
test for this hypothesis is, in fact, UMP for H : ϑ ≤ ϑ0 vs K : ϑ > ϑ0 . Now consider the
case of general Σ0 . Make the well-specified linear transformation,
X → U = (X1 , X T[2,p] − Cov(X1 , X[2p] )(VarX1 )−1 X1 )T ,
(8.3.29)
where X [2,p] ≡ (X2 , . . . , Xp )T . Note that ΣR ≡ Cov(X1 , X [2,p] ) = (σ12 , . . . , σ1p ) and
VarX1 = σ11 so that the second term has dimension p − 1. Write U ≡ (U1 , . . . , Up )T ≡
AX and
σ11 Σ12
Σ0 =
.
Σ21 Σ22
It follows that U ∼ Np (U , AΣ0 AT ) where U ≡ Aϑ. Since µ1 = ϑ1 we still test H : ϑ =
ϑ0 vs K : ϑ = ϑ1 . Most significantly, AΣ0 AT is of the form
σ11 0T
0 Σ∗22
where Σ∗22 = Var(U2 , . . . , Up )T . It is easy to see (Problem 8.3.16(b)) that in this case, just
as with the special case Σ0 = J = diag(1, . . . , 1), the Bayes test with the same prior as
before is of the form {Reject H iff X1 > c}. Therefore this test is UMP for H : ϑ ≤ ϑ0
Section 8.3
Invariance, Equivariance, and Minimax Procedures
107
vs K : ϑ > ϑ0 no matter what Σ0 is. This is the likelihood ratio test and is both UMP
invariant and unbiased (Problem 8.3.19(c),(d)).
Remark 8.3.3.
(a) It turns out, see Lehmann and Romano (2005), that some of the classical likelihood
ratio tests for H : σ ≤ σ0 , etc. when we observe X1 , . . . , Xn i.i.d. N (µ, σ 2 ) are
UMP and not just UMP unbiased, as we saw in Section 8.2, but most are not.
(b) The theory of minimax testing and, as we shall see, estimation in group models is
very closely tied to the existence of (invariant) Haar measures on groups—see Nachbin (1965) for a treatment. If G can be smoothly indexed by a Euclidean parameter
θ ∈ K, a compact set in Rd , these can be taken as probability distributions, PL , PR
on G. They have the property that if A (measurable) ⊂ G then
PL (gA) = PL (A)
PR (Ag) = PR (A)
for all A and g ∈ G. Here gA = {h = gb, b ∈ A} and similarly for Ag. The
pair PL , PR are unique, although PL 6= PR is possible if G is not commutative. If
G = {gθ : θ ∈ Θ} and G is the group induced on Θ we can produce a prior distribution on any orbit of G by considering gθ0 for θ0 fixed and g ∼ PL or PR . It turns out
that PR is the Haar measure important in our context. In our Example 8.3.9, where G
is the orthogonal group, indeed PL = PR is the uniform distribution on G and not surprisingly this generates the uniform prior on a sphere (an orbit of G). If G is not compact, improper Haar measures arise. We discuss this further in the next subsection.
A very complete treatment of the theory of UMP and minimax tests is given by Lehmann
and Romano (2005).
8.3.6
Minimax Estimation, Admissibility, and Steinian
Shrinkage
We have discussed minimax estimation briefly in Chapter 3. In this section we want to go
further and foreshadow some phenomena discussed in Chapters 10 and 11. We begin by
recalling Section I.6 in simplified form.
Suppose X1 , . . . , Xp are independent Gaussian with common known variance σ02 and
EXi ≡ µi unknown and arbitrary. Equivalently, X ∼ Np (µ, σ02 J) where J = diag(1, . . . ,
1). We want
estimate µ with quadratic loss. That is, if d = (d1 , . . . , dp )T , l(µ, d) =
Pto
p
2
|µ−d| = i=1 (µi −di )2 . This is a special case of Example 8.3.2 with f0 ↔ Np (0, σ02 J).
It is easy to see that Pitman’s estimator here is X itself since M (X) is degenerate (n = 1)
and so E0 (X|M (X)) = E0 X = 0. We know that X is UMVU and that, by Problem
3.3.20, X is minimax. If we have a multivariate normal sample, X̄ is Gaussian, so the
remarks about X also apply to X̄.
Remark 8.3.4. As with testing, minimaxity and equivariance here can be tied to Haar’s
measure on the shift group. X1 , . . . , Xp are the formal Bayes estimates with respect to
108
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
R∞ R∞
the “uniform distribution” on Rp . Since −∞ −∞ dt1 . . . dtp = ∞ the uniform is an improper prior.
It corresponds to a Haar measure which is not a probability distribution. If
R
λ(A) ≡ A dt1 . . . dtp then λ(A + c) = λ(c + A) = λ(A). Corresponding Haar measures
on scale and other groups generate priors, usually improper, which play an important role in
Bayesian inference, being viewed as “uninformative.” Various arguments can be raised for
their use including frequentist efficiency. There is now little controversy in the literature
that these methods are as legitimate as maximum likelihood. There is more controversy
about the nonsubjective interpretation of posterior probabilities calculated under such assumptions. Arguments in favor of the subjective point of view may be found in Bernardo
and Smith (1994), Berger(1985), and Gelman, Carlin, Stern and Rubin (1995).
✷
We introduce a decision theoretic notion now only rarely considered which played an
important role in the discovery of the practical weaknesses of the minimax principle. A decision rule δ ∗ in a given decision problem is said to be admissible iff there exists no other
δ such that R(P, δ) ≤ R(P, δ ∗ ) for all P ∈ P with < for some P . Otherwise δ ∗ is said
to be inadmissible. By refining the methods of Section 3.3 it is possible to obtain usable
criteria for proving admissibility (Blyth (1951)). This is also possible using the information inequality (Hodges and Lehmann (1951)). This work and relations to complete class
theorems such as those mentioned in Section 1.3 are discussed in some detail in Lehmann
and Casella (1998)—see also Schervish (1995).
We do not proceed further into the depths of this topic but make the trivial point that if a
procedure has constant risk and is minimax but inadmissible then any improving procedure
is necessarily also minmax. Stein’s remarkable (1956(b)) discovery was that although for
the risk function E(|X − µ|2 ), X is minimax for all p, it is admissible only for p = 1, 2.
In a sense this can be viewed as an indication that Robbins’ (1964) empirical Bayes view
that, for p large, X is improvable in fact already holds for p > 2.
The basic tool in this analysis which will prove its value more broadly is an identity due
to Stein (1981) giving what has become known as Stein’s unbiased risk estimate.
Theorem 8.3.3. Let X ∼ N (µ, J) and δ : Rp → Rp be an estimate of µ. Define
∆(x) = δ(x) − x.
(8.3.30)
Suppose that, for 1 ≤ j ≤ p, the jth component of ∆ is piecewise continuously differentiable with respect to xj and
Pp
(i) Eµ |∆(x)2 | = j=1 Eµ ∆2j (X) < ∞
and
h
i
Pp
d∆j
(ii)
j=1 E | dxi (X)| |Xj − µj | < ∞
Then, the quadratic loss risk of δ is given by
Eµ |δ(X) − µ|2 = p + Eµ |∆(X)|2 + 2Eµ
p
X
∂∆j
j=1
∂xj
(X) .
(8.3.31)
Section 8.3
109
Invariance, Equivariance, and Minimax Procedures
Proof. The proof relies on a lemma important not only for this result but, more generally, as
the foundation of a large body of results in probability referred to as the Stein-Chen method
(Stein (1981)).
Lemma 8.3.1. Let X ∼ N (µ, σ 2 ). Let g : R → R be a continuous, piecewise differentiable function such that
(i) E|g ′ (X)| < ∞,
(ii) E|X − µ|g(X) < ∞.
Then,
Eg ′ (X) = E
(X − µ)
g(X) .
σ2
(8.3.32)
Proof. Use integration by parts to get
1
σ
A
Z
1
1 A
x−µ
x−u
dx = g(x) ϕ
−
)dx
g (x) ϕ
g(x)ϕ′ (
σ
σ
σ
σ
−A
−A
−A
A
Z A
(x − µ)
x−µ
x−µ
1
dx
+
g(x)
ϕ
= g(x) ϕ
σ
σ
σ2
σ
−A
−A
Z
A
′
x−µ
σ
and the result follows since g(x)ϕ
x−µ
σ
|A
−A → 0 as A → ∞.
✷
The reason for the utility of this elementary lemma is that it holds for arbitrary g. A
simple generalization provides a unique characterization of N (µ, r2 ). Condition (ii) may
be dispensed with (Stein (1981)).
Proof of Theorem 8.3.3. By assumption,
Eµ |δ(X) − µ|2
= Eµ |X − µ|2 + 2Eµ ∆T (X)(X − µ) + Eµ |∆(X)|2
p Z ∞ Z ∞
p
X
Y
= p+2
ϕ(xi − µi )dxi + Eµ |∆(X)|2 .
∆j (x)(xj − µj )
j=1
−∞
−∞
i=1
Write
Z
∞
−∞
=
Z
∞
−∞
Z
Z
∞
−∞
∞
∆j (x)(xj − µj )
Y
−∞ i6=j
p
Y
i=1
ϕ(xi − µi )dxi
Z
ϕ(xi − µi )dxi
∞
−∞
∆j (x)(xj − µj )ϕ(xj − µj )dxj .
Apply Stein’s identity (8.3.32) to ∆j (x) viewed as a function of xj with {xi : i 6= j} fixed.
The theorem follows.
✷
110
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
This identity does indeed provide an unbiased estimate of the risk (mean squared error)
of any estimate δ satisfying the conditions of Theorem 8.3.4. It can be reinterpreted as
saying that
b ≡ p + |∆(X)|2 +
R
p
X
∂∆j
j=1
∂xj
(X)
(8.3.33)
is an unbiased estimate of R(µ, δ) ≡ Eµ |δ(X) − µ|2 . The application of (8.3.33) to
studying inadmissibility of X with respect to a competitor δ(X) is given by
Corollary 8.3.2. Suppose δ satisfies the conditions of Theorem 8.3.4, and ∆ defined by
(8.3.30) is such that
2
p
X
δ∆j
j=1
δxj
(x) + |∆|2 (x) ≤ 0,
(8.3.34)
with strict inequality on a set of positive probability. Then for all µ,
R(µ, δ(X)) < R(µ, X) = p .
(8.3.35)
Note that still,
sup R(µ, δ(X)) = p
(8.3.36)
µ
Stein’s original (1956(b)) proposal of an estimate satisfying (8.3.34) for p ≥ 3 was
δ s (X) =
p−2
X
1−
|X|2
(8.3.37)
where (8.3.35) was established.
The nature of δ s is interesting. It always lies along the vector X. For large values
of |X| it essentially coincides with X but for |X| > p − 2 it is shrunk more and more
towards 0 and, for 0 < |X| < p − 2, it points in the direction opposite to that of X.
The last property is patently unreasonable and, indeed, Stein (1981) showed that the Stein
positive part estimate,
δ+
s (X)
p−2
= 1−
X,
|X|2 +
(8.3.38)
strictly improves δ s (X) and hence X itself. Here x+ ≡ max(x, 0). δ +
s (X) has a very
natural interpretation. The standard test of µ = 0, Reject iff |X|2 > p − 2 is carried out. If
the test accepts, estimate µ by 0, else use [1 − (p − 2)/|X|2 ]X.
Section 8.3
111
Invariance, Equivariance, and Minimax Procedures
Next we establish Stein’s result for δ s using his identity. Here,
∆s (X) = −
(p − 2)
X
|X|2
∂∆js
= −(p − 2)
∂xj
x2j
1
−
2
|X|2
|X|4
!
Thus, if ∆s ≡ (∆1s , . . . , ∆ps )T
2
|∆| + 2
=
p
X
∂∆js
j=1
∂xj
(x) =
(p − 2)2
2p(p − 2) 4(p − 2)
−
+
|X|2
|X|2
|X|2
(p − 2)(2 − p)
< 0 if p ≥ 3
|X|2
(8.3.39)
and Corollary 8.3.2 shows that δs renders X inadmissible. The same argument establishes
that δs+ renders X inadmissible. Unfortunately showing that δs+ renders δs inadmissible
requires showing that the difference of their risks satisfies


p
+
X
∂(∆
−
∆
)
sj
sj 
2
Eµ |∆s |2 − |∆+
≥0.
s | +2
∂x
j
j=1
(8.3.40)
+
+
Here ∆+
s , ∆sj are defined analogously to ∆s , ∆sj with δs replaced by δs .
An enormous literature has grown up from Stein’s results. One of the major conclusions of the theory is that “shrinking” towards linear submodels, not just 0, is an improvement for relatively small values of p. An important method in applied regression analysis,
“Ridge regression,” invented independently, successfully utilizes shrinkage towards regression models of dimension k < p if there are p regressors. A very extensive discussion of
this literature and many more results may be found in Lehmann and Casella (1998). We
shall return to this topic in Chapters 10 and 11. We close this section, following Efron and
Morris (1973), by showing how δs may be motivated from a parametric empirical Bayes
point of view.
Empirical Bayes and the James-Stein estimates δ s
Suppose X ∼ Np (µ, J) as we have assumed throughout this subsection. Now, following the empirical Bayes point of views of Section I.6, suppose µ1 , . . . , µp are i.i.d. N (0, σ 2 )
with σ 2 unknown. If σ 2 were known the Bayes estimate of µ is seen from Example 3.2.1
to be
δ B (X) = (δB (X1 )), . . . , δB (Xp ))T
where δB (x) = σ 2 x/(1 + σ 2 ). Can we estimate σ 2 from our data? Note that, uncondition2
2
ally, the Xi are i.i.d. N (0, 1 + σ 2 ). Thus σ
b2 ≡ |X|
p − 1 is the UMVU, MLE of σ and we
112
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
are led to
bB (X) ≡
δ
1−
p
|X|2
X.
(8.3.41)
This is close to but not quite the James-Stein (1961) estimate. It may in fact
be shown,see
Theorem 5.1, Lehmann and Casella (1998), for instance, that δc (X) ≡ 1 − c(p−2)
X
|X|2
uniformly improves X provided that p ≥ 3 and 0 < c < 2. Thus, the estimate (8.3.41)
uniformly improves X for p > 4.
Remark 8.3.5. We have seen that the Stein-type estimators of a vector parameter with
p ≥ 3 have a smaller average mean squared error than that of the usual unbiased estimator.
However, the mean square error of any one of the components in the Stein vector estimate
may be larger than that of the usual unbiased estimate. See Rao and Shinozakt (1978).
Summary: This section explores connections between equivariance, invariance, minimax
decision rules, and group models. We consider models generated by location groups,
location-scale groups, affine groups, and monotone transformation groups. It is found that
when we restrict decision rules to those that have the same equivariance properties as the
groups generating the models, we can in some cases derive uniformly minimum risk equivariant estimates and equivariant minimax tests. Testing problems that are invariant with
respect to the group of increasing transformations lead to rank tests as the invariant tests.
We consider rank tests that are locally UMP invariant for semi-parametric model subgroups
called copulas. We also discuss weaknesses of the “minimax equivariant” principle by presenting results of Stein that show the inadmissibility of such minimax procedures.
8.4
PROBLEMS AND COMPLEMENTS
Problems for Section 8.2
1. Let X ∼ B(n, θ). Show that X(n − X)/n(n − 1) is the UMVU estimate of θ(1 − θ).
2. Variable Selection. Methods for selecting the covariates to be included in a regression
model are often based on unbiased estimates of mean squared estimation error (or equivalently (Problem I.7.5), mean squared prediction error).
b
b
b 1 )−R(µ,
b 0 ) is an UMVU estimate of R(µ, µ
b 1 )−
(a) In Problem I.7.3, show that R(µ,
µ
µ
b 0 ) when model (I.8.1) holds.
R(µ, µ
bp − R
bq is an UMVU estimate of Rp − Rq when model
(b) In Problem I.7.4, show that R
(I.8.2) holds.
3. Let X1 , . . . , Xn be a sample from U(θ1 , θ2 ) where θ1 , θ2 are unknown.
(a) Show that T (X) = min(X1 , . . . , Xn ), max(X1 , . . . , Xn ) is sufficient.
Section 8.4
113
Problems and Complements
(b) Assuming that T (X) is complete, find a UMVU estimate of (θ1 + θ2 )/2.
(c) Show that T (X) is complete.
4. Let N = (N1 , . . . , Nk ) have a M(n, θ1 , . . . , θk ) distribution, k ≥ 2, θ1 , . . . , θk unknown. Find a UMVU estimate of θ2 − θ1 .
5. Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) population.
(a) Show that if σ 2 is known and µ is not, then X̄ is an UMVU estimate of µ.
(b) Show that the UMVU estimate of σ 2 if µ = µ0 is known, and σ 2 is not, is
n
1X
(X1 − µ0 )2 .
n i=1
6. Let X1 , . . . , Xn be as in Problem 4.2.5 and σ = 1. Find the UMVU estimate of
Pµ [X1 ≥ 0] = Φ(µ).
Hint. (X1 , X̄) has a bivariate normal distribution. Apply Theorem B.4.2.
7. Let X1 , . . . , Xn be a sample from a P(θ) distribution. Find the UMVU estimate of
Pθ [X1 = 0] = e−θ .
8. Let X1 , . . . , Xn be a sample from a Γ(p, λ) population where both p and λ are unknown.
Find the UMVU estimate of p/λ.
9. Let X = (X1 , . . . , Xn ) be a N (µ, σ 2 ) sample. Show that if n ≥ 2, X though sufficient
is not complete.
Hint. Consider S(X) = X2 − X1 .
More instructive counterexamples may be found in Stigler (1972). See Problem 18.
10. Let X = (X1 , . . . , Xn ) be a U(0, θ] sample, θ > 0. Show that Mn = X(n) is the MLE
of θ while T ∗ = [(n + 1)/n]Mn is the UMVU estimate of θ. Show that, if R denotes mean
squared error, there exists an estimate T such that R(θ, T ) < min{R(θ, Mn ), R(θ, T ∗ )}.
Thus Mn and T ∗ are inadmissible.
n+2
1
2
Hint. Consider T = n+1
Mn . Note that R(θ, T ) = (1+n)
2θ .
11. Suppose that T1 and T2 are two UMVU estimates of q(θ) with finite variances. Show
that T1 = T2 .
2
. Use the correlation inequality.
Hint. If T1 and T2 are unbiased so is T1 +T
2
Problems 12, 13, and 14 give UMVU estimates of genetic trait parameters
12. Consider the genetic example of Problem 2.4.6 where X has the zero truncated binomial distribution B1 (n, θ) with
n x
n−x
x θ (1 − θ)
p(x, θ) =
, x = 1, . . . , n .
1 − (1 − θ)n
114
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
(a) Show that X is complete and sufficient for θ.
(b) Show that the expected number of family members with the trait given that at least
one has the disease is Eθ (X) = nθ/[1 − (1 − θ)n ] and thus conclude that (X/n) is
the UMVU estimate of q(θ) = θ/[1 − (1 − θ)n ].
13. For reasons similar to those of Problem 2.4.6 the zero truncated Poisson distribution
p(x, θ) =
θx e−θ /x!
, x = 1, 2, . . .
1 − e−θ
is sometimes used as a model. Note that p(x, θ) = P (Y = x|Y ≥ 1) where Y ∼ P(θ).
Show that the UMVU estimate of q(θ) = Pθ (Y ≥ 1) = 1 − e−θ is given by
T ∗ = 0 if x is odd
= 2 if x is even.
Hint. Show that if Eθ T (X) = q(θ) then T must equal T ∗ .
14. Consider the model of Problems 2.4.6 and 8.2.12. Suppose that among the offspring of
k sisters, exactly one of the children has the disease. Let X denote the number of children
with the trait in the family in which one child has the disease; let Y denote the number of
children with the trait in the ith of the other families; and let θ denote the true proportion of
children with the trait. A reasonable model is one in which X, Y1 , . . . , Yr are independent,
X has the Br (n, θ) distribution of Problem 8.2.2, and Yi ∼ B(ni , θ), where r = k − 1
and n, n1 , . . . , nr denotes the number of offspring in the families. Show that the UMVU
estimate of θ is
n+N −1
−1
− N
T −1
T −1
n+N
− N
T
T
where
T =X+
r
X
i=1
and N =
r
X
ni .
i=1
Pr
Hint. Use Theorem 8.2.2 to show that T is complete and sufficient. Write i=1 Yi as
PN
i=1 Zi where Z1 , . . . , ZN are the indicators of independent B(1, θ) events. Now compute
E(Z1 |T ).
15. Estimating the Probability of Early Failure. It is sometimes reasonable to think of the
lifetime (time to first repair) of a piece of equipment as a random variable following an
exponential distribution with parameter λ which is unknown. Suppose n “identical” pieces
of equipment are run and the failure times X1 , . . . , Xn are observed. We want to estimate
the probability of early failure, that is, Pλ [X1 ≤ x] = 1 − e−λx for some fixed x. Show
that the UMVE is
n
n−1
X
x
T ∗ = 1 − 1 − Pn
Xi ≥ x
if
i=1 Xi
i=1
= 1 otherwise.
Section 8.4
Hint. T =
115
Problems and Complements
P
Xi is complete, sufficient. Set S(X1 ) = 1[X1 ≤ x], then
hX
i
x
1
E S(X1 )|T = t = P [X1 ≤ x|T = t] = P
T =t .
≤
T
T
By the substitution
theorem for conditional
expectations, (B.1.16), the right hand side
equals P (X1 /T ≤ (x/t)|T = t . Now, X1 /T is independent of T and has a beta,
β(1, n − 1), distribution because X1 and (X2 + · · · + Xn ) are independent and the second
of these variables has, by Corollary B.2.2, a Γ(n − 1, λ) distribution; next apply Theorem
B.2.3. Therefore, if b1,n−1 denotes the β(1, n − 1) density,
E S(X1 )|T = t =
Z
(x/t)
b1,n−1 (u) du .
0
Since b1,n−1 (u) = (n − 1)(1 − u)n−2 for 0 < u < 1, the result follows.
16. In Problem 15, consider the UMVU estimate T ∗ and the MLE Te = 1 − exp[−x/X̄].
√
P
Show that as n → ∞, n(T ∗ − Te) −→ 0.
17. Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) population. We want to estimate the
proportion q(θ) of the population below a specified limit c, where θ = (µ, σ 2 ).
(a) Show that the MLE of q(θ) is Φ (x − X̄)/b
σ where X̄ and σ
b2 are the sample mean
and variance.
(b) Show that the UMVU estimate of q(θ) is given by
T ∗ (X) = 0
if kV ≤ −1
1 1 = G + kV if − 1 < kV < 1
2 2
= 1 if kV ≥ 1
√
where k = n/(n − 1), V = (c − X̄)/s, and G is the β 12 , 12 (n − 2) d.f.
Hint. An unbiased estimate is S(X) = 1[X1 ≤ c]. Now compute
b2 .
E S(X)|X̄, σ
b2 = P X1 ≤ c|X̄, σ
18. (Stigler (1972)). Consider the family P of distributions Pθ , θ = 1, 2, . . . , defined by
Pθ [X = i] =
1
if i = 1, . . . , θ, = 0 otherwise.
θ
(a) Show that P is complete. Hint. Use induction.
(b) Show that 2X − 1 is the UMVU estimate of θ.
(8.4.1)
116
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
(c) Show that the family P0 = P − {Pk } is not complete when k is a positive integer.
Hint. Consider E v(X) where
v(i) =
=
=
0 for i 6= k, k + 1
1 for i = k
−1 for i = k + 1 .
(d) Show that 2X − 1 is not UMVU if Pθ ∈ P0 .
Hint. Consider
T (X) = 2X − 1, X 6= k, k + 1
=
2k,
X = k, k + 1.
Remark. This model is not regular in the sense of Sections 1.1.3 and I.0.
19. Consider model (8.4.1). For what α ∈ (0, 1) can you find a similar test of H : θ = 10
vs K : θ > 10?
20. Establish (8.2.12).
21. Establish the converse of Theorem 8.2.3.
22. Lehmann and Scheffé (1950, 1955) show that if a complete sufficient statistic exists, it
must be minimal sufficient. Assume this result. Let X1 , . . . , Xn be a sample from Unif(θ −
0.5, θ + 0.5).
(a) Show that (X(1) , X(n) ) is minimal sufficient.
Hint. Use the factorization theorem. See Example 1.5.1 (continued).
(b) Show that (X(1) , X(n) ) is not complete.
Hint. Take v(s, t) = t − s − [(n − 1)/(n + 1)] and use Problem B.2.9.
(c) Show that no complete sufficient statistic exists.
23. Establish (8.2.17).
24. Suppose (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. as N (µ1 , µ2 , σ12 , σ22 , ρ). Consider the two
statistics:
X
X
X
X T
T =
Xi ,
Yi ,
Xi2 ,
Yi2
P
(Yi − Ȳ )(Xi − X̄)
ρb = P
1 .
P
(Xi − X̄)2 (Yi − Ȳ )2 2
If ρ = 0, are T and ρb independent? Why? Why not?
25. Let (Xi , Yi )T , i = 1, 2, . . . , n, be independent vectors with joint density
fXi ,Yi (xi , yi |τ, σ) =
for xi , yi > 0 and τ, σ > 0.
yi2 e−yi /τ −xi yi /στ
τ 3σ
Section 8.4
117
Problems and Complements
(a) Give a complete sufficient statistic (T1 , T2 ) for θ = (τ, σ).
(b) Give the moment generating function for (T1 , T2 ) and give the corresponding joint
distribution.
(c) Derive the maximum likelihood estimators τb, σ
b.
(d) Calculate E(b
τ ), E(b
σ ) and construct UMVU estimators of τ and σ.
26. Let Y1 , . . . , Yn be i.i.d. from the uniform distribution U (0, θ) with an unknown θ ∈
(1, ∞). Suppose that we only observe
Yi if Yi ≥ 1,
Xi =
i = 1, . . . , n .
1 if Yi < 1,
(a) Derive a UMVUE of θ.
(b) Find the MLE of θ and η ≡ P (Y1 > 1).
(c) Derive a UMP test of size α for testing H : θ ≤ θ0 versus K : θ > θ0 , where θ0 is
known and θ0 > (1 − α)−1/n .
Problems for Section 8.3
1. Show that for the affine group on Rd , (8.3.2) defines ḡ uniquely; that Ḡ is a group, and
that Ḡ is isomorphic to G.
2. Suppose X ∼ P0 = Nd (0, J), J = diag(1, . . . , 1)d×d . Show that the transformation
model induced by P0 and the affine group is {Nd (µ, Σ); µ ∈ Rd , Σ ∈ S}, where S =class
of d × d nonsingular matrices.
3. Establish the claim of Example 8.3.2.
4. Establish the claim of Remark 8.3.1(a).
5. Establish the claim of Remark 8.3.1(c).
6. If x, y ∈ Rd , |x|2 = |y|2 there exists A orthogonal such that Ax = y.
x
Hint. Construct an orthonormal basis where the first member is |x|
mapping
T
(1, 0, . . . , 0) . See Vol. I, page 496.
x
|x|
onto
7. (a) Show that the ranks are indeed maximal invariant in Example 8.3.2.
(b) Suppose X = Rn and G is as before. Exhibit a maximal invariant in this case.
8. Establish Proposition 8.3.1.
9. Show that the Xd2 (θ2 ) family is a monotone likelihood ratio family in the parameter θ2 .
Hint. See Problem B.3.12.
10. Suppose (X1 , . . . , Xn ) have joint continuous case density f1 and that f0 (x1 , . . . , xn ) =
Q
n
i=1 h(xi ) for some univariate continuous case density h that we can choose. Let X(1) <
118
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
. . . < X(n) be the order statistics of the Xi and (R1 , . . . , Rn ) the ranks. Assume that
f0 > 0 whenever f1 > 0.
(a) Show that the following Hoeffding’s formula holds.
P1 [R1 = r1 , . . . , Rn = rn ] =
1
Ef
n! 0
f1 (X(r1 ) , . . . , X(rn ) )
h(X(r1 ) ) . . . h(X(rn ) )
.
Hint. Use formula (2.4.8) and note that (X(1) , . . . , X(n) ) and (R1 , . . . , Rn ) are independent under f0 .
Apply Hoeffding’s formula to deduce that when
f1 (x1 , . . . xn ) =
n1
Y
h(xi )
i=1
n
Y
g(xi ) .
i=n1 +1
Qn
g
1
(b) P1 [R1 = r1 , . . . , Rn = rn ] = n!
E0
i=n1 +1 h (X(ri ) ) .
(c) Suppose h = H ′ , g = G′ and G = 1 − (1 − H)θ , θ > 0. Show that the UMP
(uniformly in F0 ) rank test of H : θ = 1 vs. K : θ = θ1 > 1 where θ1 is known, is given
by rejecting for large values of
Tn (θ1 ) = E0
n
n Y
o
(1 − U(ri ) )θ1 −1 ,
i=n1 +1
where U(1) < . . . < Un are the order statistics of a sample from U(0, 1).
(d) Set h0 (u) = 1, g1 (u) = 2u, g2 (u) = 2(1 − u), 0 < u < 1, n1 = n2 = 2, and
α = 1/6. Show that the MP level α rank test for H : h = g = h0 vs K : h = h0 , g = g1
rejects H iff R3 , R4 ∈ {3, 4} while the MP level α rank test for H vs (h0 , g2 ) rejects H iff
R3 , R4 ∈ {1, 2}.
Hint. P (R3 , R4 ∈ {3, 4}) = 42 P (X1 < X2 < X3 < X4 ), etc.
(e) Describe the MP level α = k 63 rank test for testing H vs (h0 , g1 ) in model (d)
when n1 = n2 = 3 and (i) k = 1, (ii) k = 2.
(f) Same as (e) except for testing H vs (h0 , g2 ).
11 (a) Show that the distribution of the rank vector (R1 , . . . , Rn ) in the two-sample problem of Example 8.3.10 depends on (F, G) only through GF −1 . Thus if h is a given continuous and strictly increasing function on [0, 1], then {Pθ,h : G = hθ (F ), θ > 0} is a group
submodel where the distribution of the ranks depends only on θ and h.
Hint. Then rank vector is invariant under the transformation Xi → F (Xi ) ≡ Xi′ . Here
Xi′ has the U(0, 1) distribution for i = 1, . . . , n1 , and the distribution GF −1 for i =
n1 + 1, . . . , n.
(b) In Example 8.3.10, show that GF −1 is maximal invariant.
(c) Set F̄ = 1 − F and Ḡ = 1 − G. Show that the Lehmann alternative {Pθ,h : Ḡ = F̄θ }
includes the exponential two-sample model.
Hint. If U ∼ U(0, 1), then −λ−1 log(1 − U ) ∼ E(λ).
(d) Suppose X ∼ F and Y ∼ G. Y and G are said to be stochastically larger than
Section 8.4
119
Problems and Complements
X and F iff P (Y ≥ t) ≥ P (X ≥ t) for all t ∈ R, that is, iff G(t) ≤ F (t), t ∈ R.
In the two sample framework of Example 8.2.7, a test ϕ(x, y) of H : F = G vs
K : “G is stochastically larger than F is monotone if yj′ ≥ yj , 1 ≤ j ≤ n, implies
ϕ(x, y′ ) ≥ ϕ(x, y). Show that:
(i) All monotone tests are rank tests. That is, functions of the ranks
Pn R1 , . . . , Rn of
Y1 , . . ., Yn . Also show that all tests of the form ϕ(X, Y) = 1
j=1 a(Rj ) ≥ c
are monotone provided a(k ′ ) ≥ a(k) for k ′ ≥ k.
(ii) All monotone tests have monotone power. That is, EF,G ψ(X, Y) ≥ EF,H ψ(X, Y)
if G is stochastically larger than H. Thus monotone tests are unbiased.
Remark. Wang (1996) developed a test of equality of distributions against nonparametric
stochastically ordered alternatives.
12. Hoeffding (1951) and Hajek and Sidak (1967) show that for rank tests of H0 : θ =
θ0 vs K : θ > θ0 , the locally UMP rank test is, under regularity conditions, based on
(δ/δθ)P (R = r)|θ=θ0 .
(a) Show that for allowable levels α the locally UMP rank test for the model in Problem
8.3.10(c) is the Savage exponential scores test.
−λt
(b) Show that UMP level α permutation test of H : F = G vs.
PnK : F (t) = 1 − e ,
−λθt
G(t) = 1 − e
, θ > 1, is given by rejecting for small values of i=n1 +1 xi with critical
value cn (x(1) , . . . , x(n) ).
(c) Show that the test in (b) is uniformly more powerful in this parametric family of
alternatives than the MP rank test given in (a).
13 (a) Show
if X(1) < . . . X(n) are the exponential, E(1), order statistics, then
Pthat
n
E(X(j) ) = i=n+1−j i−1 .
Hint. See Problem B.2.14.
(b) In (a), show that if j = [nu + 1], 0 < u < 1, then E(X(j) ) = − log(1 − u) + o(1).
Hint. X(j) = − log(1 − U(j) ), where U(1) < . . . < U (n) are uniform, U(0, 1), order
statistics. Taylor expand around U(j) ∼
= u. See Problem B.2.9.
14. Copulas. Consider
Example 8.3.11.
(a) Let Ci F (·) be the copula model with d = 1, θi = exp{βzi } and the LehmannSavage copula
Ci (u) = 1 − [1 − u]θi , 0 < u < 1 .
Use Hoeffding’s formula to show that the locally UMP invariant test of H0 : β = 0 vs
H1 : β > 0 is based on the exponential scores statistic
1
T E = n− 2
n
X
i=1
aE (Ri )(zi − z̄)
where aE (k) = E(Z (k) ) with Z (1) < . . . < Z (n) exponential order statistics.
Hint. Set K(t) = 1 − exp(−t). The df of W = K −1 F (Yi ) is K(w/θi ). (Approximate
critical values for TE can be obtained from TE /s ∼ N (0, 1), s2 = Σ(zi − z̄)2 /(n − 1)).
120
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
(b) In the transformation model h(Yi ) = θi + εi , set h(y) = log{− log[1 − F (y)]} for
some unknown continuous df F and suppose εi ∼ 1 − exp{− exp(t)}. Let Yi ∼ Fi and
assume that Fi and F have continuous case densities fi and f . Show that Fi satisfies the
proportional hazard rate model where λ(y|zi ) = θi λ(y) with λ(y|zi ) = fi (y)/[1 − Fi (y)]
and λ(y)/[1 − F (y)] being regression and baseline hazard rates.
(c) Show that the models of (a) and (b) are the same.
(d) Show that for d = 1 and θi = α + βzi in the Gaussian copula model, the normal
scores statistic is locally UMP invariant. Use Hoeffding’s formula.
(e) Same as (d) except logistic copula and uniform scores statistic.
(f) Proportional odds model.
(i) Let Ci F (·) be the copula model with d = 1, θi = α + βzi . Bell and Doksum
(1966, Table 8.1), considered
Ci (u) =
u
.
u + (1 − u) exp(θ)
Use Hoeffidng’s formula to show that the uniform scores test is locally UMP invariant
for testing H0 : β = 0 vs H1 : β > 0.
(ii) The odds ratio is defined as ri = Fi /(1 − Fi ). Let r = F/(1 − F ) where F is a
continuous baseline df. Show that for the model in (i),
ri (y) = exp(θi )r(y) .
This is called the proportional odds model.
(iii) In the monotone transformation model h(Yi ) = θi + εi with d = 1 and θ1 = α + βzi ,
set h(y) = log F (y)/[1 − F (y)] for some unknown continuous df F and assume
that εi has a logistic distribution. Show that this model is equivalent to the model in
(i) and (ii).
(g) Random effect proportional hazard. Bell and Doksum (1966) considered
eθi u − 1
,
eθi − 1
=u,
Ci (u) =
θi 6= 0
θi = 0 .
Sibuya (1968) and Nabeya and Miura (1972, unpublished) showed that Ci F (·) (called
the SINAMI model) is a random effects proportional hazard model λ(y|zi ) = ∆i λ(y)
with ∆i having a zero truncated Poisson (θi ) distibution. Show that when d = 1 and
θi = α + βzi , the uniform scores test is locally UMP invariant for testing H0 : β = 0 vs
H1 : β > 0.
(h) Same as (f) except
(i) Ci (u) = (1−θi )u+θi u2 . Show that the uniform scores test is locally UMP invariant.
(ii) Ci (u) = (1 − θi )u + θi um , m ≥ 2 an integer. Find the locally UMP invariant test
when m is known.
Section 8.4
Problems and Complements
121
15. Monotone regression tests. In the regression framework where we observe (z1 , Y1 ), . . . ,
e denote
(zn , Yn ) with z1 < . . . < zn nonrandom and Y1 , . . . , Yn independent. Let “<”
e . . . <F
e n is
stochastic inequality. A test ϕ(y) of H : Fi = F , 1 ≤ i ≤ n, vs K : F1 <
′
′
′
′
regression monotone if ϕ(y ) ≤ ϕ(y) for all y and y for which i < j and yi ≤ yj imply
yi ≤ yj . Show that
(a) All monotone tests are rank tests.
(b) All monotone tests have monotone power, that is, inf Fi−1 Gi (y) ≤ Fj−1 Gj (y) for
i ≤ j, all y, and if ϕ(·) is monotone, then
EF ϕ(Y) ≥ EG ϕ(Y) .
e . . . <F
e n.
Note that it follows that monotone tests are unbiased for H vs K : F1 <
−1
′
′
′
′
Hint. Let Yi ∼ Gi , set Yi = Fi Gi (Yi ) , then i < j and Yi ≤ Yj =⇒ Yi ≤ Yj .
(c) Let R1 , . . . , Rn bePthe ranks of Y1 , . . . , Yn . If a(i) is nondecreasing in i = 1, . . . , n,
n
then the test ϕ(y) = 1
i=j zi a(Ri ) ≥ c is monotone.
16. Consider Example 8.3.12.
(a) Show that if (X, Y ) has the continuous case density f (x, y), then (Hoeffding’s formula for bivariate data)
n
1 n X f (V (ri ) , W (si ) ) o
P (R, S) = (r, s) = E
n!
f (V (ri ) )f2 (W (si ) )
i=1 1
where f1 and f2 are the marginal densities of f and {V (k) } and {W (k) } are the order
statistics in independent samples from f1 and f2 .
(b) Show that the bivariate normal scores statistic is locally UMP invariant for testing
H0 : ρ = 0 vs H1 : ρ > 0 in the bivariate Gaussian copula. You may differentiate inside
the expected value in Hoeffding’s formula.
Hint. Because of the invariance of the ranks, assume (X, Y ) ∼ N (0, 0, 1, 1, ρ) ≡ fρ . Use
n
∂ X
∂ n
log fρ Πni=1 fρ .
Πi=1 fρ =
∂ρ
∂ρ i=1
(c) Consider the model with g(X) = V + θZ, h(Y ) = W + θZ, where V and W
have N (µ1 , σ12 ) and N (µ2 , σ22 ) densities, the df M of Z is arbitrary, V , W , and Z are
independent, and g and h are increasing. Show that the bivariate normal scores statistic is
locally UMP invariant for testing H : θ = 0 vs K : θ > 0.
Hint. For invariant (rank) statistics, we can assume
Z ∞
f (x, y; θ) =
ϕ(x − θz)ϕ(y − θz)dM (z) .
−∞
(d) Show that for the bivariate Gaussian copula, |ρ| = supa,b corr(a(X), b(Y )) where
the sup is over monotone functions a and b.
17. Establish (8.3.21).
122
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
18. Establish (8.3.23).
19. Show in Example 8.3.13 that the size α test that rejects H iff X1 ≥ c is
(a) UMP for testing H : θ ≤ θ0 vs K : θ ≥ θ0 when Σ0 is known.
(b) The Bayes test for the loss function lc (u, φ) and the priors π0 and π1 corresponding
to point masses at (θ0 , ηT0 )T .
(c) UMP unbiased when Σ0 is unknown.
(d) UMP invariant when Σ0 is unknown.
20. Show that X is minimax for quadratic loss when X ∼ Np (µ, σ02 J) where J =
diag(1, . . . 1)p×p .
21. Let N have the binomial distribution B(p, 10); given N = n, let X1 , . . . , Xn be i.i.d.
N (µ, 1). The observations consist of (N, X1 , . . . , XN ).
(i) If p has a known value p0 , show that there exists neither a UMP test nor a UMP
unbiased test of H : µ = 0 against µ > 0.
(ii) If p is unknown, show that there does exist a UMP unbiased test of H : µ = 0 against
µ > 0.
22. Let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ). State the MP test of H : σ = σ0 against a simple
alternative (µ1 , σ1 ) for the two cases (i) σ1 > σ0 ; (ii) σ1 < σ0 , and in each case describe
the least favorable distribution for µ over the parameter set (µ, σ0 ) : −∞ < µ < ∞ .
23. Let X1 , X2 be positive random variable with density
1
x1 x2 ,σ > 0 .
f
,
2
σ
σ σ
Assuming it exists, let δ(X) be the minimum risk scale equivariant estimate of σ, under the
loss function (d − σ)2 /σ 2 .
(a) Show that
R∞
δ(X) = R0∞
0
u2 f (uX1 , uX2 ) du
.
u3 f (uX1 , uX2 ) du
(b) Suppose there exists a complete sufficient statistic t(X) for the model. Show that
δ(X) is itself a sufficient statistic.
24. Let C be a d×d matrix. Suppose γ1 , . . . , γ2 and w1 , . . . , wd satisfy Cwj = γj wj , 1 ≤
j ≤ d. Then γ1 , . . . , γd are called eigenvalues of C and w1 , . . . , wd are
Pcalled eigenvectors.
See Section B.10. Consider the inner product defined by (x, y) =
xi yi and the norm
1
|x| = (x, x) 2 . Let
v1 = arg max{aT Ca : |a| = 1}
v2 = arg max{aT Ca : |a| = 1, a ⊥ v1 }
i
∴
vd = arg max{aT Ca : |a| = 1, a ⊥ [v1 , . . . , vd−1 ]}
Section 8.5
123
Notes
and λj = vjT
P
vj , 1 ≤ j ≤ d.
(a) Show that v1 , . . . , vd , λ1 ≥ λ2 ≥ . . . ≥ λd are all well defined and that Cvj =
λj vj , 1 ≤ j ≤ d. Thus λ1 , . . . , λd are eigenvalues and v1 , . . . , vd are eigenvectors.
This is the Courant-Fischer theorem.
P
(b) When C is the covariance matrix
of a random vector X ∈ Rd , the linear combination Lj = vj X is called the jth principal component. Show that Var(Lj ) = λj .
bj which corbj , λ
The term “principal components” is also used for the empirical v
P
c
respond to , the empirical variance covariance matrix of a sample X1 , . . . , Xn .
P
P
Sometimes the correlation matrix is used in place of
and c.
P
(c) Establish the (a) part for the inner product (x, y) = di=1 Πi xi yi , where Πi > 0
Pd
and i=1 Πi = 1.
25. Let X1 , . . . , Xn be i.i.d. discrete random variables with
P (X1 = x) = γ(x)θx /c(θ), x = 0, 1, 2, . . . ,
P
x
where γ(x) ≥ 0, θ > 0 is unknown, and c(θ) = ∞
x=0 γ(x)θ . Let Y1 , . . . , Yn be i.i.d.
random variables having the beta distribution with density
Γ(a + b) a−1
y
(1 − y)b−1 , 0 < y < 1 ,
Γ(a)Γ(b)
where a > 0 and b > 0 are unknown. Assume that Xi ’s and Yi ’s are independent. Suppose
that we observed Zi = Xi + Yi , i = 1, . . . , n.
(a) Show that EX1 = θc′ (θ)/c(θ), where c′ (θ) is the first order derivative of c(θ).
(b) Obtain a complete and sufficient statistic for the unknown parameter (θ, a, b).
(c) Derive a uniformly minimum variance unbiased estimator of the parameter θ.
(d) For a given α ∈ (0, 21 ), derive a uniformly most powerful test of size α for
H0 : θ ≥ θ0 vs H1 : θ < θ0 ,
where θ0 > 0 is a known value.
8.5
Notes
Note for Section 8.2.
(1) The “similar” terminology arose from the notion that if φ was non randomized and
corresponded to critical region C, then C was, under H, similar to the sample space X
property P (X ) = 1 in having probability P (C) = α, P ∈ P0 , not depending on P .
124
Distribution-Free, Unbiased, and Equivariant Procedures
Chapter 8
Notes for Section 8.3.
(1) For general definitions of abstract groups and their properties we refer to, for instance,
Birkhoff and MacLane (1998).
(2) Lehmann (1953) proposed G = F θ . Savage (1956,1980) noted that if F is replaced
by S = 1 − F one obtains (8.3.16). He called this Lehmann alternatives.
(3) A formulation of this type is in Wijsman (1990).
Chapter 9
INFERENCE IN SEMIPARAMETRIC
MODELS
As we indicated in Section I.1, our concern in this chapter will be with methods for inference about parameters in non- and semiparametric models whose qualitative large sample
behaviour is like that of the corresponding
√ procedures for regular parametric models. In
the i.i.d. case, estimates converge at rate n to Gaussian distributions, and the behaviour
of tests is governed by that of Gaussian processes. Section 9.1 introduces theory for estimates based on maximizing modified and empirical likelihoods in important semiparametric models such as linear models with stochastic covariates, biased sampling models, Cox
proportional hazard models with censoring, and independent component analysis models.
Section 9.2 and 9.3 deal, respectively, with asymptotic normality and efficiency for estimates. Section 9.4 similarly deals with asymptotic inference for tests in semiparametric
models. We need to use the machinery of Chapter 7 to extend the techniques of Chapters 5
and 6. In Section 9.5 we show how the powerful notions of contiguity and local asymptotic
normality can clarify power and information bound calculations.
9.1
Estimation in Semiparametric Models
In Sections 9.1, 9.2, and 9.3, we want to primarily address the question of how to optimally estimate regular parameters in semiparametric models. We have defined semiparametric models by example and will continue to do so. Loosely, they are models which are
naturally parametrized by both Euclidean and infinite dimensional parameters; or where
distributions in the model are subject to restrictions. Nonparametric models are loosely
described as ones where every probability distribution is at least approximable by distributions in the model. We will consider parametric and nonparametric models to be special
cases of semiparametric models.
9.1.1
Selected Examples
We have already considered a number of semiparametric models:
125
126
Inference in semiparametric models
Chapter 9
Section 3.5: The symmetric location model (3.5.1)
Xi = µ + εi , εi ∼ i.i.d. F, µ ∈ R ,
where the natural parameter is (µ, F ) with F the error distribution function.
The gross error model
1
f (x) = F ′ (x) = (1 − λ) ϕ(x/σ) + λh(x)
σ
of Section 3.5.3 parametrized by (µ, λ, σ, h) where h is a density and µ is unidentifiable
without restrictions on h(·) such as h(x) = h(−x).
Section I.1.1 and 8.3: The two-sample model of Example 8.3.10 where we observed two
independent samples with distributions F and G, is naturally parametrized by (F, G),
which is fully nonparametric, and the hypothesis model is {(F, G) : F = G}, which
is semiparametric.
We pursue an important model which was introduced in Examples 6.2.1 and 6.2.2.
Example 9.1.1. Semiparametric linear models with stochastic covariates. Here, we observe (Z1 , Y1 ), . . . , (Zn , Yn ) i.i.d. as (Z, Y ), Z ∈ Rp , Y ∈ R, with
Y = α + ZT β + σε .
Various semiparametric versions of this model are in use. Here are two:
(a) Z and ε are independent, ε ∼ F, Z ∼ H, F ∈ F , H ∈ H, and F , H are general,
and Σ ≡ VarH (Z) ≡ EH (Z − EH (Z))(Z − EH (Z))T is nonsingular.
(b) The joint distribution Q of (Z, ε) is arbitrary save that Eε2 < ∞, E(ε|Z) = 0, and
Var(Z) is nonsingular.
Thus we can think of model (a) as parametrized by (α, β, σ, F, H) and model (b) by
(α, β, σ, Q). It is easy to see that in neither of these models is σ identifiable and α is
not in the first as well (Problem 9.1.1). However, we have already seen in Example 6.2.2
that, if in addition to assumption (a) above, we assume F with density f symmetric about
0, then β and α are both identifiable and we exhibited estimating
√ equations for (α, β). In
fact, symmetry is only required for identifiability of α, and β is n consistently estimable
b = β + OP (n− 12 ) for model (a) by using the estimating equations of
in the sense that β
Example 6.2.2 with, for instance, f0 the logistic
√density (Problem 9.1.1(d)).
In model (b), the LSE of Example 6.2.1 is n consistent. To show this, write
b =β+Σ
b −1 1
β
n
n
X
i=1
(Zi − Z̄)(Yi − ZT β)
(9.1.1)
P
b ≡ 1 n (Zi − Z̄)(Zi − Z̄)T . To check √n consistency, note that by consistency
where Σ
i=1
n
of continuous functions of sample moments
P
b −1 →
Σ−1 .
Σ
Section 9.1
Estimation in Semiparametric Models
127
Further,
n
1X
(Zi − Z̄)(Yi − ZT β)
n i=1
=
n
n
1X
1X
(Zi − E(Z))σεi − (Z̄ − E(Z))
εi
n i=1
n i=1
1
= OP (n− 2 ) + OP (n−1 )
√
since E(Z − E(Z))ε = E(Z − E(Z))E(ε|Z) by B.1.20 and E(ε|Z) = 0, so that n
consistency of βb holds in model (b).
✷
√
In these two semiparametric linear models, the construction of n consistent estimates
of the interesting Euclidean parameter β proved fairly simple. However, it is not clear
what alternative and possibly better procedures there might be. That is, we need to develop
a concept of efficiency for semiparametric models that can be used to determine what a
“best” estimate is. This will be done in the context of this example in Section 9.3.
Example 9.1.2. Biased Sampling. Stratification. It is often the case that, while we want to
obtain information about a population df F , or at least features of F such as the mean and
variance, we are only able to observe a sample from P(F,G) , where G is a finite or infinite
dimensional parameter. We begin by considering the essentially nonparametric model of
biased sampling from a single population. Here, if the distribution F of Z has density
(discrete or continuous case) f , observations are not on Z but on X from
w(x)f (x)
(9.1.2)
W (F )
R
where W (F ) = w(x) dF (x), and w(·) is assumed known.
An important special case is “length biased” sampling where X ∈ R+ , X ∼ F , and
w(x) = x. This arises classically if we are interested in, say, the proportion f (2) of twochild families in a city with a known total number of households N . If we could sample
households with at least one child, and Z is the number of children in a sampled household
we could estimate f (2) = P [Z = 2] directly. Suppose instead we sample n children
at random and consider the proportion pF (2) of children coming from two-child families
in this group. If the city is large, f (j) is the proportion of j child families in the city,
j = 0, 1, 2, . . ., and X is the number of children in a sampled household, then,
pF (x) =
pF (2) = P [X = 2] = 2f (2) /
∞
X
j f (j) .
(9.1.3)
j=1
In this case (9.1.2) holds with w(x) = x.
Based on observations from pF (·) in (9.1.2), is F identifiable? This is true iff w(x) > 0
whenever f (x) > 0. How do we identify F in this case? We claim that (Problem 9.1.2)
pF (x)/w(x)
.
f (x) = R 1
w(x) dPF (x)
(9.1.4)
128
Inference in semiparametric models
Chapter 9
Next we consider stratified populations where biased sampling may occur within each
strata. A stratified population S is made up of subpopulations S1 , . . . , Sk called strata. We
are interested in the density f (z) of a random draw Z from S, but sample by first randomly
selecting an index I with corresponding strata SI . This situation arises if S is the set of
all patients in a county and S1 , . . . , Sk are the patients in the k hospitals and clinics in the
county. Length biased sampling may occur within each strata. In this case we observe
X = (I, Y ) where I = 1, . . . , k, P [I = j] = λj , 1 ≤ j ≤ k, are assumed known, and
given I = j, Y has density
wj (y)f (y)
(9.1.5)
Wj (F )
R
where Wj (F ) = wj (x)dF (x). Again, w1 , . . . , wk are assumed known. In stratified
sampling λj is the probability that Y will be from the jth strata Sj and wj (x) = 1(x ∈ Sj ).
Pk
We claim that F is identifiable iff j=1 λj wj (x)f (x) > 0 (Problem 9.1.2).
✷
pF (y|j) =
The next two examples come from the area of survival analysis, the study of lifetimes
of subjects in clinical trials, experimental animals, pieces of equipment, etc.. The models
in these examples apply more generally to experiments that involve “time to event data.” A
description of lifetime distributions more suited to the subject than densities or distribution
functions is the hazard rate.
Definition 9.1.1. The hazard rate λ(·) of a nonnegative, continuous variable T with density
f and distribution fucntion F is defined for t ≥ 0 by
λ(t) = lim {P [t ≤ T ≤ t + ∆|T ≥ t]/∆}
∆→0
= lim
∆→0
f (t)
d
F (t + ∆) − F (t)
=
= (− log S(t))
∆(1 − F (t))
1 − F (t)
dt
where S(t) = P [T > t] = 1 − F (t) is defined as the survival function.
It follows that,
f (t) = λ(t) S(t) = λ(t)e−
Rt
0
λ(s)ds
≡ λ(t)e−Λ(t)
(9.1.6)
where Λ(t) is the cumulative hazard function. Note that
Λ(t) = − log S(t) .
(9.1.7)
The discrete version of the hazard rate is defined for a discrete variable T as
λ(t) ≡ P [T = t|T ≥ t] =
P [T = t]
.
P [T ≥ t]
By definition, λ(t) is 0 unless P [T = t] > 0. The analogue of (9.1.6) becomes
P∞
Lemma 9.1.1. If j=1 P [T = tj ] = 1, t1 < t2 < . . ., then
k−1
k−1
P [T ≥ tk ] = Πj=1
(1 − λ(tj )) , P (T = tk ) = λ(tk )Πj=1
1 − λ(tj ) .
(9.1.8)
(9.1.9)
Section 9.1
129
Estimation in Semiparametric Models
k−1
Proof. P (T ≥ tk ) = Πj=1
n
P [T ≥tj+1 ]
P [T ≥tj ]
o
. Next note that
P [T ≥ tj+1 ] = P [T ≥ tj ] − P [T = tj ] .
The second equation follows from (9.1.8).
✷
We will call (9.1.6) and (9.1.7) the continuous case approximations to (9.1.9). In park−1
ticular, exp{−Λ(t)} approximates Πj=1
(1 − λ(tj )), and vice versa.
Our next example introduces two different types of biased sampling.
Example 9.1.3. Censoring and truncation. A very common type of data in many contexts
is censored or truncated data. In (right) censored sampling we wish to observe “times”
T1 , . . . , Tn pertaining to n units sampled from a population. Instead, events occurring at
times C1 , . . . , Cn may prevent some of T1 , . . . , Tn from being observed. These censoring
times are such that (T1 , C1 ), . . . , (Tn , Cn ) are independent and we observe Xi = (Yi , δi )
where
Yi = min(Ti , Ci ),
δi = 1(Ti ≤ Ci ) , 1 ≤ i ≤ n .
That is, we observe either the time of interest T or the censoring time C and are told
which one is observed. A classical situation where this occurs is when T is the time an
individual with some disease dies after entering a study of duration L. If T ≤ L, the
survival time is observed; if T > L, the individual is said to be lost to follow up. In this
case C = L. Subtler is the case where an individual enters the study at a random time T0
assumed independent of the survival time from entry. We then observe survival time T if
T ≤ L − T0 and C = L − T0 otherwise, which is assumed independent of T , and indeed
1(T ≤ C) is also observable. An extensive discussion is given for instance in Andersen,
Borgan, Gill and Keiding (1988,1993) and Kalbfleisch and Prentice (2002). If we assume
that T and C are independent with distributions F, G which are arbitrary and if F, G have
densities f, g, then the distribution of (Y, δ) is given by
Z y
f (s)Ḡ(s)ds
(9.1.10)
P [Y ≤ y, δ = 1] =
−∞
Z y
P [Y ≤ y, δ = 0] =
g(s)F̄ (s)ds
−∞
where F̄ = 1 − F , Ḡ = 1 − G are the appropriate survival functions. Thus, the model
is parametrized by (F, G). Can (F, G) be fully identified? The answer is yes, under mild
conditions, as we shall see later in the continuation (Example 9.1.9) of this example.
Another important type of biased sampling is truncation. Here, the distribution F of
T is, as usual, what is wanted, but what we do observe is Y where Y has the conditional
distribution of T given T ≥ M where M , assumed independent of T , is an observed
threshold. Thus, if Y has density f and df F and M has density g and df G, then the
density of X = (M, Y ) is
p(m, y) =
g(m)f (y)1(y ≥ m)
.
F̄ (m)
(9.1.11)
130
Inference in semiparametric models
Chapter 9
An important example is truncation in astronomical data where the luminosities L of stars
are available only if their (visual) brightness B is above a minimal detection level since
brightness depends both upon the luminosity and the distance to the star. B which is a
function of the distance is genuinely random, and, under a hypothesis of homogeneity
of star types in any given direction, can be assumed independent of L — see Babu and
Feigelson (1996) for instance. The issue of identifiability of F and G arises and again the
answer is affirmative under natural conditions — see Example 9.1.9.
✷
Example 9.1.4. The Cox proportional hazard model. Analogues of the Gaussian linear
model that were meant to model event times T were initially built around the exponential,
E(λ), family of distributions. Specifically, given a vector of covariates Z which might
include factors such as age, weight, and treatment indicator, the response lifetime variable
T was modelled so that the natural parameter, the conditional hazard rate λ(t|Z = z), was
a function of Z and a parameter β. Since λ is positive, a specification such as
λ(t|z) = exp{zT β}
is natural. If, for instance, we have two groups, z = (z1 , z2 )T , z1 ≡ 1, and z2 = 0 or 1
as the observation comes from group 1 or 2, then this is just a model for two samples from
arbitrary exponential distributions. This is a generalized linear model (Section 6.5) with
link function
g(µ) = − log µ ,
since µ = 1/λ is the mean of E(λ). But the E(λ) family tends to be a poor model for lifetimes that experience wear, in view of its memoryless property: the conditional distribution
of T − t given T ≥ t is the same as the marginal distribution of T .
The E(λ) family is characterized by λ(t) ≡ constant. Cox (1972) proposed an important semiparametric generalization of this model. He introduced an unknown continuous,
positive baseline hazard rate λ(t) on [0, ∞) and then postulated the conditional hazard rate
of the distribution of the nonnegative continuous variable T given Z = z to be of the form
λ(t|z) = λ(t) exp{zT β} .
(9.1.12)
This Cox (β, λ) model is in widespread use throughout biostatistics in part because of the
interpretation of the coefficients βj : By writing the model as log λ(t|z) = Σdj=1 βj zj +
log λ(t), we see that βj is the change in the hazards on the log scale as zj is perturbed one
unit.
More generally, we consider
λ(t|z) = λ(t)r(z, β) ,
(9.1.13)
where r > 0 is a known function. The model as specified by its conditional hazard rates
and the marginal distribution of Z is evidently semiparametric. Although λ(·) is arbitrary,
just as with the linear model (a) of Example 9.1.1, it far from covers all λ(t|z) . It, in fact,
contains a powerful assumption corresponding to the additive treatment effect assumption
of the linear model: the relative effect of Z on the distribution of T is λ(t|z)/λ(t|0) =
Section 9.1
131
Estimation in Semiparametric Models
r(z, β)/r(0, β), which doesn’t depend on λ or t. That is, (9.1.13) is a proportional hazard
rate model.
The first question we ask is whether β is identifiable. We will give a full answer to this
question later. For the time being, we note identifiability in an important special case. Suppose Z is discrete, Z ∈ {z0 , . . . , zk }, P [Z = zj ] > 0, 0 ≤ j ≤ k, and that (9.1.12) holds.
Write rj (β) ≡ r(zj , β) and let S(t) ≡ e−Λ(t) be the survival function corresponding to λ
and assume r(z, 0) ≡ 1. Then,
P [T > t|Z = z j ] = S
r j (β )
(t) ,
(9.1.14)
a model essentially proposed for the two-sample case by Lehmann. See Example
8.3.10. If
Pn
Z is discrete as above and we observe (Z1 , T1 ), . . . , (Zn , Tn ), let Nj = i=1 1(Zi = zj )
and let
n
1 X
Sbj (t) ≡
1(Ti > t, Zi = zj ), j = 0, . . . , k
Nj i=1
be the empirical conditional survival functions. By the Glivenko-Cantelli theorem, as n →
∞,
P
sup |Sbj (t) − S rj (β) (t)| −→ 0
t
and β is identifiable √
iff β → (r0 (β), . . . , rk (β)) is 1–1. In fact, r0 (β), . . . , rk (β) and thus
β can be estimated n consistently in many ways — for instance, by picking a value t
such that 0 < Sb0 (t) < 1 and estimating rj (β) by
rbj,t =
log Sbj (t)
.
log Sb0 (t)
(9.1.15)
Which is the best choice of t? It turns out there is a “best” estimate of β in this model but
it is not from (9.1.15). We shall develop it in Section 9.3.
✷
Example 9.1.5. The Independent component analysis (ICA) model. Principal components.
As we have seen in Appendix B.6 of Volume I, the multivariate Gaussian N (µ, Σ) model
can be written as
Xd×1 = AZ + µ
where Z is a vector of i.i.d. N (0, 1) variables and A and µ are unknown. We know that
this A may not be identifiable since Var (X) = AAT ≡ Σ and E(X) = µ identify the
distribution and A may have many more parameters than Σ and µ.
We will develop a particular choice of A which is identifiable by using the representation (see Appendix B.10)
d
X
λj vj vjT
Σ = OΛOT =
j=1
where O = (v1 , . . . , vd )T , which is orthogonal and d × d, Λ is the diagonal matrix of
eigenvalues λ = (λ1 , . . . , λd ) of Σ, with λ1 ≥ . . . ≥ λd > 0. See Problem 8.3.24.
132
Inference in semiparametric models
Chapter 9
Here λ1 , . . . , λd have corresponding orthonormal eigenvectors, v1 , . . . , vd with |v| = 1,
vi ⊥ vj if i 6= j and
Σvj = λj vj , 1 ≤ j ≤ d .
(9.1.16)
If the vj are chosen so that the first nonzero element of vj is positive, they are uniquely
(0)
defined if λ1 > . . . > λd > 0. If there are ties between some of the λ’s, let λ1 > . . . >
(0)
(0)
λk be the ordered distinct eigenvalues, then the set of {vj : λj = λi }, 1 ≤ i ≤ k, is the
1
identifiable parameter. Viewed in this way, O and Λ are identifiable and so is A = OΛ 2
and we have the ICA model
1
X = OΛ 2 Z + µ .
(9.1.17)
The linear combinations v1T X, . . . , vdT X of X’s are the principal components of X
with vjT X representing the contribution of vj X to the representation
X=µ+
d
X
(vjT X) vj .
j=1
P
v1
v1 = sup Var(aT X) : |a| = 1 . That is, the vector a ≡ v1 of v1T
a
P
provides the linear combination ai Xi with the largest possible variance subject to |a| =
1. P
See Section B.10.1.2, Example 8.3.11 (continued), and Problem 8.3.24. Moreover,
v2 has the same property among the set of vectors a orthogonal to v1 , and so on.
v2T
Here v1T
P
There are a number of natural semiparametric generalizations of sampling from this
model: X1 , . . . , Xn i.i.d. Nd (µ, Σ). The most important appears to be X = AZ where
A is an unknown parameter and Z = (Z1 , . . . , Zd )T with the Zj independently distributed
and with the distribution Qj of Zj arbitrary. Thus, if Q = (Q1 , . . . , Qd ),
P = {P(A,Q) : P(A,Q) is the distribution of X = AZ with Q “arbitrary”} . (9.1.18)
It is easy to see that the Qj are identified at most up to location and scale. What is interesting, but certainly not obvious, is that if all the Qj are not Gaussian, then, as we shall see
later, A is identified up to a permutation and scale change of its columns. This is equivalent
to saying that any parameter q(A) such that q(AD) = q(A), q(Aπ) = q(A) for all diagonal matrices D and permutation matrices Π is identifiable. In the engineering literature
algorithms estimating such parameters have proven very effective in a number of important
problems — see Hyvarinen, Karhunen and Oja (2001). This methodology is referred to as
Independent component analysis (ICA). The simplest problem leading to ICA corresponds
to the situation where Z1 , . . . , Zd represent the output of independent sources which are
superimposed when received by d different observers, X1 , . . . , Xd . The task is to separately identify the signal coming from the sources. In practice, the Z1 , . . . , Zd are often
time series, as are the X1 , . . . , Xd , and the Xj are observations taken at different times tj ,
j = 1, . . . , n. However, the methods developed in the i.i.d. case work in the time series
situation as well. The only adjustment needed is to the estimates of the variance of our
estimates of A.
✷
We now turn to methods of estimation in semiparametric models.
Section 9.1
9.1.2
Estimation in Semiparametric Models
133
Regularization. Modified Maximum Likelihood
There is a fundamental difficulty when trying to apply the maximum likelihood method
or the plug-in approach to some non- and semiparametric models. The difficulty occurs
for models P that do not contain the empirical probability Pb and where the parameter of
interest, say ν(P ), is not defined at Pb . In the case of the MLE, this means that the maximum
of the likelihood is not assumed. Here is an example.
Example 9.1.6. We illustrate with X1 , . . . , Xn i.i.d. P ∈ P = {All probabilities P with
continuous densities on R}. If we parametrize P by p, the density of P , then the likelihood
for an observed vector x = (x1 , . . . , xn )T is
Lx (p) = Πni=1 p(xi ) .
Nothing prevents us from making p(xi ) arbitrarily large for all i = 1, . . . , n. For instance,
consider the distributions Πni=1 pk (xi ) in P with, for k = 1, 2, . . .,
n
1X 1
x − xi
pk (x) =
φ
n i=1 σk
σk
where φ is the standard normal density and σk → 0 as k → 0. Thus,
sup Lx (p) = ∞
P
is not assumed for any p and this approach does not yield an estimate of the infinite dimensional
parameter ν(P ) = p, nor simple one dimensional parameters such as ν(P ) =
R
p(x)dx.
✷
In this example, it is intuitively clear that sequences {pk } which make Lx (p) → ∞ conP
verge weakly to Pb = n−1 ni=1 δxi , the empirical probability distribution, which does not
have a density. From a measure theoretic point of view, the members of P are dominated
by Lebesgue measure, but the “maximum likelihood” Pb is with probability 1 undominated.
Kiefer and Wolfowitz (1956) proposed an approach which they called nonparametric maximum likelihood which leads to choosing the empirical distribution as the estimate of P ,
when we enlarge P to include all discrete distributions. This approach only makes sense in
measure theoretic terms and we do not present it, but we shall discuss modified maximum
likelihood methods related to empirical likelihood (Owen (1988, 2001)) in Section 9.1.3.
Let P be an arbitrary specified semiparametric model and let M0 be the union of the
closure of P with the set of all discrete distributions. Let P̄0 be the closure of M0 under (say) weak convergence. Then, given observations x1 , . . . , xn of i.i.d. X1 , . . . , Xn ,
consider
Px = {P ∈ P¯0 : P {x1 , . . . , xn } = 1, P corresponds to P ∈ P} ,
the set of all members of P̄0 with support {x1 , . . . , xn } that satisfy the model restrictions
of P. Now Px can be parametrized by
o
n
p = (p1 , . . . , pn )T ∈ p : pi = P [X = xi ], 1 ≤ i ≤ n, some P ∈ Px , (9.1.19)
134
Inference in semiparametric models
Chapter 9
a subset of the simplex in Rn . We identify Px with this set. In most applications, Px is
a smooth parametric model indexed by p(·; θ), θ ∈ Rd , d ≤ n, and maximum likelihood
over Px is of the usual type. That is,
n
n
X
1 X
θb = arg max
p(xi ; θ) = 1 .
log p(xi ; θ) :
n i=1
i=1
Here p(·; θ) is determined by the restrictions that define the distributions in P. Sometimes,
we can only require 0 < P {x1 , . . . , xn } < 1 with 1 − P {x1 , . . . , xn } being assigned to
∞. This happens, as we shall see, in censored data problems.
We define the empirical or modified likelihood estimate of P for the model P by
Pbe = arg max{Πni=1 pi : P ∈ Px } .
(9.1.20)
We will refer to Pbe as the nonparametric MLE (NPMLE). The NPMLE of a parameter
ν(P ) is defined by νb = ν(Pbe ) when νb(Pbe ) exists.
Example 9.1.7. We illustrate (9.1.20) with P = {All P with continuous densities on R}.
Then P̄0 = M ≡ {All distributions on R}. When there are no ties among {x1 , . . . , xn },
Px is naturally parametrized by
Θ ≡ {(p1 , . . . , pn−1 ) : pj ≥ 0,
n−1
X
j=1
pj ≤ 1, 1 ≤ j ≤ n − 1} .
We can write
Lx (p) = Πni=1 pi
Pn−1
with pn = 1 − j=1 pj . Because this is a multinomial likelihood with one observation in
each category, it is uniquely maximized over Θ by p1 = . . . = pn = n−1 . See Example
2.2.8. The NPMLE of P is indeed Pb , the empirical distribution.
✷
Remark 9.1.1. Ties. Categorical data. Ties occur when X is discrete, when there is
roundoff, and when data are collected in categories (e.g. age groups or geographic strata).
(0)
(0)
In the case of ties, we let {x1 , . . . , xm } be the distinct xi ’s or indicators of the categories
{1, . . . , m} and let
(0)
Px = {P ∈ P̄0 : P {x1 , . . . , x(0)
m } =
m
X
pi = 1} .
j=1
This Px is parametrized by
(0)
p = (p1 , . . . , pm )T ∈ {p : pj = P (X = xj ), 1 ≤ j ≤ m, some P ∈ Px } .
The empirical likelihood of X1 , . . . , Xn in the case of ties or categorical data is
n
j
Lx (p) = Πm
j=1 pj ,
Section 9.1
135
Estimation in Semiparametric Models
Pn
(0)
where nj = i=1 1(xi = xj ). In the case of two categories (e.g. treatment and control),
m = 2, p2 = 1 − p1 , and Lx (p) = pn1 1 (1 − p1 )n−n1 is the Bernoulli likelihood.
For tied or categorical data, the estimate of P for the model P is
Pbe = arg max{
m
Y
j=1
n
pj j : P ∈ Px } .
When P̄0 is {all distributions on R} the solution (NPMLE) is pbj = nj /n, 1 ≤ j ≤ m. See
Problem 2.2.30.
✷
There are many cases where the NPMLE of a parameter θ can be obtained as the solution of a generalized estimating equation of the form Q(θ, Pb ) = 0, where θ solves
Q(θ, P ) = 0, and Q(θ, P ) is not necessarily linear in P . That is, Q may not satisfy
Z
Q(θ, P ) = v(θ, x)dP (x)
for any function v. Stratification is an example.
Example 9.1.8. Biased sampling. Stratification (Example 9.1.2). Here P is the collection
n
of all P with densities Π pF (xi ) where pF satisfies (9.1.2), that is,
1
w(xi )fi
Px = (p1 , . . . , pn ) : pi =
W (F )
Pn
where W (F ) = i=1 w(xi )fi and the parameter of interest is θ = (f1 , . . . , fn )T . Here
(p1 , . . . , pn ) range freely over the n simplex. As we saw in Example 9.1.7, the maximum
is attained for pbi = n−1 , 1 ≤ i ≤ n, and, by inspection, a solution to pbi = w(xi )fi /W (F )
is
n
X
−1
b
fi = w (xi )(
w−1 (xk ))−1 .
k=1
We can write this empirical likelihood estimate fbe (x) of f (x) in terms of the nonparametric
empirical probability Pb as
Z
−1
b
b
dFe (x) = w (x)dP (x)/ w−1 (y)dPb(y)
in agreement with (9.1.4).
We next consider the more general stratified model with X = (I, Y ). Here
p(I,Y ) (j, y) = λj
wj (y)f (y)
,
Wj (F )
1 ≤ j ≤ k, y ∈ R.
It turns out that Px is unsatisfactory since the observed x1 , . . . , xn force a coupling of the
values of I and Y which is incompatible with the model distribution. However, we can
define a modified likelihood naturally as follows:
136
Inference in semiparametric models
Chapter 9
Suppose wj (y) > 0 for all y. Then the possible support of a member of P which puts
positive mass on all the observed (Ii , Yi ), i=1,. . . ,n, is {1, . . . , k} × {y1 , . . . , yn }. The
point (a, yb ) in the support of P ∈ P is assigned probability
wa (yb )
p(a, yb ) = λa
fb ;
Wa (F )
Wa (F ) =
n
X
fb wa (yb ).
b=1
The condition wj (y) > 0 for all j is sufficient to make F identifiable. The modified
likelihood is
Y wa (Yb ) N
λa Na+ fb +b
Wa
a,b
Pn
Pk
where Nab = 1(Ib = a), Na+ =
=
b=1 Nab , N+b P
a=1 Nab , Wa ≡ Wa (F ).
The modified likelihood equations under the condition b fb = 1 introduced through a
Lagrange multiplier γ yield
ba = Na+
λ
n
for a = 1, . . . , k
k
N+b X Na+
w (Y ) = γ .
−
ca a b
fbb
W
a=1
Multiplying by fbb and summing we get
n=γ+
n
k
X
Na+ X
a=1
Hence a solution is γ = 0 and
N+b
Na+
a=1 W
ca wa (Yb )
fbb = Pk
ca
W
b=1
wa (Yb )fba = γ + n.
.
(9.1.21)
Then, for j = 1, . . . , k, summing wj (Yb )fbb we obtain the following estimate of Wj (F ),
cj =
W
n
X
b=1
k
X
Na+
wj (Yb )N+b (
wa (Yb ))−1 .
c
a=1 Wa
If we let Pb1 be the empirical df of I and Pb2 that of Y we can rewrite (9.1.22) as
Z
Z
wa (y) b
c
wj (y)(
Wj −
dP1 (a))−1 dPb2 (y) = 0.
c
Wa
(9.1.22)
This is an equation yielding a generalized estimating equation or generalized M estimate
c ≡ (W
c1 , . . . , W
ck )T . That is W
c = W (Pb) where P is the distribution of X, Pb is the
W
empirical probability, and W (P ) solves Q(W , P ) = 0. Here Qk×1 is the function
Z
Z
−1
wa (y)
dP1 (a) dP2 (y)
(9.1.23)
Q(W , P ) ≡ W − w(y)
Wa
Section 9.1
137
Estimation in Semiparametric Models
where P is the joint probability distribution of (I, Y )T and P1 , P2 are the respective marginals
c exists and is
of I and Y . It may be shown, if 0 < P [δ = 0] < 1 (Problem 9.1.7), that W
c
unique with probability tending to 1. Once we have computed W we have our NPMLE Pbe
of P
bi wi (y)dFbe (y)
dPbe (i, y) = λ
Z
−1
wa (y) b
dFbe (y) = dPb2 (y)
dP1 (a)
ca
W
and we can, in principle, estimate any parameter ν(P ), P ∈ P, by ν(Pbe ).
Example 9.1.9. Censoring and truncation (Example 9.1.3).
likelihood theory for censoring. Introduce the hazard rates,
λF =
f
,
F̄
λG =
✷
We develop the empirical
g
Ḡ
and note that we can write the joint density of (Y, δ) (for measure theory afficionados, with
respect to the appropriate measure!) as
p(F,G) (y, δ) = λδF (y)λ1−δ
G (y) exp{−(ΛF (y) + ΛG (y))}
(9.1.24)
To write the likelihood for Px , where Xi = (Yi , δi ), Yi = min(Ti , Ci ), and δi = 1(Ti ≤
Ci ), we take as possible values for Y the ordered Yi denoted by y(1) , . . . , y(n) . The corresponding values of δ are (δ(1) , . . . , δ(n) ), indicating whether Y(i) is censored or not. We
will use the discrete form of the hazard rates of T, C at y(i) calling them λF i , λGi .
Consider y(i) , δ(i) , i = 1, . . . , n, coming from Xi = xi , 1 ≤ i ≤ n, as fixed values,
and λF i , λGi as unknown parameters. Then, for X = (Y, δ) ∼ P ∈ Px , using (9.1.8) and
(9.1.9), and assuming P (Ti = Ci ) = 0, 1 ≤ i ≤ n,
pX (y(j) , δ(j) ) = P δ [T = y(j) ]P δ [C ≥ y(j) ] · P 1−δ [C = y(j) ]P 1−δ [T ≥ y(j) ]
j−1
Y
= [λF j
k=1
·[λGj
(1 − λF k )]δ [
j−1
Y
k=1
(1 − λGk )]
j−1
Y
(1 − λGk )]δ
k=1
1−δ
[
j−1
Y
k=1
(1 − λF k )]1−δ
(9.1.25)
where we write δ for δ(j) . We make a further modification of the likelihood by noting that
under our assumptions the supports {t1 , . . . , tn } and {c1 , . . . , cn } of the empirical versions
of F and G are disjoint. Then, with εij ≡ 1(Yi = y(j) ) the log empirical likelihood for a
sample X1 , . . . , Xn is
lx (λF , λG ) =
n X
n
X
i=1 j=1
+
j−1
X
k=1
εij {(δ(j) log λF j + (1 − δ(j) ) log λGj )
[log(1 − λF k ) + log(1 − λGk )]} .
(9.1.26)
138
Inference in semiparametric models
Chapter 9
Let Y(L) be the largest y(i) with δ(i) = 1. For i > L, λF i does not appear in the sum.
We have no information about the distribution of times larger than the largest uncensored
time other than an estimate of the survival function at T(L)
P. nFor i ≤ L, we obtain after
some algebra (Problem 9.1.14), if there are no ties, that is, i=1 εij = 1 for all j, that the
maximizer of (9.1.26) is
bF i =
λ
δ(i)
.
n−i+1
(9.1.27)
If δ(i) = 0, then Pb [T = y(i) ] = 0. Using (9.1.9), we arrive at the classical Kaplan-Meier
estimate of the survival function S(y) = P (T > y) for y ≤ Y(L) ,
SbKM (y) = Π{(1 −
δ(i)
) : y(i) ≤ y} .
n−i+1
(9.1.28)
Remark 9.1.2.
1) When there are no ties, if Pb corresponds to SbKM , then Pb [T > Y(L) ] = 0 iff L = n.
2) The formula (9.1.25) can be extended to the case where P is discrete to begin with.
That is, there are ties but we still suppose the supports of T and C are disjoint or, as Lawless
(1982) views it, apparent ties between censoring times and deaths are broken by moving
censoring times infinitesimally to the right.
3) If we replace (9.1.25) by its continuous approximation, (9.1.24), formally given by
pX (y(j) , δ(j) ) =
λδF j λ1−δ
Gj
exp{−
j
X
(λF k + λGk )} ,
(9.1.29)
k=1
bF j (Problem 9.1.13), but a new expression,
we arrive at the same formula for λ
SbN A (y(j) ) = exp
−
j
X
k=1
bF k
λ
(9.1.30)
as an estimate of S. This is the Nelson-Aalen estimate which we will discuss further later.
To study identifiability of λF , λG , F , and G, consider P defined as all P of the form
(9.1.24) and note that the unique maximizer over P of, (Problem 9.1.13)
Z
K(P(F,G) , P ) ≡ log p(F,G) (y, δ)dP (y, δ)
for G fixed and 0 < P [δ = 0] < 1 is given by
λF (P )(y) ≡
p1 (y)
P [Y ≥ y]
where
p1 (y) =
(9.1.31)
d
P [Y ≤ y, δ = 1] .
dy
Section 9.1
139
Estimation in Semiparametric Models
An analogous formula holds for G. From (9.1.31) and its analogue for G, we get F, G such
that
P(F,G) = P .
Hence (F, G) given by (9.1.31) and its G analogue is the unique maximizer of K and we
have identifiability if 0 < P [δ = 0] < 1. Here the condition P [δ = 0] < 1 is necessary for
identifiability of F while p1 (y) is not defined if P [δ = 1] = 0. If P [δ = 0] = 0, then G
cannot be identified. Note that these identifiability results are density analogues of Problem
1.1.10, identifiability for the discrete case.
Truncation is easily handled by the same hazard rate parametrization.
f (y)
1(y ≥ m)
F̄ (m)
= g(m)λF (y) exp{ΛF (m) − ΛF (y)} 1(y ≥ m) .
p(M,Y ) (m, y) = g(m)
The NPMLE makes
and if y(1) < . . . < y(n) ,
bF (y) = 0,
λ
(9.1.32)
y∈
/ {y1 , . . . , yn }
bF (y(j) ) = (Nj − j)−1
λ
Pn
where Nj =
k=1 1(Mk ≤ y(j) ), provided Nj > j. That is, we can only estimate
b
λF (y(j) ) for j ≥ J with J = first k such that y(k) 6= Mlk where ylk ≡ y(j) . Equivalently
bF is defined with probability
we can only estimate F (y) for y ≥ y(J) . We deduce that λ
tending to 1 if F̄ (y) > 0 for all y. Moreover, f can be identified by
λF (y) = pY (y)(P [M ≤ y] − P [Y ≤ y])−1
and if G(m) > 0 for all m, g can be similarly identified. We can also obtain
b̄ (y) = Π 1 − λ
bF (y(j) ) : y(j) ≤ y
F
b̄ (y ) .
b (y )F
Pb [{y }] = λ
F
(j)
F
(j)
(j)
We leave details of these results to the reader (Problem 9.1.8).
(9.1.33)
(9.1.34)
✷
Although empirical likelihood gives simple answers for nonparametric censoring and
truncation, survival analysis models with no covariates, it is much less satisfactory for
more complicated models such as that of Cox because of the awkward formula (9.1.9).
Simpler solutions are obtained by a natural approximation to the empirical likelihood,
yet another modified likelihood, suggested by (9.1.6) — but whose real foundations lie
in counting process theory — see Aalen (1978), Andersen et al. (1993), and Kalbfleisch
and Prentice (2002)(3). Replace, even in the discrete case, if T is a survival time, p(t) by
λ(t) exp{−Λ(t)}, where Λ is the cumulative hazard rate. We illustrate the method in
Example 9.1.10. The Cox model (Example 9.1.4). We enlarge P of the proportional hazard
Cox model to include all distributions of (Z, T ) with T discrete and with hazard rates, given
Z = z,
λ(t|z) = r(z, β)λ(t)
140
Inference in semiparametric models
Chapter 9
where λ(t) and λ(t|z) are defined as in (9.1.8) and (9.1.13). Here Z has density (discrete
or continuous) h(·). We apply the empirical likelihood approach to this model. Given x1 =
(z1 , t1 ), . . . , xn = (zn , tn ), we can parametrize Px by θ ≡ (h1 , . . . , hn , λ1 , . . . , λn , β)
with
hi = h(zi ), λ(ti |zi ) ≡ λi r(zi , β), 1 ≤ i ≤ n ,
where hi ≥ 0, λi ≥ 0 : 1 ≤ i ≤ n, and β ∈ Rd . Using (9.1.9), we obtain the modified
likelihood,
Lx (θ) = Πni=1 hi λi r(zi , β) Π{(1 − r(zi , β)λj ) : tj < ti } .
j
(9.1.35)
We assume that h = (h1 , . . . , hn )T is not related to (β, λ), in which case we can treat
Πni=1 hi as a constant c. See Example 6.2.1. Or we can replace hj with its NPMLE b
hj =
nj /n from Remark 9.1.2, in which case Πb
hj is a constant c. Thus we replace Πni=1 hi with
c and θ with (λ, β) in what follows.
Our strategy will be to first fix β and find
b
λ(β)
= arg max Lx (β, λ) .
λ
b
Next consider lx (β) = Lx β, λ(β)
, which is called the profile empirical likelihood. The
b = arg max lx (β), and λ(
b β).
b Maximizing Lx (θ) with respect
profile NPMLEs are now β
to λ for β fixed, we get, solving ∂ log Lx (θ)/∂λk = 0,
n
X
1
r(zi , β)
=
1(tk < ti ) .
λk
1 − r(zi , β)λk
i=1
This equation is awkward. It has no explicit solution for λk , and computer solutions need
to be produced for a large grid of β’s. However, if we replace
Y
1 − r(zi , β)λj : tj < ti
Λ(ti |zi ) =
j
in (9.1.35) by its continuous case approximation from (9.1.6), that is
Λ(ti |zi ) ∼
= r(zi , β)
n
X
j=1
λj 1(tj ≤ ti ) ,
we get the approximate empirical likelihood,
Lx (θ) = cΠni=1 r(zi , β)λi exp{−r(zi , β)
n
X
j=1
λj 1(tj ≤ ti )} .
(9.1.36)
From (9.1.36) and ∂ log Lx (θ)/∂λk = 0, we find that the maximizers of Lx (θ) are
!−1
n
X
bk (β) =
r(zi , β) 1 (ti ≥ tk )
,1 ≤ k ≤ n .
(9.1.37)
λ
i=1
Section 9.1
Estimation in Semiparametric Models
141
bk (β) for λk in (9.1.36) and noting that
By substituting λ
n
X
r(zi , β)
i=1
n
X
j=1
bj (β)1(tj ≥ ti ) = n ,
λ
we find that the approximate profile likelihood is proportional to
"
#
n
Y
r(zi , β)
b
Pn
L(β) ≡ Lx β, λ(β) =
.
n−1 j=1 r(zj , β)1(tj ≥ ti )
i=1
(9.1.38)
The profile empirical likelihood estimate of β based on the continuous case approximation
(9.1.6) is now defined by
b = arg max L(β) .
β
β
It coincides with the Cox (1972,1975) partial likelihood estimate. See Example 9.1.11.
b , 1 ≤ j ≤ d, and their approximate standard errors are available in
The Cox estimates β
j
most statistical software packages.
We next introduce expressions that will be useful for identifiability and asymptotics.
Rewriting (9.1.37) with t in place of tk we get
b β) =
dΛ(t,
Z
−1
b
r(z, β) 1 (s ≥ t) dP (z, s)
dPb2 (t)
(9.1.39)
where Pb2 is the marginal empirical distribution of T and Pb(·, ·) is the empirical of (Z, T ).
Given (Z, T ) ∼ P , define P1 and P2 in general as the the marginal distributions of Z and
T . Let
Z
(9.1.40)
S0 (t, β, P ) = r(z, β)1(s ≥ t)dP (z, s) = EP r(Z, β)1(T ≥ t) .
By taking log in (9.1.38) and using (9.1.39) we see that maximizing L(β) as a function of
β is equivalent to maximizing Γ(β, Pb ) where
Γ(β, P ) = EP1 log r(Z, β) − EP2 log S0 (T, β, P ) .
(9.1.41)
Define
β(P ) = arg max Γ(β, P ) ;
b is
then the estimate β
(9.1.42)
b = β(Pb) .
β
b as an empirical plug-in estimate based on the generalized estiNote that we can regard β
mating equation Γ̇(β, Pb) = 0, where Γ̇ is the gradient with respect to β. This makes it
easier to handle identifiability and asymptotics. See Section 9.2.
142
Inference in semiparametric models
Chapter 9
Identifiability of β means that, under the Cox model, if β 0 is true, −Γ(β, P ) has a
unique minimum at β = β0 , that is, it is a contrast function as defined in Section 2.1.1.
This holds iff the map β → LP (r(Z, β)) is 1–1 and P [S0 (T, 0, P ) = c] < 1 for all c. A
proof is sketched in Example 9.2.3 when r(z, β) = exp{βT z}. We shall show that, under
the identifiability conditions, −Γ(β, P ) is strictly convex in β when r(z, β) = exp{βT z}
even when the Cox model does not hold. Thus, for this r, β(P ) defined by (9.1.42) is a
well defined parameter for general P . Similarly, for this r, −Γ(β, Pb ) is strictly convex in
b is well defined. Note that Γ(β, P ) is not linear
β with probability tending to 1 so that β
in P so that M estimation theory as given in Section 6.2.1 can not be used directly for
asymptotics. We return to this in Section 9.2.2.
✷
We next turn to a widely used modified likelihood.
Example 9.1.11. The Cox partial likelihood. Regression analysis for censored and tied
data. Consider Example 9.1.9 with censoring, but now assume we have available a covariate vector Z. As before, let λ(t|z) be the conditional hazard rate given Z = z. We change
notation and let t1 < . . . < tm be the observed distinct failure times, leaving out censoring
times.
The Cox partial likelihood Πm
i=1 qi is a conditional empirical “likelihood,” where qi
is the probability that a failure occurs at time ti computed conditionally for those patients
whose failure or censoring times are at least ti , that is, for patients with yj = min{tj , cj } ≥
ti . It follows from (9.1.8) that with Ri = {j : yj ≥ ti },
Πm
i=1 qi
=
λ(ti |zi )
Πm
i=1 P
j∈Ri λ(ti |zj )
=
Πni=1
"
λ(yi |zi )
P
j:yj ≥yi λ(yi |zj )
#δi
(9.1.43)
provided that given Zi , 1 ≤ i ≤ n, C1 , . . . , Cn are independent of T1 , . . . , Tn . As pointed
out by Cox (1972,1975), Πm
i=1 qi is not a “complete” likelihood, so the term “partial” is
appropriate. The λ(ti ) term cancels in Πm
i=1 qi under the proportional hazard assumption
λ(ti |zi ) = λ(ti )r(zi , β), and we have the Cox partial likelihood for proportional hazard
model:
"
#δi
r(zi , β)
m
n
Πi=1 qi = Πi=1 P
.
(9.1.44)
j:yj ≥yi r(zj , β)
When there is no censoring and there are no ties, this likelihood is equivalent to the modified
profile likelihood (9.1.38).
✷
In the next subsection we formalize the approximation behind the modified likelihoods
(9.1.36) and (9.1.38).
9.1.3
Other Modified and Approximate Likelihoods
There are models
where it is convenient to use discrete type modified likelihoods that do
P
not require pi = 1. See e.g. Murphy (1994), Murphy, Rossini and van der Vaart (1997),
van der Vaart (1998), Murphy and van der Vaart (2000), Zeng and Lin (2007), and Kosorok
Section 9.1
Estimation in Semiparametric Models
143
(2008). We consider this approach but start with what we call the delta method likelihood,
Ld . It leads to approximate likelihoods such as (9.1.36) and the Cox likelihood (9.1.44)
in a natural way. This approach is based on a very close approximation to the ordinary
likelihood and does not involve discrete type modified likelihoods.
Here is the justification of the delta method likelihood: Many semiparametric models
can be written as models with parameters β ∈ Rd and η(·); and a likelihood that depends
on β, η(·), and η ′ (·), where η ′ (y) ≥ 0. Then the likelihood of i.i.d. (Y1 , Z1 ), . . . , (Yn , Zn )
can be expressed in terms of θ(i) , η(y(i) ), γ(i) ≡ η ′ (y(i) ), and z(i) where the y(i) ’s are the
yi ’s ordered, z(i) is the covariate vector corresponding to y(i) , and θ(i) = r(z(i) , β) with
r(·, ·) known. To simplify the likelihood, rewrite η(y(i) ) as the telescoping sum
η(y(i) ) =
n
X
j=1
η(j) 1(y(j) ≤ y(i) )
with
η(j) = η(y(j) ) − η(y(j−1) ), η(y(0) ) = 0 .
Now the parameters are β, γ and η where γ = (γ(1) , . . . , γ(n) )T , η = (η(1) , . . . ,
η(n) )T . To eliminate γ consider the delta-approximation η(y + d) ∼
= η(y) + dη ′ (y), that is,
γ(i) = η ′ (y(i) ) ∼
=
η(y(i) ) − η(y(i−1) )
η(i)
=
y(i) − y(i−1)
d(i)
(9.1.45)
where d(i) = y(i) − y(i−1) , y(0) = 0. We substitute γ(i) ∼
= η(i) /d(i) in the likelihood
and get an approximate likelihood Ld (β, η) that depends on β and η only so that we have
reduced the dimension of the parameter space. In the proportional hazard model, with
η(y) = Λ(y), the resulting likelihood is equivalent to the approximate empirical likelihood
b of (9.1.38), is equivalent to the Cox
(9.1.36) (Problem (9.1.15), and, with η replaced by λ
partial likelihood (9.1.44).
The authors referred to in the first paragraph define P
a similar approximate likelihood
n
La (β, η) by replacing η(y) with the step function η[y] ≡ i=1 γ(j) 1(y(j) ≤ y) with the γi
being the basic parameter. That is, La can be obtained from Ld by setting the spacings d(i)
in Ld equal to one. The two approaches based on the δ likelihood and La are equivalent for
estimation of β when all terms in involving d(i) , 1 ≤ i ≤ n, do not contain any parameters.
See Problem 9.1.16 for a case where they are not equivalent. Ld has the advantage that it
is a very close approximation (Problem 9.1.24) to the actual semiparametric likelihood and
it does not use discrete models where continuous models are natural.
Murphy and van der Vaart (2000) , and Zeng and Lin (2007) among others, have shown
under suitable assumptions, that if we use the profile approach where we fix β and set
the estimate
b (β) = arg max{La (β, η)} ,
η
η
b = arg max{La (β, η
b (β)}
β
144
Inference in semiparametric models
Chapter 9
is asymptotically semiparametrically efficient in a sense to be introduced in Section 9.3. We
b and ηb ≡ ηb(β)
b the maximum approximate profile likelihood estimates (MAPLEs).
call β
Example 9.1.12. Semiparametric transformation models. Consider the family of composite models where the df of Y given Z = z is a composite of a parametric df G(·; θ) and a
nondecreasing function η(·). That is, consider the transformation model
θ = r(z, β) ∈ Θ ,
F (y; θ, η) = G η(y); θ),
(9.1.46)
where β ∈ Rd , y ∈ A ⊂ R, and η(y) ∈ [η, η] with G(η, θ) = 0, G(η, θ) = 1, all
θ ∈ Θ. The proportional hazard model
is of this form with G(t; θ) = 1 − exp{−θt}
and η(y) = Λ(y) = − log 1 − F (y) for an unknown baseline df F . More generally,
F (y) ; and G(t; θ) = G0 (t − θ),
interesting cases are G(t; θ) = G1 (θt), η(y) = G−1
1
η(y) = G−1
F (y) ; for specified df’s G1 and G0 (see Problem 9.1.17).
0
If η(·) has a derivative η ′ (y) > 0, and G has density g(v; θ), the exact likelihood is
Πni=1 qi with
qi = η ′ (y(i) )g
X
η(j) ; θi
j≤i
(9.1.47)
where η(j) = η(y(j) ) − η(y(j−1) ), η(0) = 0, and θi = r(zi , β). With the approximation
η ′ (y(i) ) ∼
= η(i) /d(i) , Ld (β, η) is equivalent to Πni=1 pi with
pi = η(i) g
X
j≤i
η(j) ; θi ,
(9.1.48)
P
where we do not require ni=1 pi = 1. In this case, Ld (β, η) is constant in d(i) = y(i) −
y(i−1) , 1 ≤ i ≤ n, and is equivalent to La (β, η). Moreover, La only depends on the ranks
of y1 , . . . , yn , and thus the estimate of β also only depends on the ranks.
The MAPLEs are formally
where
n
X
Y
b = arg max
b=η
b (β) ,
ηbi (β)g
ηbj (β); r(zi , β) , η
β
β i=1
j≤i
b (β) = arg max
η
η
n
Y
i=1
pi : η < ηj < η, 1 ≤ j ≤ n .
As an example, supposeG(y; θ) = L(y − θ) where L(t) is the logistic distribution [1 +
exp(−t)]−1 and η(t) = log F (t)/[1−F (t)] for some baseline F ; then the model (9.1.47)
b is the profile NPMLE of Murphy,
is the proportional odds model (Problem 9.1.17) and β
Rossini and van der Vaart (1998). Existence, consistency, and asymptotic normality of the
MAPLEs for this model is given in that paper.
Section 9.1
9.1.4
Estimation in Semiparametric Models
145
Sieves and Regularization
Empirical likelihood and its relatives are methods which give attractive answers in a number
of situations, but their scope is, in fact, limited. To see this, suppose we try to apply it to
the symmetric location model of Example 3.5.1. Here
P = {P : P has density p(x) = f (x − θ), f is arbitrary symmetric about 0, θ ∈ R} .
Unfortunately, Px cannot be defined since, in general, there is no distribution concentrated
on x which puts positive mass on all points, which is also symmetric about any point.
Sieves
We next consider a conceptually very powerful approach, the method of sieves, introduced into statistics by Grenander (1981). In the method of sieves, the approach is to
approximate the model P by a sequence of regular parametric models {Pm } called a sieve
that in some sense converge to P as m → ∞. The distribution Pbm in the sieve that corresponds to the maximum likelihood estimate for Pm estimates P and provides a plug-in
estimate ν(Pbm ) of a parameter ν(P ) in a semiparametric model. It is sometimes convenient
to take for Pbm not maximum likelihood but minimum distance or some other estimate well
behaved when P ∈ Pm . These methods can be thought of as examples of regularization,
an approach first proposed by Tikhonov (1963) in the context of providing stable solutions
to “ill posed” differential equations. See Section 11.4.1.
The method of sieves is a method which, in principle, can be applied to any statistical
problem where we have available an i.i.d. sample with empirical probability Pb . Suppose,
given a model P for our data, we can determine P1 ⊂ P2 ⊂ . . . such that
(i) P ⊂ ∪∞
j Pj , where Ā denotes closure of A in at least the sense of weak convergence.
(ii) Pj is a regular parametric model
Pj = {Pθ(j) : θ(j) ∈ Θ(j) ⊂ Rdj }
where {dj } is a sequence strictly increasing to ∞ as j → ∞ with Pθ (j) having
density function p(·, θ (j) ) and the map θ(j) → p(·, θ (j) ) is smooth.
Let ρ(P, Q) be a measure of the discrepancy between two probabilities P and Q. For a
given fitting criterion of the form
Πj (Q) = arg min ρ(P, Q) : P ∈ Pj ,
(j) such that if
the idea now is to find parameters ηj (P ) ∈ Θ ⊂ ∪∞
j=1 Θ
Πj (P ) ≡ Pηj (P ) ∈ Pj
then
Πj (P ) → Π(P ) = P as j → ∞
146
Inference in semiparametric models
for P ∈ P, and
1
Chapter 9
Πj (Pb ) = Πj (P ) + ∆jn
where ∆jn = oP (n− 2 ) for all j. Here Πj (P ) − P can be thought of as contributing to bias
and Πj (Pb) − Πj (P ) as contributing to variance. Typically ∆jn cannot be made uniformly
small in j for all n.
In the method of sieves, we choose a model selection rule which chooses PJn as the
“best” model according to some criteria (see Section I.7), and then act as if PJn were true.
That is, with m = Jn and Pb the empirical probability, we estimate P by
Pbm ≡ Πm (Pb ) .
The estimate of a parameter ν(P ) is then νb ≡ ν(Pbm ).
The three elements of the method of sieves for estimating a parameter ν(P ), P ∈ P
are:
a) The choice of {Pj }.
b) The choice of ηj (P ). Frequently the first thing tried is the maximum likelihood
(Kullback-Liebler) discrepancy of Section 2.2.2,
Z
ηj (P ) = arg max{ log p(x, θ(j) )dP (x) : θ (j) ∈ Θ(j) } .
That is, for P ∈ P, ηj (P ) is the parameter θ in the jth model that makes Pθ closest
to P in the Kullback-Leibler sense.
c) The choice Jn of j.
Unfortunately, the applications of the method of sieves usually lead to technical difficulties which have to be resolved by delicate empirical process theory arguments; see Bickel,
Klaassen, Ritov and Wellner (1993,1998), van der Vaart and Wellner (1996), and van der
Geer (2000) for theory and practice, examples, and references to the literature.
Regularization
Sieves are just a type of regularization. Consider the representation of maximum likelihood in terms of Kullbach-Leibler discrepancy given in Section 2.2.2. We start by writing
Z
1
log Lx (p) = log p(x)dPb (x) .
n
We saw in Example 9.1.6 that
Z
arg max
log p(x)dPb(x) : P corresponds to P ∈ P
(9.1.49)
Section 9.1
Estimation in Semiparametric Models
147
is not always well defined. Yet we know by Shannon’s Lemma (2.2.1) that, quite generally,
if P0 ∈ P,
Z
P0 = arg max
log p(x)dP0 (x) : p corresponds to P ∈ P
is well defined. This suggests that we consider approximate solutions of the problem
(9.1.49) which are “nice” and so will be apt to be closer to the “nice” true P0 .
Sieves do this by restricting the model over which optimization is carried out. An
alternative is to penalize solutions which are not “nice” by changing the quantity to be
For instance, we assume that P is all densities p with a derivative p′ such that
Roptimized.
′
2
[p (x)] dx < ∞. It is natural to maximize
Z
Z
b
log p(x)dP (x) − λn [p′ (x)]2 dx .
(9.1.50)
For λn > 0 theR maximizer of (9.1.50) may be shown to be unique and necessarily have
a density with [p′ (x)]2 dx < ∞. On the other hand if λn → 0 it is intuitively clear
that the maximizer of the population version of (9.1.50) where Pb is replaced by P should
approximate the true p0 . This method is studied further in Chapter 11. It is, in fact, closely
linked to the idea of using the posterior mode of a Bayes prior on P. It can also be viewed
as a special case of the method of sieves where,
the models {Pn } of the sieve
however,
R
are themselves infinite dimensional, Pm = p : [p′ (x)]2 dx ≤ εm . We discuss this
further in Chapter 11 but note here the general principles. A review of regularization with
discussion of various examples is given by Bickel and Li (2006).
Examples
We shall give an illustration of how sieves can be used in a, by now, classical example.
In this example we are using a straightforward extension to two sequences {j1 } and {j2 }.
Example 9.1.13. Partial linear model. Suppose X = (U, Z, Y ) and we postulate
Y = βZ + g(U ) + ε ,
(9.1.51)
where (U, Z) have unknown joint distribution H, ε ∼ N (0, σ 2 ) independent of (U, Z) and
g is arbitrary except for regularity conditions. We shall assume (U, Z) ∈ R2 , although
the model and results are readily extendable to vector U and Z and ε ∼ F arbitrary. This
model, introduced by Engle et al. (1986), arises if, for instance, Z takes on only a finite set
of values such as the levels of a treatment, but U is a real valued covariate of less intrinsic
interest, e.g. age. Thus ν(P ) = β is the parameter of interest. Rewrite the model (with a
new g) as
Y = β(Z − E(Z|U )) + g(U ) + ε
P ={P(β,g,σ2 ) : β ∈ R, σ 2 > 0, g arbitrary} .
148
Inference in semiparametric models
Chapter 9
Then, in analogy with ordinary linear regression, because
Y − EP (Y |U ) = β(Z − EP (Z|U )) + ε
for P ∈ P, we can formally define a parameter
β(P ) =
EP (Z − EP (Z|U ))(Y − EP (Y |U ))
EP (Z − EP (Z|U ))2
(9.1.52)
such that
β(P(β,g,σ2 ) ) = β .
Now (9.1.52) clearly gives a valid definition of β(P ) if EP (Z 2 ) < ∞, EP g 2 (U ) < ∞,
and EP (Z − EP (Z|U ))2 > 0. Then P is identifiable provided EP (Y 2 ) < ∞, EP (Z 2 ) <
∞, and Z is not a function of U with probability 1. In fact, we shall argue that only
the condition 0 < EP (Z − EP (Z|U ))2 < ∞ is needed (Problem 9.1.12). Note that
we cannot define β(Pb) because EPb (Z|U ) and EPb (Y |U ) are not well defined. However,
b |U = ·)
it is natural to plug-in regularized estimates of these quantities, call them E(Y
b
and E(Z|U
= ·), to obtain a procedure which works. For simplicity, we shall use the
histogram density estimate applied to p(y|u), p(z|u). Assume that (U, Y ) has a density on
[0, 1] × [0, 1]. Define Ijh = (jh, (j + 1)h] for each axis of the square as in Example I.5 so
(1)
(2)
that Ij1 ,h1 × Ij2 ,h2 form a partition of the square for 0 ≤ ji ≤ 1/hi , i = 1, 2. Let
(1)
(2)
(1)
pbn (y|u) = Pb [Y ∈ Ij2 ,h2 (y) | U ∈ Ij1 ,h1 (u)]
(2)
where Ij1 ,h1 (u) is the partition interval that contains u and Ij2 ,h2 (y) the one that contains
y. This is an application of the method of sieves where Pj1 ,j2 corresponds to the model
with all densities of (U, Y ) and (Z, Y ) constant on the rectangles described above and
{θ(j1 ,j2 ) } are the probabilities of these rectangles. Let h2 → 0. Strictly speaking we no
longer have a continuous case conditional density when h2 = 0, but rather the discrete
density assigning equal mass to all Yi such that Ui ∈ Ij1 ,h1 . However, the procedure turns
out to be satisfactory, yielding,
Pn
Yi 1(Ui ∈ Ij1 ,h1 (u))
b |U = u) ≡ Pi=1
.
(9.1.53)
EPbn (Y |U = u) ≡ E(Y
n
i=1 1(Ui ∈ Ij1 ,h1 (u))
b
We next let h1 → 0 and get a similar definition for E(Z|U
= u). Now βb ≡ β(Pbn ) is
b
obtained by plugging these E’s into (9.1.52). These are the plug in estimates corresponding
b In this example,
to the regularization given. We shall later study a simplified version of β.
choosing ηj (P ) amounts to choosing the bandwidths h1 and h2 . See Chapters 11 and 12.
See also Engle et al (1986), Chen (1988), and Fan et al (1998).
✷
Example 9.1.14. ICA (Example 9.1.5). The density of X in the independent component
analysis (ICA) model (9.1.18) is
p(x; A, Q) = |A|−1 Πdj=1 qj (a(j) x)
(9.1.54)
Section 9.1
149
Estimation in Semiparametric Models
where a(j) is the jth row vector of A−1 .
We begin with the problem of identifiability, following Kagan, Linnik and Rao (1973);
see also Comon (1994). Suppose (A1 , Q1 ), (A2 , Q2 ) lead to the same distribution for X,
where all components of Ql , l = 1, 2, have mean 0 and variance 1. Without loss of generality (Problem 9.1.10), we can take A2 = J, the d × d identity. For l = 1, 2, let Zl =
(Z1l , . . . , Zdl )T be the vector of independent variables distributed as (Q1l , . . . , Qdl )T .
Consider the characteristic function of X = A1 Z,
(1)
Ψ(t) = E[exp{itT X}] = E[exp{itT A1 Z}] = Πdk=1 ψk (ak t)
(9.1.55)
(l)
where ψk is the characteristic function of Zkl , l = 1, 2, and ak is the kth row vector of
AT1 . Similarly, since A2 is the identity,
(2)
Ψ(t) = Πdk=1 ψk (tk ) .
(9.1.56)
(j)
(j)
Taking logs in (9.1.55) and (9.1.56) appropriately, we obtain if log ψk ≡ γk ,
d
X
(1)
[γk ](ak tT ) =
k=1
d
X
(2)
[γk ](tk )
k=1
for all t in a neighborhood of 0. Differentiating twice with respect to tb , tc , we obtain
d
X
(1)
[γk ]′′ (ak tT )abk ack = δbc
(9.1.57)
k=1
(1)
where ak = (a1k , . . . , adk ) and δbc is the Kronecker delta. By assumption, [γk ]′′ (0) = 1
for all k. Hence, we see that a1 , . . . , ad are orthonormal. We can deduce from (9.1.57) that
(1)
either [γk ]′′ (t) is constant, i.e. Zk is Gaussian, or ak is plus or minus one of the coordinate
vectors and conclude that, if at most one of the Zk is Gaussian, identifiability of A and γk ,
1 ≤ k ≤ d follows (Problem 9.1.10). This argument suggests that an estimate of A can be
constructed as follows: Let A∗ be the matrix obtained from A by scaling each column to
have norm 1 and drop the requirement that the Zk have variance 1. Let
Z
|t|2
∗
2
b − Πd ψk (a∗ tT , a(k)
b
Q(A ) = |ψ(t)
}dt
∗ )| exp{−
k=1
k
2
(k)
where ψb is the empirical characteristic function EPb [exp{itT X}] of X and ψk (·, a∗ ) is
(k)
(k)
the empirical characteristic function of a∗ X where a∗(k) and a∗ are the kth row vectors
of A∗ and [A∗ ]−1 . Chen (2003) shows that
is a
b∗ = arg min Q(A
b ∗)
A
√
n consistent estimate of A∗ under mild conditions.
150
Inference in semiparametric models
Chapter 9
More significantly, we can also apply the method of sieves to obtain what turns out to
be an efficient estimate of A∗ . The idea is to construct suitable estimates of qk′ /qk (·, A∗ ),
the score functions of the densities of Z1 , . . . , Zd assuming that A∗ is the truth. A by now
standard approach (see Stone, Hansen, Kooperberg and Truong (1997)) is to approximate
log qk (·, A∗ ) by a linear combination of basis functions, e.g. splines. That is, we approximate an arbitrary density q by a member of the exponential family,
p(x, θ) = exp{
d
X
θk wk (x) − Ad (θ)}
k=1
where the linear span of the wk is dense in the space of all functions for suitable metrics.
See Section 11.4.2. Given the qbk (·, A∗ ) we plug back into (9.1.54) and maximize the
resulting likelihood in A∗ . This is an application of the method of sieves. Appropriate
A∗n can be constructed and despite technical difficulties, this approach works — see Chen
(2003).
We next consider a model that’s a rival to the Cox model (Reid (1994)).
Example 9.1.15. The accelerated failure time (AFT) model. Consider the model where
a failure time T is an accelerated version of a baseline failure time T0 in the sense that
T = αT0
;
T0 ≥ 0 ,
where α = exp{−βT Z}, T0 corresponds to β = 0, and T0 is independent of the vector Z
of predictors. We can rewrite this model as
T eβ
T
Z
= T0 .
For the case where Z = Z(t) is a time dependent covariate vector, Cox and Oakes (1984,
Section 5.2) and Zeng and Lin (2007) considered the model
Z
T
eβ
T
Z(t)
dt = T0 .
(9.1.58)
0
Let Λ(t|z) and Λ(t) denote the cumulative hazards of T and T0 , then
Λ(t|z) = Λ
Z
t
eβ
T
0
z(s)
ds .
(9.1.59)
As before, let C denote a censoring time set δ = 1(T ≤ C), Y = min(T, C), assume
that C is independent of T given Z, and that the distribution of C given Z does not involve
β or Λ. Then the log likelihood for data {(Yi , Zi (Yi ), δi ; 1 ≤ i ≤ n} is proportional to
(Problem 9.1.20),
l(β, λ) = n−1
n
X
i=1
δi β T Zi (Yi ) + δi log λ(θi ) − Λ(θi )
(9.1.60)
Section 9.2
Asymptotics. Consistency and Asymptotic Normality
151
where λ(·) = Λ′ (·) and
θi =
Z
0
Yi
exp βT Zi (s) ds .
In this example, the profile approach where we fix β and maximize l(β, λ) with respect
to λ(·) does not produce useable estimates (see Problem 9.1.20). If we formally pass to the
RY
limit as n → ∞ in (9.1.60) we obtain θ(Y ) ≡ θ(Y, β) = 0 exp{βT Z(s)}ds. Note that
T0 = θ−1 (T, β). It is natural to consider
θ(·, β) as a nuisance
parameter and regularize the
d
P δ = 1, Y ≤ θ−1 (t, β) by a kernel smoothed version
profile likelihood by replacing dt
which is then estimated empirically. See Chapter 11 for such estimates. The resulting
estimates of β are, under regularity conditions, semiparametrically efficient in the sense of
Section 9.3 (Zeng and Lin (2007)).
✷
Remark 9.1.3 (a). The proportional hazard and the AFT models coincide for the Weibull
model and the periodic hazard rate model. See Problem 9.1.21.
(b). The proportional quantile hazard rate and AFT models are equivalent. See Problem
9.1.22.
✷
Summary. We discuss in Section 9.1.1 several semiparametric models for a variety of common experimental frameworks including the symmetric location model, the linear model
with stochastic covariates and unknown error distribution and biased sampling models. In
the context of survival analysis we introduce the survival function, the hazard function, censoring, truncation, and the Cox proportional hazard model. We also consider the Independent Component Analysis (ICA) model where the basic variables are linear combinations of
independent but not identically distributed variables whose distributions are “arbitrary.” We
consider in Sections 9.1.2 and 9.1.3 maximum likelihood type methods for semiparametric
models that are based on modifications of the maximum likelihood principle. They are
usually based on approximating or replacing the arguments of the objective functions associated with the maximum likelihood (Kullback-Leibler
divergence) with arguments that are
√
“well behaved.” Success is achieved when n consistent (generally efficient) estimates of
Euclidean parameters are obtained when the empirical probability Pb is plugged in for the
population probability P . The first method considered is modified likelihood which leads to
nonparametric MLEs and the second method considered in Section 9.1.4 is the method of
sieves. We then illustrate these regularized maximum likelihood and sieve procedures in the
semiparametric models introduced in Section 9.1.1. We obtain some classical procedures
including the Kaplan-Meier, Nelson-Aalen, and Cox estimates in survival analysis.
9.2
Asymptotics. Consistency and Asymptotic Normality
In this section we establish a general consistency criteria and use it to prove consistency
and asymptotic normality for some of the estimates we have suggested.
152
Inference in semiparametric models
9.2.1
Chapter 9
A General Consistency Criterion
We state a generalization of Theorem 5.2.3, which turns out to be broadly useful for proving
consistency. Let kf kK ≡ sup{|f (t)| : t ∈ K}, K ⊂ Rd , and let
{Qn (t) : t ∈ Rd }
be a sequence of real valued stochastic processes such that
A1: Qn (t) is convex as a function of t with probability 1.
P
A2: kQn − QkK −→ 0 as n → ∞
for all compact K ⊂ Rd , where the limit Q : Rd −→ R is a deterministic function.
A3: Q is strictly convex and continuous with lim{Q(t) : |t| −→ ∞} = ∞.
It may be shown (Rockafellar (1969), p.74) that A2 follows from
P
A2’: Qn (·) is separable and Qn (t) −→ Q(t) for all t.
Let
t0 ≡ arg min Q(t),
btn = arg min Qn (t) .
By convention, b
tn is chosen among the set of minimizers if argmin is not unique and b
tn ≡ 0
if the minimum is not assumed.
Proposition 9.2.1. If A1, A2, and A3 hold,
(i) t0 is uniquely defined.
P
(ii) b
tn −→ t0 .
Proof. Condition A3 guarantees (i). For (ii), let ǫ > 0, and note that “|b
tn − t0 | > ǫ”
implies that the minimizer of Qn (t) is in {t : |t − t0 | ≥ ǫ}. Thus
P (|b
tn − t0 | > ǫ) ≤ P inf{Qn (t) : |t − t0 | ≥ ǫ} ≤ Qn (t0 ) .
Now (ii) follows if we argue that as n → ∞,
P inf{Qn (t) : |t − t0 | ≥ ǫ} > Qn (t0 ) → 1 .
(9.2.1)
Note that by A3,
inf{Q(t) : |t − t0 | = ǫ} > Q(t0 ) .
Therefore, by A2, as n → ∞,
P inf{Qn (t) : |t − t0 | = ǫ} > Q(t0 ) → 1 .
Next suppose b
tn ∈ {t : |t − t0 | > ǫ}. Then there exists λ ∈ (0, 1) such that
t1 ≡ λb
tn + (1 − λ)t0 ∈ {t : |t − t0 | = ǫ} .
(9.2.2)
Section 9.2
153
Asymptotics. Consistency and Asymptotic Normality
By convexity of Qn ,
Qn (t1 ) ≤ λQn (b
tn ) + (1 − λ)Qn (t0 ) ≤ λQn (t0 ) + (1 − λ)Qn (t0 ) = Qn (t0 ) .
It follows that b
tn ∈ {t : |t − t0 | > ǫ} implies that
inf{Qn (t) : t : |t − t0 | = ǫ} ≤ Qn (t0 ) .
But by (9.2.2), the probability of this event tends to zero. We have established (9.2.1).
✷
Proposition (9.2.1) is what we generally need for consistency arguments. For asymptotic normality particularly in relation to modified likelihood, we shall use this consistency
argument in conjunction with the generalized delta method of Section 7.1.
9.2.2
Asymptotics for Selected Models
We begin with
Example 9.2.1 Biased sampling asymptotics (Examples 9.1.2 and 9.1.8). We consider the
case of one population first. Let X = R and recall
Fbe (x) =
Z
Rx
x
dFbe (y) = R−∞
∞
−∞
−∞
w−1 (y)dPb (y)
,
w−1 (y)dPb (y)
(9.2.3)
where Pb is the NPMLE of P and P is assumed to have density pE (x) = w(x)f (x)/W (F ).
If EP (w−1 (X))2 < ∞, then, using Theorem 7.1.5 (Problem 9.2.1),
En (x) ≡
where
Z(x) =
√
n(Fbe (x) − F (x)) =⇒ Z(x)
Rx
−∞
w−1 (y)dWP0 (y)
EP w−1 (X)
−
Rx
−∞
w−1 (y)dP (y)
(EP w−1 (X))2
Z
∞
−∞
w−1 (y)dWP0 (y)
and WP0 (·) is the Brownian bridge on the support of f with respect to P .
More generally, for the stratified framework with several populations, it follows from
Theorem 7.2.1 that if EP w−2 (Y, F ) < ∞ where
w(y, F ) ≡
k
X
j=1
λj
wj (y)
,
Wj (F )
then W(Fb ) is well defined with probability tending to 1 and
Z
1
b
W(F ) = W(F ) + P(x, F )dPb (x) + oP (n− 2 )
(9.2.4)
154
Inference in semiparametric models
for Ψ given by (9.1.23). It follows from (9.2.4) that if Fbe (y) =
defined in Example 9.1.8 and
En (y) =
then
√
n(Fbe (y) − F (y))
Ry
−∞
Chapter 9
dFbe (u) where dFbe is
Z ·
1
En (y) = n 2
w−1 (u, F ) d(Fbe − F )(u)
−∞
Z y
n
h Z ∞ w (u)
a
−
dF (u) c−2 (u, p)
d(Pb1 − P1 )(a)
W
a
−∞
−∞
Z Z
io
wa (u)
ψa (x, P )dP1 (a)d(Pb − P )(x) + oP (1)
−
2
Wa
(9.2.5)
by repeated application of the chain rule where x = (i, v), ψ = (ψ1 , . . . , ψk ) and the
second term comes from (9.2.4). These results are sketched in the problems. See also Gill,
Vardi and Wellner (1988).
✷
Example 9.2.2. Asymptotics for the Kaplan-Meier and Nelson-Aalen estimates (Examples
9.1.3 and 9.1.9). The Kaplan-Meier estimate of the survival function can be approximated
by the simpler Nelson-Aalon estimate which is easier to analyze, and is of the modified
empirical likelihood method. For t < Y(L) , using log(1 − a) = −a + 12 a2 + o(a2 ),
log SbKM (t) =
X log 1 −
= −
X
1(Y(i) ≤ t)
(9.2.6)
( n
)
X δ(i) 1(Y(i) ≤ t)
.
≤ t) + OP
(n − i + 1)2
i=1
δ(i)
n−i+1
δ(i)
1(Y(i)
n−i+1
Recall from Theorem 7.2.3 that
sup{|Y(i) − H −1 (
i
i
)| : ≤ 1 − ε} = oP (1)
n+1
n
where
H(x) = 1 − F̄ (x)Ḡ(x)
is the distribution function of Y . Since t < Y(L) , H̄(t) ≡ 1 − H(t) > 0, and hence
n
X
δ(i) 1(Y(i) ≤ t)
1
= OP ( ) .
2
(n
−
i
+
1)
n
i=1
(9.2.7)
Define
n
X
b 1 (t) = 1
δ(i) 1(Y(i) ≤ t)
H
n i=1
Section 9.2
155
Asymptotics. Consistency and Asymptotic Normality
b
the (sub) empirical distribution function of the uncensored observations, let H(t)
be the
b̄
b Then,
empirical distribution function of the Yi , 1 ≤ i ≤ n, and let H = 1 − H.
Z t
n
X
b 1 (y)
δ(i)
dH
b N A (t)
1(Y(i) ≤ t) =
≡Λ
(9.2.8)
n
−
i
+
1
b̄
−∞
H(y−)
i=1
b N A is the Nelson-Aalen estimate of the cumulative hazard function. This leads to
where Λ
the Nelson-Aalen estimate of the survival function of T ,
b N A (t)} .
SbN A (t) = exp{−Λ
(9.2.9)
From (9.2.6) and (9.2.7), it is clear that if H̄(τ ) > 0, then
1
sup{|SbKM (t) − SbN A (t)| : t ≤ τ } = oP (n− 2 ) .
(9.2.10)
Using (9.2.10) and empirical process arguments from Chapter 7, we will show
Theorem 9.2.1. If H̄(τ ) > 0, and F is continuous, then if a “*” subscript denotes either
the KM or NA estimates
1
(i) sup{|Sb∗ (t) − S(t)| : t ≤ τ } = OP (n− 2 ).
(ii) As a stochastic process on (−∞, τ ],
Z t
b 1 − H1 )(y)
Sb∗ (t) = S(t) +
H̄ −1 (y−)d(H
−
Z
t
−∞
−∞
1
b̄
(H(y−)
− H̄(y−))H̄ −2 (y−)dH1 (y) + oP (n− 2 ) .
(iii) n (Sb∗ (·) − S(·)) =⇒ Z(·) as a stochastic process on [0, τ ] to a mean 0 Gaussian
process with
1
2
Cov(Z(s), Z(t)) = K̄(s)K̄(t)(C(s ∧ t) − C(s)C(t))
where K = S/[1 + C] and
Z t
Z t
1
1
dΛF (y) =
dF (y)
C(t) ≡
H̄(y−)
H̄(y−)
F̄ (y−)
0
0
(9.2.11)
Proof. We prove (ii) which will imply (i) and (iii). By (9.2.7), it’s enough to consider SbN A .
But
b N A}−exp{−ΛF } = −(ΛN A −ΛF )+OP (kΛ
b N A −ΛF k2 ) (9.2.12)
SbN A −S = exp{−Λ
∞
where kgk∞ ≡ sup{|g(t)| : −∞ < t ≤ τ }. Further,
Z t
b 1 − H1 )
d(H
b N A − ΛF (t) = −
Λ
(y)
H̄(y−)
−∞
Z t
b̄
+
(H(y−)
− H̄(y−))H̄ −2 (y−)dH1 (y) + ∆n (t)
−∞
156
Inference in semiparametric models
Chapter 9
where
∆n (t) = −
+
Z
Z
t
t
−∞
−∞
b̄ − H̄)
(H
b 1 − H1 )(y)
(y−)d(H
b̄
H H̄
b̄ − H̄)2
(H
(y)dH1 (y) .
b̄ H̄ 2
H
Call the terms of ∆n (·), Rem1 and Rem2 , respectively. Now, given Donsker’s theorem for
b
H(·),
we see that
b̄ − Hk2∞ H1 (τ ) sup{[H(y)
b̄ H̄ 2 (y)]−1 , y ≤ τ }) = OP (n−1 )
Rem2 = OP (kH
b̄ H̄ 2 (y) : y ≤ τ } = H(τ
b̄ )H̄ 2 (τ ) and H̄(τ ) > 0 by assumption.
because inf{H(y)
For the other remainder term, write, using the probability integral transforms, ν =
b 1 (y) and ν = H1 (y),
H
Rem1 =
Z
b 1 (·)
H
0
=
Z
H1 (·)
0
Z H1 (·) b̄
b̄ − H̄)(H
b −1 (v)−)
(H
(H − H̄)(H1−1 (v)−)
1
dv −
dv
b̄ H̄(H
b −1 (v)−)
b̄ H̄(H −1 (v)−)
0
(H
H
1
1
b̄ − H̄)(H −1 (v)−) − (H
b̄ − H̄)(H
b −1 (v)−)}
{(H
1
1
1
dv + oP (n− 2 ) .
−1
H̄ 2 (H1 (v)−)
−1
1
P
b̄ 1 − H̄ −1 )k∞ −→
0 (Problem
By Donsker’s theorem, Rem1 = oP (n− 2 ) because k(H
1
9.2.4).
This completes (ii). The weak convergence to a mean zero Gaussian process in (iii) is
again a consequence of Donsker’s theorem after an integration by parts. Finally (9.2.11)
may be obtained after a fairly tedious computation — see Problem 9.1.16. In fact, to
obtain (9.2.11) simply, we have to approach the problem from the point of view of counting
processes. We refer to Aalen (1978), Andersen et al (1993) and Shorack and Wellner (1986,
p. 308) for such a development.
✷
Example 9.2.3. Cox estimate asymptotics (Examples 9.1.4 and 9.1.10). We would like to
b = arg max Γ(β, Pb ) from (9.1.41) is
argue that under suitable conditions, the estimate β
uniquely defined, consistent, and asymptotically normal. Although this can be done, it is
easier to consider a slight modification: In order to avoid technical difficulties in the right
tail of the distribution of T , we limit ourselves to observations (Zi , Ti ) with Ti ≤ τ for
some finite fixed constant τ > 0. Here τ can be interpreted as a preassigned time limit in
the study.
Recall that
S0 (t, β, P ) = EP r(Z, β)1(T ≥ t) .
and let
Γτ (β, P ) =
Z
0
τ
− log S0 (t, β, P )dP2 (t) + E 1(T ≤ τ ) log r(Z, β)
(9.2.13)
Section 9.2
157
Asymptotics. Consistency and Asymptotic Normality
where P1 and P2 are the marginal probability distributions of Z and T . Note that Γ(β, P )
as defined earlier in (9.1.14) is just Γ∞ (β, P ). Suppose r(z, β) = exp{βT z} and let
β(P ) = arg max{Γτ (β, P )}.
(9.2.14)
We will show that β(P ) is identifiable if we use this r and make the assumptions
C : (i) P [T = c(Z)] < 1 for every function c : Rd → R .
(ii) EZZT is nonsingular.
(iii) T is continuous with a positive density on (0, ∞).
We next show that assumption C implies identifiability of β(P ) when r(z, β) = exp[β T z].
b ≡ β(Pb) under these assumptions.
We also show consistency and asymptotic normality of β
These results hold even when the Cox model for X = (Z, T ) is not satisfied. Here β(P )
can be thought of as the parameter of the distribution in the Cox model “closest” to P. See
Section 2.2.2. Let x = (z, t).
Theorem 9.2.2. Assume that (Z, T ), T ≥ 0, has a joint probability distribution P such that
C is satisfied. Set r(z, β) = exp(β T z). Then β(P ) = arg max Γτ (β, P ) is identifiable.
Moreover, let Ṡ0 and Γ̈τ denote gradient and Hessian of S0 and Γτ with respect to β, then
P
b is uniquely defined with probability tending to 1 and β
b −→
(i) β
β(P ) as n → ∞.
R
b = β(P ) + ψ(x, P )dPb (x) + oP (n− 12 ) where
(ii) β
ψ(x, P ) = −Γ̈−1
τ (β, P )γ(x, P )
with an expression for Γ̈τ (β, P ) given in the proof and, writing β = β(P ),
(
Z τ
i
h Ṡ0
Ṡ0
γ x, P ) = z − E[Z1(T ≤ τ )] −
(t, β, P ) −
(u, β, P )dP2 (u)
S0
0 S0
)
Z t
io
n
h Ṡ
T
0
−1
β
z
(u, β, P ) dP2 (u) 1(t ≤ τ ) .
(9.2.15)
−e
S0 (u, β, P ) z −
S0
0
1
b − β) =⇒ N (0, Σ(β, P )) where
(iii) n 2 (β
Z
Σ(β, P ) = ψψ T (x, P )dP (x) .
Proof. In view of Proposition 9.2.1, for identifiability and (i), it is enough to check that
(a) Γτ (β, Pb) is concave with probability 1.
158
Inference in semiparametric models
Chapter 9
(b) Γτ (β, P ) is continuous and strictly concave and Γτ (β, P ) → −∞ as |β| → ∞.
P
(c) kΓτ (β, Pb ) − Γτ (β, P )k −→ 0.
We will establish (a) and (b) by calculating the gradient of Γτ (β, P ),
Γ̇τ (β, P ) = −
Z
τ
0
and the Hessian,
Γ̈τ (β, P ) =
Z
0
τ
Ṡ0
(t, β, P )dP2 (t) + E Z1(T ≤ τ ) ,
S0
S0−2 Ṡ02 − S0 S̈0 (t, β, P )dP2 (t) .
It follows that Γτ (β, P ) is strictly concave if Ṡ02 − S0 S̈0 is negative definite (Section B.9).
We show this for the case Z ∈ R. Recall that S0 (t, β, P ) = EP [eβZ 1(T ≥ t)]. Let
1
1
1
U = [eβZ 1(T ≥ t)] 2 , V = [ZeβZ 1(T ≥ t)] 2 , and W = Z[eβZ 1(T ≥ t)] 2 , then
By Cauchy-Schwarz,
2
Ṡ02 − S0 S̈0 = E(V 2 ) − (EU 2 )(EW 2 ) .
2 2
(EU 2 )E(W 2 ) ≥ E(U W ) = E(V 2 ) ,
with equality iff W 2 = a + bU 2 for some a and b 6= 0. Equality leads to the equation
(Z 2 − b)eβZ 1(T ≥ t) − a = 0 .
Taking expectation with respect to L(T |z) gives
P (T ≥ t|z) = ae−βz /(z 2 − b) .
This is impossible unless T = c(Z) for some function c, which is ruled out by C(i). Thus
Γτ (β, P ) is strictly concave. The condition Γτ (β, P ) → −∞ as |β| → ∞ follows from
strict concavity. The preceding Cauchy-Schwarz argument shows that Γτ (β, Pb) is concave
in β with probability one. So (a) and (b) hold. Note that these concavity results hold for all
α ∈ [0, 1), where α = P (T ≤ τ ).
Finally, showing condition (c) is an exercise in empirical process theory. First check
that
P
sup{|S0 (t, β, Pb ) − S0 (t, β, P )| : t ≤ τ + ε} −→ 0
for ε > 0 sufficiently small and then apply the Glivenko-Cantelli Theorem.
b of Γτ (β, Pb ) satisfies Γ̇τ (β,
b Pb ) = 0. Since
To obtain (ii), first note that the maximizer β
b
b
b
b
β is consistent, we can expand Γ̇τ (β, P ) around β = β, set the expansion equal to zero,
and use the mean value theorem to obtain
√ b
∗ b √
b
n(β − β) = −Γ̈−1
τ (β , P ) nΓ̇(β, P )
Section 9.2
159
Asymptotics. Consistency and Asymptotic Normality
P
where |β ∗ − β| −→ 0. In addition, we can show
Z
1
b
Γ̇τ (β, P ) = Γ̇τ (β, P ) + γ x, P ) dPb (x) + oP (n− 2 ) .
(9.2.16)
We establish (9.2.16) by using the infinite dimensional delta method of Section 7.2. That is,
we compute the Gâteaux derivative of ν(P ) ≡ Γ̇τ (β, P ) and show that it equals γ(x, P ).
Then we show that −Γ̈−1
τ (β, P )γ(x, P ) satisfies condition (7.2.2) for being an influence
b Some of the details are in Appendix D4.
function for β.
Remark 9.2.1 The formula for γ(x, P ) is valid when τ = ∞ (Tsiatis (1981), Anderson
and Gill (1982)).
✷
In the special case of the Cox model we have
Proposition 9.2.2. Suppose (Z, T ) has the Cox (β 0 , λ) distribution and that C holds. Then
β(P ) = β 0 .
Proof. We have shown that Γ̇τ (β, P ) = 0 has a unique solution thus it remains to show
that Γ̇τ (β 0 , P ) = 0. We give the proof when d = 1; the general case is only notationally
more difficult. Let P0 , P10 , and P20 denote the joint and marginal probability distributions
of (Z, T ) under the Cox (β0 , λ) model (9.1.12), respectively. We need to show that β0
satisfies the equation
Z τ
Ṡ0
(t, β0 , P0 )dP20 (t) = E0 Z1(T ≤ τ ) ,
0 S0
where E0 is expectation under P0 . By conditioning on Z and using the iterated expectation
theorem (B.1.20), we find
dP20 (t) = dE0 P0 (T ≤ t|Z) = dE0 1 − exp{−Λ(t)eβ0 Z }
= E0 λ(t)eβ0 Z exp{−Λ(t)eβ0 Z } = S0 (t, β0 , P0 )dΛ(t) .
By also conditioning on Z, we find
Ṡ0 (t, β0 , P0 ) = E0 Zeβ0 Z 1(T ≥ t) = E0 Zeβ0 Z P (T ≥ t|Z)
= E0 Zeβ0 Z exp{−Λ(t)eβ0 Z } .
It follows that
Z τ
Z ∞Z τ
Ṡ0
(t, β0 , P0 )dP20 (t) =
zeβ0 Z exp{−Λ(t)eβ0 z }dΛ(t)dP01 (z)
0 S0
−∞ 0
!
Z ∞
Z vτ (z)
−v
=
z
e dv dP01 (z) ,
−∞
0
by making the integral the change of variable v = Λ(t)eβz and setting vτ (z) = Λ(τ )eβ0 z .
Because V = Λ(T )eβ0 z given
R ∞Z = z has a standard exponential distribution (Problem
9.2.14), the right hand side is −∞ zP (T ≤ τ |z)dP01 (z) = E(Z1(T ≤ τ )).
✷
160
Inference in semiparametric models
Chapter 9
Example 9.2.4. Partial Linear model (Example 9.1.13). Write
n
X
βb − β(P ) =
b i |Ui ))[Yi − E(Y
b i |Ui )]
{ (Zi − E(Z
i=1
n
X
− β
Since
i=1
b i |Ui )) }
(Zi − E(Z
n
X
i=1
b 2 (Zi |Ui ))2
(Zi − E
−1
.
εi = Yi − E(Yi |Ui ) − β(Zi − E(Zi |Ui ))
we obtain
Pn
b i |Ui )]εi
[Zi − E(Z
βb − β(P ) = Pi=1
n
2
b
i=1 (Zi − E(Zi |Ui ))
X
b i |Ui ))(E(Yi |Ui ) − E(Y
b i |Ui ))
+ { (Zi − E(Z
+ β
X
Moreover,
n
X
+ 2
n
X
b i |Ui ))2 )−1 .
b i |Ui )){E(Z
b i |Ui ) − E(Zi |Ui )}( (Zi − E(Z
(Zi − E(Z
i=1
b i |Ui ))2 =
(Zi − E(Z
i=1
n
X
i=1
(9.2.17)
n
X
i=1
(Zi − E(Zi |Ui ))2
b i |Ui )) +
(Zi − E(Zi |Ui ))(E(Zi |Ui ) − E(Z
(9.2.18)
n
X
i=1
It follows that
b i |Ui ) − E(Zi |Ui ))2 .
(E(Z
n
1 X [Zi − E(Zi |Ui )]εi
1
βb = β(P ) +
+ oP (n− 2 )
n i=1 E(Var(Zi |Ui ))
(9.2.19)
provided that we can show
n
1
1X b
(E(Zi |Ui ) − E(Zi |Ui ))2 = oP (n− 2 )
n i=1
(9.2.20)
n
1X
b i |Ui )) = oP (n− 21 )
(Zi − E(Zi |Ui ))(E(Zi |Ui ) − E(Z
n i=1
(9.2.21)
n
1X
b i |Ui ))(E(Yi |Ui ) − E(Y
b i |Ui )) = oP (n− 12 ) .
(Zi − E(Z
n i=1
(9.2.22)
It’s relatively easy to show such results under an artificial modification of the definib which is nevertheless useful conceptually. Suppose we divide the sample into
tion of E
Section 9.3
161
Efficiency in Semiparametric Models
{X1 , . . . , Xn−m } and {Xn−m+1 , . . . , Xn } where m ≍ nα with α < 1 to be chosen
later. Use {Xn−m+1 , . . . , Xn } only to form the histogram estimates of E(Y |U = u) and
b (2) (·|u). Then, define
E(Z|U = u). Call these E
βb∗ =
Pn−m
i=1
b (2) (Zi |Ui ))(Yi − E
b (2) (Yi |Ui ))
(Zi − E
.
Pn−m
b (2) (Zi |Ui ))2
i=1 (Zi − E
(9.2.23)
b (2) (·|u) and X1 , . . . , Xn−m , the analogues
The advantage is that, by the independence of E
of (9.2.20)–(9.2.22) become fairly easy. Results, which we shall postpone to Chapter 11,
enable us to conclude that then
2
2
b(2) (Y1 |U1 ) − E(Y1 |U1 ))2 = O(m− 3 ) = O(n− 3 α )
E(E
(9.2.24)
b (2) (Z1 |U1 ) holds under moment assumptions on Z and Y , the
and a similar result for E
smoothness of the densities of (U, Z) and (U, Y ), and tail assumptions on the density of U .
It is then easy to verify (9.2.20)–(9.2.22) and hence that (9.2.19) holds for βb∗ . We leave the
details to the problems and the validation of (9.2.19) without sample splitting to Chapter
b is well known — see
11. The validity of (9.2.19) under different regularization estimates E
Chen (1988) and Fan et al (1998) for instance.
✷
9.3
Efficiency in Semiparametric Models
As we noted in Sections 5.4.3 and 6.2.2, in regular parametric models, there is an asymptotic notion of being the “best” (efficient) estimate of a parameter ν(P ) ∈ R (or Rd ) among
broad classes of estimates of ν which are asymptotically Gaussian in some uniform way.
For instance, we showed in Theorem 6.2.1, that in a model P = {Pθ : θ ∈ Θ ⊂ Rd }
which is “regular,” among M estimates of θ which are “regular,” the “unique” asymptoti1
cally best procedure is one which is equivalent to maximum likelihood to order oP (n− 2 ).
“Regular” in these cases corresponds to assumptions A0–A6 of Section 6.2.2 holding for
˙ θ)
the candidates’ influence function Ψ and for the optimal influence function ΨOPT = l(·,
which corresponds to the MLE. Note that the conditions on ΨOPT correspond to smoothness conditions on the model. We also demonstrated through Example 5.4.2 (Hodges’
example) that some conditions were always necessary.
The models considered in this subsection will include the important case in which the
parameter ν is defined implicitly. That is, P = {P(ν,η) : ν ∈ Rd , η ∈ H} where η ranges
over a function space and ν is defined by ν(P(ν,η) ) = ν. For example, in the symmetric
location model (3.5.1), X ∼ P(ν,η) with density
p(x, ν, η) = η(x − ν), x, ν ∈ R ,
where η is an arbitrary
R unknown symmetric density. Here ν(P(ν,η) is uniquely defined as,
for instance, ν(P ) = x dP (x), if P has a first moment. Similarly, in the Cox proportional
hazard model (9.1.12), ν is identified with β and η(·) with λ(·). The partial linear model
(9.1.51) has ν = β and η(·) = g(·).
162
Inference in semiparametric models
Chapter 9
In this section we shall try to separate out assumptions on the model and on the candidate procedures and, using Hilbert space geometry, show how to extend results from
Section 5.4.3 and 6.2.2 to semiparametric models.
Regular Models
Suppose P is a general model for the distribution P of a random vector X. Consider
a parameter ν : P → R defined on P. Let Q be a one dimensional regular parametric
submodel of P containing a point P0 in its interior. That is,
(i) Q ⊂ P.
(ii) P0 ∈ Q.
(iii) Q = {Pt : |t| < 1} where Pt have continuous or discrete densities p(·, t).
Let l(x, t) = log p(x, t). We define a model Q to be regular if
(a) The map t → p(x, t) is continuously differentiable on (−1, 1) for (almost) all x.
˙ t) ≡
(b) l(x,
∂p
1
p(x,t) ∂t (x, t)
is defined for (almost) all x, and for all |t| < 1
˙
0 < I(t) ≡ EPt [l(X,
t)]2 =
Z
l˙2 (x, t)p(x, t)dµ(x) < ∞
where dµ(x) indicates sum or integral as defined in Section I.1; or more generally, µ
is a measure.
(c) The map t → I(t) is continuous.
Remark 9.3.1. We only consider submodels of the form (iii) that satisfy the conditions (a),
(b), and (c). These conditions are stronger than the Hellinger differentiability conditions of
Le Cam given in Bickel, Klaassen, Ritov and Wellner (1993,1998) (abbreviated to BKRW),
who show in Proposition 1, p. 13, that they imply the Le Cam conditions. They are easily
seen to be implied by A0–A6 of Section 5.4.6 applied to ψ(x, t) ≡ l̇(x, t).
✷
Note that l(x, t)and its derivatives depend on Q and its parametrization. That is, if
f (x, τ ) = p x, t(τ ) where τ (·) is one to one with inverse t(·), then ∇ log f at τ (t) is related to ∇ log p at t by the chain rule. However, we will take advantage of the equivariance
of the Fisher information bound. See Problems 3.4.3 and 9.3.2.
Regular Estimates
We shall say νb is a regular estimate of ν(P ) on P if
(i) νb is consistent on P.
(ii) νb is uniformly locally asymptotically linear and Gaussian on all regular parametric
submodels containing P0 for all P0 ∈ P.
Section 9.3
163
Efficiency in Semiparametric Models
What we mean by (ii) is the following: Fix P0 ∈ P. Let Q be any regular one dimensional
submodel of P containing P0 and let
Z
Z
L02 (P0 ) ≡ {h :
h2 dP0 < ∞ and
hdP0 = 0}.
Then, there exists ψ(·, P0 ) ∈ L02 (P0 ) such that for X1 , . . . , Xn i.i.d. as X ∼ P0 ,
Pn
1
(a) νb = ν(P0 ) + n1 i=1 ψ(Xi , P0 ) + oP0 (n− 2 )
1
(b) For any sequence tn → 0 such that tn = O(n− 2 ),
√
ν − ν(Ptn ))) → N (0, σ 2 (ψ, P0 ))
LPtn ( n(b
R
where σ 2 (ψ, P0 ) = ψ 2 (x, P0 )dP0 (x).
(9.3.1)
We call ψ(·, P0 ) the influence function of νb at P0 in agreement with
√ our previous convention
ν − ν(Pt )) converges
in Section 7.2.1. Note that (b) is a weak uniformity condition: n(b
1
to the same limit in law under Pt as under P0 as long as t → 0 at rate n− 2 .
Remark 9.3.2: Although conditions A0–A6 in Section 5.4.2 appear to purely pertain to P0
and not be uniform, they in fact imply that (9.3.1) holds via Le Cam’s theory of contiguity;
see Section 9.5. However, the definition of regularity given here is stronger than that of
BKRW.
✷
Efficient Estimates
The following discussion sketches the derivation of a lower bound on the asymptotic
variance of regular estimates given in Lemma 9.3.1 below.
Consider any regular estimate νb with corresponding influence function ψ and any regular one dimensional submodel Q containing P0 . Note
√ that q(t) = ν(Pt ) is a parameter on
Q. Hence, we know that the asymptotic variance of nb
ν at P0 is no smaller than the best
we could do if we knew that the true P belonged to Q rather than to P. If νb were an M
estimate of q(t), Pt ∈ Q, and if I(t) denotes the Fisher information for Pt , then this would
mean that
σ 2 (ψ, P0 ) ≥ [q ′ (0)]2 I −1 (0) ≡ I −1 (P0 : ν, Q)
(9.3.2)
by an extension of Theorem 5.4.3 to estimation of q(t) rather than just to estimation of t as
in Proposition 3.4.2. Note that the definition of I −1 (P0 : ν, Q) given at the end of (9.3.2)
indicates that the bound does not depend on the parametrization of Q (Problem 9.3.2).
Suppose
Z
q(t) = ν(Pt ) = ν(P0 ) + ψ(x, P0 )dPt (x) + o(t)
(9.3.3)
and
∂
∂t
Z
ψ(x, P0 )dPt (x)|t=0 =
Z
˙ 0) dP0 (x) .
ψ(x, P0 )l(x,
(9.3.4)
164
Inference in semiparametric models
Chapter 9
Expression (9.3.3) says that the same influence function approximation is valid for the
{Pt } as for the empirical probability Pb and (9.3.4) that integration and differentiation can
be interchanged. Together, (9.3.3) and (9.3.4) imply that
q ′ (0) =
∂ν
˙
(Pt )|t=0 = CovP0 (ψ(X, P0 ), l(X,
0)) .
∂t
(9.3.5)
It may be shown (see Section 9.5) that by Le Cam’s contiguity theory, (9.3.5) is implied
by regularity of Q and νb as we have defined these. It follows that (9.3.2) can be extended
to all regular νb, Q. That is,
Lemma 9.3.1. For all P0 ∈ P and all ψ corresponding to regular νb on P,
σ 2 (ψ, P0 ) ≥ sup{I −1 (P0 : ν, Q) : Q ⊂ P, Q regular 1 dimensional containing P0 }
Q
≡ I −1 (P0 : ν, P) .
(9.3.6)
Proof. By (9.3.5) and (9.3.2),
2
˙
I −1 (P0 : ν, Q) = CovP0 ψ(X, P0 ) , l(X,
0)/I(0) I(0) ,
˙
where I(0) = EP0 [l(X,
0)]2 by (iii)(b). The result follows by Cauchy-Schwarz.
Note that I −1 (P0 : ν, P) doesn’t depend on any parametrization of P.
✷
Definition 9.3.1. We call an influence function ψ ∗ (·, P0 ) corresponding to a regular estimate νb∗ of ν(P ), P ∈ P, achieving the bound in (9.3.6), the efficient influence function
for ν in P at P0 and we call νb∗ an efficient estimate at P0 . This ψ ∗ is often denoted by
e
l(·, P0 : ν, P).
Adopting the terminology of Section 3.3, νb∗ is called asymptotically minimax over the
class of regular one dimensional submodels Q of P. We will find such νb∗ for the class of
regular estimates, but it is true generally. See BKRW.
Next we sketch the derivation of a method for finding such νb∗ given in Proposition 9.3.1
below. Evidently, for ψ ∗ as in Definition 9.3.1 and all ψ corresponding to regular νb on P,
we need to show that
σ 2 (ψ ∗ , P0 ) ≤ σ 2 (ψ, P0 ) .
On any Q as above, by Theorems 5.3.3 and 5.4.3, any νb∗ regular on Q, achieving I −1 (P0 :
ν, Q), is uniquely characterized by its influence function,
ψ ∗ (x, P0 ) =
q ′ (0)
l̇(x, 0) .
I(0)
(9.3.7)
Note that ψ ∗ (·, P0 ) satisfying Definition 9.3.1 cannot depend on the parametrization of Q.
We can show (Problem 9.3.2) that if f (x, τ ) = p(x, t(τ )) as above with t(0) = 0 and
t′ (τ ) > 0, then l̇f = t′ (0)l˙p . Thus if we let
˙ 0)] = {cl(·,
˙ 0) : c ∈ R} ,
Q̇(P0 ) ≡ [l(·,
(9.3.8)
Section 9.3
Efficiency in Semiparametric Models
165
˙ 0) depend on the parametrization of Q, Q̇(P0 ) does not. For a
then, although ψ ∗ and l(·,
parameter ν corresponding to a regular estimate νb, we may rephrase (9.3.7) as:
“ψ ∗ (·, P0 ) is the unique member of Q̇(P0 ) such that
EP0 [ψ ∗ (X, P0 )]2 = I −1 (P0 : ν, Q).”
(9.3.9)
Note that νb∗ corresponding to ψ ∗ in (9.3.9) may not be regular or even consistent on
2
2
P. Suppose, for instance, that P = {N (µ,
(0, 1) and
Pσn ) : 2µ ∈ R, σ > 0}. Let P0 = N
2
2
−1
2
Q = {N (0, σ ) : σ > 0}. Then, n
X
is
an
efficient
estimate
of
σ
on Q, but
i=1 i
is inconsistent for all P ∈ P with mean 6= 0. However, if we can assume that νb∗ is regular
on all of P, we can conclude
Proposition 9.3.1. If νb∗ is a regular estimate of ν(P ), P ∈ P, and its influence function
ψ ∗ (·, P0 ) is in Q̇(P0 ) for some regular 1 dimensional Q, then
σ 2 (ψ ∗ , P0 ) = I −1 (P0 : ν, P) ,
so that νb∗ is efficient at P0 .
Proof. ψ ∗ necessarily satisfies (9.3.9) for Q and is efficient for Q. Thus,
σ 2 (ψ ∗ , P0 ) = I −1 (P0 : ν, Q) .
By definition
I −1 (P0 : ν, P) ≥ I −1 (P0 : ν, Q) = σ 2 (ψ ∗ , P0 ).
On the other hand, by Lemma 9.3.1, because νb∗ is regular on P,
σ 2 (ψ ∗ , P0 ) ≥ I −1 (P0 : ν, P) ,
and the proposition follows.
✷
Adopting the terminology of Section 3.3, Q is asymptotically the least favorable regular
one dimensional submodel for estimating ν(P ) using regular estimates.
We next give a simple example to help interpret the new concepts and to show connections to concepts in Volume I.
Example 9.3.1. Suppose X = (Z, Y ) with Z ∈ Rd , Y ∈ R. Let X1 , . . . , Xn be i.i.d. as
X ∼ P where
Y = ZT β + ε .
Here ε ∼ N (0, σ02 ), σ02 is known, Z and ε are independent, and ZZT is nonsingular (cf
Examples 6.2.1 and 9.1.1). Let P be the class of distributions of X, β ∈ Rd , and let Qa
be the class of distributions Pt,a of X with β t = β 0 + ta, where β 0 ∈ Rd and a ∈ Rd are
fixed and t ∈ R is the parameter. If la is the log likelihood for Qa , then, as a function of
166
Inference in semiparametric models
Chapter 9
βt ,
n
la (x, t) = −
l˙a (x, 0) =
2
1 X
yi − zTi βt + constant
2 i=1
n
X
(zTi a)
yi −
T
2
i=1
T
Ia (0) = nE(a Z a) .
zTi βt
≡
n
X
(zTi a)ei
i=1
If we let cj be the d-vector with 1 in the jth entry and 0 elsewhere, then the equations
l̇cj (x, 0) = 0, j = 1, . . . , d, are the normal equations as in Example 2.1.1, and yield the
MLE of β as in Examples 6.1.2 and 6.2.1.
As in Example 6.2.1, consider the parameter
ν = ν(P ) = β1 .
b = (βb1 , . . . , βbd ) be the MLE and set νb = βb1 . Then νb is regular and by (6.2.22)
Let β
the influence function is ψ1 (x, β) where ψ1 is the first entry in ψ = (ψ1 , . . . , ψd )T =
E(ZZT )−1 ZT e. For this Qa and this ν, q(t) = ν(Pt,a ) = β01 + a1 t; and
Pn
a1 i=1 (ziT a)ei
a21
−1
∗
,
I
(P
:
ν,
Q
)
=
.
ψ (x, P0 ) =
0
a
nE(aT ZT a)2
nE(aT ZT a)2
Morever, independently of ν,
n
X
(zTi a)ei : c ∈ R .
Q̇a (P0 ) = c
i=1
Finally, using calculus (Problem 9.3.3), we find
−1 sup I −1 (P0 : ν, Qa ) : a ∈ Rd = nE(ZZT )
(1,1)
.
The right hand side coincides with the information bound in Example 6.2.1 and we can
conclude that βb1 is efficient and
−1 I −1 (P0 : ν, P) = nE(ZZT )
.
(1,1)
✷
Remark 9.3.3. The d dimensional case. These concepts and results extend readily from
ν(P ) ∈ R to ν(P ) ∈ Rd by using Section 6.2. In this case, (9.3.6) becomes a matrix
inequality where A ≥ B for d × d matrices A and B means that (A − B) is nonnegative
definite. See Section B.10.2. Note that if we compute the influence function ψj∗ of each
b is the vector of influence functions
βbj in Example 9.3.1, then the influence function of β
∗
∗ T
(ψ1 , . . . , ψd ) . See Theorem 6.2.2 for the parametric case.
✷
Section 9.3
Efficiency in Semiparametric Models
167
Remark 9.3.4 It may be shown that, in parametric models P satisfying the conditions of
Theorem 5.4.3 or 6.2.2, the MLE is efficient in the sense of Definition 9.3.2. For the 1
dimensional case this is clear, but for Θ ⊂ Rd , d > 1, it has to be shown that being
best on all 1 dimensional submodels is equivalent to optimality in the full model (Problem
9.3.4).
✷
Remark 9.3.5. This method of approaching efficiency in the semi- and nonparametric
context is due to Stein (1956(a)). It is encompassed in a much more general context by
Le Cam (1986) and Le Cam and Yang (1990). For more background, see BKRW (Bickel,
Klaassen, Ritov and Wellner (1998)).
✷
In Subsection 9.1.2, we have examined modified likelihood approaches to constructing
estimates in semiparametric models. In this subsection, we check if such estimates are
efficient, and determine potential efficient influence functions in advance of constructing
estimates we hope to be efficient.
Definition 9.3.2. The tangent set Ṗ 0 (P0 ) of P at P0 is the union of the set of all Q̇(P0 )
obtained by varying Q over all regular 1 dimensional submodels of P containing P0 .
By definition, Ṗ 0 (P0 ) is a subset of the set L02 (P0 ) of mean zero L2 (P0 ) functions.
Definition 9.3.3. The tangent space Ṗ(P0 ) of P at P0 is the linear closure of Ṗ 0 (P0 )
0
in L02 (P0 ), that is, the smallest closed linear subspace of L02 (P
P0m) which contains0Ṗ (P0 ).
Thus Ṗ(P0 ) is the closure of the set of functions of the form j=1 cj ψj , ψj ∈ Ṗ (P0 ).
We use this notation generically, i.e. P can be any model and, in particular, any regular
1 dimensional Q.
˙ 0) over regular 1
Note that Ṗ(P0 ) is not dependent on any ν. It is computed from l(·,
dimensional parametric models. Ṗ(P0 ) will be used to find efficient estimates for semiparametric models containing both d dimensional and function parameters.
Remark 9.3.6. The parameters set for Pt were defined in (iii) as (−1, 1). By writing
t = (t/ǫ)ǫ, we see that we may use (−ǫ, ǫ) as the parameter set for t. See the proof of
Proposition 9.3.2 for a case where this is useful.
Here is a result that will help interpret Ṗ(P0 ) :
Proposition 9.3.2. Suppose P = {Pθ : θ ∈ Θ} is a d dimensional parametric model
with l0 (·, θ) = log f (·, θ), which satisfies A0–A6 of Theorem 6.2.2, and suppose Θ ⊂ Rd
is open. Then, if P0 ≡ Pθ 0 , θ0 ∈ Θ,
∂l0
∂l0
(x, θ 0 ), . . . ,
(x, θ 0 ) ,
Ṗ(P0 ) =
∂θ1
∂θd
where [S] is the linear span of S, that is, the set of all linear combinations of elements in
S.
Proof. The assumptions A0 – A6 imply that for all a 6= 0, and ǫ sufficiently small,
t → Pθ 0 +ta , |t| < ǫ, is a regular one dimensional parametric submodel Q with l̇(·, 0) =
168
Inference in semiparametric models
Chapter 9
i
h
∂l0
∂l0
∂l0
a
(·,
θ
)
⊂ Ṗ(P0 ). On the other hand, if
(·,
θ
).
Thus,
(·,
θ
),
.
.
.
,
0
0
0
j=1 j ∂θj
∂θ1
∂θd
p(x, t) = f x, θ(t) is a one dimensional submodel, then
Pd
d
X ∂θj
∂l0
˙ 0) = ∂l0 (x, θ(t)) |t=0 =
(x, θ0 )
(0)
l(x,
∂t
∂t
∂θj
j=1
∂l0
∂l0
belongs to [ ∂θ
, . . . , ∂θ
].
1
d
✷
This proposition makes the rationale for calling Ṗ the tangent space clear. If we think of
P as a smooth d dimensional manifold on the set of all probability distributions viewed as
a much higher dimensional space, then the tangent space at P0 is, as in the Euclidean case,
the linear space spanned by all tangents of smooth curves (1 dimensional regular models)
through P0 .
Let Π(h|L) denote the projection in L2 (P0 ) of any h in L2 (P0 ) onto the closed linear
space L ⊂ L2 (P0 ) with inner product (h1 , h2 ) = CovP0 h1 (X), h2 (X) and norm khk =
1
(h, h) 2 . That is
π(h|L) = arg min{EP0 [h(x) − g(x)]2 : g ∈ L} .
Such projections are exemplified in Section 1.4 and discussed in terms of Hilbert spaces in
Appendix B.10.3. In particular, if L is the space of functions of the form a + bX, Theorem
1.4.3 states that for X, Y with E(X) = E(Y ) = 0,
π(Y |L) = Cov(X, Y )/EX 2 X = (X, Y )/|X|2 X .
The natural candidate for efficient influence function is given by
Proposition 9.3.3. Suppose Ṗ 0 (P0 ) is linear and ψ(·, P0 ) is the influence function of any
d dimensional regular estimate. Then, the efficient influence function is given by,
ψ ∗ (·, P0 ) ≡ e
l(·, P0 : ν, P) = Π(ψ(·, P0 )|Ṗ(P0 )) .
(9.3.10)
Proof. We claim that the result holds if P = Q, a regular 1 dimensional submodel since by
Theorem 1.4.3, (9.3.5), and (9.3.7),
Π(ψ(·, P0 )|Q̇(P0 )) =
′
˙ 0))
(ψ(·, P0 ), l(·,
˙ 0) = q (0) l̇(·, 0)
l(·,
˙ 0)k2
I(0)
kl(·,
which we have shown is an efficient influence function for Q. We now write Q̇, Ṗ 0 for
Q̇(P0 ), Ṗ 0 (P0 ). Since e
l is itself an influence function,
Π(e
l|Q̇) = Π(ψ|Q̇)
for all Q̇ ∈ Ṗ 0 . Since Ṗ 0 is linear, it follows that
Π(e
l|Ṗ) = Π(ψ|Ṗ) .
Section 9.3
169
Efficiency in Semiparametric Models
But, by assumption and what we have demonstrated,
ke
lk2 = I −1 (P0 : ν, P)
= sup{ kΠ(e
l|Q̇)k2 : Q ⊂ P, Q regular 1 dimensional containing P0 } .
Q
Thus, there exist Q̇m , m ≥ 1, such that
ke
lk2 = lim kΠ(e
l|Q̇m )k2
which by Pythagoras’ theorem B.10.3.1 gives
ke
l − Π(e
l|Q̇m )k2 → 0 .
Thus, e
l ∈ Ṗ(P0 ) and the result follows.
✷
Remark 9.3.7. We assume that there exists a regular estimate with influence function
ψ(·, P0 ) for the parameter ν. Such estimates exist under mild conditions. See BKRW.
It is not hard to derive Theorem 3.4.3 from Proposition 9.3.3 and the standard formula
for projection onto a finite dimensional linear space (B.10.20) (Problem 9.3.5). This proposition and some others we are about to state are used in a semi-rigorous fashion. As we
shall see in the applications (called examples) to important models which follow, we calculate the structure of tangent spaces formally, usually ending up with linear spaces which
we can only be sure are subspaces of the tangent space. However, if we can show that the
influence function of a regular estimate νb∗ belongs to the subspace we have identified or
that the projection of an arbitrary influence function on such a subspace is the influence
function of a regular estimate νb∗ , we can conclude without further ado that νb∗ is efficient.
The following results given in BKRW are to be taken in that formal spirit also. That
is, equalities as stated are really true only under regularity conditions which are model
dependent.
Proposition 9.3.4. Additive property of the efficient score function. Suppose P = {P(g,h) :
g ∈ G, h ∈ H}, where G, H may be Euclidean, function spaces, or abstract. Let P0 =
P(g0 ,h0 ) and let
P1 (P0 ) ≡ {P(g,h0 ) : g ∈ G} ,
P2 (P0 ) ≡ {P(g0 ,h) : h ∈ H} .
Then,
Ṗ(P0 ) = Ṗ1 (P0 ) + Ṗ2 (P0 ) .
We have encountered parameters ν that are defined implicitly by equations of the form
ν(Pθ,h ) = θ. For instance, in linear regression or Cox regression the regression parameter
vector β is defined this way. Similarly, so is the location parameter in the location problem.
For such parameters there is a geometric way of obtaining efficient influence functions e
l
via orthogonal projections of the score function for θ on tangent spaces corresponding to
nuisance parameters..
170
Inference in semiparametric models
Chapter 9
Proposition 9.3.5. The efficient influence function for implicitly defined parameters. Suppose G ≡ Θ open ⊂ R, and let lθ ≡ l(·, P(θ,h0 ) ), θ ∈ Θ, be the log likelihood for θ.
θ
Suppose ν(P(θ,h) ) = θ for all h ∈ H. Let P0 ≡ P(θ0 ,h0 ) and l˙ ≡ ∂l
∂θ (·)|θ=θ0 . Then,
e
l(·, P0 : ν, P) =
l̇ − Π(l|˙ Ṗ2 (P0 ))
,
˙
kl − Π(l|˙ Ṗ2 (P0 ))k2
I −1 (P0 : ν, P) = kl˙ − Π(l|˙ Ṗ2 (P0 ))k−2 .
(9.3.11)
(9.3.12)
✷
Note that e
l is a normalized version of the projection of l˙ onto the orthocomplement of
the nuisance parameter tangent space. The normalization ensures that (e
l, e
l) = 1, which is a
requirement of influence functions; see BKRW (1998) for details.
The tangent space of a nonparametric model
Proposition 9.3.6. Let M0 be a model that contains the collection of all probabilities on
R with continuous bounded densities. Then, Ṁ0 (P0 ) = L02 (P0 ).
Before proving this proposition, we note its satisfying implication.
Corollary 9.3.1. If ν : M0 → R is a parameter for which a regular estimate νb exists, then
νb is efficient.
Proof: The result is immediate from Proposition 9.3.4, since every influence function
ψ(·, P0 ) ∈ L02 (P0 ) and is its own projection.
✷
For instance, if ν = g(µ), µ = E(X), then νb = g(X̄) is nonparametrically efficient for
the model M0 provided g is differentiable at µ. See Theorem 5.3.3.
Proof
R of Proposition 9.3.4. It is enough to show that Ṁ0 (P0 ) contains all bounded h such
that h(x)dP0R(x) = 0. Given such an h, let p(x, t) = exp{th(x) − A(t, h)}p(x, 0) where
A(t, h) = log eth(x) dP0 (x). Then, p(·, t) is a canonical exponential family and a regular
submodel with
˙ 0) = ∂ log p(x, t)|t=0 = h(x) − ∂A (0, h) .
l(x,
∂t
∂t
But by Corollary 1.6.1,
Z
∂A
(0, h) = h(x) dP0 (x) .
∂t
R
˙ 0) contains h with h(x) dP0 (x) = 0 and the result follows.
Thus l(x,
✷
Examples
We finish this section by obtaining candidate efficient influence functions for some of
our examples, seeing whether we have obtained estimates efficient at particular P0 or on all
Section 9.3
Efficiency in Semiparametric Models
171
of P. As we have noted we usually do not check that we have computed the full tangent
space but rather identify a subspace that gives the structure of what elements of the tangent
space should look like. Then we check whether an estimate νb obtained by some criteria
such as likelihood or modified likelihood has an influence function that belongs to the
subspace we have identified. The subspace is computed using the functional delta methods
of Section 7.2.
Example 9.3.2. The semiparametric linear model (Example 9.1.1). Tangent spaces and
influence functions. As we have noted we usually do not check that we have computed the
full tangent space but rather identify the structure of what elements of the tangent space
should look like. We consider two models that correspond to assumptions and notation
given earlier in Example 9.1.1 as (a) and (b) (with σ ≡ 1, variability in σ is absorbed into
the parameter f below).
Model (a). Here,
p(z, y; α, β, h, f ) = h(z)f (y − α − β T z) .
Suppose P0 corresponds to the parameter value (α0 , β0 , h0 , f0 ) and consider formally 1 dimensional models, αt = α0 +t∆α , βt = β 0 +t∆β , log ht = log h0 (z)+t∆h (z), log ft =
log f (y) + t∆f (y). Formally, if Pt corresponds to the parameter value (αt , βt , ht , ft ) and
has density pt (x) with x = (z T , y)T , then
∂
f′
(9.3.13)
log pt (x)|t=0 = ∆h (z) + ∆f (ε) + ∆α 0 (ε) ,
∂t
f0
R
R
where ∆h (z) h0 (z) dz = ∆f (ε) f0 (ε) dε = 0 and ε ≡ y − α0 − βT0 z. Thus, we
expect that
f′
f′
Ṗ(P0 ) = {a(Z) + b(ε) + [Z 0 (ε)] + [ 0 (ε)]} ,
f0
f0
b has influence function
where a, b are arbitrary functions in L2 (P0 ). Now the LSE β
(9.3.14)
ψ(x, P0 ) = [Var(Z)]−1 Z − E(Z) ε
which belongs to Ṗ(P0 ) if f0′ /f0 (ε) = cε for some c 6= 0, that is, if f0 corresponds to a
mean 0 Gaussian distribution.
It follows from our analysis in Example 6.2.2 that, if f0 is true, the efficient influence
function corresponds to the MLE obtained by assuming f0 is true and known. Since we
don’t know f0 , where does this leave us? It turns out that this is a situation where we can
b such
adapt in the sense of Example 6.2.1. That is, it is possible to construct estimates β
that
n
X
1
f′
b =β+ 1
β
(9.3.15)
[Var(Z)]−1 Zi − E(Z) − 0 (εi ) + oP (n− 2 )
n i=1
f0
by estimating f0 properly. Since (9.3.15) is how the best estimate of β behaves if we knew
f0 (but not α), this is the best we can hope to do. We refer to Chapters 11 and 12 of this
volume as well as BKRW and van der Vaart (1998) for such constructions.
172
Inference in semiparametric models
Chapter 9
Model (b). Here, for simplicity, take z ∈ R. Then,
p(z, y, α, β, f ) = f (z, y − α − βz)
(9.3.16)
subject to
Z
(y − α − βz)f (z, y − α − βz)dy = 0
(9.3.17)
for all z but f is otherwise an arbitrary density. Fix α, β at their true values and consider
the one dimensional submodel
log ft = log f0 + t∆(z, ε)
with ε ≡ y − α − βz. Then,
and (9.3.17) is, for all z,
which is equivalent to
∂
log ft = ∆(z, ε)
∂t
Z
ε∆(z, ε)f (z, ε)dε = 0
E(ε∆(Z, ε)|Z) = 0 .
(9.3.18)
We see from this and (9.3.14) that the influence function of the LSE βb does not belong to
Ṗ(P0 ) unless E(ε2 |Z) = 0. The efficient estimate which can, under some conditions, be
constructed has influence function
f′
E f00 (Z, ε)ε|Z (Z − EZ) f0′
(Z, ε) −
ε ,
(9.3.19)
I(P0 , β) f0
E(ε2 |Z)
where
and
f′
n
E 2 f00 (Z, ε)ε|Z o
′
2 f0 2
I(P0 , β) = E (Z − EZ)
(Z, ε) −
f0
E(ε2 |Z)
∂
f0′
(Z, ǫ) =
log f0 (Z, v)
f0
∂v
v=ε
.
For these claims see Problem 9.3.8.
Example 9.3.3. Tangent spaces for biased sampling (Examples 9.1.2, 9.1.8, and 9.2.1). If
k = 1, Ṗ(P0 ) = L2 (P0 ) by (9.1.4). For k > 1, proceed formally as usual. Write, since
(w1 , . . . , wk ), λ are known,
pt (j, y) =
λj wj (y)ft (y)
Wj (Ft )
Section 9.3
173
Efficiency in Semiparametric Models
and, given h such that
R
h(y)dF0 (y) = 0,
ft (y) = exp{th(y) − A(t, f0 )}f0 (y) .
Z
Wj (Ft ) = wj (y) exp{th(y) − A(t, f0 )} dF0 (y) .
Then,
˙ t (y), 0) = ∂ log pt (j, y)|t=0 = h(y) − A′ (0, f0 ) −
l(f
∂t
R
h(y)wj (y) dF0 (y)
.
Wj (F0 )
Thus, at least for any bounded h, if X = (I, Y ),
˙
l(X,
0) = h(Y ) − E0 (h(Y )|I) .
(9.3.20)
It is, in fact, true that the tangent space is
Ṗ(P0 ) = {a(I) + b(Y ) ∈ L02 (P0 ) : E0 (b(Y )|I) = −a(I)} .
(9.3.21)
It is not hard to check that the influence function of the estimate, θ(Fb ) of θ(F ) = F (y), or
more generally,
Z
θ(F ) =
v(y) dF (y)
is of the form specified in Ṗ(P0 ), and efficiency of θ(Fb ) follows provided regularity can
be shown.
✷
Example 9.3.4. Censoring and truncation tangent spaces (Examples 9.1.3, 9.1.9, and
9.2.2). It turns out that, in these cases,
Ṗ(P0 ) = L02 (P0 )
so that any regular estimate of a parameter is efficient for that parameter. This follows,
since given any distribution P of (Y, δ), there is a unique (F, G) such that P = P(F,G) .
This is a consequence of
f (y) = λF (y) e−ΛF (y)
and
λF (y) =
h1 (y)
H̄(y)
where
h1 (y) =
d
H1 (y),
dy
H1 (y) = P [Y ≤ y, δ = 1],
H(y) = P [Y ≤ y]
are determined by P and determine P (as functions) — see Problem 9.3.9.
✷
174
Inference in semiparametric models
Chapter 9
Example 9.3.5. The Cox model tangent space (Examples 9.1.4, 9.1.10, and 9.2.3). Here
the model is
p(z, y; β, λ, h) = h(z)λ(y)r(z, β) exp{−r(z, β)Λ(y)} .
Let P0 correspond to the parameter values β0 , λ0 , h0 . We will use the additive property of
efficient score functions (Proposition 9.3.4) and compute tangent spaces for each parameter
with the other parameters fixed then combine these tangent spaces. First, with λ = λ0 and
d
h = h0 fixed, β t = β 0 +
R yta, a ∈ R , p1t the density p corresponding to the parameters
βt , λ0 , h0 , and Λ0 (t) = 0 λ0 (s)ds, we find
∂
Ṗ1 (P0 ) :
log p1t
∂t
t=0
=a
T
ṙ
(z, β 0 ) − ṙ(z, β 0 )Λ0 (y)
r
(9.3.22)
where ṙ is the gradient ∇r(·, β) with respect to β. Similarly, let β = β0 , h = h0 be fixed,
let b(y) be a function with E[b(Y )] = 0, and set λt (y) = λ0 (y) exp{tb(y)}. If p2t is the
density for these parameters, we find
Z y
∂
Ṗ2 (P0 ) :
b(s)dΛ0 (s) .
log p2t t=0 = b(y) − r(z, β 0 )
∂t
0
Next, with β = β0 , λ = λ0 fixed and log ht (z) = log h0 (z) + c(z)t where E[c(Z)] = 0;
and with p3t the corresponding density, we find
Ṗ3 (P0 ) =
∂
log p3t
∂t
t=0
= c(Z) .
It follows formally that the tangent set Ṗ(P0 ) = Ṗ1 (P0 ) + Ṗ2 (P0 ) + Ṗ3 (P0 ) is the class
of functions of the form
Z y
ṙ
aT
(z, β 0 ) − ṙ(z, β 0 )Λ0 (y)
+ b(y) − r(z, β 0 )
b(s)dΛ0 (s) + c(z)
r
0
(9.3.23)
for functions b(·) and c(·) with E b(Y ) = E c(Z) = 0.
The influence function of the Cox estimate as given by (9.2.15) can be shown to be of
this form for the Cox model when r(z, β) = exp{βT z}. In this case ṙ = z exp(β T0 z) and
(ṙ/r) = z, and with Y = failure time,
T
S0 (y, β, P ) = EP eβ Z 1(Y ≥ y) .
Let Ṡ0 stand for the gradient
with respect to β. We proceed
with β ∈ R, then Ṡ0 (y, β, P ) =
EP ZeβZ 1(Y ≥ y) . Now, in the case where 1 y ≤ τ = 1, differentiate (9.2.15) with
respect to y and show that it may be written in the form (9.3.23). We can show that the
derivatives are equal; see Problem 9.3.10. Efficiency follows.
✷
Section 9.4
175
Tests and Empirical Process Theory
Example 9.3.6: Partial linear model tangent space (Examples 9.1.13 and 9.2.4). Assuming that ε ∼ N (0, σ 2 ) , we obtain as members of Ṗ(P0 ), ZT ε, and (ε2 /σ02 ) − 1. The
nonparametric g, appearing as gt in a 1 dimensional submodel, yields
∂ 1
ε ∂gt
(Y − β T0 Z − gt (U ))2 |t=0 = − 2
(U )|t=0 .
2
∂t 2σ0
σ0 ∂t
This gives us
Ṗ(P0 ) =
ε2
− 1, h(U )ε, w(Z), Zε
σ02
b may be
for all h, w such that E0 h2 (U ) < ∞, E0 w2 (Z) < ∞. The influence function of β
shown to be
[E Var0 (Z|U )]−1 (Z − E0 (Z|U ))ε
whose components do indeed belong to Ṗ(P0 ).
✷
We leave further consideration of these examples and other models to the problems.
9.4
Tests and Empirical Process Theory
It is natural, when entertaining a parametric model, P = {Pθ : θ ∈ Θ}, to consider its
consistency with the data. If we have no strong a priori evidence of departure from the
model in any particular direction, we naturally begin by considering H : P ∈ P versus
K : P 6∈ P. As usual, with a formulation as general as this one, nothing can be done.
But begin with a blanket hypothesis such as our usual one, X = (X1 , . . . , Xn ), Xj i.i.d F .
Then we can identify P with F = {Fθ : θ ∈ Θ}, Θ ⊂ Rd Euclidean, a regular parametric
model and test H : F ∈ F vs K : F 6∈ F , using measures of distance between F and
F as we discussed in Section 4.1 and Example I.2. Here failure to reject the hypothesized
model only suggests that a model may be an adequate approximation. Recall from Section
1.1.1 that “Models, of course, are never true but fortunately it is only necessary that they
be useful,” Box (1979). We pursue model testing in more detail here. Strictly speaking, we
are dealing with parametric hypotheses in nonparametric models.
Example 9.4.1. Testing goodness-of-fit to the Gaussian distribution (Example 4.1.6 and
Section I.2 continued).
Consider the test statistic Tn introduced in Example 4.1.6 for testing
for some µ, σ vs K : F not Gaussian,
H : F (·) = Φ ·−µ
σ
x − X̄
b
b
Tn ≡ sup |F (X̄ + σ
bx) − Φ(x) | = sup F (x) − Φ
.
σ
b
x
x
We claim that, under H,
L(Tn ) → L(sup |Z(x)| )
(9.4.1)
x
where Z(·) is a Gaussian process with mean 0 and, if x ≤ y,
1
Cov(Z(x), Z(y)) = Φ(x)(1 − Φ(y)) − ϕ(x)ϕ(y) − xy ϕ(x)ϕ(y)
2
(9.4.2)
176
Inference in semiparametric models
Chapter 9
As we have noted, to obtain critical values for this statistic and some others we study
below, this result is unnecessary since we can always employ the Monte Carlo method
described in Examples 4.1.6 and 10.1.1. However, the proof gives us a sense of how the behaviour of this statistic differs from the Kolmogorov statistic of Example 4.1.5 and suggests
a general approach for constructing and analyzing goodness-of-fit statistics for composite
hypotheses.
To establish (9.4.1), assume without loss of generality that the true member of H under
which we compute is N (0, 1). Then,
√
√
x − X̄
b
Zn (x) = n(F (x) − Φ(x)) + n Φ
− Φ(x)
σ
b
≡ En (x) + ∆n (x),
where En (·) is the empirical process. Expand ∆n (x) as a function of X̄ and σ
b2 around 0
and 1 to get (Problem 9.4.1)
√
√
x
σ 2 − 1) + Rn (x)
∆n (x) = −ϕ(x) nX̄ − ϕ(x) n(b
2
(9.4.3)
where supx |Rn (x)| = OP (n−1 ). Let 1x (y) = 1(y ≤ x) and
fx (y) = 1x (y) − ϕ(x)y −
xϕ(x) 2
(y − 1) .
2
By definition
En fx (y) =
Z
√
xϕ(x) √
fx (y)dEn (y) = Zn (x) − ϕ(x) nX̄
n(b
σ 2 − 1) .
2
en − Zn k∞ = oP (1). It follows that
So if we define Zen (x) ≡ En (fx ), (9.4.3) yields kZ
en . It is easy to show that {fx (·) : x ∈ R} satZn =⇒ Z0 where Z0 is the limit of Z
isfies the conditions
of
Theorem
7.1.5
and hence that Z0 (x) = Z(x) = WΦ0 (fx (·)) ≡
0
W Φ(x) where, in the notation of Section I.1, W 0 (·) is the Brownian bridge on [0, 1].
That Cov(Z(x), Z(y)) is as specified in (9.4.2) can be calculated directly (Problem 9.4.1).
Formula (9.4.2) shows that the limit distribution of Tn is different from that (see 7.1.24)
of the Kolmogorov-Smirnov statistic corresponding to H : F = Φ but this gives little
further insight. However, here is another derivation of (9.4.2). We introduce some notation;
see Appendix B.10. If H is a Hilbert space and G ⊂ H, let [G] be the smallest closed linear
subspace of H containing G. If G = {h1 , . . . , hk }, then [G] = {c1 h1 + . . . + ck hk : c =
(c1 , . . . , ck ) ∈ Rk }. Take H = L2 (Φ), L = [X, X 2 − 1] where X ∼ N (0, 1). Then
[X] ⊥ [X 2 − 1] and for any g(X) ∈ L2 (Φ), if Π(·|L) denotes projection on the closed
linear space L,
Π(g|L) = Π(g|[X]) + Π(g|[X 2 − 1])
and hence
Π(g|L)(y) = E(g(X)X)y +
Eg(X)(X 2 − 1) 2
(y − 1) .
2
Section 9.4
177
Tests and Empirical Process Theory
If g(y) = 1(y ≤ x) which we abbreviate as 1x (y),
Z x
Eg(X)X =
yϕ(y)dy = −ϕ(x)
−∞
Z x
Eg(X)(X 2 − 1) =
y 2 ϕ(y)dy − Φ(y)
−∞
Z x
yϕ(y)dy − Φ(y) = −xϕ(y) .
= −
−∞
Then fx (·) = 1x (·) − Π(1x |L)(·) and, by the usual properties of projections,
Cov(WΦ0 (fx ), WΦ0 (fy )) = CovΦ (fx , fy ) = CovΦ (1x (X), 1y (X))
− CovΦ (Π(1x |L), Π(1y |L)) = CovΦ (Π⊥ (1x (X)), Π⊥ (1y (X))
(9.4.4)
where Π⊥ is projection on the orthocomplement of Ṗ(P0 ) in L02 (P0 ) and we write
CovΦ (g, h) for Cov(g(X), h(X)) with X ∼ N (0, 1). Thus, as we might expect in the
limit, the process Zn (·) is more concentrated than En (·) in the sense that all linear functionals of Zn (·) have smaller variances than the same functional of En (·) in the limit. The
implications of this calculation for power are intuitively correct. The tests based on Zn (·)
are less powerful against the same alternative than the corresponding tests based on En (·).
See Problem 9.4.5. We generalize this argument in the next example.
✷
Other test statistics’ limit distributions can be similarly derived. For instance, the
Cramér-von Mises type statistic for this situation is given by (see Problem 4.1.1 and Example 7.1.6)
Z ∞
x − X̄ 2
Sn ≡ n
(Fb(x) − Φ(
) ) dΦ(x)
(9.4.5)
σ
b
−∞
R∞
Clearly, h → −∞ h2 dΦ, is a continuous map from L∞ (R) to R. Hence,
Z
LΦ (Sn ) −→ L(
∞
−∞
[WΦ0 ]2 (fx (·))dΦ(x) ).
This limit law has an explicit form — see Shorack and Wellner (1986), for instance.
Example 9.4.2. Goodness of fit to a general smooth parametric model. We assume that
P ≡ {Pθ : θ ∈ Θ ⊂ Rd } is a smooth parametric model satisfying the assumptions of
Theorem 6.2.2. Let T be a set of functions on X satisfying the assumptions of Theorem
7.1.5. Let, for f ∈ T ,
n
1 X
(f (Xi ) − E b (f (X))
Zn (f ) = √
θ
n i=1
(9.4.6)
b is the MLE of θ. Here we compute Eθ f (X) then substitute θ = θb to get
where θ
Eθb f (X) . It is reasonable that suitable goodness-of-fit statistics for H : P ∈ P should
178
Inference in semiparametric models
Chapter 9
be based on the magnitude of |Zn (·)|. For instance, if X = R, f (x) = 1x , the natural
generalization of the Kolmogorov statistic is
1
Tn = sup |Fb(x) − F b (x)| = √ sup |Zn (1x )|.
θ
n x
x
We claim that, under H with the data X generated by P0 ≡ Pθ0 ,
Zn (·) =⇒ Zθ0 (·) .
(9.4.7)
0
where Zθ0 (f ) = Wθ0 (f − Πθ0 (f )). Here Wθ0 = WP00 , Πθ0 (f ) ≡ Π(f | [∇l(X, θ0 )] )
0
0
0
denotes projection in L2 (P0 ), and ∇l(X, θ0 ) =
∂l
∂l
∂θ1 (X, θ 0 ), . . . , ∂θd (X, θ 0 )
T
. Then,
Cov(Zθ0 (f ), Zθ0 (g) ) = Covθ0 (f (X), g(X) ) − Covθ 0 (Πθ 0 (f )(X), Πθ 0 (g)(X) )
0
0
⊥
= Covθ0 (Π⊥
θ (f ), Πθ (g)) .
(9.4.8)
0
0
We have discussed the consequences of this formula in terms of power above. The
proof of (9.4.7) is a straightforward extension of (9.4.4). Write
√
Zn (f ) = En (f ) − n(E b f (x) − Eθ 0 f (x)).
θ
Write the last term as
√
n
Z
b − p(x, θ0 ))dµ(x).
f (x)(p(x, θ)
We can justify Taylor expansion to get
√
n(E b f (X) − Eθ 0 f (X))
θ
Z
T
√
b − θ0 ) + op (1).
=
f (x)∇l(X, θ 0 )p(x, θ 0 )dµ(x)
. n(θ
(9.4.9)
Now by (6.2.10), (9.4.9) is equal to
n
1 X −1
[Eθ 0 f (x)∇l(X, θ 0 )]T √
I (θ 0 )∇l(Xi , θ0 ) + op (1)
n
= En ([Eθ 0 f (X)∇l(X, θ0 )]
i=1
−1 −1
I
(9.4.10)
(θ0 )∇l(·, θ 0 ) ) .
But the argument of En is just Π(f | [∇l(X, θ0 )]) by (B.10.20), and (9.2.7) follows. The
validity of (9.4.8) comes from the general property (B.10.14) applied to H = L2 (P0 ) and
L = [∇l(X, θ0 )],
CovP (f − Π(f | L)(X), g(X)) = 0
for all g ∈ L.
✷
Section 9.4
Tests and Empirical Process Theory
179
The result of this example is implicit in the work of Darling (1955). See also Durbin
(1973). An excellent and very complete treatment of goodness-of-fit tests such as these is
Chapter 5 of Shorack and Wellner (1986).
The parametric bootstrap
How do we use the result (9.4.7)? Suppose that all the constants implicit in Theorem
7.1.5 and the conditions of Theorem 6.2.2 do not depend on Pθ for θ in small open sets.
This implies that all stochastic convergence and approximations that we claim are uniform
in θ on these open sets. We want to set an approximate critical value for a test statistic, Tn =
q(Zn (·)) where q is continuous on l∞ (T ). It is natural to use the parametric bootstrap test
which we define as follows.
b Tn .
(i) Compute θ,
∗
∗
(ii) Generate B samples of size n, (Xb1
, . . . , Xbn
), b = 1, . . . , B from P b .
θ
∗
∗
∗
(iii) Compute Tbn
, b = 1, . . . , B, the test statistic Tn (Xb1
, . . . , Xbn
) for each of the B
b
samples. For instance, if Tn = supx |F (x) − F b (x)|, then
θ
∗
∗
∗
Tbn
= supx |Fbb∗ (x) − F b (x)| where Fbb∗ is the empirical df of (Xb1
, . . . , Xbn
) and
θb
∗
∗
b is the MLE computed on the basis of X , j = 1, . . . , n.
θ
b
bj
∗
∗
∗
(iv) Reject H at level α iff Tn > T([b(1−α)]+1)
where T(1)
≤ . . . ≤ T(B)
are the ordered
∗
Tnb , 1 ≤ b ≤ B.
The result (9.4.7) suggests that this procedure provides an approximate size α test if the
true θ 0 belongs to the interior of Θ. Here is a sketch proof under the assumption that the
limit law Lθ0 (q(Zθ 0 )) of Tn = q(Zn ) is continuous. Note first that
B
1 X
P
∗
1(Tnb
> Tn ) − P b [Tn∗ > Tn ] −→ 0
θ
B
(9.4.11)
b=1
as B → ∞ by Chebyshev’s inequality. Both expressions in (9.4.11) are to be interpreted
∗
∗
∗
as conditional on X1 , . . . , Xn , that is, Tn∗ = Tn (X11
, . . . , X1n
) where the X1i
are i.i.d
Pb
, given X1 , . . . , Xn . But under our assumptions,
θ(X1 ,...,Xn )
sup{|Pθ [Tn > x] − Pθ [q(Zθ ) > x]| : x ∈ R, |θ − θ0 | ≤ ε} −→ 0 .
(9.4.12)
b → θ0 in P probability, (9.4.11) and (9.4.12) imply that if Tn and T ′ have
Since θ
n
θ0
the same distribution but are independent, P b [Tn∗ > Tn ] − Pθ0 [Tn′ > Tn ] → 0 in Pθ 0
θ
probability. This is what we needed to argue.
This technique is the first example of bootstrap techniques, Monte Carlo in the service
of inference, that we will investigate further in Chapter 10.
To make matters concrete, we specialize to
180
Inference in semiparametric models
Chapter 9
Example 9.4.3. Goodness of fit for the gamma family. Suppose P is the Γ(p, λ) family,
a possible model for lifetime distributions (see Lawless (1982)). We specialize to T =
{1(0, u) : u ∈ R} or equivalently to statistics based on
√
Zn (u) = n(Fb (u) − Gpb,λb (u))
b is the MLE of (p, λ) and Gp,λ (u) is the c.d.f. of Γ(p, λ). In particular, consider
where (b
p, λ)
Tn = supu |Zn (u)|, a Kolmogorov-Smirnov type statistic. Since
Gp,λ (u) = Gp,1 (λu)
√
u
n sup |Fb ( ) − Gpb,1 (u)|
b
u
λ
so that we can simulate under λ = 1. However, no such simplification is possible for the
parameter p. Thus, to get critical values for Tn we need to apply the parametric bootstrap
and simulate from Gpb,1 (·) as we discussed.
✷
Tn =
Remark 9.4.1.
(a). Goodness of fit tests are often used as diagnostics. Thus, for instance, in the Gaussian goodness-of-fit problem we might want to assess where the deviations from Gaussianity are significant by taking into account that the standard deviation of the empirical process
at a point depends on that point. Under normality, Fb (µ + σu) has variance Φ(1 − Φ)(u)/n.
Thus it is natural to consider
1
1
Sn = n 2 sup{|Fb (X̄ + σ
bu) − Φ(u)|[Φ(1 − Φ)]− 2 (u) : |u| ≤ M } .
(9.4.13)
Then, not only is a large value of Sn indicative of departure from Gaussianity, but where
that departure occurs is weighted appropriately as in Example 4.4.3. We cannot take M =
∞ here (Problem 9.4.2). However, it is possible to consider statistics such as
2
Z 1 Fb (x) − Φ x−X̄
σ
b
n
dΦ(x)
Φ(x)(1 − Φ(x))
0
which are asymptotically equivalent to the well known Shapiro-Wilk statistics — see Mecklin and Mundfrom (2004) for a review.
It is possible, as in Section 6.3, to make power calculations for such tests. Consider
1
H : F = F0 . For a sequence of alternatives {Fn } with Fn (x) = F0 (x) + n− 2 (∆(x) +
o(1) ) uniformly in x, it is possible to obtain a limit distribution for Zn (·) in terms of a
Gaussian process. Unfortunately, the family of ∆ we need to consider is too large and the
resulting expressions generally analytically not tractable. For some calculation of this type
see Hajek and Sidak (1967). These results may also be put in the general framework of
contiguity, which we discuss further in Section 9.5. It is possible to extend these goodnessof-fit tests we have described to hypotheses such as that of independence of U and V where,
in general, X = (U, V ) has an arbitrary distribution. Such a hypothesis is semiparametric.
This extension is discussed in the problems. A general approach to constructing tests of
goodness-of-fit to semiparametric hypotheses is given in Bickel, Ritov and Stoker (2003).
Section 9.5
9.5
Asymptotic Properties of Likelihoods. Contiguity
181
Asymptotic Properties of Likelihoods. Contiguity
As we have seen in Chapters 5 and 6 the asymptotic behavior of maximum likelihood
estimates and tests and related confidence regions corresponds to the exact behavior of
the same procedures specialized to the Gaussian linear model with known variance. In this
section we will show how these results follow heuristically from properties of the likelihood
which are by no means limited to i.i.d. observations. The theory originated in early work
of Wald (1943) but was brought to full generality and understanding largely by Le Cam in
a series of papers starting in 1956 and culminating in his (1986) treatise. Hàjek (1972) also
made important contributions. We shall derive results only under the strong conditions of
Chapters 5 and 6 rather than under the elegant necessary and sufficient conditions of Le
Cam. But we will state the general results in the appropriate Hilbert space context with
references to their appearance and proofs in Le Cam and Yang (1990), Le Cam (1986), and
Hàjek and Sidàk (1967). In this way we shall connect this theory with the semiparametric
estimation theory of Sections 9.1–9.3.
We begin by motivating the abstract definitions which follow by a closer examination of
the i.i.d. case. We need to take the point of view of Section 9.3 in which we introduced local
parametric models. That is, we fix Pθ 0 and then consider regular parametric submodels of
the form Pε ≡ {Pθ : |θ − θ0 | ≤ ε}, where ε is arbitrarily small and θ belongs to Rd .
We formalize this by reparameterizing the model using the sample size dependent scale of
order n−1/2 in the i.i.d. case. That is, we consider
PM,n ≡ {Pθ0 + √t : |t| ≤ M }
(9.5.1)
n
for M < ∞ arbitrary. Note that Pθ 0 could be any P residing in a semiparametric model.
What matters is that we construct a sequence of shrinking regular parametric models centered at Pθ 0 .
This enables us, following Le Cam, to think about local approximations to models and
all related decision theoretic problems rather than just to distributions of estimates or tests.
In particular we can formulate all the optimality properties of Chapters 5 (Sections 5.4.3
and 5.4.4), 6 (6.2.2), and Section 9.3 in terms of the local
t. If θ∗n is a regular
√ parameter
∗
∗
estimate of θ, then, by the plug-in principle, tn ≡ n(θn − θ0 ) is the corresponding
estimate of t and we can state properties in terms of t∗n . For instance, optimality of the
bn as defined in (9.3.1) and (9.3.2) now reads: Under regularity conditions, if t∗ is
MLE θ
n
asymptotically Gaussian with mean 0 uniformly in t on all PM,n , that is,
Lθ 0 + √t (t∗n ) → Nd (t, Σ∗ )
(9.5.2)
n
uniformly for |t| ≤ M , all M < ∞, then
Σ∗ ≥ I −1 (θ 0 )
and, if btn is the MLE of t,
Lθ0 + √t (btn ) → Nd (t, I −1 (θ 0 ))
n
182
Inference in semiparametric models
Chapter 9
uniformly for |t| ≤ M , all M < ∞. We also note that for the model
Nd ≡ {Nd (t, I −1 (θ0 )) : t ∈ Rd }
the various asymptotic optimality statements become exact for fixed n. We can read this
as saying that Nd is an approximation in a strong sense to the set of local model {PM,n :
M < ∞} as n → ∞.
We begin by introducing the fundamental notion of Local asymptotic normality (LAN)
(n)
of general likelihoods. For Θn ⊂ Rd , let En ≡ {Pθ : θεΘn }, ; n ≥ 1, be a sequence
of experiments (models). By this notation we mean that, for the nth experiment, we ob(n)
serve X (n) ∈ X (n) , X (n) ∼ Pθ , θ ∈ Θn . We can think of X (n) as being a vector
(X1 , . . . , Xn )T belonging to Rn . In the i.i.d. case, where the Xi are i.i.d. with common
(n)
density f (x, θ), Pθ has density
pn (x, θ) =
n
Y
f (xi , θ).
i=1
The parametrization of Θn we consider usually depends on n. It contains open balls around
a fixed point θ 0 so that we can talk about θn = (θ 0 + tγn ) ∈ Θn , where γn ↓ 0, and
|t| ≤ M for some M > 0 and all t if n is large enough. Having γn ↓ 0 is what makes our
1
calculations local. In all “regular” i.i.d. cases, γn = n− 2 . Note that we have already made
local calculations like this in Section 5.4.4, in connection with testing.
The local likelihood ratio is defined as
Ln (t) =
pn (X (n) , θn )
pn (X (n) , θ0 )
θn ≡ θ0 + tγn .
,
(9.5.3)
We suppress θ0 in the notation and we use the conventions that if p(n) (·, θ 0 ) = 0 and
p(n) (·, θn ) > 0, then Ln (t) = ∞, and that 0/0 = 0. Finally let
Λn (t) ≡ logLn (t).
Definition 9.5.1. En is uniformly locally asymptotically normal (ULAN) at θ0 iff there
exist d × 1 statistics {T n (X(n) , θ0 )}, n ≥ 1, such that we can write
1
Λn (t) = tT T n (X (n) , θ0 ) − tT Σ0 t + Rn (t)
2
where
Pθ
0
(i) sup{|Rn (t)| : |t| ≤ M } −→
0 for all M < ∞.
(ii) Lθ 0 (T n (X (n) , θ0 )) −→ Nd (0, Σ0 ).
Remark 9.5.1. If X 1 , . . . , X n are i.i.d. Nd (θ, Σ−1
0 ) and θ = 0, then
1
Λn (t) = tT Σ0 (n− 2
n
X
1
X i ) − tT Σ0 t
2
i=1
Section 9.5
Asymptotic Properties of Likelihoods. Contiguity
183
1 Pn
and n− 2 i=1 Σ0 X i ∼ Nd (0, Σ0 ).
We have already encountered the ULAN property in Examples 7.1.1 and 7.1.2. We
show there that if {Pθ : θ ∈ Θ} is a model satisfying the regularity condition of these
(n)
(n)
examples, then En ≡ {Pθ : θ ∈ Θ}, where Pθ is the (product) joint distribution of
X (n) ≡ (X1 , . . . , Xn ) i.i.d. Pθ , is ULAN at all θ0 with
T n (X (n) , θ0 ) = n−1/2
n
X
l̇(Xi , θ0 ) ,
i=1
and Σ0 = I(θ 0 ), the information matrix.
We shall also have use for the following general property. Let X (n) be a sequence of
probability spaces. Typically X(n) ∈ X (n) with X (n) as a subset of Rn .
Definition 9.5.2. Given two sequences of probabilities {P (n) }, {Q(n) }, n ≥ 1, on X (n)
we say that {Q(n) } is contiguous to {P (n) } iff for any sequence of events {An }, n ≥ 1,
such that P (n) (An ) → 0, Q(n) (An ) → 0 also.(2)
Contiguity has a very simple statistical interpretation. Any test of H : P = P (n) vs
K : Q = Q(n) which asymptotically has probability of type I error 0 must asymptotically
have probability of type II error 1.
Contiguity can be used to turn asymptotic results for a relatively simple P (n) into
asymptotic results for a more complicated Q(n) because if an approximation Sn to a statistic Tn satisfies Tn − Sn = oP (1) under P (n) , then Tn − Sn = oP (1) under Q(n) also. For
instance:
Example 9.5.1. Consider the uniform score statistic
1
T U = n− 2
X Ri
(zi − z̄)
n+1
of Example 8.3.11. Under the hypothesis H, an approximation to TU is
X
1
Ui (zi − z̄), Ui = F (Yi ) .
SU = n − 2
Here Ui ∼ U(0, 1) under H. By ordering the Ui we can write
1
SU − T U = n − 2
X
U(i) −
i (z(i) − z̄)
n+1
where z(i) is the predictor value of the sample point with response Y(i) . Because E(U(i) ) =
i/(n + 1) we have, under H,
Set s2 = n−1
P
EH (TU − SU )2 = Var(TU − SU ) .
(zi − z̄)2 ; then we can use Problem B.2.9 to show that
2
E (TU − SU )/s → 0
184
Inference in semiparametric models
Chapter 9
provided zi , 1 ≤ i ≤ n, satisfy Lindeberg’s condition. See Problem 9.5.3. For contiguous alternatives Qn , (TU − SU )/s = oP (1) also. By Slutsky’s theorem we√can find the
asymptotic power of TU for contiguous alternatives by computing limn→∞ P ( 12SU /s ≥
z1−α ). See Problem 9.5.4.
✷
Define
L(n) (X (n) ) ≡ q (n) (X (n) )/p(n) (X (n) )
where p(n) and q (n) are densities of P (n) and Q(n) and 0/0 = 0, a/0 = ∞, for a > 0.
A simple and powerful sufficient (and necessary) condition is embodied in
Proposition 9.5.1. (Le Cam’s First Lemma). Suppose
(i) LP (n) L(n) (X (n) ) ⇒ L(L) with L finite valued, and
(ii) E(L) = 1.
Then, Q(n) is contiguous to P (n) .
We refer to Hàjek and Sidàk (1967), BKRW, and van der Vaart (2000) for a general
proof of this lemma. We sketch a proof in Problem 9.5.1.
We next consider L(n) (X (n) ) = Ln (t) as defined in (9.5.3) and, as a corollary, we
obtain
(n)
Proposition 9.5.2. Suppose {Pθ
(n)
(n)
: θ ∈ Θn }, n ≥ 1, is ULAN at θ0 . Let P (n) ≡ Pθ 0
1
and Q(n) ≡ Pθ n , where θn ≡ θ0 + tγn , γn = n− 2 , and |t| ≤ M < ∞ all n. Then Q(n)
is contiguous to P (n) .
Proof. By Examples 7.1.1 and 7.1.2, Ln ⇒ eZ where Z ∼ N (−σ 2 /2, σ 2 ) and σ 2 ≡
tT Σ0 t. But, by (A.13.20), EesZ = exp{−(σ 2 s)/2+(σ 2 s2 )/2} which equals one if s = 1,
and the result follows from Proposition 9.5.1 with L = eZ .
✷
The importance of this proposition is that the likelihood ratio is asymptotically determined by T (n) whatever θ is in the local model which suggests that T (n) is, in a suitable
sense, asymptotically sufficient, so that any decision procedure has risk equivalent in the
local asymptotic sense to the experiment in which one observes only T (n) (X (n) , θ0 ). To
see that this last experiment is very simple we state and prove the fundamental
Proposition 9.5.3. (Le Cam’s Third Lemma). Suppose En is a sequence of ULAN experiments at θ 0 , T n ≡ T n (X(n) , θ0 ) is as in Definition 9.5.1, and S n ≡ {S n (X (n) )} is a
sequence of p dimensional statistics such that
!!
Σd×d Cd×p
0d×1
T
T T
.
,
Lθ 0 (Tn , Sn ) → Nd+p
T
Kp×p
Cd×p
µp×1
1
Then, if θ n = θ 0 + tγn , with γn = n− 2 ,
Lθ n (Sn (X (n) )) → Np (µ + C T t, K).
(9.5.4)
Section 9.5
Asymptotic Properties of Likelihoods. Contiguity
185
An immediate consequence is
Proposition 9.5.4. Suppose En is ULAN at θ 0 with T n (X (n) , θ0 ) and Σ0 as in Definition
1
9.5.1. Then, if θn = θ0 + tn− 2
(n)
Lθ n (Σ−1
, θ0 )) → Nd (t, Σ−1
0 T n (X
0 ).
Remark 9.5.2. We discuss the implications of Proposition 9.5.4 before giving the proof
of both Propositions 9.5.3 and 9.5.4. Since observing Σ−1
0 T n and T n are equivalent this
(n)
proposition says that the experiment, observing Σ−1
T
for the model Pn = {Pθ0 +tγn :
n
0
t ∈ Rd }, is equivalent, in some approximate sense, to observing W = t + Z, t ∈ Rd
1
d
2
where Z ∼ Nk (0, Σ−1
0 ); or equivalently, Σ0 W = µ + Z, µ ∈ R , for Z ∼ Nd (0, J),
J = d × d identity. Coming back to our i.i.d. example, observing (X1 , . . . , Xn ) for i.i.d.
Pθ 0 +tn−1/2 is asymptotically equivalent, in the rough sense we have discussed, to observP
ing I −1 (θ 0 )n−1/2 ni=1 l̇(Xi , θ0 ) which, under regularity conditions, is asymptotically
bn − θ 0 ). Unfortunately, having op (1) approximations
equivalent to observing tbn ≡ n1/2 (θ
for the likelihood ratio statistic in the local model is far from enough to justify the asymptotic sufficiency properties they suggest. We refer to Le Cam (1956, 1972, 1986) for the
difficult arguments involved.
Proof of Proposition 9.5.3. The characteristic function of S n is
Eθn exp{iwT S n }
= Eθ 0 exp{iwT S n + Λn (t)} + O(Pθ 0 [p(n) (X (n) , θ 0 ) = 0]) .
(9.5.5)
The second term in (9.5.5) tends to 0 by contiguity. By Definition 9.5.1,
1
iwT S n + Λn (t) = iwT S n − tT Tn − tT Σ0 t + oPθ0 (1) .
2
(9.5.6)
Let (U , V ) denote the limit in law of (Tn , Sn ). The right hand side of (9.5.6) converges in
law to iwT V + tT U − (1/2)tT Σ0 t. It can be shown (Problem 9.5.2) that
1
E exp{iwT V + tT U } = exp{iwT µ − (w T KV + tT Σ0 t + itT Cw)}. (9.5.7)
2
By Hammersley’s theorem (B.7.3) we can find (Λ∗n (t), [S ∗n ]T ) with the same distribution as (Λn (t), S Tn ) which converge in probability to (tT U ∗ − 1/2 tT Σ0 t, V ∗ ) with
(U ∗ , V ∗ ) having the same Gaussian distribution as (U , V ). Then,
1
P
exp{iwT S ∗n + Λ∗n (t)} → exp{iwT V ∗ + tT U ∗ − tT Σ0 t} .
2
The limit variable has, by (9.5.6), expectation exp{iwT (µ + C T t) − 12 wT Kw} which is
the characteristic function of the postulated Gaussian limit distribution.
Finally, we can conclude that the limit of expectations is the expectation of the limit in
(9.5.5) by Remark B.7.1.
✷
186
Inference in semiparametric models
Chapter 9
Proof of Proposition 9.5.4. Let S n = Σ−1
0 T n . Then the conditions of Proposition 9.5.3
−1
−1
are satisfied with µ = 0, K = Σ−1
0 Σ0 Σ0 = Σ0 , C = Jk×k where J is the identity.
The result follows.
✷
We give an application of Le Cam’s Third Lemma which also serves to illustrate the
asymptotic optimality properties of procedures that we associate with LAN.
Example 9.5.2. Asymptotic efficiency of the Rao score test. Recall the Rao test in a 1
dimensional regular model given by (5.4.55):
1
“Reject H : θ = θ0 iff Sn ≡ [nI(θ0 )]− 2
n
X
∂l
(Xi , θ0 ) ≥ z1−α .”
∂θ
i=1
(9.5.8)
This test was shown to be asymptotically optimal by a direct calculation in Problem 5.4.8.
We reprove this result under weaker assumptions.
By Example 7.1.1., for a regular one diP
∂l
(Xi , θ0 ). By the central limit theorem,
mensional model, Tn (X (n) , θ0 ) = n−1/2 ni=1 ∂θ
under H,
(Tn , Sn ) ⇒ N (0, 0, I(θ0 ), 1, 1).
√
By Le Cam’s Third Lemma, if θn = θ0 + t/ n, then
p
Lθn (Tn ) → N (t I(θ0 ), 1).
p
Thus the asymptotic power function of the Rao test is β(t) = 1 − Φ(z1−α − t I(θ0 )),
which by Theorem 5.4.5 is optimal.
Note that the result is valid under the conditions of Example 7.1.1 which are weaker
than those of Theorem 5.4.5. In particular nothing is assumed about the MLE. To see this
we need only note that, by Example 7.1.1, the most powerful test of H : θ = θ0 vs.
K : θ = θn , “reject H iff Λn (θ0 , θn ) ≥ cn (α, θ0 ),” is asymptotically equivalent to the Rao
test both under H and under K by ULAN and contiguity (Problem 9.5.9).
We next turn to an extension of the Rao score test to models with nuisance parameters.
Example 9.5.3. Neyman’s Cα test. Consider H : ν = ν0 vs. K : ν > ν0 for p(·, θ), θ ∈
Rd a regular model with θ0 = (ν0 , η T )T satisfying the conditions of the multivariate part
of Example 7.1.1. Let
n
X
1
l∗ (Xi , ν0 , η)
Sn (ν0 , η) = n− 2
i=1
where l is the efficient score function defined by l∗ = e
l/ke
lk2 with e
l = l˙ − Π(l|˙ Ṗ2 (P0 )),
˙ 2 (P0 ) is
Ṗ2 (P0 ) ≡ {P(ν0 ,η) : η ∈ Rd−1 }, and e
l, Π defined as in (9.3.10). Here Π l|P
the orthogonal projection of the ν score function l˙ on the tangent space generated by the
nuisance parameter η. As shown in Section 1.4, such projections can often be computed as
conditional expectations. See Remark 1.4.6.
∗
Section 9.5
Asymptotic Properties of Likelihoods. Contiguity
187
Let us assume that we can use the data to find out that we are in a neighborhood of the
b, such that for any θ 0 ,
hypothesis, if it holds. That is, there exist η
1
n− 2
n
X
1
l∗ (Xi , ν0 , η 0 ) = n− 2
i=1
n−1
n
X
n
X
i=1
[l∗ ]2 (Xi , ν0 , η 0 ) = n−1
n
X
i=1
i=1
b + oPθ 0 (1) ,
l∗ (Xi , ν0 , η)
(9.5.9)
b + oPθ0 (1) .
[l∗ ]2 (Xi , ν0 , η)
Maximum likelihood estimates of η under H satisfy (9.5.9) under A0-A6 of Theorem 6.2.2.
See Problem 9.5.10 for a suitable semiparametric example where (9.5.9) holds. If (9.5.9)
holds, we can construct the following Neyman Cα test of H:
where
b ≥ z1−α σ(η)
b
Reject H iff Sn (ν0 , η)
σ 2 (η) ≡ n−1
n n
2
X
1X ∗
∗
∗
l∗ (Xi , ν0 , η) − l (ν0 , η) and l (ν, η) =
l (Xi , ν, η).
n i=1
i=1
Using (9.5.9) and Slutsky’s theorem we can show (Problem 9.5.12) that for all θ 0 satisfying
H,
b ⇒ N (0, [I 11 (θ0 )]−1 )
Sn (ν0 , η)
(9.5.10)
where I 11 (θ0 ) is the upper left entry of I −1 (θ0 ). Thus the test based on Sn is asymptotically level α for each θ0 . By contiguity it is asymptotically of level α uniformly for θ that
1
satisfy H and are in radius n− 2 balls of θ0 .
What about the power function of the Cα test? To apply the third lemma, set
We know that
1
1
Tn = n− 2 Σℓ̇(xi , θ0 ) ∈ Rd , Sn = I ′′ (θ0 ) 2 S(ν0 , η) ∈ R .
L
(TTn , STn )T → N
I(θ 0 ) Cd×1
0d×1
,
,
T
1
0
Cd×1
1
where Cd×1 = ([I 11 (θ0 )]− 2 , 0T )T is obtained from the identities – see Problem 9.5.13.
∂l
(X 1 , θ0 ) = E[l∗ ]2 (X 1 , θ0 ) = [I 11 (θ 0 )]−1
∂ν
∂l
(X 1 , θ0 ) = 0, 1 ≤ j ≤ d − 1 .
El∗ (X 1 , θ 0 )
∂ηj
El∗ (X 1 , θ 0 )
(9.5.11)
We conclude from Le Cam’s Third Lemma that if θn = θ0 + tn−1/2 , θ0 ∈ H, t =
(t1 , . . . , td )T ,
1
1
Lθ n ([I 11 (θ0 )] 2 Sn (θ 0 ) → N (t1 [I 11 (θ0 )]− 2 , 1) .
188
Inference in semiparametric models
Chapter 9
Thus the asymptotic power function of the Cα test is
1
b ) ≥ z1−α σ(b
β ∗ (θ) = lim Pν n [Sn (ν 0 , η
η )] = 1 − Φ(z1−α − t1 [I 11 (θ0 )]− 2 )
n
for θ n as given before. The convergence is uniform on n−1/2 neighborhoods of points
θ0 in H. The details of this argument are given in the hint to Problem 9.5.14. The basic
point is that, again, the generalization (9.5.10) of the Rao test (the Neyman Cα test) allowing for nuisance parameters, has, in the presence of ULAN, by Example 7.1.1 the same
asymptotic optimality property as the standard test in the multivariate shift Gaussian model
with known variance covariance matrix — see Example 8.3.10. See Problem 9.5.15 for an
example.
✷
We conclude with two final applications of Le Cam’s Third Lemma pointing to its
generality.
Example 9.5.4. Testing for constant hazard rate. Recall that the exponential distribution is
characterized by constant hazard rate and plays the role of default distribution in reliability
theory and survival analysis. We will, following Bickel and Doksum (1969), analyze the
power of a class of tests which are expected to have power against the plausible alternative
of increasing failure rate r(x) = f (x)/[1 − f (x)]. Proschan and Pyke (1966) proposed
the use of a statistic which is a U statistic in the normalized spacings of the data, defined
as follows. Let X(1) < . . . < X(n) be the order statistics of a sample X1 , . . . , Xn i.i.d. P
concentrating on R+ with P continuous. Define the normalized spacings as
Di = (n − i + 1)(X(i) − X(i−1) )
with X(0) ≡ 0. It may be shown that if P = E(λ) then D1 , . . . , Dn are themselves i.i.d.
E(λ). (See Problem B.2.14.) On the other hand, intuitively if P has an increasing failure
rate we expect the Di to decrease in i stochastically
— see Problems 1.1.14, 3.5.13, and
P
9.5.5. Proschan and Pyke proposed T = i<j 1(Dni ≥ Dnj ) as a test statistic. Barlow
and Proschan (1996), Bickel and Doksum (1969), and Bickel (1970) investigated statistics
of the form
U=
n
X
i=1
ci (
Di
− 1)
S
(9.5.12)
Pn
Pn
where S = i=1 Di = i=1 Xi , and the ci are decreasing in i. The motivation is that
these statistics are scale-free measures of the covariance between the normalized spacings
Di /S and the ci . We expect the covariance to be positive if the failure rate is increasing.
Moreover, as we shall see below, Bickel (1970) shows that for suitable ci the tests based on
(9.5.12) are efficient against specific alternative families.
Because U is scale invariant its distribution under H doesn’t depend on λ and we can
set suitable critical values for U by Monte Carlo simulation under E(1). See Example 4.1.6.
As we shall see, a simple Gaussian approximation is also valid.
Suppose that we have a regular parametric model P ≡ {P(λ,η) : λ ∈ R+ , η ∈ R} with
P(λ,0) = E(λ) for all λ — see Problem 9.5.6. We want to compute the power of the size α
Section 9.5
Asymptotic Properties of Likelihoods. Contiguity
189
test for H : η = 0 vs K : η > 0 based on rejecting for large values of U and compare it to
the asymptotically MP test. Let
T
∂l
∂l
l̇(·) =
(·, λ0 , 0) ,
(·, λ0 , 0)
.
∂λ
∂η
To applyP
Le Cam’s Third Lemma we need the asymptotic joint distribution of U and
Tn ≡ n−1/2 ni=1 l̇(Xi , θ0 ) for θ0 = (λ0 , 0)T . We start by approximating Tn with a linear
function of the Di . Without loss of generality set λ0 = 1, then
T ≡
=
n
X
l̇ (Xi , θ0 ) =
i=1
n
X
i=1
G−1
n
X
l̇ X(i) , θ0
i=1
n
X
a
i
i
X(i) − G−1
n+1
n+1
i=1
n
2
i
i
1X
+
bn
X(i) − G−1
)
(9.5.13)
2 i=1
n+1
n+1
i
n+1
+
where G is the df of Xi under Pθ0 , so that G−1 (t) = − log(1 − t), and
∂ l̇
i
i
=
G−1
, θ0
a
n+1
∂x
n+1
2
i
i
∂ l̇
bn
=
, θ0
µn
n+1
∂x2
n+1
where µn lies between X(i) and G−1 (i/n + 1).
Write


i
i
X
X
i
(D
−
1)
i
j
X(i) − G−1
=
+  (n − j + 1)−1 + log(1 −
)
n+1
n
−
j
+
1
n+1
j=1
j=1
i
X
1
(Dj − 1)
.
+ OP
=
(n
−
j
+
1)
n
j=1
(9.5.14)
The last statement follows from the well-known approximation,
n
X
1
1
,
= log n + γ0 + O
j
n
j=1
where γ0 is Euler’s constant. Substituting (9.5.14) into (9.5.13) we get
n
n
n
X
X
X
1
i
1
1
1
d
O
l̇ (Xi , 1, 0) = n− 2
(Di − 1) + n− 2
n− 2
n+1
n
i=1
i=1
i=1
n
2
X
i
i
1
X(i) − G−1
(9.5.15)
+ n− 2
bn
n
+
1
n
+
1
i=1
190
Inference in semiparametric models
where
d
i
n+1
=
n
X
1
1
k
a
n+1
n+1
1−
k
n+1
k=i+1
Chapter 9
.
2
i
= OP (n−1 ) and thus, if bn is
We can show (Problem 9.5.7) that X(i) − G−1 n+1
bounded, then
1
n− 2
n
X
1
l̇(Xi , θ0 ) = n− 2
i=1
n
X
d
i=1
i
n+1
1
(Di − 1) + OP (n− 2 ).
Suppose ci ≡ c(i/(n + 1)) with the function c(·) continuous,
Z
0
1
|d(t)|2 dt +
Z
0
1
R1
0
(9.5.16)
c(u)du = 0, and
c2 (t)dt < ∞ .
Then, since the Di are i.i.d. E(1) under H we can apply the Lindeberg-Feller theorem
(Appendix D.1) and the delta method to U . We obtain after some calculation (Problem
9.5.8) that
Z
(T , U ) ⇒ N3 0, 0, I(θ0 ),
0
1
2
c (t)dt,
Z
1
T
d (t)c(t)dt
0
.
(9.5.17)
From this we may use Lemma 9.5.3 to deduce the power function of U and relate it to that
of the asymptotically most powerful test of H : P = E(λ) vs. K : P = P(λ,η) , η ≥ 0
(Problem
R 9.5.8). Note that approximate critical values can be obtained because U =⇒
N (0, c2 (u)du). For more discussion and the relations of tests of type U to the ProschanPyke and other rank tests see Bickel and Doksum (1969) and Bickel (1970).
✷
Our final example needs central limit theorems for dependent random variables which
we do not pursue in this book. We include it to illustrate the generality of LAN.
Example 9.5.5. Autocorrelation. The most common model for temporal dependence of a
sequence of observations is the autoregressive one — see Example 1.1.5, described by
Xi = βXi−1 + εi i = 1, . . . , n
where the εi are i.i.d. with common density f . This only specifies the conditional distribution of X2 , X3 , . . . given X1 . If f is N (0, σ 2 ) and |β| < 1 we can, by specifying
X1 ∼ N (0, σ 2 (1 − β 2 )−1 ) ,
make the sequence X1 , . . . , Xn a stationary, homogeneous, Gaussian Markov Chain. That
is, (X1 , . . . , Xk ) and (X1+m , . . . , Xk+m ) are identically distributed and the conditional
distribution of Xi given X1 , . . . , Xi−1 depends only on Xi−1 and is the same for all i. In
fact, if we add to the model specification on X1 a non-zero mean as in Example 1.1.5 we
Section 9.5
Asymptotic Properties of Likelihoods. Contiguity
191
obtain the only examples of stationary, homogeneous Gaussian Markov Chains. We restrict
ourselves to this simple case for most of this example.
Suppose we are interested in testing H : β = 0 vs. K : β 6= 0; that is, independence
vs. autoregression. It is easy to see, if p(x1 , . . . , xn , β) is as given in Example 1.1.5, for
µ = 0 and σ 2 known, that
n
n
X
∂
1 X
εi−1 εi .
Xi−1 Xi =
log p(X1 , . . . , Xn , β)|β=0 = 2
∂β
σ i=2
i=1
(9.5.18)
Thus, (9.5.18) is the Rao score statistic. By letting X1 ∼ N (0, σ 2 /(1 − β 2 )) we make the
process stationary. We examine the asymptotic distribution of
1
1
Λ(tn− 2 ) ≡ log
p(X1 , . . . , Xn , tn− 2 )
.
p(X1 , . . . , Xn , 0)
(9.5.19)
The usual Taylor expansion gives
1
1
Λ(tn− 2 ) = tn− 2
n
X
n
εi−1 εi +
i=2
t2 X 2
ε
+ oP (1).
2n i=2 i−1
(9.5.20)
We show the model is ULAN. To establish LAN we need to show that
1
n− 2
n
X
i=2
L
εi−1 εi → N (0, 1)
(9.5.21)
since we know, by the WLLN, that
It is clear that n1/2
Var(n−1/2
n
X
i=2
Pn
i=2
n−1
1X 2P
ε → Eε21 = 1.
n i=2 i
Eεi−1 εi = 0 and
εi−1 εi ) = Var(ε1 ε2 ) + 2
n
X
Cov(εi−1 εi , εj−1 εj ) = Var(ε1 ε2 ) = 1,
i6=j
since Cov(εi−1 εi , εj−1 εj ) = Eεi−1 εi εj−1 εj = 0 by the independence of the εi .
That (9.5.20) holds under the hypothesis H of independence is a consequence of any
one of the three following approaches:
(i) Martingale central limit theorems (Hall and Heyde (1980))
(ii) Central limit theorems for functions of Markov chains (Prakasa Rao (1983))
(iii) Strongly mixing random variables (Ibragimov and Linnik (1971))
192
Inference in semiparametric models
Chapter 9
All of these essentially imply asymptotic normality of Λ(tn−1/2 ) with natural parameters
(mean and variance). Unfortunately, none of the tools we have developed in the independence case can be applied here directly.
To perform tests, we can use statistics which are more robust, for instance,
1
U ∗ ≡ n− 2
n
X
ψ(Xi−1 )ψ(Xi )
(9.5.22)
i=2
with a function ψ which downweights large Xi . Thus, under the assumption that ε has
a double exponential distribution we arrive at (9.5.22) for the Rao statistic with ψ(t) ≡
sgn(t) as the Rao score statistic. Le Cam’s Third Lemma and a multivariate central limit
theorem for sums of dependent random variables yield the asymptotic distribution of U ∗ as
N (0, π 2 /4). Using (i), (ii), or (iii) again yields
L
L(T, U ∗ ) → N 0, 0, 1, 1, E(sgn(εi εi−1 )εi εi−1 )/Eε21
= N (0, 0, 1, 1, (E|ε1 |)2 /Eε21 ) .
Thus, as in the case of the ordinary one sample problem, if ε ∼ N (0, 1), the asymptotic
distribution of U ∗ is N (0, π 2 /4).
The comparison in power of T and this U ∗ leads to the same conclusions as the asymptotic comparison of the sign test to the t test or of the median to the mean in the one sample
problem. For more on inference under dependence we refer to Prakasa Rao (1983) and
Ibragimov and Hasminskii (1981).
✷
Summary. This section treats local approximations to models that can be used to find
approximations to the properties of statistical decision procedures.
For the regular model
√
{Pθ : θ ∈ Θ}, the local model is {Pθn : θn = θ0 + t/ n ; |t| ≤ M } for some constant
M > 0. More generally we consider sequences of models, called experiments, of the
(n)
: θ ∈ Θn } and define En to be uniformly locally asymptotically normal
form En = {P
θ
(ULAN) at θ0 if the log likelihood ratio for testing θ0 vs θ0 + tγn , where |γn | ↓ and
|t| ≤ M , can be closely approximated by an asymptotically normal sequence of statistics.
We relate the ULAN concept to the concept of contiguity: The sequence of probabilities
{Q(n) } is contiguous to the sequence of probabilities {P (n) } iff for any sequence of events
An such that P (n) (An ) → 0, Q(n) (An ) → 0 also. This concept is useful because if we can
P
show that an approximation S n to a vector T n satisfies T n − S n −→ 0 under a hypothesis
P
(n)
H, then T n − S n −→ 0 also for contiguous alternatives. In the case where P (n) = P
θ0
(n)
with θn = θ 0 + tγn , |t| ≤ M , Q(n) can be shown to be
is ULAN and Q(n) = P
θn
contiguous to P (n) . We establish Le Cam’s Third Lemma which states that if T n denotes
the ULAN approximation to the local log likelihood ratio and if the joint distribution of
T n and some statistic S n tends to the multivariate normal distribution for P (n) , then the
asymptotic distribution of S n is asymptotically multivariate normal for Q(n) . We illustrate
the usefulness of these ideas by establishing the asymptotic efficiency of the Rao score
and Neyman Cα tests and by developing asymptotic theory for a class of unbiased tests of
exponentiality vs increasing failure rate based on normalized spacings.
Section 9.6
9.6
193
Problems and Complements
PROBLEMS AND COMPLEMENTS
Problems for Section 9.1
1. In Example 9.1.1,
(a) Show that β is identifiable in model (a).
(b) Show that α is unidentifiable in model (a) and σ is unidentifiable in both (a) and (b).
(c) Show that both α and β are identifiable in model (b).
b as defined in Example 6.2.2 with f0 the logistic density is
(d) Show that β
for β in model (a).
√
n consistent
b and α
b in model (b).
(e) Give the influence functions of the estimates β
b = Ȳ − Z̄β
2. In Example 9.1.2,
(a) Establish (9.1.4).
(b) Prove that F is identifiable in both (9.1.2) and (9.1.5). What conditions are needed?
3. Parametric biased sampling. Simpler models result in more efficient inference when the
sample size is small; see Section I.7 (although if seriously wrong the biases they bring can
be overwhelming). Thus comparisons of procedures derived from parametric and semiparametric models are of interest. In the length bias part of Example 9.1.2 let V be the
number of strata with v elements and suppose that V has the Poisson (µ) density
h(v) = hµ (x) ≡
e−µ µv
, v = 0, 1, . . . .
v!
An element is drawn at random from the union of all strata S0 , S1 , where Sj has j elements
and the size X of the strata is observed, where now X ≥ 1. Let Z be a variable with
L(Z) = L(V |V ≥ 1).
(a) Show that Z has density
f (z) = fµ (z) ≡
e−µ µz
, z = 1, 2, . . . .
(1 − e−µ )z!
(b) Show that W (F ) = E(Z) = µ/[1 − e−µ ] and thus
pF (x) =
That is L(X) = L(V + 1).
e−µ µx−1
, x = 1, 2, . . . .
(x − 1)!
194
Inference in semiparametric models
Chapter 9
(c) Recall that V and Z are not observed; instead XPis. Show that the MLE of µ based
n
on a sample X1 , . . . , Xn from pF (x) is µ
b =
i=1 (Xi − 1)/n and thus, by the
equivariance property of MLEs, the MLEs of h(v) and f (g) are b
h(v) = hµb (v) and
b
f (g) = fµb (z).
Remark 1. The Poisson model includes households with no children and µ = E(V ) is the
expected number of children over all households.
4. Parametric stratified sampling. In the stratified sampling part of Example 9.1.2, suppose
(I1 , Y1 ), . . . , (In , Yn ) is a sample from pF (x)(j, y). Based on this sample, find the MLE
of µ when Z is Poisson (µ), k = 2, X1 = {0, 1, 2, . . .}, X2 = {1, 2, 3, . . .}, and (a)
λ1 = λ2 = 21 , (b) λ1 and λ2 general but known; where λj = P (I = j), j = 1, 2.
5. Size biased class size. Suppose we are interested in the mean size of undergraduate
classes at a certain university. We sample n students and ask which of the following categories their current classes fall in: [1, 20), [20, 40), [40.60), . . . , [800, ∞). These categories
are converted to the class sizes k = {10,
P 30, 50, . . . , 900}, and we define the mean from the
perspective of the students as µS = k=K kpk , where pk is the proportion of students in
classes of size k. Suppose we are interested in the mean class size µP from the perspective
of the professors at the university. Explain why µP is smaller than µS and describe how to
estimate µP .
6. (a) In Example 9.1.10, show that β → LP (r(Z, β)) is 1–1 and P [S0 (T, 0, P ) = c] < 1
for all c are necessary for identifiability of β in the Cox model.
(b) Show that under these conditions, if β(P ) is defined by (9.1.42), β(Pβ ) = β0 .
0
c exists and is unique with probability tending to one
7. In Example 9.1.8, show that W
provided 0 < P (δ = 0) < 1.
Hint. Use exponential family theory.
8. In Example 9.1.9, prove (9.1.32) to (9.1.34) following the approach for censoring.
9. For a semiparametric example where maximum likelihood fails, consider the model
where Y ∈ {0, 1}, Z ∈ B ⊂ Rd , B bounded, U ∈ [0, 1], and
E(Y |Z, U ) = P (Y = 1|Z, U ) = L β T Z + η(u) , β ∈ Rd ,
with L(t) = 1/[1 + e−t ]. This is a partially linear logistic regression model. Assume that
R (k) 2
η(·) is in the class of functions Fk with J (k) (η) < ∞, where J (k) (η) =
η (t) dt,
k ≥ 1. Let L(β, η) denote the likelihood based on (Yi , Zi , Ui ) i.i.d. as (Y, Z, U ). Write
p(y, z, u) = p(y|z, u)p(u, z) and assume that p(u, z) does not involve β. Show that
sup{L(β, η) : η ∈ Fk } = 1 .
Thus any b ∈ B is a MLE of β. In this model efficient estimates can be obtained by max
2
2
imizing the penalized likelihood L(β, η) − λ2 J (k) (η) where J (k) (η) is a roughness
penalty and λ2 is a tuning parameter. See Mammen and van der Geer (1997).
10. In Example 9.1.14,
Section 9.6
Problems and Complements
195
(a) Show that without loss of generality we can take J to be the d × d identity matrix.
(b) Fill in the details in the proof that if
Var Zk = 1, EZk = 0, 1 ≤ k ≤ d
and at most one of the Zk is Gaussian, then A is identifiable. (Assume that if EZk2 <
∞, γk (t) can be defined and is twice differentiable in a neighbourhood of 0.)
Hint. If γk′′ ≡ c, then it takes on an infinite number of distinct values.
11. In Example 9.1.4, the estimate rbj,t depends on selecting t ≡ b
t so that 0 < Sb0 (b
t) < 1.
P
Thus t = b
t is random. Find conditions under which rbj,bt −→ rj (β).
12. Show that in Example 9.1.13, if EP |Y | < ∞, 0 < EP (Z − E(Z|U ))2 < ∞, then
(9.1.52) is valid and β is identifiable for the given parametrization.
Hint. Y − EP (Y |U ) = β(Z − E(Z|U )) + ε.
13. Show in Example 9.1.9 that
(a) If (9.1.25) is replaced by its continuous approximation (9.1.29), we arrive at the same
estimating equation for λFj but the expression (9.1.30) for S.
(b) The unique maximizer of K(P(F,G) , P ) is given by (9.1.31). Thus the parametrization sending (F, G) into P according to (9.1.24) is identifiable if 0 < P [δ = 0] < 1
by using (9.1.31).
Hint for (b). Use Shannon’s inequality.
14. In the censored data framework of Example 9.1.9, show that the empirical likelihood
bF i =
estimate of the hazard rate λF i at y(i) based on maximizing on (9.1.26) or (9.1.29) is λ
δ(i) /(n + 1 − i) when there are no ties.
15. Show that in the proportional hazard model, Ld of Example 9.1.12 is proportional to
the approximate empirical likelihood (9.1.36).
16. In Example 9.1.12, show that the likelihoods Ld and La are different for the model
with
Λ(y|z) = aθΛ(y) + (1 − a)Λ(θy) ,
where θ = exp{β T z}, a ∈ (0, 1) is fixed, and Λ(y) = − log 1 − F (y) for some baseline
df F on [0, ∞).
17. The odds ratio is defined as Γ(t|z) = F (t|z)/[1 − F (t|z)] and the proportional odds
model is defined by
Γ(t|z) = θΓ(t), θ > 0, t > 0 ,
where θ = r(z, β) and Γ(t) = F (t)/[1 − F (t)] for some baseline distribution F on [0, ∞).
(a) Show that if G(u; θ) = u/[u + θ(1 − u)], 0 ≤ u ≤ 1, then G F (y); θ follow a
proportional odds model.
196
Inference in semiparametric models
Chapter 9
−t −1
(b) Show that if G(y; θ) = L(y − θ) where L(t)
= [1 + e ] is the logistic df and if
−1
F (y|z) = G η(y); θ , η(y) = L
F (y) , then F (y|z) follow a proportional odds
model.
(c) (i) Write an expression for the likelihood Ld of Example 9.1.12 for this model.
(ii) When θ = exp{βz}, z ∈ R, show that this likelihood is concave in η1 , . . . , ηn ,
where we write ηi for η(i) .
Remark 2. Bell and Doksum (1966, Table 8.1, last row) considered optimal tests of H :
θ = 1 in the proportional odds
model (a). This paper gives several more examples of
models of the form G η(y); θ . See Problems 8.3.14 and 9.1.25.
18. We noted that Ld and La give the same estimates of θ and β in model (9.1.46).
However, they give different estimates of η ′ (t). For instance, in the Cox proportional
hazard model, the La estimate of λ = Λ′ is given in (9.1.37). It is increasing in y ∈
{y(1) , . . . , y(n) } and converges to zero and thus is not consistent. On the other hand, the
Breslov estimate
n
X
b B (y) =
b (k) ) 1 (y(k) ≤ y) ,
Λ
λ(y
k=1
b
which is the counting measure integral of λ(·),
is known to be consistent.
(a) Show that the Ld estimate is
bd (y(i) ) =
λ
(y(i) − (y(i−1) )
1
P
T
b
j≥i exp{β z(j) }
,
y(0) = 0, 1 ≤ i ≤ n .
bd (y) to be λ
bd (y(i) ) on [y(i) , y(i−1) ), 1 ≤ i ≤ n, and zero elsewhere,
If we define λ
then
Z y
bd (s)ds = Λ
b B (y)
λ
0
bd (·) is the ordinary right derivative of the
is again the Breslov estimate. Thus λ
Breslov estimate.
bd (·) is not a consistent estimate of λ(·). For consistent
(b) Show, using an example, that λ
estimates of λ(·), see Chapter 11.
Hint. If Y(i) is an exponential, E(λ), order statistic, then (n + 1 − i)(Y(i) − Y(i−1) ) has an
E(λ) distribution (Problem B.2.14).
19. In the accelerated failure time model, show that the log likelihood is proportional to
(9.1.60).
20. For the AFT model (9.1.58) the approximate log likelihood la is obtained by replacing
Λ( ) by a step function with the jump size λi at each v = θi , 1 ≤ i ≤ n. Show that
P
P
(a) la (β, λ) = n−1 ni=1 δi βT Zi (Yi ) + δi log λi − nj=1 λj 1[θj ≤ θi ] .
Section 9.6
197
Problems and Complements
(b) For fixed β, the maximizer is
δk
,
1(θ
j ≥ θk )
j=1
bk (β) = Pn
λ
1≤k≤n.
(c) The approximate profile log likelihood is proportional to
la (β) = n−1
n
n n
io
hX
X
1(θj ≥ θi ) .
δi β T Zi (Yi ) − δi log
i=1
j=1
Note that this objective function has an infinite maximum and does not
P achieve its maximum for finite β. Thus the approximate likelihood La without the
pi = 1 condition
does not work for this model.
21. Models where the proportion hazard (PH) and accelerated failure time (AFT) models
are equivalent. Let Ti ≥ 0 be a failure time whose continuous df Fi depends on a vector zi
of fixed covariates. Let λi (t) = fi (t)/[1 − fi (t)] where fi (t) is the continuous case density
of Ti . Consider the following models for independent failure times T1 , . . . , Tn :
(i) PH: λi (t) = ∆i λ(t), some unknown baseline λ(t); ∆i = r(zi , β) > 0, some known
r(·, ·), unknown β.
(ii) AFT: Fi (t) = F (τi t), some unknown baseline F ; τi = s(zi , α) > 0, some known
s(·, ·), unknown α.
(iii) Periodic hazard rate (PHR): λi (t) = hi (log t)tγi −1 , some known non-negative period function hi (·) with period ci > 0; γi = q(zi , θi ) > −1, some known q(·, ·),
unknown θ.
(iv) Weibull: λi (t) = γi tγi −1 , γi = q(β i , θ) > −1, some known q(·, ·), unknown θ. In
this case, Fi is said to be the Weibull distribution.
(v) Trigonometric PHR: λi (t) = exp{sin 2π log t/ log τi }tγi −1 , τi = s(zi , α) > 0 and
γi = q(zi , θi ) > −1, where s(·, ·) and q(·, ·) are known, α and θ are unknown.
(a) Show that (i) and (ii) both hold iff
(vi) λ(τi t) = ∆i λ(t)/τi , all t > 0, some τi 6= 1, τi > 0, some ∆i > 0.
(b) Suppose limt↓0 [λ(t)/ta ] exists and is positive for some a > −1. Show that (i) and
(ii) hold iff Fi is a Weibull distribution.
(c) Show that if (iii) holds with ci = | log τi | and γi = log ∆i / log τi , then (vi) holds so
that the PHR model is both PH and AFT.
(d) Show that (iv) and (v) are examples of (iii).
198
Inference in semiparametric models
Chapter 9
Remark. Doksum and Nabeya (1984) show that a model is both PH and AFT iff it is PHR.
22. The quantile hazard rate (QHR) of a random variable T ≥ 0 with hazard rate λ(·)
is defined by q(α) = λ(tα ), 0 < α < 1, where tα = F −1 (α) is the αth quantile of the
distribution F of T . Let q(α|z) denote the quantile hazard rate for (T |z). The proportional
quantile hazard (PQH) model is defined by
q(α|z) = θq(α) ,
where q(·) is a baseline QHR and θ = r(z, β) for some known function r(·, ·). Show that
L
(T |z) follow the PQH model iff (T |z) follow the AFT model where (T |z) = θ−1 T0 for
some baseline variable T0 .
23. The Hoeffding Rank Likelihood.
(a) Consider Example 9.2.12 where Fi (y) = G(η(y); θi ) for some parametric df G(v; θ)
and increasing differentiable function η(·), where θi = r(zi , β) with zi a nonrandom
covariate vector. Show that the Hoeffding rank likelihood of Problem 8.3.10(a) reduces to
!
n
.
Y
n! ,
g(V (ri ) ; θi ) g(V (ri ) )
LR (β) = EG
i=1
where g(v; θ) is the continuous case density of G(v; θ) and g(v) = g(v; θ0 ) with
θ0 corresponding
to a null hypothesis (typically β = 0). Assume that g(v) > 0
Qn
whenever i=1 g(v; θi ) > 0.
(b) Show that if g(y; θ) = 1 − exp{−θy} and η(y) = − log[1 − F0 (y)], y > 0, for
some continuous df F0 , then LR (β) is equivalent to the Cox likelihood (9.1.44). See
Kalbfleish and Prentice (1973, 2002) who called LR (β) a marginal likelihood.
Hint for (b): Note that if h(·) is decreasing, Rank h(Yi ) = n+1−Ri. For the proportional
hazard model, transform Yi by Ui = 1 − F0 (Yi ), then fUi (u) = θi uθi −1 , 0 < u < 1.
Hoeffding’s formula with f (u) = 1(0 < u < 1) shows
LR (β) ∝
n
Y
i=1
θi
Z
·········
Z
n
Y
0<u1 <u2 <···<un <1 i=1
uiδi −1 du1 · · · dun ,
where δi = θbi and bi = index on the Y with rank n + 1 − i. Use this to show that
LR (β) ∝
n
Y
θi
.
θ
Σ
k:Y
k ≥Y(i) k
i=1
Remark 3. Asymptotic properties of maximum rank likelihood estimates were obtained
by Bickel and Ritov (1997) who show that in a general class of transformation models,
estimates based on LR are asymptotically efficient in the sense of Section 9.3.
Section 9.6
199
Problems and Complements
24. (a) Show that n−1 times the log likelihood for the Cox model in the uncensored case is
l(ξ) = n−1
n
X
i=1
yi log pi , ξ = β, λ(·)
where
pi = θi λ(ti ) exp{−θi Λ(ti )}, θi = exp(β T zi ) .
P
(b) Set λi = Λ(ti ) − Λ(ti−1 ), Λ(t0 ) = 0, then Λ(ti ) = j≤i λj . Let ld be l with λ(ti )
replaced by λi /di , where di = ti − ti−1 . Let F and f = F ′ be the df and density of T .
Assume that log λ(·) is uniformly continuous and that λ(·) is nondecreasing. Show that
l(ξ) − ld (ξ) = OP n−1 log λ(Tn )
where Tn is the largest F order statistic, T ∼ F .
Hint. By the mean value theorem, λi = di λ(Ti∗ ) where Ti ≤ Ti∗ ≤ Ti−1 . Here Ti−1 and
Ti are order statistics. Next show that
X
log λ(Ti ) − log λ(Ti−1 ) = n−1 log λ(Tn ) .
l(ξ) − ld (ξ) ≤ n−1
Remark. It is known that λ(Tn ) ∼
= Tn f (Tn ) in the sense that the ratio tends to one with
probability one. See de Haan and Ferreira (2006), Theorem 5.4.1. Thus if tf (t) is bounded,
then l(ξ) − ld (ξ) = O(n−1 ) with probability one.
(c) Let la be l with λ(ti ) replaced by λi , that is, let Λ(t) be a step function with jumps
of size λi . Show that in the exponential case where λ(t) = 1/µ is constant, E[l(ξ) − la (ξ)]
does not tend to zero as n → ∞.
P
Hint. See Problem B.2.14 and recall that if T ∼ E×p(µ) then E(T(i) ) = nj=n+1−i (1/j).
Remark. In this model ld and la are equivalent functions of ξ. However, ld is more
convenient for showing closeness to the semiparametric likelihood l.
25. Two sample plug-in semiparametrics. Suppose X1 , . . . , Xn are i.i.d. as X ∼ F ,
Y1 , . . . , Yn are i.i.d. as Y ∼ H; the X’s and Y ’s are independent. Consider the model
where H(y) = G F (y); θ with F unknown for some parametric model G(u; θ), 0 ≤ u ≤
1. Assume temporarily that F is known; then if we set U = F (Y ), then V has distribution
b ) of θ based on U1 , . . . , Un , Ui = F (Yi ). If F
G(u; θ) and we can find the MLE θ(F
b
b
b
is unknown, set θ = θ(F ), where F is the empirical of the X’s. Because supx |Fb (x) −
1
b ) is a consistent estimate of θ(F ) and θ(F )
F (x)| = OP (n− 2 ), θb is consistent when θ(F
is a continuous function of F in the sup norm. For the following models, find
b ) assuming F is known.
(a) The influence function of θ(F
b Fb).
(b) The influence function of θb = θ(
Models:
(i) Proportional hazards. G(u; θ) = 1 − [1 − u]θ , θ > 0.
200
Inference in semiparametric models
Chapter 9
(ii) Lehmann. G(u; θ) = uθ , θ > 0.
(iii) Proportional odds. G(u; θ) = u/[u + θ(1 − u)], θ > 0.
(iv) Normal transformation model. G(u; θ) = Φ Φ−1 (u) − θ , θ ∈ R.
(v) SINAMI. G(u; θ) = (eθu − 1)(eθ − 1)−1 , θ 6= 0, = u when θ = 0.
(vi) Self-contamination. G(u; θ) = (1 − θ)u + θuc , 0 ≤ θ ≤ 1, c > 1 is a constant.
Remark 4. These models were considered in Problem 8.3.14.
Problems for Section 9.2
1. In Example 9.2.1, show that En (·) =⇒ Z(·).
Hint. Apply the delta method and show that the remainder is bounded uniformly.
2. Derive (9.2.5) using Theorem 6.2.1 and Theorem 7.1.4.
3. Derive (9.2.7).
1
4. Establish that Rem 1 = oP (n− 2 ) in the proof of Theorem 9.2.1, using the equicontinuity
of the empirical process and the consistency of the empirical quantile function.
5. Establish (9.2.11) by computing the covariance of the limiting process in part (ii) of
Theorem 9.2.1.
Hint. Use integration by parts.
6. Verify in detail that the conditions of Theorem 9.1.3 imply statements (a) and (b) of the
proof of Theorem 9.2.2 showing how Theorem 1.6.3 applies.
7. Verify the details of the proof that c) of Proposition 9.1.1 applies in the proof of (iii) of
Theorem 9.2.2.
1
Hint. Use the fact that n 2 (S0 (·, β, Pb ) − S0 (·, β, P )) can be thought of as an empirical
process and then apply the delta method to Λ̇α .
8. Derive a result similar to Theorem 9.2.1 for the truncation estimates (9.1.34).
9. Give the details of the derivation of (9.2.17).
10. Show that (9.2.17) and (9.2.20)–(9.2.22) imply (9.2.19).
11. Verify (9.2.20)–(9.2.22) assuming the validity of (9.2.24).
12. Consider the Cox model with z fixed and T having hazard rate λ(t) exp{βT z}. Let
A = {t : log S0 (t, β, P ) is strictly convex in β} .
Using the proof of Theorem 1.6.3, show that log S0 (t, β, P ) is always strictly convex and
analytic and P [T ∈ A] > 0 unless
P [T = c(z)] = 1 ,
which is ruled out by our assumptions.
13. Suppose Z ∼ Uniform(0, 1) and that (T |z) has the exponential model constant failure
rate λ exp{βz}, λ > 0, β ∈ R.
Section 9.6
201
Problems and Complements
(a) Find the asymptotic distribution of the MLE βbMLE of β in this model.
(b) Find the asymptotic distribution of the Cox estimate βbCOX using Theorem 9.2.2.
(c) Give the asymptotic relative efficiency (ARE) of βbCOX with respect to βbMLE in the
exponential model (see Problem 5.4.1(f) for the definition of ARE). Give the value
of the ARE when τ = ∞.
(You may use MATLAB or any other software in this problem.)
14. Suppose that (Z, T ) ∼ Cox(β 0 , λ) and that condition C holds. Show that, given
Z = z, V = Λ(T ) exp(β T0 z) has a standard exponential distribution.
Hint. Set θ = exp(βT0 z). Note that
P θΛ(T ) ≤ s|z = P T ≤ Λ−1 (s/θ)|z .
15. Suppose (Z1 , T1 ), . . . , (Zn , Tn ) are i.i.d. as (Z, T ), where T > 0 is a survival time
and Z ∈ R, with 0 < E[Z 2 ] < ∞ and P (Z = 0) = 0, is a predictor random variable.
Let H(z, t; β), β ∈ R, denote the distribution of (Z, T ) and assume that the marginal
distribution of Z does not involve β. Consider the estimating equation
n
1X
ψ(zi , ti ; β) = 0 ,
n i=1
where ψ(z, t; β) = ṙ(z;β)
r(z;β) − ṙ(z; β)t, and r(· ; β) > 0 is a 1-1 known function of β with
derivative ṙ(z; β) = ∂r(z; β)/∂β.
(a) Suppose (Z, T ) has distribution H1 where H1 is such that T given Z has the Weibull
distribution with df
1 − exp{−r(z, β0 )tα }, α > 0 .
For what values of α does EH1 [ψ(Z, T ; b)] = 0 have the solution b = β0 ?
(b) Let r(z; β) = exp(−βz) and let α be the answer(s) to part (a). Is the solution b = β0
in part (a) unique?
Hint. You may use the fact that if W has the Weibull distribution with df 1 −
1
exp{−r(wα / θ), then E[W ] = θ α Γ(α−1 + 1).
Pn
(c) Let D(b) = n1 i=1 ψ(Zi , Ti ; b). Suppose r(z, β) = exp(−βz). Show that D(b) =
0 has a unique solution.
(d) Let β0 be the solution to EH [ψ(Z, T ; b)] = 0 when r(z; β) = exp(−βz) and let βb
be the solution to D(b) = 0 as detailed in (c). Derive the asymptotic distribution of
√ b
n(β − β0 ).
√
√
Hint. Note that P ( n(βb − β0 ) ≤ s) = P D(bn ) ≤ 0 , where bn = β0 + s/ n.
202
Inference in semiparametric models
Chapter 9
Problems for Section 9.3
1. Show that (9.3.5) is implied by (9.3.3) and (9.3.4).
2. (a) Show that I −1 (P0 : ν, Q) given in (9.3.2) is valid and independent of parametrization.
Hint. Let t → τ (t) be a reparametrization of Q. Let f ∗ (·, τ ) = p(·, t(τ )) where t(τ ) has
inverse τ (t), t(0) = 0, and t′ (τ ) > 0. Let Pf,τ denote the probability corresponding to
f (·, τ ) and compute
∂
log f ∗ (·, τ )
∂τ
τ =0
,
∂
ν(Pf,τ )
∂τ
τ =0
.
(b) Show that ψ ∗ and Q̇ as defined in Definition 9.3.1 and (9.3.8) do not depend on the
parametrization of Q.
Hint. Note that l˙f = t(a)l˙p .
3. In Example 9.3.1, show that sup{I −1 (P0 : ν, Qa ) : a ∈ Rd } = [nE(ZZT )]−1 (1,1) .
4. Show that in parametric models P satisfying the conditions of Theorem 6.2.2, the MLE
is efficient in the sense of Proposition 9.3.1.
Hint. Take any a ∈ Rd , θ0 and consider the one dimensional model Q ≡ {Pθ 0 +λa : |λ| ≤
ε}.
5. Derive Theorem 3.4.3 from Proposition 9.3.2.
6. (a) Give the tangent spaces for the {G(p, λ) : p > 0, λ > 0}, {β(r, s) : r > 0, s > 0}
models at a generic point of the parameter space.
Pd
(b) Suppose p(x, η) = exp{ j=1 ηj Tj (x)−A(η)}h(x), η ∈ E. Compute the tangent
space of this model at η0 .
7. Suppose X ∼ P(θ,η) , θ ∈ R, η ∈ Rd , a regular parametric model satisfying A0 – A6
of Theorem 6.2.2 for l(X, θ, η). Let θb be the MLE of θ. Show that θb has influence function
at (θ0 , η 0 ) given by
where
e
l(X, P0 : θ, P) =
l˙1 − Π(l˙1 |[l˙21 , . . . , l˙2d ])
kl˙1 − Π(l˙1 |[l˙21 , . . . , l̇2d ])k2
∂l
∂l
l˙1 (X) =
(X, θ0 , η 0 ),
(X, θ0 , η 0 ), l2j (X) =
∂θ
∂ηj
1≤j ≤d.
Hint. e
l is the first coordinate of I −1 (θ0 , η 0 )(l˙1 , l̇21 . . . . , l˙2d )T . Use B.10.20.
8. Suppose X = (Z, Y ), Y = βZ + ε, E(ε|Z) = 0 with the joint distribution of (Z, ε)
otherwise arbitrary and Eε2 < ∞, 0 < EZ 2 < ∞.
(a) Show formally that the tangent space of this model at (α0 , β0 , L0 ) where L0 is the
Section 9.6
203
Problems and Complements
joint distribution of (Z, ε) is given by
n
h(Z, ε) + c
f0′
(Z, ε) : f0 the joint density of (Z, ε), Eh(Z, ε) = 0,
f0
o
Eh2 (Z, ε) < ∞, E(εh(Z, ε)|Z) = 0 and c ∈ R ,
and that the influence function of the LSE does not belong to the tangent space in general.
(b) Suppose that Z is two valued, P [Z = 0] > 0, P [Z = 1] > 0. Show that the
estimate which is the MLE for β in the model
Yi = α + βZi + σ(Zi )εi
where the εi are i.i.d. N (0, 1) and independent of the Zi , the Zi are treated as fixed, and
σ(·) as known is efficient for β at P0 with the εi as above.
(c) Suppose σ 2 (Z) is unknown but the εi are still Gaussian N (0, 1) independent of Zi .
How would you construct an efficient estimate?
R
Hint. (a) If pt (z, ε) = exp{th(z, ε) − ∆(p0 , h)}p0 (z, ε) and εpt (z, ε)dε = 0 for all t, z,
then E(εh(Z, ε)|Z) =
√ 0.
(c) The LSE is a n consistent estimate of β.
9. In the censoring example (Example 9.3.4), let ν(P ) = S(t0 ) for some fixed t0 .
(a) Show (formally) that the tangent space at (F, G) in this example is
{h(Y, δ) ≡ δ∆1 (Y ) −
Z
Y
−∞
∆1 (s)dΛF (s) + 1 − δ∆2 (Y ) −
Eh(Y, δ) = 0, Eh2 (Y, δ) < ∞} .
Z
Y
∆2 (s)dΛG (s),
−∞
(b) Show that the influence function of the empirical MLE of ν, exp{−
of this form.
RY
0
b 1 (s)
dH
}
b¯
H(s)
is
10. Show that in Example 9.3.5 the influence function of the Cox estimate (9.2.15) is of the
form (9.3.23).
Hint. Set up identity and differentiate to solve.
11. (a) Show that the influence function of Fe (x) for Fe as given in Example 9.1.2 belongs
to Ṗ2 (P0 ) as given in Example 9.3.2.
(b) Show that θ(Fb) ≡ Fb (x) is regular. You may assume wj ≥ ε > 0 for 1 ≤ j ≤ k if
necessary.
12. In Example 9.2.4, show that βb∗ given by (9.2.23) is efficient.
13. Consider the nonparametric regression model Y = µ(X) + e, where µ(·) is unknown,
E(e) = 0, Var(e) = σ 2 , X and e are independent with cin with continuous case densities
2
f and g. The nonparametric correlation is η 2 ≡ CORR
µ(x),
Y ). Assume g ′ and η 2
exist. Recall (Problem 7.2.25) that η 2 = E µ2 (x) − µ2Y /Var(Y ). Formally find the
204
Inference in semiparametric models
Chapter 9
tangent space Ṗ(P0 ) and show that it is of the form “(function of x) + (function of x times
a function of e).”
Remark. By Problem
7.2.25 it follows that the
function µ2 (x) + 2µ(x)e of
Pninfluence
2
−1
2
ν(P ) = E µ (X) is in Ṗ(P0 ). Thus νb ≡ n
b (Xi ) will be an efficient estimate
i=1 µ
of ν(P ) provided µ
b(·) is an “accurate” estimate of µ(·). See Doksum andSamarov (1995)
and Aı̈t-Sahalia, Bickel and Stoker (2001) for such estimates of E µ2 (X) for multivariate
X.
14. Suppose (X, Y ) ∼ P
∈ P where P is the bivariate normal copula model with
Φ−1 F (X) , Φ−1 G(Y ) ∼ N (0, 0, 1, 1, ρ).
(a) Show that if F and G are known, then the tangent space at (0, 0, 1, 1, ρ0) is
Ṗ1 (P0 ) = ρ0 + (1 + ρ20 )zw − ρ0 (z 2 + w2 ) /(1 − ρ0 )2
where z = Φ−1 F (x) and w = Φ−1 G(y) .
Hint. P (X ≤ x, Y ≤ y) = H(z, w) where H = N (0, 0, 1, 1, ρ).
(b) Suppose F and G are unknown, and that f = F ′ and g = G′ exist. Show that the
tangent space Ṗ(P0 ) at (P0 , F0 , G0 ) is Ṗ1 (P0 ) + Ṗ2 (P0 ) + Ṗ3 (P0 ) where
b(x)
b(x)f0 (x) ρ0 w − z
+
z
+
f0 (x)
ϕ(z)
1 − ρ20
c(y)
c(y)g0 (y) ρ0 z − w
Ṗ3 (P0 ) =
+
w
+
g0 (y)
ϕ(w)
1 − ρ20
Ṗ2 (P0 ) =
for some b(x) and c(y) in L2 (P0 ). Here ϕ, f0 , and g0 are the continuous case densities of
Φ, F0 , and G0 .
Remark. Klaassen and Wellner (1997) show that the normal scores correlation coefficient
of Example 8.3.12 is efficient for the bivariate normal copula model. That is,they show that
the influence function of ν(Pb ) is in Ṗ(P0 ), where ν(P ) = Corr Φ−1 F (X) , Φ−1 G(Y ) .
Problems for Section 9.4
1. Validate (9.4.3).
2. (a) Show that Sn of (9.4.13) is well defined, i.e. finite even if M = ∞.
(b) Show that weak convergence theory cannot yield anything here since
1
sup |Z(x)| [Φ(1 − Φ)]− 2 (x) = ∞
x
where Z is given in Example 9.2.1.
Hint. Use the Law of the Iterated Logarithm,
1
1
limt→0 |W (t)|t− 2 | log2 t|− 2 > 0
where log2 ≡ log log.
1
3. Suppose that Fn (x) = F0 (x) + ∆n (x)/n 2 where |∆n − ∆|∞ → 0 and let Pn be the
distribution under which X1 , . . . , Xn are i.i.d. Fn .
Section 9.6
205
Problems and Complements
(a) Show that, under Pn ,
En0 (·) ≡
√
n(Fb (·) − F0 (·)) =⇒ W 0 (F0 (·)) + ∆(·).
(b) Let F0 be the U(0, 1) df and let φ1 , φ2 , . . . be the eigenfunctions of the kernel
where [Kh](s) =
R1
0
h → Kh,
h ∈ L2 (0, 1)
h(u)(s ∧ u − su)du, i.e.
Z
1
φa (u)φb (u)du = δab , Kφa = λa φa .
0
The φ and λ are well known to be
√
φa (u) = 2 sin(aπu),
0 ≤ u ≤ 1,
λa = (aπ)−2 .
R1
P∞
Show that if ∆ ∈ L2 (0, 1), ∆ = k=1 (∆, φk )φk where (∆, φk ) ≡ 0 ∆(u)φk (u)du
R
ban ≡ φa (u)En0 (u). Then, for all k
and Z
bkn ) FIDI
(Zb1n , . . . , Z
=⇒ (Z1 , . . . , Zk )
where Zj are independent N (µj , λj ), 1 ≤ j ≤ k, and µj ≡ (∆, φj ).
R1 2
P
P∞
− 12 2
′
2
′
(c) Show that 0 En0
(u)du =⇒ ∞
k=1 Zk =
j=1 λj (Zj + ∆j λj ) where Zj , j ≥
1, are i.i.d. N (0, 1).
R1
P
P∞
2
2
Hint. (c) E 0 (En (u) − m
k=1 Zkn φk (u)) du =
k=m+1 (λk + (∆, φk ) ).
4. Extend the result of Problem 3 to the situation of Example 9.4.2. Suppose T =
1
{1(−∞, s] : s ∈ R}, let Pn correspond to Fθ 0 (x) + ∆n (x)/n 2 and |∆n − ∆|∞ → 0.
Show that√if
bn (·) = n(Fb − F b )(·), then, under Pn ,
Z
θ0
bn (·) =⇒ Z 0 (·) + (∆ − Π (∆))(·).
(a) Z
θ0
θ
0
∂l
(X, θ0 ) : 1 ≤ j ≤ d], then any test based on
(b) In particular, show that if ∆ ∈ [ ∂θ
0
bn (·) has no power for the alternative Pn . Why is this qualitatively reasonable?
Z
√
5. Show, extending the result of Problem 9.2.4, that for any fixed S, Tn1 ≡ n(Fb −F b )(s)
θ
has no greater, and usually less, power as a test statistic for H : F = Fθ 0 vs K : Fn as in
√
Problem 9.2.4 than does Tn2 ≡ n(Fb − Fθ0 )(s).
Hint. Compute critical values under H using Problem 9.2.4 and then show as in Chapter 5
206
Inference in semiparametric models
Chapter 9
that the power for Tnj is governed by a ratio of signal to noise, µj /σj . Show that µ22 /σ22
can be written as (a, b)2 /kbk2 for a suitable a, b while
2
µ21
(a, b − Π0 b)2
(a, Π⊥
0 (b))
=
=
2
σ12
kb − Π0 bk2
kΠ⊥
0 (b)k
where Π0 , Π⊥
0 are projection operators. Use the identity
Π(a|[b]) = Π(a|[Π⊥
0 b]) + Π(a|[Π0 b])
2
and hence kΠ(a|[b])k2 ≥ kΠ(a|[Π⊥
0 b])k .
6. Suppose νb is an efficient estimate of ν(P ) for P ∈ P. Let e
l(X, P0 : ν, P) be its
influence function. Suppose that, for all P0 ∈ P, e
l(·, Pb : ν, P) is well defined and
E0
!2
n
1Xe
(l(Xi , Pb : ν, P) − e
l(Xi , P0 : ν, P)
→0.
n i=1
(a) Show how to construct asymptotic 1 − α confidence bounds for ν(P ), P ∈ P based
on νb.
(b) Suppose that
P0
n Var∗ νb∗ −→
I −1 (P0 : ν, P)
where the ∗ as usual indicate computation under the bootstrap distribution given Pb as defined in Section 9.4. Show how to construct the asymptotic 1 − α confidence bounds in this
case.
Problems for Section 9.5
1. Establish Proposition 9.5.1.
Hint. Given An , by the Neyman–Pearson lemma, there exists cn < ∞ such that P (n) [Ln >
cn ] ≤ P (n) (An ) ≤ P (n) [Ln ≥ cn ] and Q(n) [Ln ≥ cn ] ≥ Q(n) (An ). If P (n) (An ) → 0,
suppose cn ↑ c ≤ ∞. Since P (n) [Ln > cn ] → 0, P [Λ ≥ c] = 0. For suitable d < c,
1 − Q(n) [Ln ≥ d] = Q(n) [Ln < d] = EP (n) (Λn 1(Ln < d)) → EΛ1(Λ < d). Since
EΛ = 1, Q(n) [Ln ≥ d] → EΛ1(Λ ≥ d). But E(Λ1(Λ ≥ d)) → 0 as d ↑ c by the
dominated convergence theorem.
2. Establish (9.5.7).
2
3. In Example 9.5.1, show that E (Tu − SU )/s → 0 under Lindeberg’s condition.
Hint. Put bounds on Var(U(i) ) and Cov U(i) , U(j) using Problem B.2.9. Note that the
covariance can be obtained from
Var(Y − X) = VarX + VarY − 2Cov(X, Y ) .
4. Let X1 , . . . , Xn be i.i.d. with continuous df F and density f = F ′ ; and let Y1 , . . . , Yn
be i.i.d. with df G(y) = F (y − θ). This is Example 9.5.1 with zi = 0, 1 ≤ i ≤ m, zi = 1,
Section 9.6
207
Problems and Complements
m + 1 ≤ i ≤ N , where N = m + n. In√this case TU of Example 9.5.1 is the two-sample
Wilcoxon statistic. Show that for θ = t/ n, t > 0 and λ = lim (n/N ), if λ ∈ (0, 1), the
N →∞
asymptotic power of TU is
Z
√
p
lim P 12TU ≥ z1−α = 1 − Φ z1−α −
λ(1 − λ) t/12σ f 2 (x)dx .
N →∞
Hint. By contiguity it is enough to compute lim P
N →∞
defined in Example 9.5.1.
√
12SU ≥ z1−α where SU is
5. Let G(t) = 1 − exp{−t}, t ≥ 0, and r(t) = f (t)/[1 − F (t)] where {t : f (t) > 0} =
[0, ∞). Show that
(a) If r(t) is increasing on [0, ∞) and i < j, then PF (Di ≤ Dj ) ≥ PG (Di ≤ Dj ) = 12 .
(b) Define F to have more increasing failure rate (IFR) than F1 , written F > F1 , if F1−1 F
c
is convex (cf. van Zwet (1964)). Show that if F1 and F are increasing on [0, ∞), 0
elsewhere, F > F1 and i < j, then
c
PF (Di ≤ Dj ) ≥ PF1 (Di ≤ Dj ) .
(c) A test φ(·) is said to have a monotone power for IFR testing if
F > F1 =⇒ EF φ(X) ≥ EF1 φ(X) .
c
In Example 9.5.4, let D = (D1 , . . . , Dn ) and D′ = (D1′ , . . . , Dn′ ) be two vectors of
normalized spacings and let δ(·) be a test function based on normalized spacings. Then
δ(·) is said to be monotone wrt IFR testing if δ(D′ ) ≤ δ(D) for all D and D′ with Di′ /Di
nondecreasing in i. Show that the tests δU,v (·) that reject exponentiality in favor of K :“F
is IFR” when U ≥ v, v > 0, have monotone power and are unbiased. (U is given by
(9.5.12).)
Hint (a). Use (b) with F1 = G for the first inequality.
′
Hint (b). Since F1−1 F is increasing, X(i)
= F1−1 F (X(i) ) is the ith order statistic in a
′
random sample from a population with distribution F1 . Let Di′ = (n − i + 1) X(i)
−
−1
′
, i = 1, . . . , n. Since F1 F is convex, i < j and Di′ ≥ Dj′ implies Di ≥ Dj .
X(i−1)
Hint (c). Let Z be a discrete random variable takingPon the values 1, . . . , n and let PP
0 and
P1 be probabilities such that P0 (Z = k) = Dk / Di and P1 (Z = k) = Dk′ / Di′ .
Let D and D′ be such that Dk′ /Dk is nondecreasing in k, then P0 and P1 have monotone
likelihood ratio in k. Set cn (i) = ci . It follows
4.3.1P(first established
P from Theorem
P
P ′ by
′
Lehmann (1955))
that
E
−
c
(Z)
=
−
c
(i)D
/
D
≤
−
c
(i)D
/
Di =
P0
n
n
i
i
n
i
EP1 − cn (Z) ; thus the tests δU,v (·) are monotone. Next use the hint to (b).
208
Inference in semiparametric models
Chapter 9
j
6. Let G(t) = 1 − e−t ; t ≥ 0. Consider the models Pη,λ
with densities λfj (λx; η),
j = 1, 2, 3, x > 0, λ > 0, η ≥ 0, where, for t ≥ 0.
f1 (t; η) = (1 + η)tη exp{−t(1+η) }
(Weibull)
1
( Linear FR)
f2 (t; η) = (1 + ηt) exp{−(t + ηt2 )}
2
f3 (t; η) = 1 + ηG(t) exp − t + η t − G(t)
(Exponential FR)
(a) Show that the FRs for fj (t; η), j = 1, 2, 3, are (1 + η)tη , (1 + ηt), and 1 + ηG(t).
Pn
j
i
(b) Show that for the models Pη,λ
, j = 1, 2, 3, the statistic i=1 d1 n+1
Di − 1 as
defined by (9.5.15) and (9.5.16) has d1 (u) given by − log − log(1 − u) , log(1 − u),
and −u, respectively (recall that λ = 1 in (9.5.16)).
7. Show that if X(i) is an E(1) order statistic, and if G(t) = 1 − exp(−t), t ≥ 0, then
X(i) − G−1
i 2
= OP (n−1 ) .
n+1
Hint. Write X(i) = G−1 (U(i) ) where U(i) is a uniform (0, 1) order statistic. Expand
G−1 (U(i) ) around E(U(i) ) = i/(n + 1) and use Problem B.2.9.
R
R1
i
8. In Example 9.5.4, let ci = c n+1
with c(u) du = 0 and 0 c2 (u) du < ∞.
(a) Establish (9.5.17).
Pn
1
P
(b) Set U ∗ = i=1 ci (n−1 λDi − 1). Show that n− 2 (U − U ∗ ) −→ 0 as n → ∞ under
the exponential hypothesis. It follows that U and U ∗ have the same asymptotic distributions
for contiguous alternatives.
(c) Use LeCam’s Third Lemma and (9.5.17) to give the asymptotic power function for
contiguous alternatives of the test based on U .
(d) Use (c) above to find c(·) that maximizes the asymptotic power function of U .
9. In Example 9.5.2 assume the model
√ satisfies the ULAN condition. Show that Λn =
Sn + oP (1) under K : θn = θ0 + t/ n.
10. Consider the partial linear model Y = βZ + g(0) + ε, ε ∼ N (0, σ 2 ), of Examples
9.1.13, 9.2.4, and 9.3.5.
(a) Establish formally analogues of (9.5.9). Identify ν0 with β and η with (µ1 , µ2 )T
where µ1 (u) = E(Z|U = u) and µ2 (u) = E(Y |U = u) = βη1 (u) + g(u). Recall that
we observe X1 , . . . , Xn i.i.d. as X = (Z, U, Y ). Assume n even, set k = n/2, and use
X1 , . . . , Xk to estimate µ1 (·), µ2 (·) and Xk+1 , . . . , Xn to estimate the influence function.
Use (11.6.1) to estimate µ1 (·) and µ2 (·).
(b) Determine sufficient conditions for your estimates of (a) to satisfy (9.5.9).
(c) Find the asymptotic power function of the Neyman cα test (9.5.10) for H : β = 0
vs K : β > 0.
(d) Suppose (U, Z) has a N (0, 0, 1, 1, ρ) distribution. Discuss the behaviour of the
power function in (c) as a function of ρ.
Section 9.7
Notes
209
11. Consider the logistic copula regression model of Example 8.3.11 with θi = β0 +
β1 zi . Assume that the zi , 1 ≤ i ≤ n, satisfy Lindeberg’s condition and
that the baseline
1 P
distribution F is known. We test H : β = 0 vs β > 0 using SU = n− 2 ni=1
√F (Xi )(zi −
z̄). Find the asymptotic power of this test for contiguous alternatives β = t/ n.
Remark. By Problem 9.5.3, the uniform scores rank statistic TU has the same asymptotic
power as SU .
12. Establish (9.5.10).
13. Establish (9.5.11).
∂l
∂l
− π0 ∂ν
|Ṗ1 (P0 ) is orthogonal to Ṗ1 (P0 ).
Hint. l∗ = ∂ν
0
0
1
14. Convergence of the power function βn (θ) of the Neyman Cα test is uniform on n− 2
neighbourhoods of θ 0 .
b ) for sequences ν0 + √sn , η 0 + √tn .
Hint. Apply Le Cam’s third lemma to Sn (ν0 , η
15. The Neyman-Scott model. Suppose Xij = Yij + σZij , i = 1, . . . , n, j = 1, . . . , K,
K ≥ 2 where Yij ≡ Yi , independent of Zij and each other and distributed according to F
unknown, and Zij are i.i.d. N (0, 1). Only Xij are observed.
(a) Construct a test for H : σ 2 = 1 vs σ 2 > 1, F unknown, which is size α and has
asymptotic power > α for σ 2 > 1 independent of F .
Pk
1
2
2 2
Hint. Consider Ui ≡ K−1
j=1 (Xij − Xi· ) , i.i.d. σ χK−1 .
(b) Show that this test is asymptotically most powerful among all asymptotically level
α tests of F = N (µ, τ a ) for any µ, τ 2 .
Hint. Compute l∗ .
(c) 2. Neyman Scott Z. Give a heuristic argument for or against the test being asymptotically most powerful at other F .
(d) Suppose you treat the Yij as unknown constants µi . Show that you arrive to the
same test.
(e) However, show that under the assumption of (b) the MLE of σ 2 is not consistent.
16. Suppose that Pn (x, θ), θ ⊂ R, satisfy the ULAN condition with Tn (X) is asymptoti2
1 (Sn ) + N (µ, σ )
cally N (0, I) under Pn (x, θ), Sn is asymptotically ancillary if L
θ+tn− 2
independent of t. Show that Tn and Sn are then asymptotically independent.
9.7
Notes
Notes for Section 9.1.
(1) Owen (2001) makes his extension of empirical likelihood to (multiple) biased sampling
Pk
by considering k independent samples of size nj , with nj > 0, 1 ≤ j ≤ k, j=1 nj = n,
and permitting an empirical likelihood model for each.
(2) The Cox model was introduced by Cox (1972,1975) in connection with yet another
modified likelihood, partial likelihood. See Example 9.1.11. Other modifications, conditional and marginal likelihoods, are discussed in Kalbfleisch and Sprott (1970) and Kalbfleich and Prentice (1973). See Problem 9.1.23.
210
Inference in semiparametric models
Chapter 9
Notes for Section 9.2.
(1) For an alternate proof of the concavity of Γτ (β, P ), see Problem 9.2.12.
Note for Section 9.3.
(1) It may turn out that even though we have a candidate efficient influence function, no
regular estimate exists — see BKRW.
Notes for Section 9.5.
(1) Le Cam’s notion of LAN is uniform in a much weaker sense. He requires only Rn (tn )
P
θ
−→0 0 if tn −→ t. The notion we use is due to Hàjek (1970). It simplifies proofs and
applies to all common situations of interest in which Le Cam’s weaker property applies.
(2) The notion of contiguity is an asymptotic analogue of the familiar measure theoretic,
finite n property that P (n) dominate Q(n) .
Chapter 10
MONTE CARLO METHODS
10.1
The Nature of Monte Carlo Methods
The importance of Monte Carlo methods as we have discussed before is that they enable us
to construct estimates of features of probability distributions which are impossible or very
difficult to compute analytically. Here “estimate” is not used in the sense we have before of
a quantity depending on data (which is random through the data being used) to estimate a
feature of an unknown probability distribution. Rather it is we who are generating random
quantities which can be used to estimate features of a probability distribution specified in a
completely known way.
We can couple the plug-in principle (Section 2.1.2) with Monte Carlo methods to obtain
statistical estimates as well. That is, if we have data from an unknown probability distribution belonging to a model we can estimate that distribution in some way and then, if we
have a feature (parameter) of the true distribution we want to estimate, use Monte Carlo to
estimate the feature by random quantities generated using the estimated distribution. We
shall discuss both Monte Carlo and statistical applications of Monte Carlo in this chapter.
Here are some typical examples of statistical situations where Monte Carlo is natural.
Example 10.1.1. Goodness-of-fit tests. Consider testing H : F = F0 using a statistic
Tn (X1 , . . . , Xn ) where X1 , . . . , Xn are i.i.d. with df F . We need an α critical value cn
such that PF0 [Tn ≥ cn ] = α.
Monte Carlo Solution: Generate, on the computer(1), B i.i.d. sets of n i.i.d. observations (X11 , . . . , X1n ), . . . , (XB1 , . . . , XBn ), from F0 and form Tnb ≡ Tn (X1b , . . . , Xnb ),
1 ≤ b ≤ B. Then, for large B, the empirical distribution of the Tnb , 1 ≤ b ≤ B, is arbitrarily close to the distribution of Tn (X1 , . . . , Xn ) under F0 by the law of large numbers. Thus
if T(1) ≤ · · · ≤ T(B) are the ordered Tnb and [ ] is the greatest integer function, the natural
estimate T([B(1−α)]+1) of cn tends to cn as B → ∞. (We assume cn uniquely defined
here.) See also Example 4.1.6. Recall again that B can be made as large as we please. Here
are two applications:
(a) A simple application is to permutation tests. Consider the two sample problem
where X1 , . . . , Xn are independent with X1 , . . . , Xm i.i.d. G1 , Xm+1 , . . . , Xn i.i.d. G2
and H : G1 = G2 . This hypothesis is not simple. However, as we have noted in Chapter 8,
under H, if X(1) ≤ . . . X(n) are the ordered X1 , . . . , Xn , the conditional distribution of
211
212
Monte Carlo Methods
Chapter 10
(X1 , . . . , Xn ) given X(1) = x(1) , . . . , X(n) = x(n) , x(1) ≤ · · · ≤ x(n) , the permutation
distribution is simply the uniform distribution on {(x(i1 ) , . . . , x(in ) ) : (i1 , . . . , in ), a permutation of {1, . . . , n}}. This is our F0 . Sampling from this distribution is easy. One
simply selects i1 uniformly from {1, . . . , n}, i2 uniformly from {1, . . . , n} − {i1 }, and so
on. A routine for doing this is implemented in R for instance as an option of the “sample”
function. The two-sample t-statistic
r
m(n − m) Y − X
T =
n
s
of Example 8.2.7 is an example of a statistic Tn . The permutation distribution of T is
appropriate when G1 and G2 are not Gaussian and min(m, n − m) is not large.
(b) A second application is to testing for normality. As in Example
4.1.6 and Section
I.2, consider testing the goodness-of-fit hypothesis H : F = Φ ·−µ
for
some µ, σ, where
σ
Φ is the N (0, 1) distribution and say the statistic is an unorthodox one such as that of
Shapiro–Wilk type, the residual sum of squares of a normal quantile plot for the data,
2
n X
X(i) − X̄
i
−1
,
Z(i) =
.
Z(i) − Φ
Tn (X1 , . . . , Xn ) =
n
+
1
σ
b
i=1
For this Tn , we can take F0 = N (0, 1) because Tn is invariant under Xi → (Xi − µ)/σ.
Approximations to the distribution of Tn under F0 are known (Venter and de Wet (1972))
but the exact distribution is only sparsely tabulated. As in Example 4.1.6, to do Monte
Carlo we need a set of B simple random samples of size n from N (0, 1). Most statistical
packages including R can generate these. But for some F0 , it is not clear a priori how this
is done, if we view as most primitive that one can generate U(0, 1) observations easily.(1)
We shall discuss some methods in Section 10.2.
Example 10.1.2. Estimating risks functions. Given a parametric model for X ∈ Rn , {Pθ :
θ ∈ Θ}, a decision theoretic formulation with loss function l(θ, δ), and a decision rule
δ(X), we want to compute the risk R(θ, δ) = Eθ l(θ, δ(X)) for selected θ. As discussed in
Section 5.1, this can typically not be done analytically since if X, say, has density p(x, θ),
Z
R(θ, δ) ≡ Eθ l(θ, δ(X) = l(θ, δ(x))p(x, θ)dx
is an n-dimensional integral.
The Monte Carlo approach is just to generate X1 , . . . , XB i.i.d. Pθ and approximate
R(θ, δ) by
B
1 X
l(θ, δ(Xb ))
B
b=1
and again appeal to the law of large numbers. For instance take X = (X1 , . . . , Xn ) i.i.d.
with df F0 (x − θ) where F0 is symmetric with center of symmetry 0. Let, for 0 ≤ α < 12 ,
n−[nα]
X̄α = (n − 2[nα])−1
X
j=[nα]+1
X(j)
Section 10.1
The Nature of Monte Carlo Methods
213
be the alpha trimmed mean. Then, if θb = X̄α and l(θ, d) = (θ − d)2 ,
Eθ (θb − θ)2 = E0 (θb2 ),
and we can approximate the constant risk function arbitrarily closely by taking B samples
of size n, from F0 , and computing the average of l(θ, δ(X1 )), . . . , l(θ, δ(XB )) over these
samples. We evaluate the performance of X̄α and other estimates by repeating this for
several F0 of various shapes such as the Gaussian, Laplace, and Cauchy df’s. See Problem
3.5.9 and Andrews et al (1972).
In the case α = 12 , when X̄α is interpretable as the median, it is possible to use numerical integration in one dimension to evaluate its MSE by formula (5.1.2). For X̄0 ≡ X̄ one
can compute the characteristic function numerically and then use Fourier inversion. Even
for X̄α , by conditioning on X̄([nα]+1) and X̄(n−[nα]) it is possible to use one- and twodimensional numerical integrations (Problem 10.1.2). However, these methods become
more and more elaborate and fail entirely, for instance, for general linear combinations of
order statistics. Thus, Monte Carlo is appropriate.
✷
Both of these examples should make it clear that, so far, the use of Monte Carlo is an
arbitrarily sharp approximation to a high-dimensional integral or sum. If the dimension
(sample size) were 1 or 2 we could use more accurate numerical integration procedures.
But even for modest sample sizes numerical integration doesn’t work. The great virtue of
(simple random sampling) Monte Carlo methods is that their MSE
E
!2
B
Var l(θ, δ(X))
1 X
l(θ, δ(Xb )) − Eθ l(θ, δ(X))
=
B
B
b=1
is not dependent on the dimension n and the numerator is bounded uniformly if l is.
We shall see that simple random sampling is not always applicable, but important variants can be used.
Example 10.1.3. Joint posterior distributions. Consider a Bayesian framework where θ
has prior density π on Rk and X given θ = θ has density p(x|θ). Here both π and p(x|θ)
are given analytically. By Bayes’ rule, the posterior density is
Z
π(θ | X = x) = p(x|θ)π(θ)
p(x|t)π(t)dt .
(10.1.1)
Rk
The denominator of (10.1.1) presents a difficult high-dimensional integration problem if k
is at all large (unless π is an analytically explicit conjugate prior as in Section 1.6.5). The
problem becomes even more acute if we need, say, the marginal posterior density of θ1
which involves a k − 1-dimensional integration for each θ1 .
It is often very useful to represent the posterior distribution of θ by the empirical distribution of i.i.d. observations θ1 , . . . , θB from the posterior distribution. In this case, obtaining an approximation to the (marginal) posterior distribution of, say, q(θ) is easy. We
just use its representation by the empirical distribution of q(θ1 ), . . . , q(θ B ). Unfortunately
it turns out that, if k is at all large, in general, it is very difficult to obtain a genuine simple
214
Monte Carlo Methods
Chapter 10
random sample, θ1 , . . . , θB , as above. For instance, consider a generalization of Example
3.2.1. Let θ = (µ, σ −2 ) and let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ) given θ. Put on θ the prior
distribution making µ and σ independent with µ ∼ N (ν0 , τ02 ) and σ −2 ∼ Γ(p0 , λ0 ). It is, a
priori, quite unclear how to generate independent observations from the resulting posterior
distribution.
In such situations, methods of a type we shall discuss in Section 10.4 known as Markov
chain Monte Carlo (MCMC) are used to generate θ∗1 , . . . , θ∗B which are approximately
independent and individually have approximately the correct posterior distribution.
✷
Summary. We illustrated the usefulness of Monte Carlo methods in three situations. The
first is where for some completely specified distribution F0 , the distribution LF (Tn ) of
a test statistic equals LF0 (Tn ) for all distributions F in a null-hypothesis class F0 . Suppose LF0 (Tn ) cannot be computed analytically. We can then generate B independent samples from F0 , compute the values Tn1 , . . . , TnB for these B Monte Carlo samples, and
then approximate the null-hypothesis distribution of Tn by the empirical distribution of
Tn1 , . . . , TnB . This case illustrates the need for algorithms for generating samples from
a given F0 . The second situation is where we need to compute the risk Eθ,τ l(θ, δ(X))
of a decision procedure δ for a model with principle parameter θ and nuisance parameter
τ . For instance, if we use a location equivariant estimate (Problem 3.5.6) to estimate the
center of symmetry θ of a distribution Fθ (x) = F0 (x − θ), with squared error loss, the risk
is not a function of θ, but it is a function of τ = F0 that can typically not be computed
analytically. The risk can be estimated by generating B independent samples from F0 , and
computing the average of l(θ, δ(X1 )), . . . , l(θ, δ(XB )) over these samples. Thus, to evaluate the performance of decision procedures δ, we need to generate Monte Carlo samples
from various interesting distributional shapes F0 . The final example is where in a Bayesian
framework, we want the posterior distribution of a random parameter θ given the data, but
this distribution cannot be computed analytically. One fruitful approach is to generate i.i.d.
observations θ1 , . . . , θB from the posterior distribution and to use the empirical distribution
of q(θ1 ), . . . , q(θB ) to approximate the distribution of a quantity q(θ) of interest. In Section 10.4 we will introduce Markov chain Monte Carlo (MCMC) methods which are the
most commonly used procedures for approximately generating such Bayesian Monte Carlo
samples.
10.2
Three Basic Monte Carlo Methods
Having seen some examples of the potential applicability of Monte Carlo methods we now
turn to the basic problem of generating simple random samples from a distribution on Rk
which is specified either by its density p0 or some other feature. We also consider the
problem of using samples from p0 to estimate integrals involving p 6= p0 , or to obtain
samples from p 6= p0 .
The fundamental methods discussed in this section and a number of other topics are
treated much more extensively in the classical book of Hammersley and Handscomb (1965)
and the more recent books by Ripley (1987) and Liu (2001).
Section 10.2
10.2.1
Three Basic Monte Carlo Metheds
215
Simple Monte Carlo
We begin by considering the problem of generating X ∼ F , for X ∈ X . If X = R then
the first basic approach is to use the probability integral transform. If F is continuous,
F −1 (u) = inf{x : F (x) ≥ u}, and U ∼ U(0, 1), then F −1 (U ) ∼ F because
P F −1 (U ) ≤ x = P U ≤ F (x) = F (x) .
If F is specified analytically, this approach is natural since F −1 can always be computed,
say, by bisection. Of course, it’s best if F −1 is itself analytically described.
Example 10.2.1. Exponential and Cauchy variables. E(λ) variables can be generated as
−λ−1 log U since F (x) = 1 − exp(−λx) and 1 − U ∼ U(0, 1). Cauchy variables (see
Problem B.2.1) can be generated as tan π(U − 12 ).
✷
This method becomes more powerful by using known distribution results.
Example 10.2.2. Box–Muller–Knuth method. We want to generate independent pairs of
N (0, 1) variables (X1 , X2 ). This may be done as follows. Let X1 = R cos A, X2 =
2
R sin A, where
(R, A) are the polar coordinates of (X1 , X2 ). Then by Theorem B.3.1, R
1
has an E 2 distribution and is independent of A which is U(0, 2π). Hence
p
p
−2 log U1 cos(2πU2 ), −2 log U1 sin(2πU2 )
have the desired joint distribution if (U1 , U2 ) are independent U(0, 1).
✷
The Box–Muller–Knuth method is an example of a general principle. When a distribution F corresponds to a function of independent variables which can be generated already,
e.g., U(0, 1) or E(λ), then observations from F can be generated.
Example 10.2.3. Gamma and beta. By Theorem B.3.1, the gamma,
λ), distribution,
PΓ(p,
m
for p = m/2 where m is an integer, can be represented as (2λ)−1 j=1 Zi2 where the Zi
are i.i.d. N (0, 1). Thus by generating a block of m i.i.d. N (0, 1) variables we can generate
a Γ(p, λ) variable. If m is even, m = 2p, we can generate the variable more simply from a
block of p i.i.d. E(λ) variables using the explicit form of F −1 for exponential variables.
For positive integers m and n, set r = m/2 and s = n/2. Then we can go further
since V = X1 /(X1 + X2 ) will have a beta, β(r, s), distribution if X1 , X2 are independent
Γ(r, 1), Γ(s, 1). Thus using independent gamma variables we can generate β(r, s) variables
(Theorem B.2.3). To obtain general gamma and beta variables, see Example 10.2.6.
✷
Example 10.2.4 Sampling from a distribution with finite support. Suppose X is finite
{x1 , . . . , xk } and we want to simulate observations of an X valued random variable X
Pk
such that P [X = xj ] ≡ pj ,
j=1 pj = 1. Even though the xj need not be real, we can
do this as follows. Form the partition of [0, 1] given by I1 ≡ [0, p1 ), I2 ≡ [p1 , p1 + p2 ),
. . ., Ik ≡ [p1 + . . . + pk−1 , 1]. Suppose that U has a U(0, 1) distribution. Then, it is clear
that if we define
g(u) = xj iff u ∈ Ij , j = 1, . . . , k
216
Monte Carlo Methods
Chapter 10
then g(U ) ∼ X. Since any distribution may be approximated by finite discrete distributions, this seems easy but, in fact, for selected x1 , . . . , xk , the P [X = xj ] are usually not
given analytically. This method is basic in the more difficult problem of generating a sample S ≡ {xi1 , . . ., xin } of size n from a finite population x1 , . . . , xN with probabilities of
inclusion π1 , . . ., πN , that is, as in Example 3.4.1,
P [xi ∈ S] = πi , i = 1, . . . , N.
10.2.2
✷
Importance Sampling
R
Suppose we wish to approximate I(h) = h(x)p(x)dx for a specific h, when we are
given, for X ∈ Rk , an analytically presented density p(x). Thus if
h(x) = 1(x1 ≤ t1 , . . . , xk ≤ tk ),
I(h) would be the df F (t1 , . . . , tk ), and if h(x) = xr11 . . . xrkk , I(h) would give
R us moments. There is a way of generating an unbiased Monte Carlo approximation to h(x)p(x)
dx without in fact generating a sample from p by using the method of importance sampling.
Theorem 10.2.1. Let p0 be any density from which X ∈ Rk can be easily generated.
Suppose that P is absolutely continuous with respect to P0 , that is, p0 (x) = 0 =⇒ p(x) =
0. Then, if X1 , . . . , XB are i.i.d. according to p0 ,
B
1 X p(Xb )
Ib ≡
h(Xb )
B
p0 (Xb )
(10.2.1)
b=1
R
is an unbiased estimate of I ≡ p(x)h(x)dx. Moreover, Ib has variance
Z 2
1
p (x) 2
2
V ≡
h (x)dx − I .
B
p0 (x)
(10.2.2)
Proof. Unbiasedness follows from
Z
p(x)
p(Xb )
h(Xb ) =
h(x)p0 (x)dx.
E
p0 (Xb )
p
[p0 >0] 0 (x)
The variance argument is similar.
✷
The idea here is that one finds a density p0 which qualitatively behaves like p but from
which simulation is easy. The extent to which the estimate is good is precisely gauged by
the size of p/p0 , since this ratio, a natural measure of qualitative similarity, governs the
variance.
The following example is given to illustrate simply the qualitative features of importance sampling. It is evidently not a situation where importance sampling would be used.
Section 10.2
Three Basic Monte Carlo Metheds
217
Example 10.2.5 Gaussian distributions. Suppose p0 is N (0, 1) and p is N (0, τ 2 ). Qualitatively both are symmetric densities but as τ 2 moves from 1, p becomes more and more
peaked or diffuse. Take h(x) = x for simplicity. Then we get, for B = 1,
2
Z ∞
1
x
2 2
2
V = √
x λ exp − (2λ − 1) dx
2
2π −∞
for λ = τ −1 . Thus V = ∞ if λ2 ≤ 12 , and by recognizing V as a multiple of the variance
of a Gaussian distribution, V = λ2 (2λ2 −1)−3/2 if λ > 21 . So if λ = 1, V = 1 as expected,
while if λ → ∞, V ∼ 2−3/2 λ−1 → 0. This last result is, in retrospect, not surprising,
since Varλ (X) → 0 as λ → ∞.
Note that it is not true that the best choice of p0 for all h is p0 = p but rather p0
proportional to |h|p (Problem 10.2.2). This incidentally leads to the important observation that importance sampling is directed at integrating a specific function with respect to
p rather than the general purpose of “representing” P by the empirical distribution of B
observations identically and independently distributed (approximately) as P . Some methods related to importance sampling such as antithetic variables will be considered in the
problems.
✷
10.2.3
Rejective Sampling
The rejective sampling method due to von Neumann (1951) enables us, knowing the density
p up to a constant, say p(x) = aq(x), to generate X with density p given that one can
generate observations from a density p0 with the property that
p
sup (x) = c < ∞.
x p0
Necessarily c > 1 unless p ≡ p0 . Let
π(x) ≡ c−1
p
q(x)/p0 (x)
(x) =
, 0 ≤ π(x) ≤ 1.
p0
sup[q(x)/p0 (x)]
(10.2.3)
The method is to generate i.i.d. X1 , X2 , . . . consecutively from p0 . Then consecutively
generate independent Bernoulli variables I1 , I2 , . . . such that
P rob[Ij = 1 | Xj ] = π(Xj ) .
Let τ be the first j such that Ij = 1. “Reject” all Xj , j < τ ; “accept” Xτ as an observation
from p. Moreover, let P r denote the probability corresponding to this sampling scheme.
Then
Theorem 10.2.2. Under the given conditions,
(a) τ has a geometric marginal distribution with parameter 1/c, that is
j−1
1
1
P r[τ = j] =
1−
, j ≥1,
c
c
and hence E(τ ) = c.
(10.2.4)
218
Monte Carlo Methods
Chapter 10
(b) P r[τ < ∞] = 1.
(c) Xτ has density p.
Proof. (a) and (b) hold because the Ij are independent and
Z
1
p
1
P r[Ij = 1] =
(x) p0 (x)dx = .
c
p0
c
To establish (c), note that
P r(Xτ ∈ A) =
=
=
∞
X
j=1
∞
X
P r(τ = j)P r(Xj ∈ A| τ = j)
P r(τ = j)P (Xj ∈ A)
j=1
∞ X
1−
j=1
=
Z
1
c
j−1
1
c
Z
A
p(x)
p0 (x)dx
p0 (x)
p(x)dx .
A
✷
Remark 10.2.1. For rejective sampling, the loss one incurs by using p0 rather than p is
naturally measured by how long it takes to generate Xτ , for instance, by the expected time
to acceptance. Using (10.2.4) we can write
E(τ ) = c(p, p0 ) ≡ sup
x
p(x)
,
p0 (x)
(10.2.5)
We illustrate the properties of this method:
Example 10.2.6. Gaussian, gamma, and beta distributions by rejective sampling. By
arguing as in Example 10.2.1, if U ∼ U(0, 1) then V ≡ sgn(U )(− log |U |) has a Laplace
(double exponential) density, p0 (x) = 21 e−|x| . Suppose we want to generate X ∼ N (0, 1)
by rejective sampling from the Laplace distribution. Then,
r
2
r
x
2
2e
p
exp − + |x| =
= 1.32.
c(p, p0 ) = sup (x) = sup
π
2
π
x
x p0
Thus one obtains a Gaussian variable typically after 2 or 3 steps. This is clearly inefficient
compared to the Box–Muller–Knuth method.
As a second example suppose we wish to generate beta(r, s) variates. Suppose r, s > 1.
If we take p0 ≡ 1, on (0, 1) we see that
c(p, p0 ) =
Γ(r + s)
Γ(r)Γ(s)
r
r+s
r−1 s
r+s
s−1
.
Section 10.3
The Bootstrap
It may be shown that (Problem 10.2.3), as r, s → ∞,
r
r
r+s
→ t, 0 < t < 1, c(p, p0 ) ∼
.
r+s
2πt(1 − t)
219
(10.2.6)
Thus using the uniform distribution as p0 is a poor method for r and s large and doesn’t
work for r, s < 1. This is not surprising since p(0) = p(1) = 0. On the other hand, using
rejective sampling and p0 given by beta([2r]/2, [2s]/2) is reasonable (Problem 10.2.4).
An alternative is to first generate independent Γ(r, 1) and Γ(s, 1) variables and use the
method of Example 10.2.3. For the gamma distribution Γ(r, 1), 0 < r < 1, we can take
p0 (x) =
1
(rxr−1 1(0 < x < 1) + e−x ), x > 0
2
(10.2.7)
from which it is easy to simulate (Problem 10.2.5) and then note that by Theorem B.2.3
sup
x
2
p
(x) ≤
≤ 2.
p0
rΓ(r)
✷
This method is dependent on being able to find a p0 from which it is easy to generate
and for which c(p, p0 ) is close to one. This issue becomes acute when we need to generate
a large number of i.i.d. variables having the desired density p, as we do, for instance, in
Bayesian inference. The most general way of attacking this problem is through the Markov
Chain Monte Carlo (MCMC) methods discussed in Section 10.4. In the next section we
study a statistically very important example of the use of Monte Carlo methods where
generation of the variables is simple.
Summary. We considered first the simple Monte Carlo method where desired samples
from a given density p0 can be obtained by using transformations of basic variables. Thus
normal variables can be obtained as transformations of uniform variables, and gamma, beta,
t, F , and multivariate normal variables can be obtained from independent normal variables using Sections B.3, B.4, and B.6. Next we considered importance sampling where
a sample from P0 can be used to estimate integrals of the form EP h(X) when P is absolutely continuous with respect to P0 . Finally, we considered rejective sampling where
variables with density p0 can be used to generate a variable with density p 6= p0 provided c = supx [p(x)/p0 (x)] < ∞. Rejective sampling is useful when p is of the form
p(x) = aq(x) with q known and a unknown; and when c is close to one.
10.3
The Bootstrap
The “bootstrap,” formally introduced by Efron (1979) (see also Efron and Tibshirani (1993)
and Shao and Tu (1995)), is the first broad application of Monte Carlo methods to inference.
The three classical problems to which he applied the method remain excellent illustrations
of the idea. The context throughout this section is simple random sampling, X1 , . . . , Xn
i.i.d. as X ∼ P . To avoid confusion, in this section, we shall write Pbn for the empirical
probability Pb to indicate its dependence on n.
220
10.3.1
Monte Carlo Methods
Chapter 10
Bootstrap Samples and Bias Corrections
We are given a parameter µ(P ) defined for all P ∈ P and its plug-in estimate µ
bn ≡ µ(Pbn ),
where we assume that P is large enough so that all P with finite support such as Pbn are
included. We wish to estimate the bias of µ
bn ,
BIASn (P ) ≡ EP µ
bn − µ(P ) .
If we assume for simplicity that µ(P ) is bounded, BIASn (P ) is itself a parameter defined
for all P ∈ P. However, it has three special features. In general,
a) It depends on n.
b) It is computable from knowledge of P only through a multiple integral. If we write
µ
bn as µn (X1 , . . . , Xn ) for some known function µn , then
Z
Z
EP (b
µn ) = · · · µn (x1 , . . . , xn )dP (x1 ) . . . dP (xn ) .
c) We expect BIASn (P ) to tend to 0 as n → ∞, but we expect nBIASn (P ) to tend to
some limit B(P ) 6= 0.
We have encountered this type of estimation problem for the special case
Z
Z
µ(P ) = h
g1 (x)dP (x), . . . , gd (x)dP (x)
(10.3.1)
in Theorem 5.3.2 and, in the general case, in Chapter 7. We applied the delta method for
Euclidean and infinite dimensional parameters, respectively, to obtain approximations to
BIASn (P ) of the form B(P )/n. We can then reasonably estimate BIASn (P ) by B(Pbn )/n.
The difficulty with this program as we have discussed earlier is that even for d = 1 and
certainly for d > 1, the approximation B(P )/n is both tedious to compute analytically and
may give little insight.
The alternative proposed by Efron is to estimate the parameter BIASn (P ) by the empirical plug-in method to obtain BIASn (Pbn ). Save for the dependence of the parameter on
n, this is the plug-in idea discussed in Chapter 2. Unfortunately, computing the first term
in
Z
BIASn (Pbn ) =
µn (x1 , . . . , xn )dPbn (x1 ) . . . dPbn (xn ) − µ(Pbn )
=
1
nn
X
1≤i1 ,...,in ≤n
µn (Xi1 , . . . , Xin ) − µ(Pbn )
is, in general, unfeasible practically. However, we can apply Monte Carlo “resampling” as
follows.
Section 10.3
221
The Bootstrap
Bootstrap Estimation of Bias
a) Monte Carlo Step. Generate B i.i.d. samples of size n from Pbn ; call them
∗
∗
X∗b = (Xb1
, . . . , Xbn
), b = 1, . . . , B .
∗
∗
That is, given X1 , . . . , Xn , Xb1
, . . . , Xbn
are i.i.d. as X ∗ with distribution P ∗ = Pbn ,
P ∗ [X ∗ = Xj ] =
1
,
n
j = 1, . . . , n.
Here, the superscript * is used to denote conditioning on the data X = (X1 , . . . , Xn ).
∗
More informally, for fixed b, each Xbj
is obtained by sampling with replacement from the
∗
b
population {X1 , . . . , Xn }. Let Pnb denote the empirical distribution of the bth bootstrap
∗
∗
sample X∗b = (Xb1
, . . . , Xbn
).
b) Estimation Step. Estimate BIASn (Pbn ) by
b
BIAS(B)
n (Pn ) ≡
=
B
1 X b∗
µ(Pnb ) − µ(Pbn )
B
b=1
B
1 X
∗
∗
µn (Xb1
, . . . , Xbn
) − µ(Pbn ) .
B
b=1
Remark 10.3.1
a) Note that B is at our disposal and may be chosen as large as we wish. In particular,
formally, B = ∞ yields BIASn (Pbn ) by the law of large numbers applied to the i.i.d.
∗
variables µ(Pbnb
), b = 1, . . . , B.
b) This procedure is nothing else than Monte Carlo approximation of the integral of
µn (x1 , . . . , xn ) with respect to the product measure Pbn × . . . × Pbn .
c) Even if B = ∞, we have no guarantee that the plug-in estimate BIASn (Pbn ) is
good.
✷
b
b
Given a biased estimate θn = θ(Pn ) of a parameter θ(P ), we may wish to try to
eliminate or reduce its bias (see Section 3.4.1). A natural approach is to use the empirically
bias corrected estimate
θen∗ ≡ θbn − BIASn (Pbn )
or the bootstrap bias corrected estimate
b
θbnB = θbn − BIAS(B)
n (Pn ) .
We illustrate these issues with an example, in which exact calculation is clearly superior
in one special case but, in another special case of much more common type, the bootstrap
is much simpler to apply.
222
Monte Carlo Methods
Chapter 10
Example 10.3.1. Estimating the bias of sample moments. To avoid technicalities, Rsuppose
P ∈ M = {All probabilities on [0, 1] with 4 finite moments}. Suppose µ(P ) = x dP ,
the population mean. Then, µ
bn = µ(Pbn ) = X̄. Now
EP µ
bn = EP X̄ = EP X = µ(P ).
−1
b
Therefore, BIASn (P ) = 0 = BIASn (Pbn ), but BIAS(B)
n (Pn ) = B
∗
Var
∗
b
BIAS(B)
n (Pn )
b
E ∗ BIAS(B)
n (Pn ) = 0 .
−1 −1
= (nB)
n
n
X
i=1
∗
PB
b=1
X̄b∗ − X̄. Then,
(Xi − X̄)2 = OP ((nB)−1 )
where E and Var denote expected value and variance for the conditional probability
P ∗ = L(X ∗ |X) defined in the Monte Carlo step. Note that P ∗ , E ∗ , and Var∗ integrate out
the Monte Carlo randomness that has been introduced.
Turn now to estimating the bias of σ 2 (Pbn ), where σ 2 (P ) ≡ VarP (X). Here,
n
1X
σ (Pbn ) =
(Xi − X̄)2 .
n i=1
2
From (3.4.2),
n−1 2
σ (P )
EP σ 2 (Pbn ) =
n
2
σ (P )
.
BIASn (P ) = −
n
Now,
n
1 X
(Xi − X̄)2 .
BIASn (Pbn ) = − 2
n i=1
Thus, the empirically bias corrected estimate is
n
n+1X
σ (Pbn ) − BIASn (Pbn ) =
(Xi − X̄)2 ,
n2 i=1
2
(10.3.2)
whose bias is −σ 2 (P )/n2 . The bootstrap bias corrected estimate
n
n−1 X
2
b
(Xi − X̄)2
σ
bnB
(Pbn ) = σ 2 (Pbn ) − BIAS(B)
n (Pn ) =
n2 i=1
agrees to order n−1 with the usual unbiased estimate (n − 1)−1
10.3.3).
Going further, let
K3 (P ) = EP (X − µ(P ))3
Pn
i=1 (Xi
− X̄)2 (Problem
Section 10.3
223
The Bootstrap
Pn
be estimated by K3 (Pbn ) = n−1 i=1 (Xi − X̄)3 . Here
1
EP K3 (Pbn ) = K3 (P )(1 − 2 ) .
n
Hence, BIASn (Pbn ) = −n−2 K3 (P ) and again K3 (Pbn ) − BIASn (Pbn ) is unbiased. However, consider the numerator of the kurtosis (see A.11.10),
K4 (P ) = EP X 4 − 3(EP X 2 )2
(10.3.3)
Now (see Problem 10.3.4)
EP K4 (Pbn ) = K4 (P )+
a
1
n
+
b1
a2
a3 b2
b3
+
+
+
K
(P
)+
σ 4 (P ) (10.3.4)
4
n2
n3
n
n2
n3
for appropriate constants aj , bj ; j = 1, 2, 3. Here, K4 (Pbn ) − BIASn (Pbn ) is biased. However, while
1
EK4 (Pbn ) = K4 (P ) + O( ) ,
n
1
(10.3.5)
EK4 (Pbn ) − BIASn (Pbn ) = K4 (P ) + OP ( 2 ) .
n
Note that the example illustrates that using BIAS(B)
n is easier than going through the exact calculation of BIASn (Pbn ). Moreover, computing B(P )/n by the delta method for this
last example is almost as tedious as computing EP K4 (Pbn ), and using the approximation
b
BIAS(B)
✷
n (Pn ) is easier.
A natural question to ask is how large does B have to be so that (10.3.5) holds when
b
BIASn (Pbn ) is replaced by BIAS(B)
n (Pn ) or, more generally, when is
1
b
Eθ(Pbn ) − BIAS(B)
n (Pn ) = θ(P ) + OP ( 2 )?
n
Calculations that answer these questions are in Hall (1986). When θ(P ) is a smooth function of means as in Theorem 5.3.2,
b
b
E ∗ BIAS(B)
n (Pn ) = BIASn (Pn )
−1 −1
b
Var∗ BIAS(B)
B ).
n (Pn ) = OP (n
(10.3.6)
(10.3.7)
Here we again use Efron’s convenient * notation for the conditional distribution L(X ∗ |X)
of observations resampled from the data X = (X1 , . . . , Xn ). Recall that P ∗ , E ∗ , etc.
integrate out the Monte Carlo randomness we have introduced.
Since, under the conditions of Corollary 5.3.1,
3
BIASn (Pbn ) = BIASn (P ) + OP (n− 2 )
224
Monte Carlo Methods
Chapter 10
b
Hall’s and our conclusion is that for BIAS(B)
n (Pn ) to be asymptotically equivalent to BIASn (P )
− 32
2
to order OP (n ), we need to take B ≍ n , where an ≍ bn as usual denotes an = O(bn )
and bn = O(an ). The proofs of these results are sketched in Problem 10.3.1.
In bias correction the bootstrap makes an asymptotically appropriate higher order correction. Of greater importance is the use of the bootstrap in estimation of variances of
complex estimates and in setting confidence bands and regions, which we turn to in the
next section.
10.3.2
Bootstrap Variance and Confidence Bounds
The standard formula for an upper asymptotic level 1 − α upper confidence bound (UCB)
for a parameter θ on the basis of the MLE θbn in a regular one dimensional parametric model
with Fisher information I(θ) is given in Section 5.4.5 and is of the form
z1−α
.
θbn + q
nI(θbn )
More generally, if we have a regular, as defined in Section 9.3, asymptotically normal
estimate µ
bn of a parameter µ(P ) such that
√
µn − µ(P ))) → N (0, σ 2 (P ))
Lp ( n(b
for all P ∈ P√and σ
bn is a (locally uniformly) consistent estimate of σ(P ) in P, then
µ
bn + z1−α σ
bn / n is a natural asymptotic level (1 − α) UCB. Thus, for instance, if
P = {Pθ : θ ∈ Rd }
bn = (θbn1 , . . . , θbnd )T behaves regularly, then, letting kI ij k denote the inand the MLE θ
verse of the information matrix I of Section 6.2.2,
s
bn )
I 11 (θ
θbn1 + z1−α
(10.3.8)
n
is an asymptotic level 1 − α UCB in P. As we have noted in the past, construction of σ
bn
11 b
and computation of I (θn ) is usually far from trivial. The bootstrap gives what appears to
be an “automatic” solution.
and
Suppose that the standard deviation of µ
bn , SDn (P ), exists and normalizes µ
bn properly,
µ
bn − µ(P )
Lp
→ N (0, 1)
(10.3.9)
SDn (P )
√
nSDn (P ) → σ(P ) .
(10.3.10)
Section 10.3
225
The Bootstrap
Then it is natural to try SDn (Pbn ) as a consistent estimate of SDn (P ), in the expectation
that
SDn (Pbn ) − SDn (P ) = OP (n−1 ) .
(10.3.11)
The resulting proposed confidence bound would, by the usual inversion argument, be
µ
bn + z1−α SDn (Pbn ) .
Calculation of SDn is again a problem because
Z
Z
SDn2 (P ) =
...
µ
b2n (x1 , . . . , xn ) dP (x1 ) . . . , dP (xn )
−
Z
...
Z
2
µ
bn (x1 , . . . , xn ) dP (x1 ) . . . , dP (xn )
.
However, getting a bootstrap approximation is simple.
Bootstrap Estimation of Variance.
∗
(i) Let {Xbj
: 1 ≤ b ≤ B, 1 ≤ j ≤ n} be bootstrap samples generated as in the Monte
∗
∗
Carlo step of the estimation of BIASn (P ) and compute µ
bn (Xb1
, . . . , Xbn
) ≡ µ
b∗nb
P
B
and µ
b∗n· = B −1 b=1 µ
b∗nb .
(ii) Estimate SDn (P ) by
(B)
d
SD
n
(B)
dn
Again, if B = ∞, SD
≡
B
1 X ∗
(b
µnb − µ
b∗n· )2
B
b=1
! 12
.
= SDn (Pbn ).
Example 10.3.2. Estimation of the variance of the median and trimmed means. Let Mn ≡
Median(X1 , . . . , Xn ). We have seen that (Problem 5.4.1), if X has density f = F ′ and if
θ(F ) ≡ F −1 ( 12 ), f (θ) > 0, then
√
1
−1 1
LF
n(Mn − F ( )) → N 0, 2
.
2
4f (θ)
Consistent estimation of f , and thus of σ(P ) = 1/2f (θ), is possible in this case, see
Chapter 11. So also is estimation of σ(P ) by plugging Pbn into the formulae (5.1.2) and
(5.1.3). Moreover, application of the bootstrap is easy. Similarly for the trimmed means
X̄α given by (3.5.3), we have in Example 7.2.6 the result
√
n(X̄α − µα (P )) → N (0, σα2 (P ))
226
Monte Carlo Methods
where X̄α = µα (Pbn ),
µα (P ) =
1
1 − 2α
Z
x1−α
xα
Chapter 10
x dP (x), σα2 (P ) = VarP ψα (X)
with ψα given by (7.2.42). Again σα2 (Pbn ) is easily computed. However, while there is no
particularly attractive formula to plug into for SDn2 (Pbn ), the bootstrap is easy to apply in
the estimation of variance step above.
✷
There are a number of questions we need to ask:
(i) For estimation of variances per se, when does plugging in Pbn yield a consistent
estimate?
(ii) When is SDn (P ) an appropriate normalization to yield asymptotic normality for
(b
µn − µ(P ))?
d (B) to approximate SDn (P ) to the appropriate
(iii) How big does B need to be for SD
n
1
order oP (n− 2 )?
(iv) What is gained by using the bootstrap in cases where an alternative approximation
exists?
It turns out that for the median and trimmed means, the answers to (i) and (ii) are
affirmative under the natural conditions considered in Example 7.2.6. We summarize the
affirmative answers to the first three questions for the simple case where µ
bn is a smooth
function of a vector mean. Its proof is left to the Problem 10.3.1.
Theorem 10.3.1. Under the conditions of Corollary 5.3.2 (a) and (b),
d n = SDn (Pbn ) + OP (n− 12 B − 12 ) .
SD
Since, under the same conditions,
SDn (Pbn ) = SDn (P ) + OP (n−1 ) .
B ≍ n will give total errors of the same order as the error in SDn (Pbn ).
The answer to question (iv), bias and variance estimation is unclear in terms of statistical performance. However, the computational simplification can be striking.
The Jackknife
The jackknife due to Quenouille (1949) and Tukey (1958) preceded the bootstrap as a
“sampling” approach to estimating biases and variances of complex statistics. It is closely
related to the sensitivity curve discussed in Section 3.5.3 (Volume I).
Given an i.i.d. sample X(n) ≡ {X1 , . . . , Xn } with X ∼ F , a parameter θ(F ) and a
sequence of estimates θ(m) (X1 , . . . , Xm ), 1 ≤ m ≤ n, of θ(F ), we want to estimate the
bias
Bn (F ) ≡ EF θb(n) − θ(F )
Section 10.3
227
The Bootstrap
and the variance
σn2 (F ) ≡ VarF θb(n) .
The idea is to use the estimates θb(n−1) (X−i ) ≡ θb(i) obtained by evaluating θb(n−1) at the
subsamples
X−i ≡ {X1 , . . . , Xi−1 , Xi+1 , . . . , Xn }
of size n − 1 obtained by deleting one Xi at the time, 1 ≤ i ≤ n. The estimates of bias and
variance are
b
bn = (n − 1)(θb(·) − θ),
B
σ
bn2 =
n
X
i=1
(θb(i) − θb(·) )2 .
θb(·) = n−1
n
X
i=1
θb(i) ,
The relation of the jackknife idea to the sensitivity curve defined in Section 3.5 and Problem
7.2.2 can be seen from
SC(X(n) , θb(n) ) = n(θb(n) − θb(n−1) ) .
Since if θ(F ) is smooth in F , the sensitivity curve is a crude approximation to n−1 times
the influence function (Problem 7.2.2). Thus we can write
σ
bn2
≈n
−2
n
X
i=1
ψ(Xi , F ) − ψ̄
2
≈n
−1
Z
ψ 2 (x, F ) dF (x)
(10.3.12)
where ψ is the influence function of θb(n) as defined in Section 7.2 and
ψ̄ = n−1
n
X
i=1
ψ(Xi , F ) ≈ 0 .
This approximation to σ
bn2 is exactly corect if θ(F ) is linear in F so that θb(n) is an average
bn can also be jusand, as we have seen in Problem 7.2.2, approximately correct generally. B
tified using higher order asymptotics. For more on the jackknife, see Efron and Tibshirani
(1993), Chapter 11.
10.3.3
The General i.i.d. Nonparametric Bootstrap
Efron (1979)(1) proposed a general bootstrap principle along the following lines: Let P
denote a nonparametric class of probabilities on a set X . We assume that P contains all
discrete distributions on X . Let X = (X1 , . . . , Xn ) denote i.i.d. observations from P ∈ P.
We are interested in a functional Tn (X, P ) which is invariant under permutations π(X) of
the elements of X, that is Tn (π(X), P ) = Tn (X, P ) for each π(X) = (Xi1 , . . . , Xin )
with ij 6= ik for j 6= k. By sufficiency of the empirical distribution, this restriction has
228
Monte Carlo Methods
Chapter 10
no force as long as we are only considering i.i.d. observations. In this case we also write
Tn (Pbn , P ) for Tn (X, P ), where we have identified Pbn with X. The parameter λn (P ) we
are interested in is calculable from the distribution of Tn (Pbn , P ). That is, for some map θ
from {LP (Tn (Pbn , P )) : P ∈ P} to some set Θ, we can write
λn (P ) = θ(LP (Tn (Pbn , P ))).
(10.3.13)
Then, the plug-in principle leads to the estimate
of λn (P ), where
λn (Pbn ) = θ(L∗ (Tn (Pbn∗ , Pbn )))
Tn (Pbn∗ , Pbn ) = Tn (X1∗ , . . . , Xn∗ , Pbn )
and * as usual denotes that we are dealing with an i.i.d. sample X1∗ , . . . , Xn∗ from Pbn . We
will use bootstrap samples as follows to approximate L∗ and λn (P ):
∗
∗
a) Approximating L∗ (Tn (Pbn∗ , Pbn )). Generate bootstrap samples X∗b = (Xb1
, . . . , Xbn
),
b = 1, . . . , B, as in the Monte Carlo Step of the bootstrap bias and variance approximation.
∗
∗
∗
∗ b
Then compute {Tnb
, 1 ≤ b ≤ B} with Tnb
= Tn (Xb1
, . . . , Xbn
, Pn ). Let δx be pointmass
PB
−1
∗
∗ b
∗
b
∗ , the empirical distribution
δ
at x. Approximate L (Tn (Pn , Pn )) by LB ≡ B
T
b=1
nb
∗
of {Tnb
, 1 ≤ b ≤ B}.
(B)
b) Bootstrap Estimation of λn (P ). The estimate λn (Pbn ) of λn (P ) is given by θ(L∗B ).
It is easy to see that estimation of the bias BIASn (P ) corresponds to taking
Tn (Pbn , P ) = µ
bn − µ(P ) ,
Z
θ(FT ) =
z dFT (z) ,
where FT is the df of Tn (Pbn , P ). This gives
θ(LP (Tn (Pbn , P ))) = EP (b
µn ) − µ(P ) .
With the same choice of Tn (Pbn , P ), estimation of SDn (P ) corresponds to
θ(FT ) =
Z
2
z dFT (z) −
Z
2 ! 12
z dFT (z)
.
This formulation suggests an entirely different approach to setting confidence bounds
and confidence regions.
Consider again setting a confidence bound on µ(P ) using µ
bn . In Examples 4.4.1 and
4.4.2, to obtain confidence bounds for the mean and variance of a normal distribution, we
used the existence of pivots, functions T (b
µn , µ(P )) whose distribution did not depend on
P , with appropriate monotonicity properties. In fact, our general
√construction of asymptotic
µn − µ(P ))/b
σn which
UCBs for µ(P ) is based on an asymptotic pivot of the form n(b
tends to N (0, 1) for all P ∈ P.
Section 10.3
229
The Bootstrap
Consider µ
bn − µ(P ) as a potential pivot. Essentially only if µ(P ) is a translation
parameter of a translation parameter family {F (x − µ) : µ ∈ R} is µ
bn − µ(P ) in fact
a pivot. However, note that if cnα (P ) is the unknown α quantile of the distribution of
µ
bn − µ(P ), then
P [b
µn − µ(P ) ≥ cnα (P )] = P [µ(P ) ≤ µ
bn − cnα (P )] = 1 − α
Thus estimating the parameter cnα (P ) to appropriate accuracy by b
cn should yield µ
bn − b
cn
(B)
as an asymptotic (1 − α) UCB. The bootstrap yields an estimate b
cn as follows: Let
Tn (Pbn , P ) = µ
bn − µ(P ), and let λn (P ) be the α quantile of L(b
µn − µ(P )). The concrete
∗
application of the bootstrap is, having generated Tnb
=µ
bn (X∗b ) − µ(Pbn ), 1 ≤ b ≤ B, then
the bootstrap approximation to cnα (P ) is the αth quantile of the empirical distribution of
∗
{Tnb
; 1 ≤ b ≤ B}, that is
∗
b
c(B)
= Tn([Bα])
n
∗
∗
∗
where Tn(1)
≤ . . . ≤ Tn(B)
are the ordered {Tnb
} and [ ] denotes the greatest integer
function. If we specialize to µ
bn = µn (Pbn ), then
cb(B)
=µ
b∗n([Bα]) − µ
bn
n
where µ
b∗n(1) ≤ . . . ≤ µ
b∗n(B) are the ordered {b
µ∗nb }. Thus, µ
b∗n([Bα]) is the Monte Carlo
estimate of the α quantile of the bootstrap distribution of µ
bn . We thus obtain as a potential
asymptotic (1 − α) UCB,
bn ) = 2b
µn − µ
b∗n([Bα]) .
µ
bn − (b
µ∗n([Bα]) − µ
(10.3.14)
This is not Efron’s so called percentile method.
Efron proposed the UCB
µ
b∗n([B(1−α)+1]) .
√
It may be shown (Problem 10.3.6) that if n(b
µn − µ(P )) tends to a Gaussian distribu1
tion and bootstrap quantiles converge to population quantiles at the n− 2 rate, then Efron’s
method with B = ∞ yields an asymptotic (1−α) UCB. Efron’s motivation for his proposal
here and for subsequent refinements is that µ
b∗n([B(1−α)+1]) is equivariant under monotone
transformations.That is, if we use it as an asymptotic level (1 − α) UCB for µ(P ) then
g(b
µ∗n([B(1−α)]+1) ) is the corresponding asymptotic level (1 − α) UCB for g(µ(P )) if g
is increasing. For more on these topics, we refer to Efron and Tibshirani (1993) and Hall
(1997).
There is nothing special about the proposed pivot µ
bn − µ(P ). If we have an estimate
of the asymptotic variance of µ
bn , call it σ
bn2 , and
√ (b
µn − µ(P ))
→ N (0, 1) ,
L
n
σ
bn
then it is natural to apply the above process to
Tn (Pbn , P ) =
√ (b
µn − µ(P ))
n
σ
bn
230
Monte Carlo Methods
Chapter 10
(B)
∗
∗
∗
= (b
µ∗nb − µ
bn )/b
σnb
, where σ
bnb
is
to obtain a bootstrap estimate, say dbnα , using Tnb
(B)
σ
bn (X∗b ), and invert to get µ
bn − σ
bn dbnα as an asymptotic UCB. We illustrate this process
concretely in a classical example.
Example 10.3.3. Bootstrap t. Suppose, P = {P : 0 < EP X 2 < ∞} and we want to
put confidence bounds on µ(P ) = EP X. If we restrict P to Gaussian distributions, we are
naturally (see Example 4.4.1 and Section 4.9.2) led to the t statistic defined, for s given in
(10.3.2), by
√ (X̄ − µ)
Tn = n
s
√
and the level (1 − α) UCB of Example 4.4.1, X̄ + tn−1 (1 − α)s/ n, where tn−1 (1 − α)
is the 1 − α quantile of the Tn−1 distribution. If P is not Gaussian but is unknown, the
distribution of Tn is also unknown and it is natural to use the bootstrap.
Form the bootstrap
√
∗
∗
∗
samples (Xb1
, . . . , Xbn
), the corresponding X̄b∗ and s∗b , Tnb
= n(X̄b∗ − X̄)/s∗b , b =
(B)
∗
1, . . . , B, and the lower α quantile of the bootstrap distribution, Tn([Bα])
≡ dbnα . The level
(1 − α) bootstrap UCB is
s
X̄ − db(B)
.
nα √
n
√
Applying the same principle to the pivot n(X̄ − µ), we arrive at the UCB,
∗
,
2X̄ − X̄([Bα])
∗
∗
where X̄(1)
≤ . . . ≤ X̄(B)
are the ordered bootstrap sample means. Does either of these
work in the sense of giving asymptotically correct coverage probability? Is one preferable
to the other on theoretical grounds? We address questions such as these now.
✷
10.3.4
Asymptotic Theory for the Bootstrap
Since asymptotic theory for statistics eventually rests on the law of large numbers and
Rthe 2central limit theorem, ∗we start ∗there as well. Suppose X1 , . . . , Xn are i.i.d. P with
x dP (x) < ∞. Let (X1 , . . . , Xn ) be a nonparametric bootstrap sample. By B.1.3 the
joint distribution of (X1 , . . . , Xn , X1∗ , . . . , Xn∗ ) is completely determined by the marginal
distribution of (X1 . . . , Xn ) and the conditional distribution of X1∗ , . . . , Xn∗ given Xi = xi ,
P
1 ≤ i ≤ n, specified as Xi∗ i.i.d. with common distribution Pbn ≡ n−1 ni=1 δxi .
As we have noted, for any statistic Tn (X1 , . . . , Xn ), L∗ (Tn (X1∗ , . . . , Xn∗ )), the conditional distribution of Tn (X1∗ , . . . , Xn∗ ) given X1 , . . . , Xn , is a random quantity. Suppose
LP (Tn (X1 , . . . , Xn )) =⇒ L0 . It makes sense to ask if
P [L∗ (Tn (X1∗ , . . . , Xn∗ )) =⇒ L0 ] = 1
which we define as almost sure (a.s.) convergence in law of the bootstrap distribution of
Tn , or the weaker statement which is the one needed for statistics: For all ε > 0, as n → ∞
P [ρ(L∗ (Tn (X1∗ , . . . , Xn∗ )), L0 ) ≥ ε] → 0
(10.3.15)
Section 10.3
231
The Bootstrap
where ρ is a metric for weak convergence, i.e. Ln =⇒ L0 iff ρ(Ln , L0 ) → 0. See Problem
10.3.8 for Mallows’ metric which is an example of such a ρ. We call (10.3.15) convergence
in law in probability of the bootstrap distribution of Tn . An elegant proof of (10.3.15)
for certain Tn using Mallows’ metric can be found in Bickel and Freedman (1981). See
Problem 10.3.8.
The most basic result is
Theorem 10.3.2. Suppose X1 , . . . , Xn are i.i.d. as X ∈ Rd . Let P denote the joint
distribution of (X1 , . . . , Xn , X∗1 , . . . , X∗n ).
(a) If E|X| < ∞ then, as n → ∞,
P
n
h1 X
n
i=1
i
X∗i −→ EX = 1 .
(10.3.16)
(b) If E|X|2 < ∞ and X has positive definite covariance matrix Σ then
1
P [L∗ (n− 2
n
X
i=1
(X∗i − X̄)) =⇒ N (0, Σ)] = 1 .
(10.3.17)
Proof. (a) Note that X∗1 , . . . , X∗n is a double array because the empirical probability Pbn
depends on n. Thus we need to use the uniform strong law of large numbers (SLLN) given
in Appendix D.1. That is, we need to show that
n
X
|Xi |1(|Xi | ≥ M )
EP |X∗ |1(|X∗ | ≥ M ) = n−1
i=1
converges a.s. to zero as n → ∞. The right hand side converges a.s. to E[X|1(|X| ≥ M )]
by the SLLN, and this expression tends to zero as M → ∞ by the dominated convergence
theorem (Theorem B.7.5). Thus P [X̄∗ → X̄] = 1 by D.6 and D.7. Because P [X̄ →
E(X)] = 1 by the SLLN, we can for each ε > 0 select N such that, a.s., for n ≥ N ,
|X̄∗ − X̄| ≤ ε/2 and |X̄ − E(X)| ≤ ε/2; then, a.s., |X̄∗ − E(X)| ≤ ε for n ≥ N
and we have shown (a). To establish (b), we refer to the Lindeberg-Feller theorem for
double arrays of independent variables as stated in Appendix D.1. According to this result
(Theorem D.3), to establish (10.3.17) for d = 1, we need only check that E ∗ (Xi∗ ) = X̄
and that for all ε > 0, as n → ∞
n
1X ∗ ∗ 2
1
a.s.
E (Xi ) 1(|Xi∗ | ≥ εn 2 ) −→ 0
n i=1
a.s.
E ∗ (Xi∗ − X̄)2 −→ Var(X) .
(10.3.18)
(10.3.19)
The left hand side of (10.3.18) is
n
∆n (ε) ≡
1X
1
(Xi )2 1(|Xi | ≥ εn 2 ) .
n i=1
(10.3.20)
232
Monte Carlo Methods
Chapter 10
1
Let M > 0 be arbitrary and select n so that εn 2 > M . Then ∆n (ε) ≤ Un , where
n
Un ≡
1X
(Xi )2 1(|Xi | > M ) .
n i=1
(10.3.21)
a.s.
By the SLLN, Un −→ EP (Un ) = EP (X)2 1(|X| > M . By the dominated convergence
theorem, we can make EP (Un ) arbitrarily small by selecting M sufficiently large, thus
(10.3.18) follows.
Since
n
1X 2
1X
(Xi − X̄)2 =
Xi − (X̄)2 ,
E ∗ (Xi∗ − X̄)2 =
n i=1
n
(10.3.19) follows from the SLLN and the result follows for d = 1. The general result is
argued in the same way (Problem 10.3.9).
✷
With this theorem, we can easily establish the conclusion we hoped for in Example
10.3.3 and the preceding discussion.
Example 10.3.3. Bootstrap t (Continued). By Theorem 10.3.2 (b), if σ 2 (P ) = VarP (X),
√
P [ n(X̄ ∗ − X̄) =⇒ N (0, σ 2 (P ))] = 1 .
√
Thus the bootstrap approximation to the (1 − α) quantile of the distribution of n(X̄ −
µ) converges to z1−α with probability one, which, by the argument leading to (10.3.14),
∗
establishes 2X̄ − X̄[∞α]
as an asymptotic (1 − α) UCB.
P
∗
Next consider (X̄ − X̄)/s∗ . By Theorem 10.3.2 (a), if [s∗ ]2 ≡ n−1 ni=1 (Xi∗ − X̄ ∗ )2 ,
then
P
s∗ −→ σ(P ) .
(10.3.22)
P
P
Here (10.3.22) follows from n−1 (Xi∗ − X̄ ∗ )2 = n−1 (Xi∗ − X̄)2 + (X̄ ∗ − X̄)2 ,
n
1X ∗
P
(X − X̄)2 −→ σ 2 (P ) ,
n i=1 i
and
P
(X̄ ∗ − X̄)2 −→ 0 .
It follows from (10.3.22) and Slutsky’s theorem that
√ (X̄ ∗ − X̄)
n
=⇒ N (0, 1)
s∗
(∞)
P
in probability which yields that dbnα −→ zα , and hence establishes the asymptotic validity
of
√
(10.3.23)
X̄ − db(∞)
nα s/ n
Section 10.3
233
The Bootstrap
as a 1 − α UCB.
Next let n → ∞ and let B temporarily be fixed. For each δ > 0
lim P ∗ [sup |
n→∞
t
B
1 X
B
1
2
b=1
∗
1(Tnb
≤ t)− P ∗ [Tn∗ ≤ t] | ≥ δ] ≤ P [kEB k∞ ≥ δ] (10.3.24)
where Tn∗ = Tn (X1∗ , . . . , Xn∗ ) and EB is the empirical process for U1 , . . . , UB i.i.d. U(0, 1).
To establish (10.3.24) let Gn (t) denote the distribution of Tn∗ , and let U, U1 , . . . , UB be i.i.d.
U nif (0, 1). Then the expression inside the sup in (10.3.24) has the same distribution as
1
B− 2
B
X
b=1
1 Ub ≤ Gn (t) − P U ≤ Gn (t) .
The sup over t of this expression is bounded above by kξB k∞ . To show equality, let Hn
denote the distribution of Tn . Then
|Gn (t) − Φ(t)| ≤ |Gn (t) − Hn (t)| + |Hn (t) − Φ(t)|
and the central limit theorem together with Polya’s theorem (Theorem B.7.7) implies that
Gn (t) converges uniformly to Φ(t). The continuity of Φ shows that
1
sup B − 2
t
B
X
b=1
∗
1(Tnb
≤ t) − P ∗ (Tn∗ ≤ t)
converges weakly to kξB k∞ . By Donsker’s theorem,
P
n)
db(B
nα −→ zα
where B = Bn → ∞ as n → ∞. We conclude that for such B the bootstrap t UCB
s
X̄ − db(B)
nα √
n
√
is asymptotically valid in the sense that P µ(P ) ≤ X̄ − dbB
nα s/ n → (1 − α) (Problem
10.3.11).
Which of these three is better and is either to be preferred to using
s
s
X̄ − z(α) √ = X̄ + z(1 − α) √
n
n
as an approximate (1 − α) UCB? It turns out (see Hall (1997)), that
1
s
P [X̄ + z(1 − α) √ ≥ µ(P )] = (1 − α) + Ω(n− 2 )
n
where cn = Ω(an ) means cn = O(an ) and an = O(cn ). Moreover
1
∗
] = (1 − α) + Ω(n− 2 ) ,
P [µ(P ) ≤ X̄[∞(1−α)+1]
234
Monte Carlo Methods
Chapter 10
if EP |X1 |3 < ∞, while
(∞) s
b
P µ(P ) ≤ X̄ − dnα √
= (1 − α) + Ω(n−1 )
n
if EP X14 < ∞. This establishes that this bootstrap t UCB method (10.3.23) is better than
the bootstrap percentile method if only probability of coverage is considered. These results
are based on Edgeworth expansions and are beyond the scope of this book. We refer to Hall
(1986) and (1997), where the effect of choice of B is also discussed.
✷
The empirical process theory of Section 7.1.1 may also be generalized. The most general result, due to Giné and Zinn (1990), is the following:
Theorem 10.3.3. Let (X1 , . . . , Xn , X1∗ , . . . , Xn∗ ) be defined as before,
Z
1
f d(Pbn − P ) ,
En (f ) = n 2
and
1
En∗ (f ) = n 2
= n
Z
− 12
f d(Pbn∗ − Pbn )


n
n
X
X

f (Xj ) .
f (Xi∗ ) −
i=1
Then, if En =⇒ WP0 ,
j=1
P [En∗ =⇒ WP0 ] = 1 ,
and conversely.
Note that since finite dimensional convergence has been established by Theorem 10.3.2,
it is only tightness that is at issue. Some maximal inequalities also generalize to bootstrap
samples. We refer to van der Vaart and Wellner (1996) for further discussion.
Here is an example of an application of the Giné-Zinn result which also follows from
an earlier result of Bickel and Freedman (1981).
Example 10.3.4. Confidence bands for a distribution function. Suppose that X ∈ R
with distribution function F which is completely unknown. We want to set a simultaneous
confidence band for F . By the equivalence of tests and confidence regions, we consider the
Kolmogorov statistic (see Example 4.4.6)
Tn (F0 ) ≡ sup |Fb (x) − F0 (x)| .
x
If cn (F0 ) is the 1 − α quantile of the distribution of Tn (F0 ) under F0 , we can have as a
1 − α confidence region,
√
Cα ≡ {F : sup n |Fb (x) − F (x)| ≤ cn (F )} .
(10.3.25)
x
Section 10.3
235
The Bootstrap
If we assume F0 is continuous, cn (F ) ≡ cn doesn’t depend on F and thus
√
Cα0 = Fb (x) ± cn / n
is a simultaneous 1 − α confidence band for F . It turns out, Problem 10.3.12, that Cα0 is,
in fact, level 1 − α for all F . However, if F is discrete, Cα0 is too big, while Cα is difficult
to compute and may not be a band. To remedy this, we can apply the bootstrap principle
and estimate cn (F ) by cn (Fb ) or rather the usual approximation, the [B(1 − α)] + 1th order
statistic cnB (Fb ) of
∗
Tnb
≡
√
n sup |Fb∗ (x) − Fb (x)|, 1 ≤ b ≤ B .
x
It follows from our discussion and the Giné-Zinn theorem that, indeed
b
bα ≡ Fb ± cnB√n (F )
C
n
is an asymptotic size 1 − α confidence band for F arbitrary. For further discussion and
some Monte Carlo simulations, see Bickel and Krieger (1989).
✷
10.3.5
Examples Where Efron’s Bootstrap Fails. The m
out of n Bootstraps
Our results so far have not established whether, for instance, Efron’s bootstrap which is
drawn n times from {X1 , . . . , Xn } will consistently estimate features of more irregular
Tn (Pbn , P ), such as nVarP {median(X1, . . . , Xn )}. In fact, the bootstrap is consistent in
approximating the distribution of
√
1
n(med(X1 , . . . , Xn ) − F −1 ( ))
2
and its variance. However, it is fairly easy to generate examples of inconsistency of the
bootstrap, as we next illustrate.
Example 10.3.5. The bootstrap distribution of the maximum. Suppose X1 , . . . , Xn are
i.i.d. real valued with df F such that F (x) = 1, x ≥ c. Assume that X1 has a density f
with f (c) > 0. Then, (Problem 10.3.13),
n(c − max(X1 , . . . , Xn )) =⇒ E(f (c)) ,
(10.3.26)
in probability, where E(λ) denotes the exponential df with mean λ−1 . But,
n(max(X1 , . . . , Xn ) − max(X1∗ , . . . , Xn∗ ))
does not converge in law in probability — see Problem 10.3.14. For this and many other
examples of bootstrap failures, see Bickel, Götze and van Zwet (1997).
✷
236
Monte Carlo Methods
Chapter 10
There is a fix for the bootstrap failures found independently by Götze (1993) and Politis
∗∗
and Romano (1994). For m ≤ n, let X∗∗ ≡ (X1∗∗ , . . . , Xm
) be a sample from X ≡
{X1 , . . . , Xn } taken without replacement. Suppose
LP (Tn (X1 , . . . , Xn )) =⇒ L0 .
∗∗
Then, if L∗∗ indicates the conditional distribution L(X∗∗ |X) of (X1∗∗ , . . . , Xm
) given
X1 , . . . , Xn ,
∗∗
L∗∗ (Tm (X1∗∗ , . . . , Xm
)) =⇒ L0
in probability, provided only that m → ∞, m/n → 0. This m out of n without replacement
bootstrap, also referred to as subsampling, is studied in Politis, Romano and Wolf (1999).
The analogous with replacement m out of n bootstrap which includes that of Efron is
studied in Bickel, Götze and van Zwet (1997).
The m out of n bootstraps can be seen as methods of regularization. Difficulties with the
Efron bootstrap occur when a feature λn (Q) of LQ (Tn (Pbn , Q)), which includes Q = Pbn
as a possible argument, is such that
λn (P ) → λ(P ) ,
where λ(P ) is an irregular parameter. Then, while the “bias” λn (P ) − λ(P ) → 0, the
“variance” λn (Pbn ) − λn (P ) can blow up. Taking m(n)/n → 0 and m(n) → ∞ still
makes the “bias” λm(n) (P ) − λ(P ) → 0 but reduces the “variance” arbitrarily depending
on how slowly m(n) → ∞. We illustrate this phenomena in Problem 10.13.15.
Summary. We introduced the bootstrap, which is a method for approximating distributions, estimates, critical values, and confidence procedures by drawing i.i.d. Monte Carlo
samples from the empirical probability distribution Pbn . Let P be a set of probabilities
on a set X , and suppose P contains all discrete distributions. Let X = (X1 , . . . , Xn )
be i.i.d. observations from P ∈ P and let Tn (X, P ) be a functional which is invariant
under permutations of X. We also write Tn ≡ Tn (Pbn , P ), where we have identified Pbn
with X. We considered parameters that are features of L(Tn ) such as the distribution
FTn (t) = P (Tn ≤ t), quantiles FT−1
(α), θ < α < 1, the expected value EP Tn , and the
n
variance VarP Tn . More generally, we considered
λn (P ) = θ(LP (Tn (Pbn , P )))
where θ is a map from {LP (Tn (Pbn , P )) : P ∈ P} to a set Θ. The bootstrap estimate of
λn (P ) is the empirical plug-in estimate λn (Pbn ). We can write λn (Pbn ) as θ(L∗ (Tn (X∗ , Pb))
where X∗ = (X1∗ , . . . , Xn∗ ) is an i.i.d. sample from Pb . This Monte Carlo sample, which
we can think of as a sample drawn with replacement from {X1 , . . . , Xn }, is called a bootstrap sample. When L∗ and λn (Pn ) are difficult to compute, they can be approximated
by the following procedure called the (Efron) bootstrap: Generate B independent boot∗
strap samples X∗1 , . . . , X∗B and compute Tnj
= Tn (X∗j , Pbn ), j = 1, . . . , B. Now use
∗
∗
the empirical distribution L∗B of Tn1
, . . . , TnB
to approximate L∗ (Tn (X ∗ , Pbn )) and use
Section 10.4
237
Markov Chain Monte Carlo
θ(L∗B ) to approximate λn (Pbn ). We showed how the bootstrap can be used to estimate the
bias and variance of estimates, how it can be used to find approximate critical values for
tests, as well as approximate confidence bounds, intervals, and regions. We develop and
discuss some asymptotic results that imply that in “regular” situations, bootstrap approximations are asymptotically valid. We also give an example of an “irregular” case where
the bootstrap fails, and we introduce alternative bootstraps, the m out of n with and without replacement bootstraps. The latter can be used in all of the situations where the Efron
bootstrap does not work.
10.4
10.4.1
Markov Chain Monte Carlo (MCMC)
The Basic MCMC Framework
We want to generate random variables X1 , X2 , . . . on a sample space X distributed according to p, but direct generation is not feasible. Rejective sampling procedures generate
from a sequence of i.i.d. variables, Y1 , Y2 , Y3 , . . . distributed according to p0 , a variable
X exactly distributed according to p. Evidently we obtain B variables, X1 , . . . , XB i.i.d.
according to p by accepting X1 = Yτ1 and then continuing to sample Y ’s, till we get
X2 = Yτ2 and so on. Rejective sampling is not always practicable because there may not
be a clear choice of p0 such that supx p/p0 (x) is not too large. It turns out to be useful
to simultaneously enlarge our field of action and our use of the term “sampling” in two
directions:
(i) We permit (Y1 , Y2 , . . .) to be generated according to a Markov sequence, a generalization of independent sampling. The required Markov chain theory is reviewed in
Appendix D.5.
(ii) We do not insist that the X1 , . . . , XB we obtain be either exactly marginally distributed according to p or be exactly independent.
This generalization of rejective sampling introduced by Metropolis et al (1953) and developed by Hastings (1970) has become very important.
The basic idea of MCMC is to construct a homogeneous positive recurrent Markov
chain with transition kernel K(x1 , x2 ), the conditional density p(x2 |x1 ) of X2 evaluated
at x2 , given X1 = x1 , such that p is the unique stationary density of K. Here the term
“conditional density” refers to both the discrete and continuous case. Thus in the discrete
case, K(x1 , x2 ) = P (X2 = x2 |X1 = x1 ), and in the continuous case moving from x1 to
x2 means drawing x2 according to the distribution whose density is p(x2 |x1 ) = K(x1 , x2 ).
In what follows we will use the continuous case notation and write integrals. In the
discrete case, the integrals should be read as sums. For functions g on the sample space X
define the total variation norm by
kgk1 ≡
Z
|g(x)|dx
238
Monte Carlo Methods
for continuous models and
kgk1 =
n
X
i=1
Chapter 10
|g(xi )|
for discrete models. Specifically, we seek K(x, y) from which, given Y1 = x, we can
generate consecutive Y2 , Y3 , . . ., simply such that the following holds.
R
(a) For all y, p(x)K(x, y)dx = p(y), that is, stationarity of p with respect to K.
(b) Let Y1 be distributed according to a selected p0 . Let Y2 , . . ., Ym , . . . be the Markov
sequence obtained by using Y1 as an initial value and K as the transition kernel.
Then, if pm (y) is the density of Ym , we require that there exist c > 0 and 0 < ρ < 1,
such that
||pm − p||1 ≤ cρm .
(10.4.1)
The idea is to begin with a relatively simple Y1 , and for M large, run the chain to YM as
in (b) above (burn in). Continue the chain and define Xj = YM+j , j = 1, . . . , n, . . .. Then
X1 , . . . , Xn , . . . will be distributed approximately according to p by (a) and (b) (Problem
10.4.1). There are many variants, for instance, repeating the (b) step B times using independent Y1 ’s and using the B resulting approximately i.i.d. as p “X1 ’s”.
Why should we use this extension of rejective sampling, rather than just stick to the
former? The main reason is that, as we stated and we shall see in our examples, it is often
the case that a p0 with a reasonably small c(p, p0 ) is not readily available. The “uniform”
p0 almost invariably is very poor. MCMC in some sense allows us to adjust p0 as we move
so as to get closer to p on the first iterate we choose to retain.
10.4.2
Metropolis Sampling Algorithms
A general class of kernels producing a desired p was proposed by Metropolis et al (1953)
and then generalized by Hastings (1970) and others. All have the following form:
(1) Choose a proposal kernel K0 (x1 , x2 ) from which it is relatively easy to generate x2
for a given x1 . Given x1 , select a candidate new value x2 according to K0 (x1 , x2 ).
(2) For a given function r : R2 → [0, 1], such that r(x, x) ≡ 1, move to x2 with
probability r(x1 , x2 ), otherwise stay at x1 . Call the resulting value y1 .
(3) Repeat with y1 in place of x1 to obtain y2 , etc.
This leads to the new kernel
K(x1 , x2 ) = r(x1 , x2 )K0 (x1 , x2 ), if x1 6= x2
X
= K0 (x1 , x1 ) +
(1 − r(x1 , y))K0 (x1 , y)
= 1−
X
y6=x1
y6=x1
r(x1 , y)K0 (x1 , y), if x1 = x2 .
(10.4.2)
Section 10.4
239
Markov Chain Monte Carlo
The original Metropolis algorithm has K0 (·, ·) symmetric and
p(x2 )
.
r(x1 , x2 ) ≡ min 1,
p(x1 )
Intuitively, a symmetric K0 has the uniform distribution as stationary distribution (Problem
10.4.2). Then this r(x1 , x2 ) nudges the generated x’s towards values more probable under
p, so that, in the end, the relative frequency with which x is visited over a long time is
approximately p(x). Also note that, as for rejective sampling to compute r, we need only
to know p(·) up to a constant.
It is evident that if x1 6= x2 , for the Metropolis original proposed K,
p(x1 )K(x1 , x2 ) = min(p(x1 ), p(x2 ))K0 (x1 , x2 )
= min(p(x1 ), p(x2 ))K0 (x2 , x1 )
(10.4.3)
= p(x2 )K(x2 , x1 ).
The condition
p(x1 )K(x1 , x2 ) = p(x2 )K(x2 , x1 ), for all x1 , x2
(10.4.4)
is called detailed balance and implies that p is the stationary distribution of K(·, ·); see
Appendix D.5.
The following result gives a general prescription for constructing K.
Proposition 10.4.1 (Stein). Suppose K0 is a Markov kernel and
K(x1 , x2 ) = r(x1 , x2 )K0 (x1 , x2 ),
x1 6= x2 .
(10.4.5)
Then K is itself a Markov kernel satisfying detailed balance if and only if r(x1 , x2 ) can be
written as
δ(x1 , x2 )
p(x1 )K0 (x1 , x2 )
(10.4.6)
where δ(x1 , x2 ) is symmetric and, for all x1 ,
X
δ(x1 , x2 ) ≤ p(x1 ) .
(10.4.7)
x2 6=x1
Proof. Detailed balance holds by definition iff p(x1 )K(x1 , x2 ) is symmetric. To check
symmetry note that, by (10.4.5) and (10.4.6),
δ(x1 , x2 ) = p(x1 )r(x1 , x2 )K0 (x1 , x2 ) = p(x1 )K(x1 , x2 ) .
Thus detailed balance holds for K if and only if δ is symmetric. To establish (10.4.7), note
that because we need
X
0 ≤ K(x1 , x1 ) = 1 −
r(x1 , x2 )K0 (x1 , x2 ),
x2 6=x1
240
Monte Carlo Methods
Chapter 10
we must have
X
r(x1 , x2 )K0 (x1 , x2 ) =
X δ(x1 , x2 )
≤1
p(x1 )
x2 6=x1
x2 6=x1
and conversely. The proposition follows.
✷
Remark 10.4.1. (a) The inequality (10.4.7) is implied by r(x1 , x2 ) ≤ 1 for all x1 , x2 .
This is necessary for our interpretation of r and of the implementation of step (2) of the
algorithm but not for (10.4.4) and (10.4.5).
(b) The general Metropolis proposal is to use
p(x2 )K0 (x2 , x1 )
(10.4.8)
r(x1 , x2 ) ≡ min 1,
p(x1 )K0 (x1 , x2 )
which is evidently symmetric.
(c) If K0 satisfies detailed balance with respect to p, then we may choose K = K0 .
The Metropolis condition and a fortiori (10.4.7) give little guidance as to how K0 is to
be chosen and, indeed, this is usually dictated by the structure of the particular situation we
want to deal with and considerations which are both numerical and statistical.
To illustrate the Metropolis algorithms, we consider an interesting application. We shall
pursue this somewhat complicated example further in Section 10.5.
Example 10.4.1. A prediction problem. Let,
Yi = Zi + εi ,
i = 1, . . . , n
where we assume that (Z1 , . . . , Zn ) are observed and, for initial simplicity, that ε1 , . . . , εn
are i.i.d. with known N (0, 1) distribution. We do not, however, observe Y1 , . . . , Yn but
rather the order statistics (Y(1) , . . . , Y(n) ) where Y(1) < . . . < Y(n) . Our goal is to predict
Yi , i = 1, . . . , n on the basis of this data. An example (Petty et al (2002)) where this
question arises (with the distribution of the εi unknown) is observation of a set of n cars
on a highway passing the first of two checkpoints at times Z1 < . . . < Zn which label the
cars, and again at a second checkpoint at times Y(1) < . . . < Y(n) . Observation is carried
out by a device (loop detector) which can record the time of passing of a vehicle but not
its identity. The model specifies that ε1 , . . . , εn , the travel times of the vehicles, are i.i.d.
given (Z1 , . . . , Zn ) and we wish to predict εi = Yi − Zi given this data. But, since Zi are
known, this is equivalent to predicting Yi . Let (R1 , . . . , Rn ) be the (unobserved) ranks of
Y1 , . . . , Yn defined by Yi = Y(Ri ) . In what follows, all quantities depend on (Z1 , . . . , Zn ),
but we suppress this dependence. The best squared error predictor of Yi is by Theorem
1.4.1,
E(Yi |Y(1) , . . . , Y(n) ) =
n
X
j=1
Y(j) P [Ri = j|Y(1) , . . . , Y(n) ]
(10.4.9)
Section 10.4
241
Markov Chain Monte Carlo
and the P [Ri = j|Y(1) , . . . , Y(n) ] are known in principle. Unfortunately, the natural formula is computationally hopeless since it appears to involve n! multiplications of n terms
each and n! additions,
P [Ri = j|Y(1) , . . . , Y(n) ] =
(10.4.10)
P
{A(Y(k1 ) , . . . , Y(kn ) , Z) : All permutations (k1 , . . . , kn ) of {1, . . . , n} with ki = j}
P
{A(Y(k1 ) , . . . , Y(kn ) , Z) : All permutations (k1 , . . . , kn )}
where
A(Y(k1 ) , . . . , Y(kn ) , Z) =
n
Y
j=1
ϕ(Y(kj ) − Zj ).
However, we claim that the expression (10.4.10) can be approximated using MCMC. The
idea is to generate B (approximately) independent “observations” (R1b , . . . , Rnb ), 1 ≤ b
≤ B, from the conditional distribution of (R1 , . . . , Rn ) given Y(1) , . . ., Y(n) , using the
Metropolis algorithm. We can then compute the natural estimate of the marginal distribution of Ri , i = 1, . . . , n, by
#{Rib = j}
Pb [Ri = j|Y(1) , . . . , Y(n) ] =
B
and the natural predictor of Yi ,
b i |Y(1) , . . . , Y(n) ) =
Ybi ≡ E(Y
n
X
j=1
Y(j) Pb [Ri = j|Y(1) , . . . , Y(n) ].
To see that we can apply the Metropolis sampler, note that
P [R1 = k1 , . . . , Rn = kn |Y(1) , . . . , Y(n) ] ∝
n
Y
j=1
ϕ(Y(kj ) − Zj )
where the “unknown” proportionality constant is the marginal density of (Y(1) , . . ., Y(n) ),
P
{A(Y(k1 ) , . . . , Y(kn ) ), Z) : All permutations(k1 , . . . , kn )}. The state space for our
Markov chain is thus {All permutations(k1 , . . . , kn )of{1, . . . , n}}.
1
What proposal distribution shall we use? One possibility is to take K(x1 , x2 ) = n!
for
all x1 , x2 , that is, pick the proposal permutation uniformly at random. This seem unreasonable in the car example, since, given that cars have approximately the same speed and
Z1 < . . . < Zn , the identity permutation (1, . . . , n) should be favored. A possible type
of proposal suggested by Ritov in Pasula et al (1999) and studied extensively by Ostland
(1999) is to make only interchanges of a pair of cars possible on each step, that is,
K0 {(k1 , . . . , kn ), (k1′ , . . . , kn′ )} > 0
′
iff, for one and only one j, kj′ = kj+1 , kj+1
= kj , and ki′ = ki for i 6= j, j + 1. Call such
moves interchanges. Further, it is assumed that all interchanges have equal probability
(n − 1)−1 . Since K0 is symmetric, the Metropolis algorithm can be described by
242
Monte Carlo Methods
Chapter 10
(1) Given k = (k1 , . . . , kn ), propose the interchange
′
(kj , kj+1 ) → (kj+1 , kj ) = (kj′ , kj+1
).
(2) Accept the interchange with probability
r(k, k′ ) = min{exp{(Zj+1 − Zj )(Y(kj ) − Y(kj+1 ) )}, 1}.
Note that since, in the car example, Zj+1 − Zj > 0, this procedure reasonably always accepts interchanges such that cars which follow each other at the beginning but do not follow
each other at the end in the initial permutation do so after interchange. But, with probability
depending (inversely) on the discrepancy between the arrival times at the beginning and at
the end, we can also invert the natural order as given by the initial permutation. It may
be shown that K is geometrically ergodic, that is, satisfies (10.4.1). Hints to proofs of the
assertions of this example are given in the problems. A discussion and simulations of this
and related models may be found in Ostland (1999).
✷
10.4.3
The Gibbs Samplers
An important and well-analyzed class of kernels K dictated by problem structure is that
of the Gibbs samplers. The simplest case of these can be characterized as special cases
of Metropolis samplers with K = K0 since this simple Gibbs sampler satisfies detailed
balance.
Suppose that X = (U, V ) has distribution with density p(·, ·). U and V can be vectors,
U ∈ Rd1 , V ∈ Rd2 , or abstract. We assume that we can easily generate observations from
the conditional distributions, L(V |U = u) for all u, and L(U |V = v) for all v, and want
to generate observations (approximately) from the marginal distribution of U , respectively
V , or (see Remark 10.4.2) generate observations from the joint distribution of (U, V ). The
Gibbs sampler, introduced by Geman and Geman (1984), prescribes a Markov chain on
Rd1 whose stationary distribution is the marginal distribution of U as follows:
(1) Draw U1 = u1 from some distribution on Rd1 .
(2) Generate V1 = v1 according to the conditional distribution of V given U = u1 .
(3) Draw U2 = u2 from the conditional distribution of U given V = v1 and repeat (2)
and (3) indefinitely using the output u generated in (3) as the input u generated in
(2).
We claim that
Proposition 10.4.2. Let U1 , U2 , . . . be generated as in (2) and (3) above. Then
(a) The desired marginal density of U , call it p(·), is known up to a constant.
(b) The chain has p as stationary density.
(c) The transition kernel defined by (2) and (3) satisfies detailed balance.
Section 10.4
243
Markov Chain Monte Carlo
Proof. To establish (a), note that if p(v|u) and q(u|v) denote the conditional densities of V
given U and U given V and q is the marginal density of V , then
p(v|u) =
q(u|v)q(v)
p(u)
(10.4.11)
by Bayes’ theorem. Hence
p(u) =
q(u|v0 )
q(v0 )
p(v0 |u)
for any fixed v0 with p(v0 |u) > 0 and (a) holds with unknown constant q(v0 ).
To establish (b), note that the Markov kernel K(u1 , u2 ) we have described is
Z
K(u1 , u2 ) = p(v|u1 )q(u2 |v)dv
(10.4.12)
where we assume that all variables considered have continuous type densities, but this is
immaterial. Essentially, the procedure is valid whenever the formalism makes sense. The
general condition for stationarity is easily checked:
Z
Z
p(u1 ) p(v|u1 )q(u2 |v)dvdu1
(10.4.13)
Z
Z
= q(u2 |v)
p(u1 )p(v|u1 )du1 dv
Z
= q(u2 |v)q(v)dv = p(u2 ) .
The proof of detailed balance is left to the problems (Problem 10.4.3).
✷
Remark 10.4.2 Note that similarly, we can generate V2 , V3 , . . . satisfying Proposition 10.4.2.
For the Gibbs sampler, U ∼ p(·) and V ∼ q(·) are, by (2) and (3), equivalent to (U, V ) ∼
p(·, ·) because p(u, v) = p(u)q(v|u) = q(v)p(v|u). Thus (UM , VM ),
(UM+1 , VM+1 ), . . . are approximately from L(U, V ).
Here is an important example of the use of the Gibbs sampler.
Example 10.4.2. Posterior distributions in the Gaussian model. We consider an extension
of Example 1.6.12 where we observe X1 , . . . , Xn i.i.d. N (µ, σ 2 ) and then prescribe a
conjugate prior distribution for θ = (µ, σ 2 ) by specifying that µ and σ 2 are independent
with µ having a N (µ0 , τ02 ) distribution and σ −2 having a Γ(p0 , λ0 ) distribution. This
model falls under the framework of Example 10.1.3. As we have seen in Example 1.6.12,
the posterior distribution of µ given X1 , . . . , Xn and σ 2 is


µ0
nX̄
+
2
2
1
σ
τ0
 .
N n
1 , n
1
+
2
σ
σ2 + τ 2
τ2
0
0
This is L(U |V = v) with U = µ and V = σ −2 in Gibbs sampler notation. The posterior
distribution of σ −2 given µ and the data can be computed (Problem 10.4.5) and is the
244
Monte Carlo Methods
Chapter 10
gamma distribution Γ(p0 + 12 n, λ(X, µ)), where
λ(X, µ) = λ0 +
1X
(Xi − µ)2 .
2
(10.4.14)
Thus L(V |U = u) is also available. So we can implement the Gibbs sampler once we know
how to draw from the normal and gamma distributions as we have learned in Section 10.2.
In this way we get, for instance, (approximate) observations from the posterior distribution
of σ 2 given X. As we noted we can use the Gibbs sampler to generate a bivariate Markov
chain (Um , Vm ), with Um corresponding to µ and Vm to σ −2 yielding approximate observations from the posterior of (µ, σ −2 ) given X. We shall see how to use these methods in
inference in Section 10.5.
✷
Example 10.4.3. A bivariate normal example. Let
L(Y |X = x) = N ρx, (1 − ρ2 ) = L(X|Y = x)
for |ρ| < 1. By Theorem B.4.2 in Volume I, these are the conditional distributions of the
bivariate normal distribution (X, Y ) ∼ N2 (0, 0, 1, 1, ρ) and hence X ∼ Y ∼ N (0, 1). We
will observe the workings of the Gibbs sampler. Suppose we initiate
X (0) ∼ N (0, σ 2 ) .
Then we can write
1
Y (0) = ρX (0) + (1 − ρ2 ) 2 Z ,
where Z ∼ N (0, 1) independent of X. Therefore,
At the next round,
Y (0) ∼ N 0, ρ2 σ 2 + (1 − ρ2 ) .
1
X (1) = ρY (0) − (1 − ρ2 ) 2 Z
X (1) ∼ N 0, ρ4 σ 2 + ρ2 (1 − ρ2 ) + (1 − ρ2 ) = N 0, ρ4 σ 2 + (1 − ρ4 ) .
Continuing, if at the kth stage
X (k) ∼ N (0, σk2 )
then (Problem 10.4.8)
and thus
X (k+1) ∼ N 0, ρ4 σk2 + (1 − ρ4 ) ,
2
σk+1
− 1 = ρ4 (σk2 − 1) .
Then, σk2 − 1 is monotone decreasing or increasing according as σk2 ≷ 1 and converges to
the limit
2
2
σ∞
− 1 = ρ4 (σ∞
− 1) .
2
So σ∞
= 1. Note that convergence is exponential.
✷
Section 10.4
245
Markov Chain Monte Carlo
Example 10.4.4. The random effects model. We follow Example 3.2.4 and have
Xij = µ + ∆i + εij , 1 ≤ i ≤ I, 1 ≤ j ≤ J ,
2
with ∆i and εij i.i.d. and independent of each other, N (0, σ∆
) and N (0, σe2 ), respectively.
2
2 T
T
Then, by (3.2.17), if τ ≡ (µ, σe , σ∆ ) , ∆ ≡ (∆1 , . . . , ∆l ) , respectively,
p(x|∆) = Πi,j ϕσe (xij − µ − ∆i )
which depends on τ . Here the ∆i are latent (unobserved) variables with joint density
Y
2
ϕσ∆ (∆i ) .
p(∆|µ, σe2 , σ∆
)=
i
In the Bayesian framework we put a prior π(·) on τ and include ∆ in our list of parameters.
The full posterior distribution is then
π(τ, ∆|x) ∝ p(x|∆)p(∆|τ )π(τ ) ,
where the proportionality constant c(x) is such that
Z
π(τ, ∆|x) dτ d∆ = 1 .
The value of the constant is irrelevant if we want the posterior mode but matters for credi2
ble Bayesian regions of τ . It is absolutely necessary if, say, we want the density of σ∆
|x.
We can, in principle, obtain this constant by the Metropolis-Hastings method finding an
e 1 , τ2 ) which has the posterior distribution of τ |x as its
homogeneous Markov kernel K(τ
stationary distribution in ways we have discussed. Note that since we are generating an empirical probability distribution, say that of τ b0 , τ b0 +t , . . . , τ b0 +nt , observations from the
Markov chain after “burn in” spaced t apart (where t is “large”), this empirical distribution
yields estimates of every plausible function of π(τ |x).
2
This approach also works for a non-Bayesian formulation of the model. Suppose µ, σ∆
,
2
2
2
σe are simply unknown parameters. Then we are interested in good estimates µ
b, σ
be , σ
b∆
and confidence regions for ∆ just based on the conditional distribution of ∆ given X with
2
parameter values µ
b, σ
be2 , σ
b∆
. We still need to calculate the conditional distributions. These
are necessarily Gaussian in this case since (∆, X) have a joint Gaussian distribution which
is easy to compute. This is not the case, however, if, for instance, we consider a robust
alternative to model (3.2.16) and assume ∆i have, say, logistic distributions. Neglecting
2
what is now the difficulty of estimating σ∆
, σe2 , and µ, we are still left with a formidable
integration problem finding the conditional density of ∆i given X. MCMC can come to
the rescue as before.
✷
The natural generalization of the simple Gibbs samplers appropriate to Bayesian problems of this structure is to the case where we want to simulate from p, the density of
U = (U1 , . . . , Uk ), where the Uj may themselves be vectors. We are given that the conditional distribution of Ui given {Uj : j 6= i}, call it L(Ui |U−i ), is easy to sample from.
246
Monte Carlo Methods
Chapter 10
Then, the idea is to cycle, initializing by picking U = u0 arbitrarily, then U1 = u11 from
L(U1 |U−i = u0−i ), then U2 from L(U2 |U−2 = (u11 , u03 , . . . , u0k )), and continuing till the
kth coordinate when we have U1 = (u11 , . . . , u1k ) complete, and then repeat. Although this
chain does not obey detailed balance, we can show that its stationary distribution is p by
arguing as we did for the case k = 2. An example is the random effects model (Example
10.4.4) where Ui = ∆1 .
10.4.4
Speed of Convergence and Efficiency of MCMC
As we have noted, a reasonable measure of the desirability of a Metropolis or Gibbs sampler
{X1 , . . . , Xn } once simplicity has been taken into account is governed by the speed of
convergence of the distribution of Xn to the desired stationary distribution with discrete or
continuous case function p. But how is convergence to be measured in general? If the state
space X = {x1 , . . . , xN } is finite and the transition matrix is K = kK( · , · )kN ×N then,
as we note in Appendix D5, if a homogeneous Markov chain is irreducible, aperiodic, and
satisfies detailed balance, then, as required in (10.4.1), for some ρ ∈ (0, 1)
∆n (K) ≡
N
X
k=1
|P [Xn = xk ] − p(xk )| ≍ ρn
where “≍ ρn ” means that, as n → ∞, n−1 log ∆n (K) → log ρ.
We next relate convergence to eigenanalysis. Consider an aperiodic irreducible reversible Markov Chain on the finite state space {1, 2, . . . , N }. Under these conditions it
can be shown (Grimmett and Stirzaker (2001), Sections 6.6 and 6.14) that the transition
matrix K has real eigenvalues λ1 , . . . , λN such that λ1 = 1 and |λj | < 1 for j = 2, . . . , N .
Moreover the entries vrj in the corresponding eigenvectors vr are real. Define the inner
product
N
X
xj yj pj
(x, y) =
j=1
where pj is the probability of observing j under the stationary distribution. We can take
the eigenvectors v1 , . . . , vN to be an orthornormal basis with respect to (·, ·). It follows
that v1 = (1, . . . , 1)T ≡ 1. We can now give an exact expression for the n-step transition
probability.
Proposition 10.4.3. Set pij (n) = P (Xn+1 = j|X1 = i). Under the conditions in the
previous paragraph,
N
X
pij (n) − pj = pj
λnr vrj vri .
r=2
Proof. We can write the unit vector ej = (0, . . . , 1, . . . , 0)T as
ej =
N
X
r=1
(ej , vr )vr =
N
X
r=1
vrj pj vr .
(10.4.15)
Section 10.4
247
Markov Chain Monte Carlo
T
Next note that Kn ej = p1j (n), . . . , pN j (n) and Kn vr = λnr vr . Multiply (10.4.15) on
the left by Kn and find that
pij (n) = pj
N
X
λnr vrj vri .
r=1
The result follows because λ1 = 1 and v1 = 1; so the first term in the sum is pj .
✷
Now we can establish convergence rates:
Corrollary 10.4.1. Under the conditions of Theorem 10.4.3,
|pij (n) − pj | ≤ pj (|λ|2 )n
N
X
vrj vri
r=2
where |λ|2 = max |λj | is less than one. Moreover, the total variation norm satisfies
2≤j≤N
N
X
j=1
|pij (n) − pj | ≤ |vi |2 (N − 1)(|λ|2 )n
where |vi |2 = max |vri |.
2≤r≤N
Proof. The first part follows from the definition of P
|λ|2 . The second part follows because
P
2
p
|v
|
≤
1
by
Cauchy-Schwarz
and
because
j pj vrj = 1 by the definition of the
j j rj
inner product for the eigenvectors.
Remark. For more on the convergence of MCMC, see Diaconis and Strook (1991), Grimmett and Stirzaker (2001), and Diaconis (2009).
Efficiency of MCMC
We will measure the effectiveness of MCMC by comparing it to the approximations we
would have if we could obtain an i.i.d. sample from p. First we need the following result which shows that the second largest eigenvalue λ2 can be interpreted as a maximal
correlation.
Theorem 10.4.1. Suppose the state space of a homogeneous chain is finite, and that
K( · , · ) satisfies detailed balance. Then, if the eigenvalues of the transition matrix K
are ordered so that 1 = λ1 ≥ λ2 ≥ . . . ≥ λN ,
λ2 = ρ+ ≡ max{Corr(g(X1 ), g(X2 )) : g arbitrary}.
Proof. The pair (Xi , Xj ) has discrete case density K(xi , xj )pi , where pi = P (X1 = xi ).
The optimization problem whose solution we claim to be λ2 is to maximize over vectors
g= (g1 , . . . , gn )T ,
X
gi gj K(xi , xj )pi
(10.4.16)
i,j
248
Monte Carlo Methods
subject to
X
g i pi = 0 ,
(i)
(ii)
X
Chapter 10
gi2 pi = 1 .
i
i
P
Regard (10.4.16) as a sum over j. Fix j and maximize each term gj i gi K(xi , xj )pi . The
solution g0 , which necessarily exists, satisfies for all j, some γ1 , γ2 ,
X
gi K(xi , xj )pi = γ2 pj gj + γ1 pj ,
(10.4.17)
i
since the method of Lagrange multipliers applies, (Apostol (1957), TheoremP
7.10, p.153;
see Problem 10.4.9). The sum over j of the left side of (10.4.17) reduces to i gi pi = 0.
Thus the sum over j of the right side is also 0. This implies that γ1 = 0 by (i). By detailed
balance,
pi
K(xi , xj ) = K(xj , xi ) ,
pj
which implies that γ2 is a right eigenvalue of K = K(xi , xj ) N ×N and g0 is a right
eigenvector. Since all eigenvectors are vectors of real numbers and g ≡ 1 is not a candidate
(by (i)) we conclude by the Courant-Fischer theorem (Problem 8.3.24), that is, γ2 is the
second largest eigenvalues.
Moreover multiplying (10.4.17) by gj , summing, and using (ii), we see that
γ2 = max{Corr(g(X1 ), g(X2 ))} .
✷
Proposition 10.4.4. Under the assumptions of Theorem 10.4.1,
max{Corr(g(X1 ), g(Xk+1 )) : g arbitrary} = (ρ+ )k .
Proof. See Problem 10.4.10.
Remark. Corr(X1 , Xk+1 ) is called the autocorrelation of the chain, and
max{Corr(g(X1 ), g(Xk+1 )} ,
g
the maximal autocorrelation.
Next we show that ρ+ can be used to show the effectiveness of MCMC.
PN
Example 10.4.5. Consider the problem of estimating Ep g(X1 ) =
j=1 g(xj )pj by
PB
−1
ḡ ≡ B
b=1 g(Xb ), where X1 is initialized at some distribution and X2 , . . . , XB are
obtained via K( · , · ). Then
2
E(ḡ − Ep g(X1 )) =
2
B
1 X
(Eg(Xb ) − Ep g(X1 ))
B
b=1
Section 10.4
1
+
B
249
Markov Chain Monte Carlo
B
B−1
B
1 X
2 X X
Var g(Xb ) +
Cov(g(Xa ), g(Xb )) .
B
B
b=1
(10.4.18)
b=1 a=b+1
′
On the other hand, suppose that X1′ , . . . , XB
are an i.i.d. sample from p and define ḡ ′ ≡
PB
−1
′
B
b=1 g(Xb ). It is reasonable to define the effectiveness or “efficiency” of the Markov
chain sampling scheme by
E(ḡ ′ − Ep g(X1 ))2
e(K) ≡ lim inf
: all g not constant .
B→∞
E(ḡ − Ep g(X1 ))2
We claim that
Theorem 10.4.2. Under the assumptions of Theorem 10.4.1
e(K) =
1 − ρ+
.
1 + ρ+
The proof is sketched in Problems 10.4.10 and 10.4.11. It is quite possible that ρ+ < 0
so that the Markov chain is more efficient than independent sampling — see Problem
10.4.12. Rosenthal (2003) shows, in the context of general chains, how to construct
Metropolis samplers with good speed of convergence from samplers with ρ+ close to −1.
Speed of convergence results are also available for countable or, more importantly, continuous state spaces. However, more conditions are required and computation is even more
challenging. An introductory treatment is in Tierney’s article in Gilks et al (1995) and a
thorough treatment is given in Meyn and Tweedie (1993). See also Rosenthal (2003). Unfortunately, computing or even bounding λ2 in cases of interest is difficult. A key question
that remains is how to determine the length of the “burn in” period from initial simulations.
We refer to Liu (2001), Diaconis (2009), and Brooks, Gelman, Jones, and Meng (2011),
for instance, for more literature on MCMC.
Summary. We introduced Markov Chain Monte Carlo methods which generate X1 , X2 , . . .
that are approximately i.i.d. from p, where p is a density known up to a constant, and from
which generation of random variables is difficult. The method entails generating observations from a homogeneous Markov chain with p as stationary distribution. In particular,
we presented the Metropolis–Hastings method for constructing such Markov chains. We
introduced the Gibbs sampler which is an algorithm where knowledge of L(V |U = u)
and L(U |V = v) is used to generate variables U and V from the marginal distributions
L(U ) and L(V ) as well as from the joint distribution L(U, V ). We showed, using an example, how in a Bayesian framework, the Gibbs sampler can be used to generate data from
L(θ1 |X), L(θ2 |X), and L(θ1 , θ2 |X) when L(θ1 |X, θ2 ) and L(θ2 |X, θ1 ) are easy to generate from. Since variables which approximately have the desired p are only produced after
the chain is run for some time — the so called “burn in” period, it is important to know
how good the approximation is. We gave conditions under which the discrete case density pk of Xk is close to p. In particular, when X has a finite state space X , then under
conditions on the Markov kernel K, the distance between p and pk is of order ρk , where
ρ ∈ (0,
R 1). We finally discussed how MCMC can be used to approximate integrals of the
form g(x)p(x)dx and measured the effectiveness of such approximations by comparing
them to Monte Carlo approximation based on (unavailable) i.i.d. samples from p.
250
10.5
Monte Carlo Methods
Chapter 10
Applications of MCMC to Bayesian and Frequentist
Inference
Geyer makes the argument in his paper in Gilks et al (1996) that MCMC is as applicable
in frequentist inference as it is in Bayesian inference, given that one is basically concerned
with computing high dimensional integrals. We agree with this point of view but go even
further. We argued in Section 5.5 that, at least from an asymptotic point of view, if the
model we consider is regular parametric, the parameter θ belongs to Rd and the prior
distribution put on θ is smooth, then optimal Bayes procedures coincide asymptotically
with efficient frequentist procedures (Maximum Likelihood), from a frequentist point of
view. This point of view runs into trouble when d is of the order of n or larger — see, for
instance, Diaconis and Freedman (1986). These questions, as well as more basic issues of
consistency of Bayes procedures continue to be vigorously investigated — see Ghosh and
Ramamoorthi (2003).
We begin with a continuation of Example 10.4.2.
Example 10.5.1. The Gaussian model. We saw in Example 10.4.2 how we could use the
Gibbs sampler to approximately generate observations θ∗1 , . . . , θ∗B from the posterior distribution of θ = (µ, σ −2 ) when the prior distribution has µ and σ −2 independent Gaussian
and Gamma, respectively. Assume for simplicity that the observations were obtained by
running a sampler from independent starting points so that the Xi∗ generated by the sampler are independent and that, although this is true only approximately, their distribution is
correct.
We have seen earlier that the Bayes estimates of µ for quadratic loss or other symmetric
loss functions if σ is assumed known is just the posterior mean
−1
n
µo
1
nX̄
.
+
+
σ2
τo2
σ2
τo2
However, if σ 2 is unknown and has a prior distribution as in Example 10.4.2, to get an
estimate of µ, we have to compute the marginal posterior mean of µ,
(
)
−1
nX̄
n
µo
1
E
| X ,
+ 2
+ 2
σ2
τo
σ2
τo
for which we need
Z
0
∞
nX̄w +
µ0
τ02
−1
1
q(w|X)dw
nw + 2
τo
where q(w|X) is the posterior density of σ −2 , a quantity only determinable through numerical integration. On the other hand, we have, using θ ∗1 , . . . , θ ∗B , the simple approximation
P
∗
∗
∗
∗ T
B −1 PB
b=1 θ1b of the marginal posterior mean of µ, where θ b = (θ1b , θ2b ) . Similarly
−1
∗
2
B
θ2b provides an estimate of the marginal posterior mean of σ .
Next, suppose we want the shortest possible Bayes level 1 − α credible interval for µ.
We need to find [µ, µ̄] such that
Section 10.5
(i)
PB
Applications of MCMC to Bayesian and Frequentist Inference
b=1
251
1(θ∗1b ∈ [µ, µ̄]) = [B(1 − α)],
(ii) µ̄ − µ minimizes µ2 − µ1 among all pairs (µ1 , µ2 ), µ1 < µ2 , satisfying (i).
Finding a minimum volume Bayes credible region for θ, a generalization of the problem
stated above, can be done as follows. Since the posterior density of θ is proportional
to the joint density of (X, θ), which is completely known, the regions which need to be
considered are just all
Sc = {θ : p(X|θ) q(θ) ≥ c}
where q is the prior density and p(X|θ) is the conditional density of X given θ = θ.
Evidently we now choose c so that
B
1 X
∼
1(θ∗b ∈ Sc ) = 1 − α .
B
b=1
✷
We pursue these ideas more generally. We observe X ∼ p(·|θ), θ ∈ Θ, where p(x|θ) is
a density function, θ ∈ Θ, θ has a prior density π(θ), (θ, X) has joint density π(θ)p(x|θ),
and we want to generate observations from the posterior density of θ given X. We took
advantage of special features of Example 10.4.2 and we used the Gibbs sampler. How do
we generate the θ∗b in general? The all purpose answer is a Metropolis algorithm since
specification of p(x|θ) and q(θ) is necessary to define the model. That is, we obtain the
θ∗b by the usual algorithm: Choose a positive recurrent aperiodic Markov kernel K(·, ·) on
Θ × Θ with stationary density π(θ|x) and generate observations from it after a suitable
burn-in from a number of starting points. As usual, one picks K0 (·, ·) positive recurrent,
aperiodic Markov on Θ × Θ and then lets
K(θ1 , θ2 ) = r(θ1 , θ2 )K0 (θ1 , θ2 )
p(x|θ2 )π(θ2 )K0 (θ2 , θ1 )
.
if θ1 6= θ2 with r(θ1 , θ2 ) = min 1,
p(x|θ1 )π(θ1 )K0 (θ1 , θ2 )
(10.5.1)
All we are using here is that the posterior density of θ, π(θ|X), is proportional (up to a
data dependent constant) to π(θ)p(x|θ).
Let A be an action space and l : Θ × A → R+ be a loss function. The Bayes procedure
for any such problem is, according to Section 3.2, given by
δ ∗ (x) = arg min E(l(θ, a)|x)
a
where E(l(θ, a)|x) =
Z
l(θ, a) dΠ(θ|x)
and Π(θ|x) is the posterior probability distribution with density π(θ|x). Suppose we are
able to generate θ∗1 , . . . , θ∗B independent, each having (approximately) the posterior distribution of θ given X = x. Then it is natural to use the approximation
δ
∗(B)
B
1 X ∗
l(θb , a) .
(x) = arg min
a B
b=1
(10.5.2)
252
Monte Carlo Methods
Chapter 10
Thus, for instance, in Example 10.5.1, in general, without our particular choice of π(µ,
σ −2 ), the conditional distributions of µ and σ −2 , respectively, given each other and X,
would not have a closed form. But the Metropolis sampler is still implementable, although,
of course, the usual questions of speed of convergence need to be asked.
The Gibbs and Metropolis sampler can be applied in frequentist inference in a number
of ways. One we have already seen in the prediction context in Example 10.4.1. We pursue
that in
Example 10.5.2. Maximum likelihood and travel time prediction. EM-MCMC. We assumed, in the prediction problem, that the travel time density was known to be N (0, 1).
Suppose, more realistically, that the density is in fact unknown. We can approximate the
density by a parametric model f (·, θ), such as the gamma family or more generally finite mixtures of gammas. How do we carry out maximum likelihood estimation of θ?
One possible approach is via the EM algorithm and MCMC. Represent Y1 , . . . , Yn as
(Y(1) , . . . , Y(n) , R1 , . . . , Rn ) with density
f (Y, θ|Z) = Πni=1 f (Y(Ri ) − Zi , θ) .
Viewing R1 , . . . , Rn as missing, we see that we can alternate
bOLD ) ≡
E step : Compute Λ(θ, θ
n
X
i=1
Eb
{log f (Y(Ri ) − Zi , θ)|Y(1) , . . . , Y(n) }
θOLD
bOLD ) to get θ
bNEW by solving
M step : Maximize Λ(θ, θ
bOLD ) =
∇Λ(θ, θ
n
X
i=1
Eb
{∇ log f (Y(Ri ) − Zi , θ)|Y(1) , . . . , Y(n) }
θOLD
= 0
(10.5.3)
The first computation can be done, in principle, by MCMC since
bOLD ) =
Λ(θ, θ
n
n X
X
i=1 j=1
log f (Y(Ri ) − Zi , θ)P b
[R = j|Y(1) , . . . , Y(n) ] (10.5.4)
θ OLD i
which is a problem of the same type we considered in Example 10.4.2. (Here, everything
is conditional on Z.) Thus, using the same notation as before on the M step, we maximize
bOLD ) ≡
Λ(θ, θ
n X
n
X
i=1 j=1
log f (Y(j) − Zi , θ)Pbθ
OLD
[Ri = j|Y(1) , . . . , Y(n) ] .
bOLD , running MCMC given θ
bOLD , then maximize Λ(θ,
b
That is, we alternate, for fixed θ
bOLD ) to get θ
bNEW . Implementation of this kind of algorithm is discussed in Ostland
θ
(1999).
✷
Section 10.5
Applications of MCMC to Bayesian and Frequentist Inference
253
We can think of this implementation of EM and a more flexible approach in a general
context, following Geyer — see Chapter 14 in Gilks et al (1995). We are given f (x, v, θ),
the joint density of (X, V, θ) in an analytically tractable form. We observe X so that the
likelihood is,
Z
g(X, θ) = f (X, v, θ) dv
and our goal is to maximize g(X, θ) with respect to θ. According to EM, we need, for the
bOLD = θ0
E step, given θ
Λ(θ, θ 0 ) ≡ Eθ0 (log f (X, V, θ) | X)
For θ0 fixed, we construct a Metropolis or other sampler for the conditional distribution of
V given X under θ0 and get V1 , . . . , VB . We estimate,
b
Λ(θ,
θ0 ) = B −1
B
X
log f (X, Vb , θ0 )
b=1
b in the M step.
and maximize Λ
A more basic alternative is to use formula (2.4.8) which gives
g(X, θ)
f (X, V, θ)
L(θ, θ0 ) ≡
= Eθ0
|X .
g(X, θ 0 )
f (X, V, θ0 )
Again, we generate V1 , . . . , VB (approximately) i.i.d. from the conditional distribution
of V given X under θ0 and estimate L(·, θ, θ 0 ) by
b θ 0 ) = B −1
L(θ,
B
X
f (X, Vb , θ)
.
f (X, Vb , θ 0 )
(10.5.5)
b=1
b as the object to be maximized and thus we apparently do not need
Naively, we can view L
to run a new sampler for each θ. This is, however, illusory since the approximation of L by
b is poor if θ is not close to θ 0 , as we illustrate in the next example.
L
But this approach does make it clear that we can vary strategies, for instance, by replacbOLD ) such as the Newton–Raphson
b θ
ing EM steps by other methods of optimizing L(θ,
method. An important feature of these approaches is that since the sampler, and, hence the
objective function, change at each iteration, we no longer have the guaranteed increase in
the objective function that the classical EM and other good algorithms give us.
Here is an application which illustrates the risks of naive application of this method and
indicates the correct way of proceeding.
Example 10.5.3. Transformation model. Many models including the Cox proportional
hazard model, so called frailty models, and a semiparametric extension of the Box-Cox
(1964) transformation model can be put in the following framework: We observe
(Zi , Yi ), 1 ≤ i ≤ n, i.i.d. as (Z, Y ), Z ∈ Rd , Y ∈ R .
254
Monte Carlo Methods
Chapter 10
We postulate that there is an unknown monotone strictly increasing transformation a : R →
R such that for θ ∈ Rd ,
a(Y ) = θT Z + ε,
1≤i≤n
(10.5.6)
where ε is independent of Z and has known distribution F0 with density f0 . We can interpret this by saying that, on an unknown scale a, the responses Y follow an ordinary linear
model. Thus, we have a semiparametric model {P(θ,a) : θ ∈ Rd , a increasing. It may be
shown (Problem 10.5.3) that if
−t
f0 (t) = e−t e−e ,
t≥0,
a Gumbel density, then the model is a reparametrization of the Cox proportional hazard
model. Other distributions correspond to frailty models (Vaupel, Manton and Stallard
(1979)) where a latent variable ξ is postulated for each individual and a conditional hazard
rate for Y given by
λ(y, θ|z, ξ) = ξr(θ, z)λ(y) .
(10.5.7)
If we assume ξ comes from a fixed known distribution, then this is again a transformation
model. See Problem 10.5.6.
From the structure of these models we see that, to test θ = θ0 , we can invoke invariance
of the model under the group of monotone transformations as in Chapter 8, to conclude
that it is reasonable to base tests on the likelihood pθ (r, z) of (Zi , Ri ), 1 ≤ i ≤ n,
where R1 , . . . , Rn are the ranks of Y1 , . . . , Yn . Because Rank(a(Yi )) = Rank(Yi ), we can
without loss of generality set a(y) = y in this likelihood. Moreover, because pθ (r, z) =
pθ (r|z)h(z) and h(z) does not depend on θ, it is enough to consider the conditional density
function of R given Zi = zi , 1 ≤ i ≤ n, which, for general fθ (y|z), is
L(θ) ≡ Pθ [R = r|z1 , . . . , zn ] = Eθ ′
f (Yi |zi )
Πni=1 θ
|R = r
fθ′ (Yi |zi )
(10.5.8)
where θ′ is a reference value of θ which is selected in what follows, and fθ (y|z) is the
conditional density of Y given Z = z.
In the linear model (10.5.6) a natural choice for θ′ is 0, in which case,
1
n fθ (Y(ri ) |zi )
L(θ) = Pθ [R = r|z1 , . . . , zn ] = E0 Πi=1
,
(10.5.9)
n!
f0 (Y(ri ) )
which is Hoeffding’s formula. See Problem 9.1.23. Note that under θ = 0, the Yi are
i.i.d. and (R1 , . . . , Rn ), (Z1 , . . . , Zn ) are independent with P0 [R = r] ≡ 1/n! . Example
10.5.2 also has this structure, save that (10.5.6) is replaced by a parametric model and we
concentrated on (Y(1) , . . . , Y(n) ) rather than (R1 , . . . , Rn ). Now it seems reasonable to
estimate θ by maximizing a Monte Carlo approximation to L(θ). From (10.5.9), this appears possible even without MCMC, and was proposed by Doksum (1987) for θ satisfying
1
|θ|/σ0 = O(n− 2 ) where σ02 is the variance of ǫ. In this approach:
Section 10.5
Applications of MCMC to Bayesian and Frequentist Inference
255
∗
∗
Generate sets of n independent observations, Y1b
, . . . , Ynb
, 1 ≤ b ≤ B, with Yjb∗ having
density f0 (·), 1 ≤ j ≤ n, and, then, given R = r, Z = z, as data, estimate L(θ) by
)
(
B
X
fθ (Y(r∗ i )b |zi )
1
n
b
L(θ)
≡
(10.5.10)
Πi=1
Bn!
f0 (Y(r∗ i )b )
b=1
∗
∗
where Y(1)b
≤ . . . Y(n)b
are the ordered Yib∗ .
This can be viewed as importance sampling or as an implementation of (10.5.5). Here
1
n fθ (Y(ri ) |zi )
b
Var0 Πi=1
Var L(θ) =
B(n!)2
f0 (Y(ri ) )
which for fixed n tends to zero as B → ∞. However, we face the usual importance
sampling issue. For the linear model (10.5.6),
!
fθ2 (Yi |zi )
n
n fθ (Yi |zi )
−1
(10.5.11)
= Πi=1 E0
Var0 Πi=1
f0 (Yi )
f02 (Yi )
) !
Z ( 2
f0 (y − θT zi )
n
= Πi=1
dy − 1 .
f0 (y)
Here, as n → ∞ for any fixed θ 6= 0, the variance goes to ∞ exponentially (Problem
1
10.5.1), although it is O(1) for the |θ|/σ0 = O(n− 2 ) case considered by Doksum (1987)
(Problem 10.5.2).
There are various alternative ways to proceed taking advantage of special features of
particular models — see Murphy (1995) for gamma frailty for instance. However, for
linear regression transformation models, a general way is to use (10.5.8) and MCMC ideas
as follows:
b0 denote the MLE for Y1 , . . . , Yn distributed as Πn f0 (yi − θT zi ).
0. Let θ
i=1
1. At the mth step, m ≥ 0, generate B i.i.d. vectors (Y1b , . . . , Ynb ), 1 ≤ b ≤ B,
bT zi ). Let Y ∗ , . . . , Y ∗ denote
where Y1b , . . . , Ynb are distributed as Πni=1 f0 (yi −θ
m
(1)b
(n)b
Y1b , . . . , Ynb ordered.
2. Estimate L(θ) by
L∗m (θ) =
B
f
1 X n
Πi=1 θ (Y(r∗ i )b |zi ) ,
B
fb
b=1
θOLD
(10.5.12)
bOLD = θ
bm . Maximize L∗ (θ) to obtain θ
bm+1 and proceed to step m + 1.
where θ
We call this algorithm the likelihood sampler. It produces a semi-parametric likelihood
estimate of θ for the transformation model where (a(Y )|z) ∼ fθ (y|z) for unknown increasing a(·). When f0 is N (0, 1), this is an extension of the Box-Cox model that has
a(t) = (tλ − 1)/λ, λ ∈ R.
✷
256
Monte Carlo Methods
Chapter 10
Summary. We demonstrated how MCMC and Gibbs sampling algorithms can be used in
Bayesian analysis to compute approximations to estimates and credible intervals by generating data approximately distributed as L(θ1 , θ2 |X) when L(θ1 |X, θ2 ) and L(θ2 |X, θ1 )
are known. We next showed, in examples, how to use MCMC to approximate maximum
likelihood estimates in difficult situations. In particular we showed how a combination of
the EM algorithm and MCMC can be used to compute maximum likelihood estimates in
the travel time model; and we showed how MCMC can be used to compute maximum
likelihood estimates in a transformation model with unknown monotone transformation.
10.6
Problems and Complements
Problems for Section 10.1
1. In Example 10.1.1 (a), suppose both G1 and G2 are χ22 under H. Let m = n − m = 15.
Use R or other software to generate the Monte Carlo and permutation distributions of the
two-sample t-statistic. Use M = 1000 Monte Carlo trials and 1000 random permutations.
Summarize the results in histograms and compare the results to the student t28 density of
Section B.3.1.
2. In Example 10.1.2, suppose that F0 is known.
(a) Show that by conditioning on X̄([nα]+1) and X̄(n−[nα]) it is possible to express
E0 (X̄α2 ) in terms of one and two dimensional integrals.
(b) Use (a) and MATLAB or some other software to evaluate E0 (X̄α2 ) when n = 10,
α = 0.25, 0.50, 0.75 and (i) F0 is N (0, 1), (ii) F0 is Laplace, f0 (x) = 12 exp{−|x| }.
Problems for Section 10.2
1. Let r = m/2 and s = n/2 for positive integers m and n. Describe how Theorem B.2.3
of Vol. I can be used to generate a random variable with a β(r, s) distribution.
2. Show that in Theorem 10.2.1, V is minimized by choosing p0 proportional to |h|p.
3. (a) Establish (10.2.6) by showing that for this example, as s, r → ∞,
c(p, p0 )
1
√
.
−→ p
r+s
2πt(1 − t)
(b) Show that rejective sampling is not useful if p0 is uniform and r or s < 1.
4. Let p denote the β(r, s) density with r, s > 1. Set m = [2r] and n = [2s] where [·] is the
greatest integer function. Let r0 = m/2 and s0 = n/2 and let p0 be the β(r0 , s0 ) density.
Discuss the behavior of c(p, p0 ) as s and r varies, including the case where s, r → ∞.
5. Show how one can simulate a random variable with the density (10.2.7).
Hint. Let Z be the Bernoulli ( 12 ). Generate a value of Z. If Z = 0, let X be generated by
Section 10.6
257
Problems and Complements
the density rxr−1 , 0 < x < 1 (show how). If Z = 1, let X be generated by the exponential,
E(1) density.
−1
6. The logistic distribution has df FL (x) = [1 + exp{−x}]
fL (x) =
and density
e−x
, −∞ < x < ∞ .
(1 + e−x )2
(a) Describe how the simple Monte Carlo technique can be used to generate X1 , . . . , Xn
with density fL .
Hint. Find FL−1 (·).
(b) Describe how rejective sampling and (a) preceding can be used to generate X1 , . . . , Xn
with density
afL (x)
, −∞ < x < ∞, a > 0 .
f (x) =
1 + e−|x|
(c) Use MATLAB or other software to find a in (b) preceding. Then find E(τ ) where τ
is as in Theorem 10.2.2.
(d) Carry out rejective sampling as in (b) preceding to obtain an observed sample x1 , . . . ,
x100 from f . You may use existing software. How long did the procedure take? Draw
a histogram of your results.
7. Suppose X has density
f (x) =
ae−|x|
, a > 0, x ∈ R .
1 + e−|x|
(a) Describe how rejective sampling can be used to generate X.
(b) Use MATLAB or other software to find the constant a.
8. Let f and g be continuous case density functions. Assume there exists a finite M such
that f (y) ≤ M g(y) for all y. To obtain a random sample of size n from the density function
f (x) using the Rejection Method Algorithm, simulate Y1 , . . . , YN from the density g(x).
Then record the accepted sample {X1 , . . . , Xn } and the rejected sample {Z1 , . . . , ZN −n }.
Let Ef denote expectation with respect to f .
PN
Pn
(a) Are δ1 = n−1 i=1 h(Xi ) and δ2 = N −1 i=1 h(Yi )f (Yi )/g(Yi ) unbiased estimators of Ef h(X)?
(b) Find the marginal density of Zi .
(c) Suppose N > n. Is the estimator
δ3 =
N
−n
X
(M − 1)f (Zj )h(Zj )
1
N − n j=1 M g(Zj ) − f (Zj )
and, unbiased estimator of Ef h(X)? Explain.
258
Monte Carlo Methods
Chapter 10
9. Suppose X1 , . . . , Xn are i.i.d. P where P ∈ P = {Pθ : θ ∈ R}.
(a) Assume P is regular in the sense that ψ(x, θ) ≡ ∂/∂θ
√ log p(x, θ) satisfies conditions
A1–A4,
A6
in
Section
5.4.2
of
Volume
I.
Let
θ
=
t/
n + θ0 and L(X1 , . . . , Xn ) ≡
n
Pn
(log
p(X
,
θ
)
−
log
p(X
,
θ
)).
Show
that
for
X
,
.
.
.
,
Xn i.i.d.
i
n
i
0
1
i=1
2
t
2
Lθ0 (L(X1 , . . . , Xn )) =⇒ N − I(θ0 ), t I(θ0 ) ,
2
where I(·) denotes Fisher information.
Hint. See Examples 7.1.1 and 7.1.2.
(b) Suppose that |∂l/∂θj |ψ(X, θ)| are uniformly bounded for θ in a compact set and
1 ≤ j ≤ 3. Given a bounded function gn : X n → R, |gn (X1 , . . . , Xn )| ≤ M < ∞, we
wish to estimate En ≡ Eθn gn (X1 , . . . , Xn ). Suppose it is easy to generate i.i.d. observations from Pθ0 and we use the importance sampling estimate
B
X
∗
∗
∗
∗
bn ≡ 1
E
gn (X1b
, . . . , Xnb
) exp{L(X1b
, . . . , Xnb
)} ,
B
b=1
∗
where {Xib
: 1 ≤ i ≤ n} are i.i.d. from Pθ0 , 1 ≤ b ≤ B.
P
bn = En −→
0 as n and B tend to infinity.
Show that E
bn ) ≤ C/B for some constant C independent of n. (In (b), if necessary, you
Hint. Var(E
may assume that the given boundedness condition implies the conditions for part (a).)
Problems for Section 10.3
R
1. Establish (10.3.6), (10.3.7), and Theorem 10.3.1 for the parameter θ(P ) = g( xdP (x)),
where we assume g ′′′ exists and |g ′′′ | ≤ M , E|X1 |4 ≤ M for some constant M > 0. Formally show that if B ≍ n2 , then
− 21
b
b
)
BIAS(B)
n (Pn ) = BIASn (Pn ) + OP ((Bn)
B b
− 32
BIASn (Pn ) = BIASn (P ) + OP (n )
(B)
d
SD
= SDn (Pbn ) + OP (n−1 B −1 )
n
3
SDn (Pbn ) = SDn (P ) + OP (n− 2 ) .
Hint. For (10.3.6), use the definition of the bootstrap based on B replications. For (10.3.7),
use the expansion
g ′′ (X̄) ∗
g ′′′ (X̄) ∗
3
θ(P ∗ ) = θ(Pbn )+g ′ (X̄)(X̄ ∗ − X̄)+
(X̄ − X̄)2 +
(X̄ − X̄)3 +OP (n− 2 ) .
2
6
σ
3
b2
∗
∗
b
b
BIAS(B)
+ OP (n− 2 ) .
n (Pn ) = E θ(P ) − θ(Pn ) +
n
Section 10.6
259
Problems and Complements
2. Suppose g : Rd → Rp has total differential g (1) (x) = k(∂gi /∂xj )(x)kp×d at
x = µ ≡ EX1 . Generalize Theorem 10.3.2 (b) to
√
L∗ ( n(g(X̄∗ ) − g(X̄))) → N (0, g(1) (µ)Var(X1 )g(1) (µ)) .
Hint. See Theorem 5.3.4.
2
3. Assume X1 , . . . , Xn i.i.d. P . In Example 10.3.1, find the bias of σ
bB
= σ 2 (Pbn ) −
(B) b
2
BIASn (Pn ). Assume that 0 < σ (P ) < ∞. Compare your result to the bias −σ 2 (p)/n
of the empirically bias corrected estimate.
4. Establish (10.3.4).
5. Establish (10.3.5).
√
6. In Section 10.3.3, show that if n[b
µn − µ(P )] tends to a Gaussian distribution and the
1
difference between the bootstrap quantiles and population quantiles are of order OP (n− 2 ),
then Efron’s percentile bootstrap µ∗n[B(1−α)+1] with B = ∞ is an asymptotic (1−α) UCB.
2
7. Let K4 (P ) = EP X 4 − 3 EP (X) . Show that EK4 (Pbn ) = K4 (P ) + O(n−1 ).
R
R8. 2The Mallows’ metric. Let F be the class of df’s with x dF (x) = 0 and 0 <
x dF (x) < ∞. The Mallows’ distance ρ between F , G ∈ F is defined by
ρ2 (F, G) = inf EP (X − Z)2 ,
P
where the inf is over bivariate probabilities P for (X, Z) with marginals F and G. A
relevant result is Fréchet’s theorem:
Z 1
2
ρ (F, G) =
[F −1 (u) − G−1 (u)]2 du ,
0
w
Let −→ denote weak convergence.
Mallows’ (1972) lemmas:
R
R
w
1. ρ(Fk , G) −→ 0 iff F −→ G and x2 dFk (x) −→ x2 dG(x).
Pk
2.
X1 , . . . , X
PIf 2Fk = L( i=1 ai Xi ) where
Pn are independent with L(Xi ) = Fi ∈ F
and ai = 1, then Fk ∈ F and ρ2 (F k , Φ) ≤ a2i ρ2 (Fi , Φ), where Φ is the N (0, 1) df.
(a) Show that ρ is a metric.
(b) Use Mallows’ Lemma 2 above to show that if X1 , . . . , Xk and Z1 , . . . , Zk are i.i.d.
from F and G, respectively, and if Fk , Gk are the df’s of
1
Sk = k − 2
k
X
i=1
[Xi − E(Xi )],
1
Tk = k − 2
k
X
i=1
[Zi − E(Zi )] ,
then ρ(Fk , Gk ) ≤ ρ(F, G).
(c) Suppose X1 , . . . , Xn are i.i.d. as X ∼ F , where σ 2 = Var(X) < ∞. Show that
1
n− 2 (X̄ ∗ − X̄) converge in law in probability to N (0, σ 2 ).
260
Monte Carlo Methods
Chapter 10
1
1
Hint: By (b), the distance between L∗ n 2 (X̄ ∗ − X̄) and L n 2 (X̄ − µ) is bounded by
1
ρ(F, Fbn ). We know that L n 2 (X̄ − µ) −→ N (0, σ 2 ) by the central limit theorem.
Remark. Let | · | denote the Euclidean metric. Bickel and Freedman (1981) define the
Mallows’ metric on the class P of probabilities P on Rd with EP |X|2 < ∞ as follows:
1
ρα (P, Q) = inf E{|X − Z|α } α ,
where 1 ≤ α < ∞ and the inf is over all joint distributions of (X, Z) with marginals P
w
and Q. They show that ρ22 (Pk , P ) −→ 0 is equivalent to “Pk −→ P and EPk |X|2 −→
2
E
R P |X| ” and they show that if d = 1 and (X, Z) has marginals (F, G), then ρ1 (F, G) =
|F (t) − G(t)| dt.
9. Establish Theorem 10.3.2 for d > 1.
Hint: Use the uniform SLLN and the multivariate Lindeberg-Feller theorem. See Appendix D.1.
P
10. Suppose X1 , . . . , Xn are i.i.d. N (µ, σ 2 ). Set T = ni=1 (Xi − X̄)2 . Use available
software to plot the exact and bootstrap distribution of T when n = 10 and σ = 1. Use
B = 100 and B = 500.
(B )
11. Suppose Bn → ∞ as n → ∞. In Example 10.3.3, show that X̄−dbnαn is an asymptotic
(1 − α) UCB.
√
12. In Example 10.3.4, show that the simultaneous confidence region Fb (x) ± cn / n has
level (1 − α) for all distributions F .
√
Hint. By example
7.1.6, n|Fb(x) − F (x)| converges in law to W 0 F (x) . Note that
supx W 0 F (x) ≤ supu W 0 (u).
13. Establish (10.3.26).
14. For the model of Example 10.3.5, show that n[max(X1 , . . . , Xn )−max(X1∗ , . . . , Xn∗ )]
does not converge in law in probability.
→ 0 and
15. In Example 10.3.5 show how an appropriate choice of m(n) with m(n)
n
m(n) → ∞ in the m out of n bootstrap makes λm(n) (P ) = λ(P )+0(1) and the “variance”
λn (Pn ) − λn (P ) arbitrarily small.
16. Let X1 , . . . , Xn be i.i.d. as F and let Fbn denote the empirical distribution. Let Xn∗ =
(X1∗ , . . . , Xn∗ ) be a bootstrap sample drawn with replacement from Fbn .
(a) Suppose F has mean µ, unit variance, finite third absolute moment. Define
√
Gn (x, F ) = P [ n(X̄ − µ) ≤ x] .
√
Similarly, define the bootstrap counterpart G∗n (x, Fn ) = P ∗ [ n(X̄ ∗ − X̄) ≤ x],
where P ∗ denotes bootstrap probability given Fbn . Prove that as n → ∞,
P sup |G∗n (x, Fn ) − Gn (x, F )| → 0 = 1 .
x
Hint: You may use the Berry-Esséen theorem. See Section A.15.
Section 10.6
261
Problems and Complements
b is unbiased.
17. Show that when θb = X̄, the jackknife estimate of Var(θ)
Pn
18. Show that when θb = n−1 i=1 (Xi − X̄)2 , the expected value of the jackknife bias
estimate is E(BIASn(J) ) = −(n − 1)−1 σ 2 .
19. Show that for linear statistics of the form
T (X) = a + n−1
n
X
h(Xi ) ,
i=1
The jackknife values T (x−i ), 1 ≤ i ≤ n, can be used to determine the value of T (x∗ ) for
any bootstrap sample x∗ .
Hint: Set ti = T (x−i ) and solve the equations
X
ti = a + n−1
hj , 1 ≤ i ≤ n ,
j6=i
for h1 , . . . , hn . Use this to find the value of T (x∗ ) for any bootstrap sample x∗ = (x∗1 , . . . ,
x∗n )T .
20. Establish (10.3.12) formally.
Problems for Section 10.4
1. Assume the conditions to Appendix D.5.D. Show that X1 , . . . , Xn of Section 10.4.1 will
be approximately distributed as p for M → ∞ provided (a) and (b) of Section 10.4.1 hold.
Hint: See Appendix D.5.
2. Show that a symmetric kernel K0 has the uniform distribution as its stationary distribution.
3. Under the conditions of Proposition 10.4.2, show that the joint distribution of (Um , Vm )
in a step of Gibbs sampler obeys detailed balance.
4. Consider the Markov chain with state space {1, 2} and kernel
a
1−a
K=
.
1−b
b
Compute the upper bounds in Corollary 10.4.1.
5. In Example 10.4.2, show that the conditional distribution of σ −2 given µ and X1 , . . . , Xn
is Γ(p + 12 n, λ), where λ is given by (10.4.14).
6. (a) Assume the conditions of Theorem 10.4.1. Suppose (X1 , X2 ) is drawn from the
distribution with joint probability K(x1 , x2 )p(x1 ). Let λ1 , . . . , λN denote the eigenvalues
of K ordered so that λ1 ≥ λ2 ≥ . . . ≥ λN . Show that
min{Corr (g(X1 ), g(X2 )) : g arbitrary} = min λj = λN ≡ ρ− .
j≥2
262
Monte Carlo Methods
Chapter 10
(b) Deduce that
max{Corr (g(X1 ), h(X2 )) : g, h arbitrary} = max{ρ+ , −ρ− } .
Hint: E[g(X1 )h(X2 )] = E[g(X1 )E(h(X2 )|X1 )].
7. Show that if the state space is finite and the transition matrix K satisfies detailed balance,
then K has equal right and left eigenvalues, all being real.
8. In Example 10.4.3, show that X (k+1) ∼ N 0, ρ4 σk2 + (1 − ρ4 ) .
9. Establish (10.4.17) using Lagrange multipliers.
10. Assume the conditions of Theorem 10.4.1. Show that
max corr g(X1 ), g(Xk+1 ) = (ρ+ )k .
Hint. Note that by Theorem B.1.1,
E g(X1 )g(Xk+1 ) = E g(X1 )E g(Xk+1 )|X1 .
(10.6.1)
Let K be the transition matrix of a chain with state space {1, 2, . . . , N } and let pij (n) be
the (i, j) element of Kn . Then use (10.6.1) to show that the eigenvalues of Kn are the nth
powers of eigenvalues of K.
11. Prove Theorem 10.4.2.
Hint. Using Problem 10.4.10,
Var
B
X
b=1
B
B X
X
X
g(Xb ) =
Var g(Xb ) + 2
Cov g(Xa ), g(Xb )
b=1
= B Var g(X1 ) + 2
a=1 b<a
B a−1
X
X
a=1 b=1
a−b
ρ+
≍ B Var g(X1 )
12. Show that the Markov Chain transition matrix K =
largest eigenvalue λ2 < 0 iff (a + b) < 1.
a
1−b
1 + ρ+
.
1 − ρ+
1−a
b
has second
Problems for Section 10.5
1. Show that the expression (10.5.11) tends to infinity at an exponential rate as n → ∞.
That is, let ∆n denote the right hand side of (10.5.11). Then show that n−1 log ∆n → c for
some c > 0.
2. In the transformation model (10.5.6) with f0 the N (0, σ02 ) density,
set µi = θT zi ,
−2 Pn
2
2
σ0 = Var(ǫ). Assume that max |µi |/σ0 → 0 as n → ∞ and that σ0
i=1 µi ≤ M for
some fixed M > 0. Show that the variance (10.5.11) is O(1).
Section 10.7
Notes
263
3. Show that the model (10.5.6) with f0 a Gumbel density is equivalent to the Cox proportional hazard model.
Hint. See Problem 1.1.13.
4. Implement the likelihood sampler (10.5.12) of Example 10.5.3 on the computer for the
transformation model (10.5.6) when f0 is the N (0, 1) density. Describe the algorithm.
Generate Y1 , . . . , Y50 from the model where a(t) = log t, E(a(Y )) = 1 + Zθ, θ = 1,
Z ∼ U (0, 1). Compute L1 (θ) and L2 (θ) for θ = .5 + δ, δ = 0, 0.1, 0.2, . . . , 1. Plot L1 (θ)
and L2 (θ) at these points.
5. Use the EM-MCMC approach of Example 10.5.2 to estimate θ in the transformation
model of Example 10.5.3. Describe an algorithm.
6. Suppose that Y and ξ are independent continuous random variables, and that (10.5.7)
holds. Show that if ξ ∼ G, then the model for Y can be written as (10.5.6) for some F0 .
Hint. Λ(y|z, ξ) = ξr(θ, z)Λ(y) where Λ(y|z, ξ) = − log[1 − P (Y ≤ y|z, ξ)] and
Λ(y) = − log[1 − H(y)] for some continuous df H. Solve for P (Y ≤ y|z, ξ).
10.7
Notes
Note for Section 10.1
(1) We speak of generating independent observations by machine. This is not possible since
computers are deterministic. But many methods have been devised for generating “pseudo
random” numbers (U(0, 1) observations) which behave as if they were random for most
usual purposes. For more on this topic see, for instance, Ripley (1987).
Note for Section 10.3
(1) Efron’s presentation in Efron (1979) is much less abstract and focussed on the estimation of bias and variance, while this presentation follows more that of Bickel and Freedman
(1981). But the basic idea is in Efron’s paper. See also Efron and Tibshirani (1993) and
Bickel, Götze and van Zwet (1997).
Note for Section 10.4
Another generalization of rejective sampling which uses (Y1 , Y2 , . . .) to generate X1 , X2 ,
. . . , XB , which, though not necessarily independent, does have distribution p, called perfect sampling, is also coming into use.
Chapter 11
NONPARAMETRIC INFERENCE
FOR FUNCTIONS OF ONE
VARIABLE
11.1
Introduction
We saw in Chapter 9 that to construct estimates of low dimensional Euclidean parameters in semiparametric models we need to estimate infinite dimensional parameters such as
curves. Examples are densities and derivatives of densities on the one hand and conditional
expectations on the other. These quantities, which will be considered in this chapter and
the next, are not defined for discrete distributions and, so, as we saw in Section I.5 we need
to consider regularized approximations to these parameters before we can construct empirical plug-in estimates. This is not a consequence of our needing to estimate functions but
rather of the irregular nature of these parameters. For instance, we can estimate distribution
functions using plug-in estimates without regularizing.
To indicate the usefulness of the estimates in this chapter, we cite Example 9.3.1 where
efficiency (and even consistency for arbitrary underlying distributions) requires us to estimate the derivatives of the logs of the densities. Similarly, in Example 9.1.10 we need
estimates of the regression functions u → E(Y |U = u) and u → E(Z|U = u). A very
important class of situations where we are led to function estimation is nonparametric classification or, more generally, prediction. This class of problems is also loosely categorized
under the topic “ machine learning.” Such topics, including the estimation of the nonparametric regression µ(z) = E(Y |Z = z) and some of its semiparametric versions when z is
d-dimensional, will be covered in Chapter 12.
We introduce and analyze in this chapter nonparametric estimates of a continuous
case density f (x), x ∈ R, and estimates of the nonparametric regression curve µ(x) =
E(Y |X = x) for a continuous bivariate pair (X, Y ) ∈ R2 . Key methods and asymptotic
expressions are introduced. We assume all distributions are continuous since discrete cases
are “easy.”
265
266
11.2
nonparametric inference for functions of one variable
Chapter 11
Convolution Kernel Estimates on R
There are many uses for density estimates in visualization and, more generally, exploratory
data analysis. For instance suppose we are contemplating using a gamma distribution to
model the distribution of the failure times of a new type of computer chip. Comparing a
nonparametric density estimate visually with a fitted gamma density is one way of exploring
possible models. For more details, see Silverman (1986), Scott (1992), Wand and Jones
(1995), and Loader (1999). The methods of this chapter can also be used to estimate the
hazard rate discussed in Section 9.1.
We turn to the simplest methods of density estimation for real valued random variables
and more generally vector valued variables: histogram and kernel density estimation.
Consider first the problem we discussed in Example 7.1.5, estimation of the density (in
the continuous sense) f of a real valued random variable X on the basis of X1 , . . . , Xn ,
which are i.i.d. as X. The models P we shall consider are nonparametric, in the sense that
any probability distribution may be approximated weakly by members of P, but members
of P have densities which range over a set F which is restricted in some way. For instance,
we may consider Fr = {f : kf (r) k∞ < ∞} for r ≥ 1. More generally we can consider
Sobolev spaces,
Z
F = {f : |f (j) (t)|dt < ∞, 0 ≤ j ≤ r} ,
and, more generally still, spaces which permit discontinuities (Besov spaces; see Daubechies
(1993) for instance). When f is not continuous it is not unambiguously defined. For simplicity we shall assume in this section that f is continuous and positive on intervals of the
form (−∞, ∞), (−∞, b], [a, ∞) or a closed interval [a, b] with f (x) = 0 for x 6∈ [a, b].
This interval S(f ) is called the support of f and the upper and lower bounds are called
boundary points. It is relatively difficult to find good estimates of f (x) when x is at or near
one of the boundary points.
Within this context we consider local problems such as estimating f (x) at points x and
global ones such as estimating f (·) as a function. We will occasionally use the notation
f (P ; x) for f (x) and f (P ; ·) for the function to remind ourselves that these are parameters
defined on P. Recall that the histogram estimator of Section I.5 is obtained by approximating f (P ; x) by h−1 P (I(x)) where for integers j, Ij = (jh, (j + 1)h], −∞ < j < ∞,
and I(x) = Ij(x) is the unique interval containing x; and then plugging in the empirical
probability Pb for P . Now fh (Pb ; ·) is a discontinuous function and intuitively should be
improvable if we assume f is smooth. To demonstrate this we introduce a famous easily
analyzed family of estimates called the kernel estimates. These can be approached by taking histogram estimates and recentering them at each x. Specifically, approximate f (x) by
the continuous function
fh (x) ≡ fh (P ; x) ≡
1
P (x − h, x + h]
2h
and plug-in Pb to get the density estimate fbh (x) = fh (Pb ; x).
(11.2.1)
Section 11.2
Convolution Kernel Estimates on
R
267
Let F denote the distribution function of X and set Jh (u) = h−1 J(u/h) with J(u) =
≤ u ≤ 1]. Then it is easy to see that, if S(f ) = R (Problem 11.2.1),
Z
Z
b
b
fh (P ; x) = Jh (x − z)dF (z), fh (x) = fh (P ; x) = Jh (x − z)dFb(z).
1
2 1[−1
That is, fh (·) is the convolution of a uniform density Jh (·) on [−h, h] with f (·) and fbh (x)
is the empirical plug-in estimate of fh (x).
Convoluting f (·) with a density K gives a function with at least the smoothness of K,
and the resulting estimate can be expected
R to be better if f itself is as smooth as K. More
generally, for any function K(·) with K(u)du = 1, we consider regularizations of f of
the form
Z
fh (x) = fh (F ; x) = Kh (x − z)dF (z) = EKh (x − Z)
(11.2.2)
where Z ∼ F , Kh (u) = h−1 K(u/h) and we identify P with the distribution function F .
If K(·) is a density and S(f ) = R, then (11.2.2) is the density of X + hV where V ∼ K(·)
and V is independent of X. See Problem 11.2.2. The plug-in estimate of f is then
fbh (x) = fh (Fb ; x) = n−1
n
X
i=1
Kh (x − Xi )
(11.2.3)
where Fb is the empirical dfR based on X1 , . . . , Xn i.i.d. from F .
A function K(u) with K(u)du = 1 is called a kernel function. Thus all densities are
kernels, but kernels need not be non-negative. We shall sometimes refer to the estimates
fh (Fb , x) with fh (F ; x) of the form (11.2.2) as convolution kernel estimates to distinguish
them from generalized kernel estimates of the form
Z
f (Pb , x) = K(x, z)dPb (z) .
(11.2.4)
The condition
Z
K(x, z)dx = 1
and K(x, z) ≥ 0 for all z is required for f (Pb, ·) to be a density.
In addition to the uniform kernel J(u), common kernels are the N (0, 1) density ϕ(u),
the Epanechnikov kernel,
KE (u) =
3
(1 − u2 )1[−1 ≤ u ≤ 1],
4
and the quartic kernel,
KQ (u) = 15(1 − u2 )2 1[−1 ≤ u ≤ 1]/16 .
268
nonparametric inference for functions of one variable
Chapter 11
This can be put in the general regularization framework of Chapter 9 by considering
the sequence of parameters fh (P ; x) which tend to f (x) as h ↓ 0 but which for fixed h are
estimated by the “plug-in” method where P is replaced by the empirical probability Pb.
Let the support of K be the closure of the set {u : K(u) 6= 0}, then
Lemma 11.2.1. Let fh (·) be as defined in (11.2.2). If either of the following holds:
(a) f is continuous at x, x is in interior of S(f ), and K has compact support,
(b) f is continuous at x and bounded on R,
then, as h → 0,
fh (x) → f (x).
Proof. Note that
R
K(u)du = 1 permits us to write
Z
1
x−z
fh (x) − f (x) =
[f (z) − f (x)]dz
K
h
h
Z
=
K(u)[f (x − hu) − f (x)]du
(11.2.5)
by changing variable to u = (x − z)/h. Since, for all M ,
sup{|f (x − hu) − f (x)| : |u| ≤ M } → 0
as h → 0 and because, under (a), K(u) = 0, |u| > M , for some M > 0, our claim follows
for (a). For (b), see Problem 11.2.3.
✷
We proceed to study the plug-in estimate fbh (x) given by
fbh (x) ≡
Z
n
1 X
K
Kh (x − z)dFb (z) =
nh i=1
x − Xi
h
.
(11.2.6)
Lemma 11.2.2. If K is bounded and has compact support, then if h = hn → 0 and
nhn → ∞ as n → ∞,
P
fbh (x) → f (x)
at all x which are points of continuity of f .
Proof. We can show that E fbh (x) → f (x) and Varfbh (x) → 0 as n → ∞. Consistency
follows. See Problem 11.2.4.
Section 11.2
11.2.1
Convolution Kernel Estimates on
R
269
Uniform Local Behavior of Kernel Density
Estimates
We analyze the bias and variance of fbh (x) at a point x under the assumption of a bounded
third derivative f ′′′ . This strong assumption gives uniformity of our convergence statements
in f ∈ F0 ⊂ F and in x restricted to some finite interval [c, d] contained in the support of f .
Without such uniformity our asymptotic analyses comparing the performance of estimates
are suspect since depending on f , the n for which an approximation to precision ε is valid
might be arbitrarily great. Let
mj (K) =
Z
uj K(u)du,
(11.2.7)
and for some constant M ∈ (0, ∞), let
F3 = {f : S(f ) = [a, b], f three times differentiable on (a, b), kf ′′′ k∞ ≤ M } .
Proposition 11.2.1. Let K be a kernel with m1 (K) = 0, mj (K) finite, 1 ≤ j ≤ 3. Then,
as h → 0, uniformly in x ∈ [c, d] ⊂ (a, b), f ∈ F3 , and n,
1
E fbh (x) = fh (x) = f (x) + m2 (K)f ′′ (x)h2 + O(h3 ).
2
(11.2.8)
Proof. Evidently, E fbh (x) = fh (x). For x ∈ [c, d], Taylor expand f (x − hu) about h = 0,
to get, if |u| ≤ M1 for some constant M1 ε(0, ∞),
1
1
f (x − hu) = f (x) − huf ′ (x) + h2 u2 f ′′ (x) − h3 u3 f ′′′ (x∗ )
2
6
for |x − x∗ | ≤ |hu| ≤ hM1 . Next substitute this expression into (11.2.5) to obtain the
result.
✷
Note that mj (K) is finite for all j if K has compact support, and that m1 (K) = 0 if K
is symmetric about 0. Also note that the bias does not depend on n, but the result holds if,
in particular, we let h = hn → 0 as n → ∞. As with histogram estimates, the bias tends
to 0 as h → 0, but we now have a faster rate and uniformity of convergence of the bias
based on the assumption of a bounded third derivative. If we assume more derivatives we
can control the bias further if we do not restrict to nonnegative K. See Problem 11.2.5.
Next we turn to the variance of fbh . Let
Z
νj (K) = uj K 2 (u)du,
ν(K) = ν0 (K),
(11.2.9)
and for M > 0, let F1 = {f : S(f ) = [a, b], f is differentiable on(a, b) and kf ′ k∞ ≤ M }.
270
nonparametric inference for functions of one variable
Chapter 11
Proposition 11.2.2. Assume that νj (K) < ∞, j = 0, 1. Then, as n → ∞, uniformly for
x ∈ [c, d] ⊂ (a, b), f ∈ F1 ,
Var fbh (x) = (nh)−1 ν(K)f (x) + O(n−1 ).
(11.2.10)
Proof. By (11.2.6),
x−X
−2
b
Var fh (x) = (nh) n Var K
h
Z
x−z
−1 −2
2
2 2
= n h
K
f (z)dz − h fh (x) .
h
Change variable to u = (x − z)/h and rewrite the first term as
(nh)−1
Z
K 2 (u)f (x − hu)du = (nh)−1 f (x)ν(K)
(11.2.11)
Z
+ (nh)−1 K 2 (u)(f (x − hu) − f (x))du .
Argue as in Proposition 11.2.1 to complete the proof (Problem 11.2.6).
✷
As with histogram estimates, Varfbh (x) is proportional to f (x) and is small if the density
is small. Further, the variance is proportional to (nh)−1 and tends to 0 as n → ∞ for fixed
h and to ∞ as h → 0 for fixed n.
Since, by Proposition 1.3.1, MSE = VAR + (BIAS)2 , we have established
Corollary 11.2.1. Under the conditions of Propositions 11.2.1 and 11.2.2, uniformly for
x ∈ [c, d] ⊂ (a, b), f ∈ F3 , as n → ∞ and h → 0
1
MSE[fbh (x)] = (nh)−1 ν(K)f (x) + h4 m22 (K)[f ′′ (x)]2 + O n−1 + h5 ). (11.2.12)
4
The bias-variance tradeoff is clear: Small h leads to small bias and large variance while
large h has the opposite effect. By choosing h = hn such that hn → 0 and nhn → ∞,
we have consistency of fbh (x). The sum of the first two terms is sometimes called the
asymptotic MSE[fbh (x)] (AMSE[fbh (x)]). Set A(h) = AMSE[fbh (x)]. Then by solving
A′ (h) = 0 for h we find that when f (x) > 0 and 0 < [f ′′ (x)]2 < ∞, the minimizer of
1
A(h) is of the form hopt = cn− 5 for some c(f ) > 0 (Problem 11.2.10). Substituting hopt
in (11.2.12) gives
4
inf MSE[fbh (x)] ≍ n− 5 ,
h
a rate that is slower than the rate n−1 for estimation of regular parameters.
Section 11.2
11.2.2
Convolution Kernel Estimates on
R
271
Global Behavior of Convolution Kernel Estimates
We can measure the distance, ρ(f, g), between two curves f and g in many ways, e.g.
R
R
1
1
kf − gk2 ≡ ( (f − g)2 ) 2 , kf − gkp ≡ ( |f − g|p ) p , p ≥ 1, and kf − gk∞ ≡
sup |f (x) − g(x)|. For most types of density estimates, all of these, other than kf − gk∞ ,
behave in the same way qualitatively in terms of rate of convergence
R ∞ to 0. We focus on
kf − gk22 . Then the loss, using fb, is the Integrated Squared Error −∞ (fb − f )2 (x)dx and
the risk is the Integrated MSE,
Z ∞
E(fbh (x) − f (x))2 dx
(11.2.13)
IMSE(fb, f ) =
−∞
Z ∞
Z ∞
=
[E fbh (x) − f (x)]2 dx +
Varfbh (x)dx.
−∞
−∞
For IMSE we invoke slightly stronger conditions than for MSE. For some constant
M > 0, let
Z b
Z b
I
′′ 2
F3 = f 3 ∈ F3 :
[f ] (x)dx < ∞,
|f ′′′ |2 (x)dx ≤ M .
a
a
Theorem 11.2.1. Suppose the conditions of Corollary 11.2.1 hold save that F3 is replaced
by F3I . Then, uniformly for f ∈ F3I ,
Z ∞
h4 2
−1
b
[f ′′ ]2 (x)dx + O(n−1 + h5 ). (11.2.14)
IMSE(fh , f ) = (nh) ν(K) + m2 (K)
4
−∞
Proof. We use Laplace’s form of the remainder in the Taylor expansion of f (x − hu):
Z
(hu)3 1
(hu)2 ′′
′
f (x) −
(1 − λ)2 f ′′′ (x − λhu)dλ.
f (x − hu) = f (x) − huf (x) +
2
2
0
R
Then, by (11.2.5) and uK(u) = 0,
E fbh (x) − f (x) =
1 2
h m2 (K)f ′′ (x)
(11.2.15)
2
Z
Z
h3 ∞ 1 3
−
u K(u)(1 − λ)2 f ′′′ (x − λhu)dλdu.
2 −∞ 0
The square of the integral term in (11.2.15) is bounded by
Z 1
nZ ∞
o2
2
3
u K(u)
(1 − λ)2 f ′′′ (x − λhu) dλdu
−∞
0
1
which, by the inequality EU ≤ E 2 (U 2 ) applied to U = |f ′′′ (x − Λhu)|, with Λ having
density 3(1 − λ)2 on (0, 1), is bounded by
Z
hZ 1
i 21 o2
1n ∞ 3
|u K(u)|
3(1 − λ)2 [f ′′′ (x − λhu)]2 dλ du
(11.2.16)
3
−∞
0
272
nonparametric inference for functions of one variable
which, because
1
3
Z
∞
−∞
R1
0
2
(1 − λ)2 f ′′′ (x − λhu) dλ ≤
|u3 K(u)|du
2 Z
∞
−∞
R∞
−∞
Chapter 11
2
f ′′′ (t) dt, in turn is bounded by
|f ′′′ (x)|2 dx .
(11.2.17)
R
2
It follows from (11.2.15) that the squared bias is 41 h4 m2 f ′′ (x) dx + O(h5 ). Next,
using (11.2.11) (Problem 11.2.11),
Z ∞
Z ∞
Varfbh (x)dx = (nh)−1
K 2 (u)du + O(n−1 )
(11.2.18)
−∞
−∞
and the result follows.
✷
Note that the bias-variance tradeoff for IMSE is the same as for MSE and that the first
two terms of (11.2.14) are of the form A(h) = c1 (nh)−1 + c2 h4 . To minimize A(h), we
set the derivative equal to zero and find (Problem 11.2.12)
1
1
hopt = {ν(K)/m2 (K)ν(f ′′ )} 5 n− 5
R
provided 0 < ν(f ′′ ) = [f ′′ (x)]2 dx < ∞. Substituting this in (11.2.14) gives
inf IMSE[fb(·)] =
h>0
1
5
4
4
{m2 (K)ν(K)ν(f ′′ )} 5 n− 5 + o(n− 5 ).
4
(11.2.19)
(11.2.20)
That is, as with MSE, assuming a well behaved f ′′′ , the order of uniform convergence over
4
the family F3 is n− 5 .
11.2.3
Performance and Bandwidth Choice
As we have noted, an immediate observation from the preceding results is that the order
of convergence of MSE and IMSE is strictly worse than the n−1 we have seen in the case
of regular parameters. This is not a feature of a local vs. global focus nor of curve vs.
Euclidean parameter estimation since the distribution function is a regular parameter.
In fact, performance depends quite specifically on the type of f we consider. We have
already seen (Section I.5) that with histogram estimates if we only assume 0 < |f ′ (z)| ≤
M < ∞ for z in a neighborhood of x the fastest pointwise convergence locally is n−2/3 and
this continues to be true locally with kernel estimates. Moreover the optimal bandwidth is
of the form cn−1/3 . On the other hand, if we specify greater smoothness for f , for instance,
0 < |f (2k+1) (z)| ≤ M < ∞, k > 2, for z in a neighborhood of x we can use K which
are sometimes negative to achieve the rate n−2k/(2k+1) using bandwidth cn−1/(2k+1) . See
Problem 11.2.9.
The major difficulty that has to be faced is the choice of the bandwidth hn . The trouble
is that even if we are willing to accept the order of magnitude n−1/5 (see (11.2.19)) dictated by the amount of smoothness for f that we are willing to assume, the optimal c in
1
hopt = cn− 5 for both MSE and IMSE depends on f ′′ . For instance, from (11.2.19), hopt
Section 11.2
Convolution Kernel Estimates on
R
273
R
in this case depends on [f ′′ (x)]2 dx which, unfortunately, is harder to estimate than f and
requires more smoothness assumptions. We are caught in a no win situation. If we believe
that more than two derivatives of f are available we should use bandwidth of larger order
than n−1/5 and obtain a final rate better than n−4/5 , but that requires the estimation of
R (4) 2
f (x) dx. However, if f (4) (·) exists, we should use an even larger order bandwidth,
and so on. There are a number of ways out. The one we consider soundest is cross validation which does not require a smoothness assumption. We discuss this method further in
Section 12.5.
A simple bandwidth choice if we are willing to assume a bounded f ′′ is to employ
the hopt appropriate for a reference distribution (Bickel and Doksum (1977)), such as
the Gaussian distribution and replacing
σ by the sample standard deviation s. For F =
√
1
N (µ, σ 2 ), we find ν(f ′′ ) = 3/(8 πσ 5 ) and hopt is c(K)n− 5 σ with c(K) equal to 1.95,
2.34, and 1.06 for the U[−1, 1], Epanechnikov, and N (0, 1) kernels, respectively. With
this choice fbh (x) will achieve the optimal rate of convergence for f ∈ F3 provided 0 <
f ′′ (x) < c and 0 < σ 2 < ∞.
11.2.4
Discussion of Convolution Kernel Estimates
Perhaps the strongest point in favor of convolution kernel estimates is that they are easy to
analyze theoretically and that they can easily be extended to the multivariate case. However,
the convolution kernel estimate (11.2.6) has a weakness if f has compact support [a, b]
because for x near the boundary points a and b, (11.2.6) with a symmetric kernel will lead
to biases by trying to account for impossible observations. A skew kernel which reduces the
bias near a will worsen the bias near b, and vice versa. A possible solution is to permit the
shape of the kernel to vary with x, that is, use a generalized kernel K(x, z) as introduced in
(11.2.4). Put another way, in replacing the density f of X with that of X + hV , we will not
limit ourselves to X, V independent. A simple method leading to such solutions is based
on using a local parametric approximation to f (x) and using the plug-in method to estimate
the parameters in this approximation. We consider this approach in Section 11.3.1. Other
methods such as penalization can also lead to estimates insensitive to the support.
Summary. We consider estimation of a continuous case density f (·) and because this is
an irregular parameter where empirical plug-in estimates are not directly available, we first
approximate f (·) by a regular parameter fh (·), h > 0, that converges to f (·) as h ↓ 0.
This regular parameter is theR convolution of f (·) with a function Kh (u) = h−1 K(u/h)
where the kernel K satisfies K(t)dt = 1. The constant h is called the bandwidth and is
an example of a tuning parameter. By applying the empirical plug-in principle to fh (·) we
obtain the convolution kernel estimate fbh (·). We develop uniform asymptotic properties of
the bias, variance, and mean squared error (MSE) of fbh (x) first for fixed x and next for the
mean integrated squared error (MISE). We find that the MSE and IMSE tend to zero and
fbh (·) is consistent provided nh → ∞ as n → ∞. Assuming a bounded third derivative for
MSE and a bounded integrated absolute third derivative for IMSE, we find that the fastest
4
possible rate for uniform convergence of MSE and IMSE to zero is n− 5 , which is obtained
274
nonparametric inference for functions of one variable
Chapter 11
1
when h = O(n− 5 ). In this section we discuss a simple approach which asks for optimality
at a reference density. We postpone an effective data based method for selecting h called
cross validation until Section 12.5.
11.3
Minimum Contrast Estimates: Reducing Boundary
Bias
In this section we extend the minimum contrast approach of Volume I to density estimation.
This extended contrast approach also applies to other curves and to a variety of classes of
regular approximating functions to these curves. Here we show that this approach produces
density estimates with desirable properties.
Let F be a nonparametric class of densities (e.g. densities that satisfy the conditions
of Corollary 11.2.1), and let Fs denote a class of regular approximating functions fs (·, β),
β ∈ Rp , which contains the constant functions. Here s ∈ S is a tuning constant such as h
whose choice was discussed above and we will return to later. We introduce a discrepancy
Dt,x (f (·), fs (· ; β)) > 0
between f (x) and fs (x, β) where t ∈ T is another tuning constant. Typically, either S or
T is empty. For fixed s and t, we assume that there is a unique minimizer βs,t (F, x) of D,
and we assume that for each x in the interior (a, b) of the support S(f ) of f
inf{Dt,x (f (·), fs (· ; βs,t (F, x))) : s ∈ S, t ∈ T } = 0.
We also assume that βs,t (F, x) depends on F only, that is, not on f . Our estimate of β is
b Note that we will get the same solution
βb = βs,t (Fb, x) and the estimate of f (x) is fs (x, β).
if we use the global discrepancy
Dt (f (·), fs (· ; β)) = EF Dt,X (f (·), fs (· ; β))
because local minimization implies global minimization, cf. Proposition 3.2.1.
Example 11.3.1. Linear kernel estimates as minimum contrast estimates. We first consider
x fixed and assuming f is continuous we want an approximation f (z; β) that is close to
f (z) for z in a neighborhood of x, where β = β(x) depends on x. One way is to select a
kernel K ≥ 0 and to find β(x) that minimizes the local discrepancy
Z
1
Dh,x (f (·), f (· ; β)) =
{[f (z) − f (z; β)]2 Kh (z − x)}dz (11.3.1)
W (x) S(f )
= EQh {[f (Z) − f (Z; β(X))]2 |X = x}
where (X, Z) has joint df Qh with density qh (x, z) = f (x)qh (z|x), for x, z ∈ S(f ), and
qh (z|x) =
Kh (z − x)1[a ≤ z ≤ b]
W (x)
Section 11.3
Minimum Contrast Estimates: Reducing Boundary Bias
275
with
W (x) =
Z
a
b
Kh (z − x)dz =
Z
(b−x)/h
K(u)du.
(a−x)/h
That is, qh (z|x) is the conditional density of Z = X + hV given that X = x and that
X +hV is in [a, b], where V ∼ K. Note that the global Dh measures how well f (Z; β(X))
predicts f (Z).
By equating the derivative of Dh,x with respect to β to zero, we formally obtain a minimizer βh (F, x) and then the estimators βbh (x) = βh (Fb, x) and fbh (x) = f (x; βbh (x)). We
can use the empirical plug-in principle because minimizing Dh,x is equivalent to minimizing
Z
Z
f 2 (z; β)Kh (z − x)dz .
f (z; β)Kh (z − x)dF (z) +
R(F ; β) ≡ −2
S(f )
S(f )
We call R(Fb ; β) a contrast function and fbh (x) a minimum contrast estimate, cf. Section
2.1.1.
As a simple illustration, take {f (z; β), β ∈ Rp } to be the class of constants {β, β ∈ R}.
In this case we find, provided EF Kh (X − x) > 0 and x ± h ∈ S(f ), that the unique
minimizer of D is
βh (F, x) = EF Kh (X − x), x ± h ∈ S(f ) .
When K is symmetric, this coincides with the convolution kernel approach (11.2.2) when
x ± h ∈ S(f ), but when x ± h is not in S(f ), that is, x is in the boundary regions of S(f ),
then the minimum contrast estimate is consistent (Problem 11.3.2) and the convolution
estimate is not (Problem 11.3.1). A key feature of the minimum discrepancy approach is
that it automatically leads to boundary corrections. We next elaborate on this remark. ✷
Boundary points
Suppose S(f ) = [a, b] with a and b finite and 2h < b−a. We say that x is in the interior
region of S(f ) if x − h and x + h both are in [a, b]. In the asymptotic theory of Section
11.2, we could argue that for any x in (a, b) we can choose h = hn small enough so that
this condition holds. However, for finite n as well as when n → ∞ the kernel estimator
(11.2.6) may be severely biased for x in the boundary regions [a, a + h) and (b − h, b].
In fact for xh of the form a + λh or b − λh, 0 < λ < 1, the kernel estimator (11.2.6) is
asymptotically biased when K has compact support (Problem 11.3.1).
However, if we use the minimizer of (11.3.1) with f (z; β) a constant, and we assume
that K is a density with support [−1, 1], then this asymptotic bias of the plug-in estimate
disappears. We can write the unique solution to dD/dβ = 0 for x ∈ (a, b) as (Problem
11.3.3)
Z b
Z (b−x)/h
fh (x) = EQ [f (Z)|X = x] =
Kh (z − x)dF (z)/
K(u)du, (11.3.2)
a
(a−x)/h
276
nonparametric inference for functions of one variable
Chapter 11
which leads to a generalized kernel estimate of the form (11.2.4). For symmetric K, it
corresponds to the density of a random variable of the form X + hV , V ∼ K, where V
depends on X, i.e., we no longer have a convolution as in (11.2.2).
Let fbh (x) be the plug-in estimate using (11.3.2). It can be shown (Problem 11.3.2) that
for xh in the boundary region, Bias fbh (xh ) = O(h). The rate O(h) at which the bias tends
to zero is slower than the rate O(h2 ) obtained for interior points. A remedy is to use a
locally linear approximation.
Example 11.3.1 continued: Locally linear and polynomial estimates. Suppose S(f ) =
[a, b] with a and b finite. Let K(u) be a nonnegative symmetric kernel with support [−1, 1].
We find α = α(x, F ) and β = β(x, F ) that minimize the local discrepancy
Z
a
b
{f (z) − [α + β(z − x)]}2 Kh (z − x)dz
(11.3.3)
from the locally linear fit g(z −x) = α+β(z −x) to f (z). The locally linear approximation
to f (x) is
fh (x) = g(0) = α(x, F ) + β(x, F )(x − x) = α(x, F ),
Setting the derivative of the quadratic in (11.3.3) with respect to α and β equal to zero
and simplifying, α(x, F ) and β(x, F ) will, for x ∈ (a, b), be the solutions to the normal
equations
Z
a
b
Z
b
a
Kh (z − x)dF (z) = αm0 (x, h) + βhm1 (x, h)
(11.3.4)
(z − x)Kh (z − x)dF (z) = αhm1 (x, h) + βh2 m2 (x, h)
where
mj (x, h) ≡
Z
(b−x)/h
uj K(u)du, j = 0, 1, 2.
(a−x)/h
By the argument above fh (x, Fb ) ≡ α(x, Fb) is an estimate of f (x). If a + h ≤ x ≤ b − h,
or, equivalently, a − x ≤ −h, b − x ≥ h, then m1 (x, h) = 0 by the symmetry of K and K
vanishing outside [−1, 1]. Thus, on this range,
Z
fh (x, Fb ) = Kh (x − z)dFb (z),
(11.3.5)
the usual convolution kernel estimate. However, for a ≤ x < a + h and b − h < x ≤ b we
obtain, solving (11.3.4) (see Problem 11.3.5),
Z
b
fh (x, F ) = Kh (x, z)dFb(z)
(11.3.6)
Section 11.3
Minimum Contrast Estimates: Reducing Boundary Bias
277
where
Kh (x, z) =
m2 (x, h) − m1 (x, h)(z − x)/h
Kh (z − x).
m0 (x, h)m2 (x, h) − m21 (x, h)
(11.3.7)
Note that Kh (x, z) is not a convolution kernel, and is sometimes negative.
For M > 0, let f ∈ F2 ≡ {f on [a, b] : |f ′′ (x)| ≤ M for all xǫ[a, b]}, where f ′′
is one sided at a and b.
Proposition 11.3.1. Suppose that the conditions of Propositions 11.2.1 and 11.2.2 hold.
Then, uniformly for f ∈ F2 and x ∈ [c, d] ⊂ (a, b),
Efh (x, Fb ) = f (x) + O(h2 ), as h → 0
Varfh (x, Fb ) = O((nh)−1 ), as n → ∞ .
(11.3.8)
(11.3.9)
The proof of these assertions and the asymptotic MSE of fh (x, Fb ) are left to Problem
11.3.5. Note also that since β(x) = f ′ (x), β(x, Fb ) gives us an estimate of f ′ (x) which is
given in Problem 11.3.5 as well. We give extensions to estimates based on local polynomial
approximations in Problem 11.3.6.
Remark 11.3.1 The conditional expectation version of (11.3.1), and Theorem 1.4.3 with
(f (Z), Z − X) in place of (Y, Z) and with L(f (Z), Z − X|X = x) in place of L(Y, Z),
give,
α(x, F ) = E[f (Z)|X = x] − β(x, F )E(Z − X|X = x),
Cov(f (Z), (Z − X)|X = x)
β(x, F ) =
.
Var(Z − X|X = x)
That is, α(x; F ) is the value of the optimal MSPE linear predictor of f (Z) at z = x
when L(Z|X = x) is qh (z|x). This representation gives a different derivation of fh (x; Fb )
(Problem 11.3.7) and shows the connection to prediction.
Remark 11.3.2. The locally linear estimate of Example 11.3.1 can be computed as follows: Let {x(1) , . . . , x(g) } denote a set of grid points in [b
a, bb] where b
a = x(1) and bb = x(n) .
(k)
For a given x determine whether it is in the left or right boundary region or the interior
region of [b
a, bb]. Thus if x(k) is in the left boundary region, x(k) = b
a + λh, 0 < λ < 1,
determine λ(k) = (x(k) − b
a)/h, and use Kh (x, z) with
mj (x, h) =
Z
λ(k)
uj K(u)du .
−1
A point x(k) in the central part uses the kernel K(u), and the right boundary point x(k) uses
R1
Kh (x, z) with mj (x, h) = −γ (k) uj K(u)du and γ (k) = (bb − x(k) )/h. This procedure
278
nonparametric inference for functions of one variable
Chapter 11
yields {(x(k) , fb(x(k) )); k = 1, . . . , g}. These points are then connected using a smoothing
routine and the final output is a smooth curve.
Example 11.3.2. Series expansions. Suppose f is in L2 (a, b), for −∞ ≤ a ≤ b ≤ ∞.
Rb
Thus, a f 2 (x)dx < ∞. Let {Bj (·) : j ≥ 1} be a complete orthonormal basis of L2 ,
Z
b
Bj (x)Bk (x)dx = 1(j = k) ,
a
and let
βj (F ) =
Z
∞
f (x)Bj (x) dx .
−∞
Then,
f (x) =
∞
X
βj (F )Bj (x)
(11.3.10)
j=1
Rb
2
Pm
in the sense that a f (x) − j=1 βj (F )Bj (x) dx → 0 as m → ∞ (L2 convergence).
For example, if a = −π, b = π, we can consider the Fourier basis,
1
Bk (x) = π − 2 cos kx,
=
− 21
(2π)
= π
− 12
,
sin kx,
k>0
k=0
k<0
Rπ
since −π cos2 kxdx = π. Another example is the Legendre polynomials. If a = −∞,
b = ∞, we can consider the Hermite functions defined by
Bk (x) = ϕ(x)Hk (x) =
(−1)k dk
ϕ(x) .
ck dxk
Here ϕ is the N (0, 1) density, Hk are the Hermite polynomials, and ck makes
Z
Hk2 (x)ϕ2 (x)dx = 1
(see also Section 5.3). Other important bases are the wavelet bases — see Donoho and
Johnstone (1995) for instance. For all these cases, we can estimate βk (F ) by empirical
plug-in,
n
1X
βbk = βk (Fb ) =
Bk (Xi ) .
n i=1
(11.3.11)
Unfortunately, while, by Plancharel’s theorem,
Z
2
f (x)dx =
∞
X
k=1
βk2 (F ) < ∞ ,
Section 11.3
Minimum Contrast Estimates: Reducing Boundary Bias
279
βk (Fb ) 6→ 0 as k → ∞ in general. Thus, the direct plug-in approach for f (x) as in (11.3.10)
is impossible. On the other hand, minimizing the discrepancy
D(f (·), fm (·, β)) =
Z
[f (x) −
m
X
βj Bj (x)]2 dx
j=0
yields βk (Fb) as in (11.3.11) and the preceding plug-in estimates βb1 , βb2 , . . ., as well as the
density estimate
where
fbm (x) =
m
X
j=0
Km (x, z) =
βbj Bj (x) =
m
X
Z
Km (x, z)dFb (z)
Bj (x)Bj (z).
j=0
R
It may be shown that if [a, b] is compact and B0 is constant, then Km (x, z) dx = 1 (Problem 11.3.8). Analysis of such estimates is not as simple as that of convolution kernels and
they are necessarily sometimes negative. We refer to Silverman (1986) and Viollaz (1989)
for further discussion. This idea becomes much more important and attractive in connection with regression. Note that, as with convolution kernel estimates, a crucial “tuning
parameter,” m in this instance, needs to be chosen.
✷
Kernel estimates, despite their formal simplicity, have been computationally unattractive since a separate summation has to be carried out for each x in a grid of x-values.
However, advances in computer technology and the availability of software (Loader, 1999,
Chapter 5) have reduced this disadvantage. Orthogonal series estimates do not share this
problem since the Bj (·) can be stored once and for all and m << n but, as we have indicated, are hard to analyze and can be negative. There are many nonlinear (nongeneralized
kernel) methods of density estimation which are competitive with kernel density estimates.
We consider these in a general discussion of regularization in the next section.
Summary. We consider a systematic approach to constructing regular approximations to
f (·) which consists of minimizing a discrepancy between f (·) and a parametric approximation f (·; β), β ∈ Rp , to f (·). The minimizer will be regular if a local quadratic discrepancy
is used. The empirical version of the minimizer is called a minimum contrast estimate. We
illustrate this approach using f (·; β) ≡ β, β ∈ R, and linear approximations f (·; β) to
f (·) over local neighborhoods of x. We show that this approach yields estimates of f with
reduced bias for boundary points xh of the form a + λh and b − λh where 0 < λ < 1 and
a and b are the left and right boundaries of the support S(f ) of f . In particular, by plugging the empirical distribution into the linear minimizer of a local quadratic discrepancy
based on bandwidth h, we obtain an estimate with bias of order O(h2 ) for all x ∈ (a, b)
including boundary points. The variance of this estimate is of order O (nh)−1 . We also
280
nonparametric inference for functions of one variable
Chapter 11
briefly consider estimates based on approximating f (·) by a truncated series expansion and
show how minimizing a quadratic discrepancy leads to a natural series expansion estimate
of f (·).
11.4
11.4.1
Regularization and Nonlinear Density Estimates
Regularization and Roughness Penalties
What we call regularization is related in spirit to the notion of regularization introduced by
Tikhonov (1963) in applied mathematics. His problem was, essentially, given an integral
equation of the form Kf = g with g ∈ K(F0 ) and F0 a “nice” set of functions (not
necessarily densities), to find numerical approximations fN ∈ F0 to f which converged
to f as N → ∞. F0 being “nice” corresponded to smoothness, for instance f twice
R1
differentiable on (0, 1) with J(f ) ≡ 0 [f ′′ (x)]2 dx < ∞. Simply discretizing K to a
matrix KN and gN to a vector g, defining fN = KN gN , and extrapolating fN to fN ∈
F0 would not work because the matrix KN was too ill conditioned. Tikhonov proposed
minimizing |KN f − g|2 + λJN (f ) for λ = λN → 0 as N → ∞, where JN (f ) penalizes
f for an extrapolation with J(f ) too large. In particular, Tikhonov considered the natural
2
PN
numerical approximation to J(f ), N −1 i=1 f i+2
− 2f i+1
+ f Ni
. Passing to
N
N
the limit, as N → ∞, for
fixed
λ
>
0
we
see
that
we
have
defined
f
as
the
solution of
λ
R1
R 1 ′′
2
2
the problem: Minimize 0 (Kf − g) + λ 0 [f (x)] dx. Under weak conditions on K,
discretizations fλN of this problem converge to fλ in strong senses for fixed λ > 0, and
fλ → f as λ → 0. By letting λN → 0 sufficiently slowly, we have fλN → f .
As we indicated in Sections I.5 and 9.1, we think of regularization as follows: We
are given a parameter θ(P ) such as θ(P ) ≡ f (·), the continuous case density, which is
on a model P0 , for instance, all P with twice differentiable densities f such that
Rdefined
|f ′′ (x)2 |dx < ∞, but not on M ≡ all distributions. Since we cannot plug the empirical
distribution Pb into θ(P ), we construct a sequence θs (P ), s ∈ R such that
(i) θs is defined on M.
(ii) θs (P ) → θ(P ) on P0 as s → ∞.
Then use θbs (Pb) with the tuning parameter sb chosen using the data.
We have already noted that if θ(P ) = f (·), then the parameter corresponding to convolution kernel estimation is
Z
fs (x) ≡ s K(s[z − x]) dP (z)
R
where s ≡ 1/h, h → 0. As we also noted the log likelihood log f (x)dPb (x) is not maximized for P ∈ P0 sinceP
the supremum is infinite and is formally achieved as f (x) looks
n
more and more like n−1 i=1 δ(x − Xi ), where δ is the Dirac function. What Tikhonov
regularization, as he introduced it, naturally corresponds to in this case is penalized maximum likelihood introduced by Good and Gaskins (1978) and Tapia and Thompson (1978).
Section 11.4
Regularization and Nonlinear Density Estimates
281
Penalized maximum likelihood (PLM)
Let F be a space of densities and let J : F → R+ be a “roughness penalty.” In analogy
to Tikhonov we define a population parameter f (·, F ) as the minimizer fλ of
Z
log fλ (x)dF (x) + λJ(fλ ).
(11.4.1)
The corresponding
estimate is just fλ (·, Fb ). Good and Gaskins (1978) took, for f on
R
′
) = {[f (x)]2 /f (x)}ds. A computationally simpler proposal, to take J(f ) =
RR, J(f
′′
[f (x)]2 dx, consistent with the Tikhonov penalty and with proposals made by Wahba
(1969) for nonparametric regression, was made by de Montricher, TapiaRand Thompson
(1975). See also Wahba (1990). The solution to (11.4.1) for J(f ) = [f ′′ (x)]2 dx, if
we restrict f to densities on [a, b] with f (j) (a) = f (j) (b) = 0, j = 0, 1, is a quadratic
spline with one continuous derivative and knots at a, x(1) < · · · < x(n) , b. Here a spline
is a continuous piecewise polynomial function with the knots t1 < · · · < tm being the
points where the polynomial changes. That is, the spline f is a different polynomial of the
(j) +
same degree over each (tk , tk+1 ], and f (j) (t−
(tk ), k = 1, . . . , m, some j ≥ 0.
k) = f
See de Boor (1978) for an extensive discussion. For proofs of these claims see Tapia and
Thompson (1978), p.106. We do not pursue this subject further here.
11.4.2
Sieves. Machine Learning. Log Density Estimation
A general method of regularization, introduced in Section 9.1, is the following: Let F be a
general (NP) class of functions. Then,
(i) Find a sequence of parametric models (usually but not necessarily nested), Fk ≡
{fk (·, θ k ) : θ k ∈ Rdk } such that every f ∈ F is the limit as k → ∞ in some norm
of fk (·, θk (f )) (for some θk (f ) ∈ Rdk ), where fk ∈ Fk .
(ii) Select a method of estimating θ k from the data, under the assumption that fk ∈
Fk holds. Typically this method is maximum likelihood. If specifying fk does not
completely specify a model, such as a regression framework where the distribution
of the error is not assumed to be of specified parametric form, least squares or some
other contrast function criteria may be used.
(iii) Determine, using a data based criterion or otherwise, a suitable sequence kn which
tends to ∞ as n → ∞.
bk ), where θ
bk is the estimate
(iv) Act as if Fkn is true. That is, estimate f (·) by fk (·, θ
from (ii).
When fk determines a model and the estimation method is maximum likelihood then
this method, for fixed x, reduces to replacing f (x) by the regular parameter fk (x, θ k (f ))
where
Z
θ k (f ) = arg max log fk (x, θ k )dF (x).
(11.4.2)
θ
282
nonparametric inference for functions of one variable
Chapter 11
Recall Section 2.2.2 where we showed that in the i.i.d. case the MLE is the plug-in estimate
of the θ that minimizes the Kullback-Leibler divergence between f (x) and fk (·, θ). In our
bk ) is a minimum discrepancy estimate in the sense of Section
new terminology, fk (x, θ
11.3.
There are two critical choices involved, that of {Fk } and that of kn . We shall discuss
the latter choice further in Section 12.5. The choice of {Fk } is clearly problem dependent.
We want Fk to be a good approximation for the type of f we expect. The method of
estimation is also dependent on the problem, and two other considerations.
(a) Ease of computation
Methods which are in closed form or involve optimization of convex functions over convex
sets are preferred for obvious reasons.
(b) Efficiency for the family Fk .
The effectiveness of this criterion depends on how closely f can be approximated by a
member of Fk since estimates may have bias of order much greater than √1n for f ∈
/ Fk .
Since no density of interest typically belongs to any Fk this criterion is secondary.
Vapnik (1999) has argued for a third consideration.
(c) Empirical optimization of a loss function: Machine Learning
Vapnik (1998,1999), as did Wald (1950) before him, has argued that one should start
with a loss function, attached to some feature of P , such as classification with 0-1 loss or
estimation of E(Y |X) with L2 loss. Vapnik then requires us to specify a class of decision
procedures. Given this framework, he insists one should minimize a natural empirical estimate of the risk. This point of view can be viewed as a modification of the method of
sieves with model selection and fitting method combined. It differs from the classical Wald
(1950) framework in two ways. First no model is specified but only a class of decision procedures. Second, empirical risk minimization has to be followed. For instance, having our
loss function for choosing g when f is true as the Kullback-Leibler divergence and specifying Fk would, to Vapnik, be equivalent to choosing the class
f is
R of procedures in which
estimated by f (·, θk ), θk ∈ Rk . The risk would be just − log f (x, θk )/f (x) f (x)dx.
R
The empirical risk which is to be minimized is − log f (x, θk )/f (x) dPb (x), which is
R
equivalent to maximum likelihood since log f (x)dPb(x) is fixed. If, however, we specify
loss as
Z
Z
Z
Z
2
f (x) − f (x, θ k ) dx = f 2 (x)dx − 2 f (x, θk )f (x)dx + f 2 (x, θ k )dx
R
R
we would be led to minimize −2 f (x, θk )dPb(x) + f 2 (x, θ k )dx, a method suggested
by Rudemo (1982). Note that Vapnik’s criterion contradicts (b) unless Kullback-Leibler
loss is used because the theory of Section 6.2 tells us that if P ∈ Fk (or presumably if it
is close), then maximum likelihood should be strictly better than the Rudemo method even
using Rudemo’s loss! See Problem 11.4.1.
Log density estimation
Section 11.4
283
Regularization and Nonlinear Density Estimates
One sieve {Fk } that satisfies both efficiency and relative computational ease is based on
expanding the log of the density. Let B1 , B2 , . . ., be a sequence of functions on (a, b) ⊂ R.
Suppose that the Bj are such that finite linear combinations can approximate an “arbitrary”
function. Examples of B1 , B2 , . . . include orthogonal polynomials with respect to different measures such as the Hermite, Laguerre, or Legendre systems, splines, trigonometric
polynomials, and wavelets.
Define
fk (x, η) = exp
k
nX
j=1
o
ηj Bj (x) − A(η) h(x)
(11.4.3)
where η ∈ Ek , the natural parameter space of this canonical exponential family, which
we assume open. Assume that we can approximate any f ∈ F , a “nonparametric” set of
densities on (a, b) as follows,
inf{ρ f (·, η), f : η ∈ Ek } → 0
(11.4.4)
as k → ∞, where ρ is a “distance” such as the Kullback-Leibler discrepancy.
P
Let X1 , . . . , Xn be i.i.d. as X ∼ F , f = F ′ , and set B̄j (X) = n−1 nj=1 Bj (Xi ),
pk (x, η) =
n
Y
fk (xi , η).
i=1
By Corollary 1.6.2 and Theorem 2.3.1, we have
Proposition 11.4.1. If Ek is open and 1, B1 (x), . . . , Bk (x) are linearly independent, then
b which maximizes log pk (x, η) exists and satislog pk (x, η) is strictly concave. A unique η
fies Ȧ(b
η ) = B̄(x), where B̄(x) = (B̄1 (x), . . . , B̄k (x))T .
Recall from Section 2.2.2 that this ηb is the plug-in estimate of the minimizer of the
Kullback-Leibler divergence between f (·) and fk (·, η). The estimate of f (x) is now
b ).
fbk (x) = fk (x, η
A key property of fbk not satisfied by many of the methods we have considered so far is
that fbk ≥ 0 by definition. Here η can be computed using standard optimization algorithms;
see Section 2.6, for example. Also note that [log fbk (x)]′ , which has a simple expression, is
an estimate of [log fk (x)]′ as required in Example 9.3.1.
If the Bj are a splines basis Stone and Koo (1986) and Stone, Hansen, Kooperberg and
Truong (2007) have established further properties of fbk (x). We are left here too with a
choice of tuning parameters, in this case the number and location of the knots. We turn to
this in Section 12.5.
284
nonparametric inference for functions of one variable
11.4.3
Chapter 11
Nearest Neighbor Density Estimates
We close with a simple nonlinear method that plays an important role in classification.
Consider the convolution kernel estimate with K corresponding to U(−1, 1). Then
Fb (x + h) − Fb(x − h)
fbh (x) =
.
2h
(11.4.5)
It seems reasonable to use larger h if f (x) is small. One way of achieving this is to note
that for h close to 0 and n large,
Fb(x + h) − Fb (x − h) ≃ 2f (x)h ,
and then choose h so that Fb (x + h) − Fb (x − h)− is the same for each x. This notion
corresponds to finding h such that
Fb (x + h) − Fb (x − h)− = k
and choosing b
h = b
h(k, x) as the smallest such h. Let |x − X|(1) < · · · < |x − X|(n)
denote |x − Xi |, 1 ≤ i ≤ n, ordered, then
where ik is defined by
b
h(k, x) = |x − Xih |
|x − Xih | = |x − X|(k) .
The kth -nearest neighbour (kNN) density estimate is
Fb x + b
h(k, x) − Fb (x − b
h(k, x)) −
k
e
fk (x) =
=
.
b
b
2h(k, x)
2h(k, x)
(11.4.6)
Evidently, {b
h(k, x)} can also be used as a family of bandwidth choices for the other kernels.
Unfortunately, it is easy to see that if S(f ) = R, p lim|x|→∞ fek (x) 6= 0 and thus kNN
estimates are properly defined and renormalizable as densities only when S(f ) is a finite
interval (Problems 11.4.2 and 11.4.3). Using the arguments used to establish Corollary
11.2.1, we can show (see Problem 11.4.4)
Theorem 11.4.1. If f has support [a, b] for finite a and b, if k = kn → ∞, n−1 kn → 0 as
n → ∞, if x ∈ (a, b), and if
sup |f ′′ (x) : x ∈ [a, b]| ≤ M
for some M > 0, then, uniformly in f , M SE fek (x) = O(k −1 ). Thus fek is consistent.
Summary. We compare our approach to regularization with regularization based on roughness penalties. Our approach involves approximating a parameter θ(P ) where θ(Pb ) is undefined with a parameter θs (P ) where θs (Pb) is defined, and using θs (Pb) to estimate θ(P ).
Section 11.5
Confidence Regions
285
This approach applies in particular when θ(P ) is a smooth function such as f (·). In statistics, the roughness penalty approach to estimating a smooth function f (·) that does not
allow a plug-in estimate involves finding the minimizer fλ (·) = fλ (·; λ) of an expression
of the form
Z
ρ fλ (x), f (x) dx + λJ(fλ )
where the
R first term measures the discrepancy between f , and J(fλ ) is a roughness penalty
such as |fλ′′ (x)|2 dx that increases with the “roughness” of fλ , and λ > 0 is a tuning
parameter. The estimate fλ (·; Fb ) of f (·) is typically a piecewise polynomial called a spline.
This approachincludes the penalized maximum likelihood approach obtained by setting
ρ fλ (x), f (x) = f (x) log fλ (x). Next we discuss the application of the method of sieves
to the regularization and estimation of an irregular function f (·) and discuss its properties
in terms of Wald decision theory and Vapnik learning theory. We apply the sieve approach
to log density estimation and find that by using results from Section 1.6 on convexity in
exponential family models, we obtain simple estimates of the log density and its derivative.
Finally we introduce kth nearest neighbor estimates that are obtained from convolution
kernel estimates with K(u) = 12 1[|u| ≤ h] by choosing h so that there are k x’s in the
interval [−h, h].
11.5
Confidence Regions
We have until now focussed on estimation. For testing the hypothesis that f is a certain
parametric shape it is, in some ways (see Section 9.5), better to use goodness-of-fit tests
based on the empirical df Fb , rather than on fbh . However, by plotting confidence regions
for f based on fbh , we get insights into the possible shapes for f provided the sample size
is large. We will indicate how to find such regions using an asymptotic result of Bickel and
Rosenblatt (1973) and the bootstrap. We motivate this approach by developing the regions
from student t-type intervals.
First we consider confidence intervals for the density f (x) with x a fixed point of interest, then we show how to widen these intervals to obtain confidence regions valid for an
interval of x’s. We use ideas from Section 5.3. Set
Zi = Kh (x − Xi ), θ = E(Zi ), τ = Var(Zi ).
Then fbh (x) = Z, θ = fh (x), and to get a confidence interval for fh (x) we can use the
pivot
√
n(Z − θ)
(11.5.1)
sz
where s2z is the sample variance of Z1 , . . . , Zn . The asymptotic distribution of this pivot
is N (0, 1) (see (5.3.18)). Moreover, Examples 4.4.1 and 5.3.3 suggest that its distribution
can be closely approximated by the Tn−1 distribution. Thus
√
fbh (x) ± |tn−1 |1−α sz / n
(11.5.2)
286
nonparametric inference for functions of one variable
Chapter 11
defines an approximate level (1 − α) confidence interval for fh (x). If h = cn−a with
1
5 < a < 1, then (11.2.8) implies that the interval is asymptotically valid for f (x) as well
under the conditions of Corollary 11.2.1, because
√
√
nh[fbh (x) − f (x)] = nh{[fbh (x) − fh (x)] + [fh (x) − f (x)]}
√
and nh[fh (x) − f (x)] → 0 by (11.2.8) as n → ∞. However, h = cn−a , 51 < a < 1,
1
excludes the asymptotically optimal choice h = cn− 5 .
If instead of estimating τ we approximate it using (11.2.10), we obtain the pivot
√
nh[fbh (x) − fh (x)]
p
T (fh (x), fbh (x)) =
(11.5.3)
ν(K) fh (x)
which by the Central limit theorem and Slutsky’s theorem has an asymptotic N (0, 1) distribution (Problem 11.5.1). Solving
|T (fh (x), fbh (x))| ≤ |z|1−α
2
2
b
b2
b
for fh (x) is equivalent to solving
√ θ − (2θ + cα )θ + θ ≤ 0 for θ where θ = fh (x), θ =
fbh (x), and cα = |z|1−α ν(K)/ nh. The solution (see Example 4.4.3) gives the approximate level (1 − α) confidence interval fb− (x) ≤ fh (x) ≤ fb+ (x) where
q
(11.5.4)
fb± (x) = fbh (x) + c2α /2 ± cα fbh (x) + c2α /4.
Under the conditions of Proposition 11.2.2, this band is valid asymptotically for fh (x), and
for f (x) if h = cn−a , 15 < a < 1.
Asymptotic regions
To modify the above intervals to have probability (1 − α) of containing f (x) for all x in
an interval I = [r, s], one approach is to widen the preceding intervals to attain asymptotic
level (1 − α). Thus we could use the α quantile kα of the asymptotic distribution (Bickel
and Rosenblatt (1973)) of
T = sup |T (fh (x), fbh (x))|.
x∈I
This would lead to a simultaneous confidence region fb± (x), x ∈ [r, s], that covers
fh (x) (and, under certain conditions, f (x)) with probability
tending to (1 − α) as n →
√
∞ when cα in (11.5.4) is replaced by e
cα = e
kα ν(K)/ nh, where e
kα is the asymptotic
Bickel-Rosenblatt (1973) approximation to kα . Here e
kα can be a poor approximation to
kα for small and moderate n. Unfortunately, the ordinary bootstrap is, we believe, not an
alternative but the m out of n bootstrap may be — see Bickel and Sakov (2008).
Summary. Confidence regions for a curve provide a visualization of the possible shapes
of that curve provided the sample size is large. We first give approximate confidence
Section 11.6
Nonparametric Regression for One Covariate
287
intervals for fh (x) and f (x) for a fixed x by using pivots based on treating fbh (x) as a
sample average. Then we show how to construct (1 − α) confidence regions for the curves
fh (·) and f (·) by taking the maximum over x of the fixed x pivot and using the asymptotic distribution of this new pivot to construct the boundaries of the (1 − α) confidence
region.
11.6
Nonparametric Regression for One Covariate
11.6.1
Estimation Principles
This book has considered many experiments where the task is to model and analyze statistically the relationship between a response variable Y and a vector of covariates X. In this
section we consider nonparametric regression in the case of a one-dimensional covariate X.
The one dimensional procedures are often used as building blocks for the d-dimensional
case, and some of the one dimensional methods we consider will have straightforward ddimensional extensions. Moreover in some applications there is one variable X of principle
interest while the others are confounding variables. In this case Y will be the response after
the effect of the other variables have been subtracted out or controlled for.
Throughout the book we have considered various models connecting Y with X. In
Section 2.2.1 we argued that for x restricted to a narrow range, a model linear in x may
provide a good fit to µ(x) = E(Y |X = x). We now will allow µ(x) to take a general
shape over the range X of X. One very effective technique is the one suggested in Section
2.2.1: Do linear fits for X restricted to local neighborhoods of multiple x’s, then piece
together the results from the multiple fits. The result is local linear regression.
Principles for construction of estimated nonparametric regression include local averaging, local parametric modelling such as local linear regression, global modelling, and
penalized modelling. Or we can make a somewhat different breakdown:
(i) Methods based on plugging the empirical probability Pb into a regular approximation
to µ(x) in the formula,
Z
µ(x) = E(Y |X = x) = y f (y|x) dy
where f (· | ·) is the density of (Y |X = x).
(ii) Methods based on viewing E(Y |X) as the minimizer of E(Y −g(X))2 over g(·) (see
Section 1.4) and obtaining a regular approximation to µ(·) by restricting g to a locally parametric class of functions and replacing quadratic loss with locally weighted
quadratic loss.
Under each of these, there are local methods and global methods.
We begin with two examples of paradigm (i) approaches. We assume that we observe
(Xi , Yi ), 1 ≤ i ≤ n, i.i.d. as (X, Y ) with (X, Y ) having a continuous case density f ∈ F .
We let PeX (·) and Pb (·, ·) denote the empirical distributions of X1 , . . . , Xn and (Xi , Yi ),
1 ≤ i ≤ n.
288
nonparametric inference for functions of one variable
Chapter 11
Example 11.6.1. The regressogram and Nadaraya-Watson estimates. If we divide the x
axis into intervals Ij = (jh, (j + 1)h], −∞ < j < ∞, the natural approximation to the
parameter f (·|x) is the conditional density fh (·|x) of Y given X ∈ I(x), the Ij to which x
belongs. This leads to approximating the parameter µ(x) by
Eh (Y |X = x) =
∞ Z
X
j=−∞
y
f (y|x) 1 (x ∈ Ij ) dy .
The empirical plug-in estimate is
µ
bR
h (x) ≡
∞
X
j=−∞
Ȳjh 1(x ∈ Ij )
where Ȳjh is the average of the Yi with Xi ∈ Ij . This is called the regressogram and like
the histogram density estimate doesn’t take enough advantage of possible smoothness of
µ(x).
Let Kh (t) = h−1 K(t/h) where K is a kernel as in Section 11.2. A smooth weighted
average is
Pn
Yi Kh (Xi − x)
µ
bNW (x) = Pi=1
,
(11.6.1)
n
i=1 Kh (Xi − x)
the Nadaraya-Watson (NW) estimate. By selecting a kernel with enough derivatives we
can intuitively obtain a version of NW which is as smooth as we please.
✷
We can obtain the NW estimate by viewing it from the optimization viewpoint as well
and, in that way, also generalize itP
to estimate smoother µ(x). Both are examples of lon
cal averaging estimates, µ
b(x) =
i=1 Yi wn (x, Xi ) where the data-determined weight,
wn (x, Xi ), decrease as |Xi − x| increases.
Example 11.6.2. Minimum contrast and local polynomial estimates. Here is a class of
procedures which falls under paradigm (ii). Let x be fixed and let µ(z, β) be a parametric
family of functions such that for suitable β(x), µ(x, β(x)) = µ(x) and µ(z, β(x)) approximates µ(x) well locally for z close to x. Mainly we consider µ(z, β(x)) a polynomial
in z − x. Suppose K ≥ 0. We obtain a regular parameter by replacing minimization of
E(Y − g(X))2 by that of the discrepancy
Z
2
y − µ z, β(x) Kh (z − x)dzf (x, y)dxdy
(11.6.2)
D β(·) =
with respect to β = β(x), where β(x) is in Rp for some p ≥ 0, and then approximating
µ(x) by µ(x, β(x)). As h → 0 we approach paradigm (ii). To simplify our notation,
we
introduce (X ′ , Y, Z) with density proportional to Kh (z − x)f (x, y)1 x ∈ S(f ) . Conditioning on X ′ = x, we obtain (Problem 11.6.1)
Z
2
y − µ(z, β) q(z, y|x)dzdy
β(x) = arg min
β
Section 11.6
Nonparametric Regression for One Covariate
289
where q(z, y|x) is the conditional density of (Z, Y )|X ′ = x, that is
q(z, y|x) =
f (z, y)Kh(z − x)1(x ∈ S(f ))
fh (x)
with fh (·) the marginal density of X ′ given by
Z
fh (x) =
Kh (z − x)f (z, y)dzdy1 x ∈ S(f )
Z
=
Kh (z − x)fX (z)dz1 x ∈ S(f ) .
Note that
β(x) = β(P ; x) ≡ arg min
β
R
2
y − µ(z, β) Kh (z − x)1 x ∈ S(f ) dP (z, y)
R
. (11.6.3)
Kh (z − x)1 x ∈ S(f ) dPX (z)
The empirical plug-in estimate β(Pb ; x) is a minimum contrast and local modelling estimate
in the same spirit as (11.3.1); we approximate µ(x) locally by µ(z, β), where β can be
expressed as β(P ; x).
If we specialize to µ(z, β) = β ∈ R we obtain the Nadaraya-Watson estimate (Problem
11.6.2). If we take µ(z, β) = β0 + β1 (z − x) we get
β1 (x, P ) =
EQ (Y (Z − EQ Z))
,
VarQ Z
β0 (x, P ) = EQ (Y )−β1 (x, P )(EQ (Z)−x) (11.6.4)
where Q is the distribution corresponding to q(z, y|x). See Remark 11.3.1.
We can use empirical plug in because dependence on P is purely through dP (z, y) and
dPX (z). Specifically,
Pn
Yi (Xi − X̄(x))Kh (Xi − x)
β1 (x, Pb ) = Pi=1
(11.6.5)
n
2
i=1 (Xi − X̄(x)) Kh (Xi − x)
where
Pn
i=1 Xi Kh (Xi − x)
X̄(x) = P
.
n
i=1 Kh (Xi − x)
Set Ȳ (x) = µ
bNW (x), then the estimate of µ(x) is the locally linear estimate
µ
bh (x) = β0 (x, Pb ) = Ȳ (x) − β1 (x, Pb )(X̄(x) − x) .
(11.6.6)
(11.6.7)
Evidently, we may replace β0 + β1 (z − x) by a higher order polynomial in (z − x) and use
Theorem 1.4.4 to obtain locally polynomial estimates.
✷
Remark 11.6.1. We obtain a robust minimum contrast estimate if we replace the squared
error e2 , e = y −u, in (11.6.2) by the Huber function ρ(e) = e2 , |e| ≤ k, ρ(e) = k|e|, |e| >
k. With this alteration, our estimates resemble the popular LOESS estimate of Cleveland
(1979) and Cleveland and Devlin (1988).
290
nonparametric inference for functions of one variable
11.6.2
Chapter 11
Asymptotic Bias and Variance Calculations
Our calculations are kept simple by first computing the bias and variance conditional on
X = (X1 , . . . , Xn )T and then computing limits in probability as n → ∞. Thus, because
E(Yi |X) = µ(Xi ),
Pn
n−1 i=1 Kh (Xi − x)µ(Xi )
Pn
E{b
µNW (x)|X} =
.
(11.6.8)
n−1 i=1 Kh (Xi − x)
Moreover, if kµ′′′ k∞ ≤ M < ∞, then uniformly in x,
1
µ(Xi ) = µ(x) + µ′ (x)(Xi − x) + µ′′ (x)(Xi − x)2 + OP (Xi − x)3
2
which makes E µ
bNW (x)|X a simple function of the i.i.d. sums
Snj =
n
X
i=1
(11.6.9)
j
Xi − x Kh Xi − x , j = 0, 1, 2.
Recall that [a, b] is the support of the density f (x) of X and recall the notation
Z
Z
j
mj = mj (K) = u K(u)du, ν = ν(K) = ν0 (K), νj = νj (K) = uj K 2 (u)du.
Theorem 11.6.1. Suppose that [c, d] ⊂ (a, b), inf{f (x) : x ∈ [c, d]} > 0, and that µ′′′ (·)
and f ′′ (·) are bounded on [c, d]. If K is symmetric, m2 (K) < ∞, and ν5 (K) exists, then,
for the estimate (11.6.1), uniformly for x ∈ [c, d] as n → ∞, h = hn → 0,
1 ′′
2µ′ (x)f ′ (x)
Bias[b
µNW (x)|X ] =
µ (x) +
m2 (K)h2 + OP (h3 ) .
(11.6.10)
2
f (x)
Proof. Using (11.6.8) and (11.6.9), we have
Bias[b
µNW (x)|X] =
µ′ (x)Sn1 + 21 µ′′ (x)Sn2 + OP (Sn3 )
.
Sno
(11.6.11)
The rest of the proof uses Chebychev’s inequality, the law of large numbers, and Taylor
expansions. See Appendix D.6.
Remark 11.6.2. We see from the expansion used to derive (11.6.10) that under the conditions of Theorem 11.6.1, if m2 (K) = 0, and if m4 (K) < ∞, then the conditional bias
of µ
bNW is of the order OP (h4 ). In general, if r is even, f ′′ and µ(r+3) are bounded, K is
symmetric, mj (K) = 0, j = 1, . . . , r, and mr+3 (K) < ∞, then the conditional bias of
µNW (K) is of order OP (hr+2 ). Kernels with mj (K) = 0, j = 1, . . . , r, are called higher
order kernels of order r. See Problem 11.2.8 for an example. If r is odd, the preceding
with r replaced by r − 1 applies.
✷
Section 11.6
291
Nonparametric Regression for One Covariate
We turn to
Var(b
µNW (x)|X) =
n
X
wi2 (x)σ 2 (Xi )
(11.6.12)
i=1
where
X
σ 2 (x) = Var(Y |X = x), wi (x) = Kh (Xi − x)
Kh (Xi − x) .
(11.6.13)
Theorem 11.6.2. Suppose that for [c, d] ⊂ (a, b), inf f (x) : x ∈ [c, d] > 0 and that
f ′ (x) and [σ 2 (x)]′ are bounded on [c, d]. If νj (K) < ∞, j = 0, 1, then, uniformly for
x ∈ [c, d], as n → ∞, h → 0, nh → ∞,
ν(K)σ 2 (x)
1
Var(b
µNW (x)|X) =
.
(11.6.14)
+ oP
nhf (x)
nh
Proof. See Appendix D.6.
Remark 11.6.3 The bias-variance tradeoff is again clear. Let B(x)h4 and V (x)(nh)−1
denote the leading terms of the squared conditional bias (11.6.10) and variance (11.6.14),
then the asymptotic conditional MSE is
Ah (x) = B(x)h4 + V (x)(nh)−1 .
By solving (d/dh)Ah (x) = 0 for h, we find the minimizer
h = hoptimal =
V (x)
4B(x)
51
1
n− 5 .
It follows that the conditional MSE of µ
bNW (x) satisfies
4
1
4
4
4
inf MSE(b
µNW (x)|X) = 5 · 4− 5 B 5 (x)V 5 (x)n− 5 + oP (n− 5 ).
h>0
1
4
Thus µ
bNW (x) with h of the form cn− 5 converges to µ(x) at the rate OP (n− 5 ).
✷
If S(f ) = [a, b] and xℓ = a + λh, xr = b − λh are left and right boundary points,
then µ
bNW (xℓ ) is still consistent, but the bias at these points is of order OP (h) rather than the
OP (h2 ) rate of Theorem 11.6.1. To improve on this rate we turn to other estimates.
Local polynomial estimates
In Example 11.6.2 consider the polynomial
µ(z; β) =
p
X
k=0
βk (z − x)k .
292
nonparametric inference for functions of one variable
Chapter 11
The partial derivative with respect to βj is ∇j µ(z − x, β) = (z − x)j and the minimizer of
the discrepancy (11.6.2) satisfies the set of linear equations (j = 0, . . . , p)
EP {Y (X − x)j Kh (X − x)} =
p
X
k=0
EP {(X − x)j Kh (X − x)(X − x)k }βk . (11.6.15)
Let L(X − x) = (1, X − x, . . . , (X − x)p )T , then the solution to (11.6.15) is
β = β(x; P ) = [EP {L(X −x)Kh (X −x)LT (X −x)}]−1 EP {Y Kh (X −x)L(X −x)}
provided the indicated matrix inverse exists. Thus the empirical plug-in estimate of β is
b
β(x)
= (Z T W Z)−1 Z T W Y
(11.6.16)
where W = Diagonal(Kh (X1 − x), . . . , Kh (Xn − x)),Y = (Y1 , . . . , Yn )T i, and Z =
((Xi − x)j ), i = 1, . . . , n, j = 0, . . . , p. Note that (11.6.16) is also the solution to the
weighted least squares problem with weight matrix W ; see (2.2.20). Using the notation
b
β(x)
= (βb0 (x), βb1 (x), . . . , βbp (x))T
the estimate of µ(x) is
b
b
µ
b(x) = µ(z; β(x))|
z=x = β0 (x).
For the local linear case where p = 1, this reduces to (11.6.7).
We next show that when µ is smooth the rate at which the bias tends to zero as h → 0
can be made arbitrarily fast by selecting p large. Note that (11.6.16) and E(Yi |X i ) =
µ(Xi ) yields
b
E(β(x)|X)
= (Z T W Z)−1 Z T W µ
where µ = (µ(X1 ), . . . , µ(Xn ))T . Set β = (µ(x), µ′ (x), . . . , µ(p) (x)/p!)T and r =
b
µ − Zβ; then the conditional bias of β(x)
as an estimate of β is
b
Bias(β(x)|X)
= (Z T W Z)−1 Z T W r.
(11.6.17)
Here is a key lemma.
Lemma 11.6.1. Assume that inf{f (x) : x ∈ [c, d]} > 0 for [c, d] ⊂ (a, b). Also assume
that f ′ (·) and µ(p+2) (·) are bounded on [c, d]. Then, uniformly for x ∈ [c, d],
r = (βp+1 (Xi − x)p+1 + oP (Xi − x)p+1 )1≤i≤n .
(11.6.18)
Proof. The result follows because by Taylor expansion
µ(Xi ) =
p
X
j=0
βj (Xi − x)j + βp+1 (Xi − x)p+1 + oP (Xi − x)p+1 ,
(11.6.19)
Section 11.6
Nonparametric Regression for One Covariate
that is, the first (p + 1) terms in the Taylor expansion of r are identically zero!
It follows that
293
✷
b
Proposition 11.6.1. If µ
b(x) is a polynomial of order p, then β(x)
is unbiased. In particular,
b
µ
b(x) = β0 (x) is unbiased.
Lemma 11.6.1 also yields the asymptotic conditional bias of µ
b(x) in general: For the
b
proof and the conditional bias of the vector β(x),
see Appendix D.6.
Theorem 11.6.3. Suppose that K is symmetric with m2p+1 (K) < ∞. Under the conditions of Lemma 11.6.1, uniformly for x ∈ [c, d],
Bias [b
µ(x)|X] =
ηp+1 (K) (p+1)
µ
(x)hp+1 + oP (hp+1 )
(p + 1)!
(11.6.20)
where ηp+1 (K) is a constant defined in Appendix D.6.
When p is even, then ηp+1 (K) = 0 (Appendix D.6), so the leading term in (11.6.20) is
zero. In the case of p = 0, we have shown that the next term in the expansion gives (11.6.10)
as the leading term, which is seen to be OP (h2 ). For the leading term when p ≥ 2 is even,
see Fan and Gijbels (1996, Theorem 3.1). When p = 1, we show in Appendix D.6 that
η2 (K) = m2 (K), thus
1
Bias µ
b(x)|X = m2 (K)µ′′ (x)h2 + oP (h2 ), p = 1 .
2
This bias is of same order, OP (h2 ), as the bias (11.6.10) of the local constant estimate.
However, at boundary points we can show that the local linear estimate retains the order
OP (h2 ) while the bias of the local constant estimator converges to zero at the slower rate
R1
OP (h) (Problem D.6.5). Let mℓ,λ (K) = −λ uℓ K(u)du and
Qλ (K) =
m22λ (K) − m1λ (K)m3λ (K)
.
m2λ (K)m0λ (K) − m21λ (K)
Proposition 11.6.2. Suppose the support of f is S(f ) = (a, b) with a and b finite, K
is symmetric with K(u) = 0 for |u| > 1, µ(p+1) (a+ ) exists, f (a+ ) > 0, and f (·) and
µ(p+1) (·) are right continuous at a, then for x = a + λh,
(a)
for p = 0, Bias [b
µ(x)|X] = m2λ (K)µ′ (a+ )h + oP (h);
(b)
for p = 1, Bias [b
µ(x)|X] =
1
Qλ (K)µ′′ (a+ )h2 + oP (h2 ) .
2
The variance
(11.6.21)
(11.6.22)
294
nonparametric inference for functions of one variable
Chapter 11
Theorem 11.6.4. Suppose K is a symmetric density with ν2p+1 (K) < ∞, inf{f (x) : x ∈
[c, d]} > 0 for [c, d] ⊂ (a, b) and that f ′ (x), µ′p+1 (x) and [σ 2 (x)]′ exist and are bounded
on [c, d]. If h → 0 and nh → ∞ as n → ∞, then, uniformly for x ∈ [c, d],
Var(b
µ(x)|X) = νp (K)
σ 2 (x)
(nh)−1 {1 + op (1)} ,
f (x)
where νp (K) is a constant given in Appendix D.6.
It follows from Appendix D.6 that ν1 (K) = ν0 (K), and the asymptotic variances for
the p = 0 and p = 1 cases coincide. Also note that the order OP {(nh)−1 } of Varb
µ((x)|X)
does not depend on p. Thus by increasing p, assuming smoothness, we can make the
asymptotic conditional bias arbitrarily small without increasing the order of the asymptotic
conditional variance.
We next turn to conditional MSE, its asymptotic and integrated versions, and its minimizers.
MSE and IMSE
Corollary 11.6.1. Under the conditions of Theorems 11.6.3 and 11.6.4, uniformly for x ∈
[c, d],
2
σ 2 (x)
ηp+1 (K) (p+1)
µ
(x)hp+1 + νp (K)
(nh)−1
M SE(b
µ(x)|X) =
(p + 1)!
f (x)
oP {hp+1 + (nh)−1 } .
(11.6.23)
We denote the sum of the first two terms in (11.6.23) by A(x, h). Note that A(x, h) is
of the form B(x)h2(p+1) + V (x)n−1 h−1 , where B(x) and V (x) are given in (11.6.23). By
solving (d/dh)A(x, h) = 0 for h we find the minimizer of A(x, h)
V (x)
h0 (x) = arg min A(x, h) =
h>0
2(p + 1)B(x)
1
2p+3
1
n− 2p+3 .
By substituting h0 (x) in A(x, h) we find
2p+2
min A(x, h) = c′ n− 2p+3
h>0
(11.6.24)
where c′ does not depend on n. The same order of convergence holds for the IMSE.
Remark 11.6.4. Note that when µ(x) is infinitely differentiable the rate of convergence
of the MSE and IMSE to zero can be made arbitrarily close to the regular parametric rate
n−1 .
✷
The (asymptotic) integrated mean squared error (IMSE) is
Z
IM SE(b
µ) = A(x, h)w(x)dx
Section 11.6
295
Nonparametric Regression for One Covariate
for an appropriate weight function w(x). If f (·) has finite support
[a, b], then w(x) = f (x)
is a natural choice. In this case, if σ 2 (x) is constant and E [µ(p+1) (X)]2 < ∞, the
optimal h is (Problem 11.6.2.1)
1
(
) 2p+3
1
[b − a]σ 2
h0 = Cp
(11.6.25)
n− 2p+3 .
(p+1)
2
E [µ
(X)]
2p+2
With this h0 , IMSE tends to zero at the rate n− 2p+3 .
Regression curves
So far our main focus has been on estimating the regression level µ(x) = E(Y |X = x).
However, in parametric analysis the emphasis is on estimating regression slopes (often
called regression coefficients) because they give the expected change in the response Y as
the covariate x is increased (or decreased). In a nonparametric framework the regression
curve b(x) ≡ µ′ (x) has a similar interpretation, and is also of interest. Recall that our
b = β(x)
b
b (x), . . . , β
b (x))T provides an estimate µ
local polynomial estimate β
= (β
b(z) =
0
p
Pp b
j
b(z) for z close to x. It follows that
j=0 β j (x)(z − x) of µ
bb = bb(x) ≡ µ
b (x)
b′ (z)|z=x = β
1
is an estimate of b(x). The asymptotic properties of bb are contained in the results in Appendix D.6. We state these results when p = 2, which is a good choice of p because it gives
conditional bias of order h2 at boundary as well as interior points.
Theorem 11.6.5. Suppose the conditions of Theorems 11.6.3 and 11.6.4 are satisfied with
p = 2. Then
n
o2
m4 (K) (3)
ν2 (K) σ2 (x) −1 −3
2
MSE(bb(x)|X) = 61 m
µ
(x)h
+m
n h + oP (h2 + n−1 h−3 )
2 (K)
2 (K) f (x)
= B(x)h4 + V (x)n−1 h−3 + Rn , say.
The asymptotically optimal bandwidth h is obtained by setting the derivative of Bh4 +
1
V n−1 h−3 equal to zero, which with C(K) = {27m2 (K)ν2 (K)/m24 (K)} 7 (Problem
11.6.7) yields
h0 =
3V
4B
17
n
− 71
= C(K)
σ 2 (x)
[µ(3) (x)]2 f (x)
71
1
n− 7 .
(11.6.26)
Smoothing, variance estimation, and confidence regions
Our estimates of µ(·) are often referred to as linear smoothers of the data (Xi , Yi ),
′
b= µ
1 ≤ i ≤ n. More precisely, let µ = µ(X1 ), . . . , µ(Xn ) and µ
b(X1 ), . . . , µ
b(Xn )
be an estimator (predictor) of µ. Then we can write
b = SY
µ
(11.6.27)
296
nonparametric inference for functions of one variable
Chapter 11
for some n × n smoother matrix S. In the case of the NW estimate, (11.6.1), S =
wi (Xj ) n×n (see (11.6.13)). Estimates of the form SY include regression splines; see
e.g. Hastie and Tibshirani (1990).
For variance estimation, we consider the model
Yi = µ(Xi ) + εi ,
i = 1, . . . , n
where Xi and εi are independent and v(Xi ) = Var(Y |Xi ) = Var(εi ). An estimator of
T
v = v(X1 ), . . . , v(Xn ) is obtained by considering TR2 for some n × n smoother
matrix T, where R = (I − S)Y is the vector of residuals and R2 is the vector of squared
residuals. To obtain a bias correction for TR2 , we note that when v(x) ≡ σ 2 ,
2
E(R2 |X) = E(b
µ − µ|X) (1 + A)σ 2
where A = diag(SST − 2S) and 1 = (1, . . . , 1)Tn×1 . To reduce bias, we write A = (aij )
and set
Ri
,
r2 = (r12 , . . . , rn2 )T .
ri = √
1 + aii
Our estimate of v is (see Ruppert, Wand, Holst and Hössjer (1997))
b = Tr2 .
v
b is conditionally unbiased (see Proposition 11.6.1) and v(x) ≡ σ 2 , vb is condiWhen µ
tionally unbiased. Moreover, for local polynomial estimates, by Theorem 11.6.3, the conditional bias is of order oP (h2(p+1) ). In the homeoscedastic case when v(x) ≡ σ 2 , we
use
σ
b2 = n−1
n
X
i=1
vb(Xi ) .
(11.6.28)
We now turn to the estimation of the variances
Pnof our estimates. The estimates of
µ(j) (·), j = 0, . . . , p, can be written in the form i=1 wij (x)Yi for appropriate wij (x);
see (11.6.1), (11.6.5), and (11.6.7). It follows that in the homeoscedastic case
n
X
2
wij
(x) ,
Var µ
b(j) (x)|X = σ 2
i=1
where we use σ
b2 from (11.6.28) to estimate σ 2 . In this case, under conditions on h where
(j)
µ
b (x) has bias of lower order than the order of its standard deviation, the pivot
Zj (x) =
µ
b(j) (x) − µ(j) (x)
Pn
2 (x)} 21
σ
b{ i=1 wij
will approximately have a N (0, 1) distribution and yield approximate (1 − α) confidence
intervals for µ(j) (x) with x in a grid G = {x(1) , . . . , x(g) }. We expect 100(1−α)% of these
intervals to contain the true values of µ(j) (x), x ∈ G. Simultaneous level (1 − α) intervals
can be obtained from the asymptotic distribution of supx |Zj (x)| (Nadaraya (1989)).
Section 11.7
297
Problems and Complements
Multiple testing in matched pair experiments
In Section 4.9.2 we considered experiments where subjects serve as their own control
and the difference Y between a treatment response and a control response is recorded. Now
in addition to Y we include a covariate such as age or BMI (body mass index). In this case
we may want to investigate for which covariate values the treatment has an effect and turn
to multiple testing where we test Hj : µ(xj ) = 0, j = 1, . . . , g, vs one-sided or two-sided
alternatives using Zj (x) or |Zj (x)|. The simultaneous confidence intervals discussed above
provide possible solutions to the multiple testing problem. If we test H0 : µ(x) = 0 for all
x, µ
b(j) (x) will not have a bias problem under H0 .
Bandwidth choice
The discussion of Section 11.2.3 also applies to the local polynomial estimates of µ(·).
Again the reference distribution approach to bandwidth selection yields the optimal rate
of convergence but not the optimal constant when p = 1 if we assume a well behaved µ′′
R ′′ 2
but are not willing to estimate
µ (x) f (x)dx nonparametrically. For instance, we can
use the parametric model where (Y |X = x) is N (α0 + α1 x + α2 x2 , σ 2 ) and substitute
the parametric estimates of σ 2 and µ′′ (x) = α2 in (11.6.25) with (b − a) replaced by
(X(n) − X(1) ). However, the soundest approach is cross validation, which adapts to the
degree of smoothness present. See Section 12.5.
Summary. Let (X1 , Y1 ), . . . , (Xn , Yn ) be i.i.d. as (X, Y ) ∼ P where (X, Y ) is continuous. We consider kernel estimates µ
bh (x) of the curve µ(x) = EP (Y |X = x) based
on weighted local fits to µ(x) over intervals [x − h, x + h]. The Nadaraya-Watson (NW)
estimate µ
bNW (x) is a kernel estimate of the form Σwi (x)Yi , where
wi (x) = Kh (Xi − x)/
n
X
i=1
Kh (Xi − x),
and can be viewed as giving a locally constant fit to µ(x) over [x − h, x + h]. We also
consider locally linear and locally polynomial fits and give their bias and variance properties
for n → ∞ and h → 0. The NW estimate has conditional bias of order OP (h2 ) for
interior points x and order OP (h) for boundary points while the locally linear estimate has
conditional bias of order OP (h2 ) for all points x in the support of X. These estimates have
the same asymptotic variance.
11.7
Problems and Complements
Problems for Section 11.2
1. Let X1 , . . . , Xn be i.i.d. with density f , where f has support S(f ).
(a) Suppose S(f ) = R. Show that (2h)−1 Pb (x − h, x + h] =
Jh (u) = h−1 J(u/k) and J(t) = 21 1[−1 ≤ t ≤ 1].
R
Jh (x − z) where
298
nonparametric inference for functions of one variable
Chapter 11
(b) Suppose that S(f ) = [a, b] and 0 < 2h < b − a. Specify Kh (x, z) so that
Z
Pb(x − h, x + h]
= Kh (x, z) dPb(z) .
2h
Consider (i) x ∈ [a, a + h), (ii) x ∈ [a + h, b − h], and (iii) x ∈ (b − h, b]
2. Suppose K is a kernel with K ≥ 0 and support S(K) = [−c, c], 0 < c ≤ ∞. Let
Xh = Z + hV where Z ∼ f , V ∼ K, Z and V are independent, S(f ) = [a, b], −∞ ≤
a < b ≤ ∞, 2h < b − a.
(a) Show that Xh has density
fh (x) =
Z
b
a
Kh (x − z)f (z)dz, x ∈ [a − ch, b + ch]
= 0,
otherwise.
Hint. Z and V have joint density f (z)K(v), z ∈ [a, b], v ∈ [−c, c]. By the Jacobian
transformation theorem (Theorem B.2.2), Xh and Z have joint density
f (z)h−1 K (x − z)/h , z ∈ [a, b], x ∈ [a − ch, b + ch] .
Rb
fh (x)dx = 1. Thus a fh (x)dx may not be one and
Rb
fh (x) may not be a density on [a, b]. Find a fh (x)dx when V = U ∼ U[−1, 1] and
(i) f is N (0, 1), (ii) f is U[0, 1].
Rb
(c) We can think of fh (x) = a Kh (x − z)f (z)dz as a weighted average of f (z) with
weights Kh (x − z), z ∈ [a, b]. However, the weights may not integrate to one. Let
Rb
Wh (x) = a Kh (x − z)dz. Show that Wh (x) = L (x − a)/h − L (x − b)/h ,
Rt
where L(t) = −∞ K(t)dt.
(b) It follows from (a) that
R b+ch
a−ch
(d) Suppose −∞ < a < b < ∞. For x = a + λh and x = b − λh, 0 < λ < 1, find
Wh (x) as defined in (c) when (i) K is U[−1, 1] and (ii) K = KE .
(e) We can get a closer approximation Rto f (x) than fh (x) by rescaling the weights
Kh (x − z); that is, use f (x; h) = Kh (x, z)f (z)dz with Kh (x, z) = Kh (x −
Rb
z)/Wh (x) and Wh (x) as in (c). Show that a f (x; h)dx = 1.
3. Show that part (b) of Lemma 11.2.1 holds using the dominated convergence theorem.
4. Establish Lemma 11.2.2 using Lemma 11.2.1.
Hint. By (A.15.2),
P |fbh (x) − f (x)| ≥ ε ε2 ≤ MSE fbh (x) = Bias fbh (x)
2
+ Var fbh (x) .
Section 11.7
299
Problems and Complements
5. Show that a symmetric kernel with m2 (K) = 0 will yield a bias term of uniform order
h4 for fbh (x) if kf (5) k < ∞ and m5 (K) < ∞. Show that K(t) = 12 (3 − x2 )φ(x) is such
a kernel when φ is the N (0, 1) density. Do the same for K(t) = (15/32)(7t4 − 10t2 +
3)1(|t| ≤ 1).
6. Complete the proof of Proposition 11.2.2.
7. Show that the conclusions of Propositions 11.2.2 and Corollary 11.2.1 remain valid if K
doesn’t have compact support.
8. Kernels K such that
Z ∞
K (j) (x) dx = 0
−∞
for 2 ≤ j ≤ r
(11.7.1)
are called higher order kernels of order r. Show that K(x) = Hj (x) exp(−x2 /2), where
Hj is a Hermite polynomial (Section 5.3), satisfies (11.7.1).
9. By repeating the arguments of Propositions 11.2.1 and 11.2.2, show that if we assume
that kf (r+3) k∞ < ∞, mr+3 (K) < ∞, and r is even, then, by using K as in (11.3.7) and
2r
bandwidth cn−1/(2r+1) , we achieve the uniform rate of convergence n− 2r+1 of MSE fb(x)
to zero. Show that if K is symmetric and r is odd, the result holds with r replaced by r − 1.
10. Assume the conditions of Corollary 11.2.1, that f (x) > 0 and 0 < [f ′′ (x)]2 < ∞.
(a) Show that the sum AMSE(fbh (x)) of the first two terms in the expansion (11.2.12) of
1
MSE[fb(x)] is minimized when h = cn− 5 for some c 6= 0. Find the expression for c.
Hint. Note that the expression is of the form a(nh)−1 + bh2 . Set the derivative equal
to zero and solve for h. Check that this gives the minimum.
1
4
(b) Show that using h = cn− 2 leads to the convergence rate n− 5 for MSE [fbh (x)].
11. Establish (11.2.18).
12. Establish (11.2.19) and (11.2.20). Verify that when F = N (µ, σ 2 ) and K = U [−1, 1],
1
then fopt = 1.95σn− 5 . See also Chapter 12.
13. Generate a sample of size n = 100 from a chi-square distribution with 5 degrees of
freedom. Plot the density of f and the estimate fbh (·) given in (11.2.6) based on
1
(a) N (0, 1) kernel and bandwidth 1.06n− 5 s based on a N (0, 1) reference distribution.
1
(b) U [−1, 1] kernel and h = 1.95n− 5 s based on a N (0, 1) reference disribution.
14. Smooth distribution function estimates. When F is a continuous distribution function,
we may want a continuous estimate of F . Consider the two estimates (see Problem 11.2.1).
Z x Z b
Z x
fbh (t)dt =
Kh (t − z)dFb (z)dt
Fbh (x) =
−∞
−∞
a
300
nonparametric inference for functions of one variable
Fb(x; h) =
Z
x
−∞
fb(t; h)dt =
Z
x
−∞
where K(t − z; h) = Kh (t − z)/wh (t), wh (t) =
Z
a
Rb
b
Chapter 11
K(t − z; h)dFb (z)dt
Kh (t − z)dz. Suppose S(f ) = [a, b].
Rx
Pn h
(a) Let L(x) = −∞ K(t)dt. Show that Fbh (x) = n−1 i=1 L (x − Xi )/h − L (a −
i
Xi )/h .
a
(b) When K = U[−1, 1] and a and b are finite, find Fb (x; h) for xε[a + h, b − h], x =
a + λh and x = b − λh, 0 < λ < 1.
(c) Show that
E fbh (x) = L (x − b)/h +
Z
(x−a)/h
(x+b)/h
F (x − uh)K(u)du
Z
h
− L (a − b)/h) +
0
(a−b)/h
Hint. Use integration by parts:
R
LdF = LF −
R
i
F (a − uh)K(u)du .
F dL.
(d) When K = U[−1, 1] and a and b are finite, find E Fb(x; h) for x as in (b) preceding.
Compare your answer to E Fbh (x) with K = U[−1, 1].
(e) Find VarFbh (x). Show that we can choose h = hn so that MSE Fbh (x) = O(n−1 ).
What smoothness conditions are needed on F ? What conditions are needed on K?
Hint. Use (c) preceding.
(f) A good choice for L is the logistic df L0 . For K = L0 , write a program that computes
Fbh (x), Fb (x; h), and Fb (x) and plot the graphs for a sample of size n = 20 from
F (x) = 1 − e−x , x > 0. Use h = 0.5s. Repeat with h = 0.25s and h = s. Include
the true f (x) on the figures.
Hint. See Remark 11.2.2.
Problems for Section 11.3
1. Suppose that K has support [−1, 1]. Under the conditions of Proposition 11.2.1, show
that forfbh (x) defined by (11.2.6), if S(f ) = [a, b] with a and b finite and x = a + λh,
0 < λ < 1, then
1
E fbh (x) = µ0,λ (K)f (x) − hµ1,λ (K)f ′ (x) + h2 f ′′ (x)µ2,λ (K) + O(h3 )
2
Rλ j
where µj,λ (K) = −1 u K(u)du.
Section 11.7
301
Problems and Complements
2. Suppose K(·) is a symmetric density on [−1, 1]. Assume the conditions of Problem
11.3.1. Show that the estimate fbh (x) based on (11.3.2) has bias O(h) for the boundary
points a + λh and b − λh, 0 < λ < 1.
3. Establish (11.3.2).
4. Let fbh (x) be defined by (11.2.6). Suppose f is the U[0, 1] density, 0 < h <
h ≤ x ≤ 1 − h. When K = U[−1, 1] find
1
2,
and
(a) MSE[fbh (x)] and give the rate at which MSE[fbh (x)] tends to zero as n → ∞.
(b) hopt which minimize MSE[fbh (x)].
(c) Solve (a) and (b) above when K(u) = 34 (1 − u2 )1(|u| ≤ 1).
(d) Solve (c) when x = λh, 0 < λ < 1.
5. In Example 11.3.1, (see (11.3.4)) set mj = mj (x, h), j = 0, 1, 2, and
Z b
sj =
(z − x)j Kh (z − x)dF (z), j = 0, 1.
a
(a) Show that the solution to (11.3.4) is
s0 h2 m2 − s1 hm1
1
.
(α, β)T =
h2 (m0 m2 − m21 )
s1 m0 − s0 hm0
(b) Use (a) preceding to show that an estimate of f ′ (x) is
Z b
f ′ (x, Fb ) ≡ β(x, Fb) =
Kh (x, z)dFb (z)
a
where
Kh (x, z) =
m0 (x, h)[(z − x)/h − 1]/h
Kh (z − x) .
m0 (x, h)m2 (x, h) − m21 (x, h)
(c) Establish (11.3.8) and (11.3.9).
Hint: For x ∈ [a + h, b − h], by (11.3.5), this follows from Propositions 11.2.1 and 11.2.2.
For xh = a + λh, 0 < λ < 1, follow the proofs of these propositions.
6. Local polynomial estimates. Suppose that for z close to x, in Example 11.2.1 continued,
we use a local polynomial approximation
f (z − x; β) ∼
= β0 +
p
X
j=1
βj (z − x)j .
That is, we approximate f (x) by fh (x) = f (0; β(x)) = β0 (x) where
Z
β(x) = β(F ; x) = arg min [f (z) − f (z − x; β)]2 Kh (z − x)dz.
β
(11.7.2)
302
nonparametric inference for functions of one variable
Chapter 11
(a) Show that β(x) = β(F ; x) is the functional
β(F ; x) = H −1 C −1 a(F )
where H = diag(1, h, . . . , hp ), C = (sjk ),a(F ) = (a0 (F ), . . . , ap (F ))T ,
Z (b−x)/h
[1−(j+k)]
sjk = h
uj+k K(u)du, j, k = 0, 1, . . . , p,
(x−a)/h
aj (F ) =
Z z−x
h
j
Kh (z − x)dF (z), j = 0, . . . , p.
Hint. This β(x) satisfies the linear equations, for k = 0, . . . , p,
Z
Z
p
X
k
βj (z − x)j+k Kh (z − x)dz.
(z − x) Kh (z − x)dF (z) =
j=0
Our approximation to f (x) is the functional β0 (F ; x) with plug-in estimate fbh (x) =
β0 (Fb; x).
(b) Show that if f (x) is a polynomial of order p and if fbh (x) is the empirical plug-in
estimate of f (x) obtained from the local polynomial approximation to f (x) of order
p, then fbh (x) is an unbiased estimate of f (x).
If f (x) has a continuous (p + 2)th derivative at the point x in its support, if p is odd,
and if K is continuous and symmetric with support [−1, 1], then it can be shown (Jiang
and Doksum (2003)) that this fbh (x) converges in probability to f (x) at the rate OP (n−a )
with a = (p + 1)/(2p + 3) provided the optimal (the one minimizing asymptotic MSE)
−1
h ≍ (n(−2p+3) ) is employed. This holds for boundary as well as interior points. Note
that if f (x) is infinitely differentiable, the rate of convergence of fbh (x) to f (x) can be
1
made arbitrarily close to the regular rate n− 2 by choosing a polynomial approximation
with p large.
Suppose we assume that f (x) has a continuous (p+ 2)th derivative.
Are there estimates
that achieve better rates of convergence than OP n−(p+1)/(2p+1) ? The answer is no; see
e.g. van der Vaart (1998).
R
7. Show that α(x; F ) as given in Remark 11.3.1 reduces to fh (x; F ) = Kh (x, z)dF (z)
where Kh (·, ·) is defined by (11.3.7).
R
(z−x)Kh (z−x)dF (z)
,
Hint. E[(Z − x)f (Z)|X = x) =
m0 (x;F )
R
(z − x)Kh (z − x)dz
hm1 (x, h)
E(Z − x|X = x) =
=
, etc.
m0 (x; h)
m0 (x, h)
R
8. In Example 11.3.2, show that Km (x, z) dx = 1 when B0 is a constant and a and b are
finite.
9. Let X1 , . . . , X200 be a sample from a beta (2,3) distribution. Consider the grid {0, 0.02,
. . . , 0.98, 1}.
Section 11.7
Problems and Complements
303
(a) Plot the locally linear density estimate (11.3.6) using the algorithm in Remark 11.3.2
using h = 0.1. Also plot the beta (2,3) density on the same figure.
(b) For comparison, repeat (a) with (11.3.6) replaced by (11.2.6).
Problems for Section 11.4
1. Consider a sample X1 , . . . , Xn from f2 (x, η), x ∈ R, as given in (11.4.3) with h(x) ≡
1, B1 (x) = x and B2 (x) = x2 − 1, the first two Hermite polynomials.
b of η.
(a) Give the asymptotic covariance matrix of the MLE η
(b) Suppose we estimate η using the Rudemo criteria. We obtain
Z
Z
b R = arg min −2 f2 (x, η)dPb (x) + f22 (x, η)dx .
η
bR .
Give the asymptotic covariace matrix of η
(c) Compare your answers to (a) and (b). Because the MLE of a function of a parameter
is the function of the MLE of this parameter, this comparison has implications for
the comparison of f2 (x, ηb) and f2 (x, ηbR ). Give this implication.
Hint. f2 (x, η) is a normal density.
Remark. If X1 , . . . , Xn is from f (·) 6= f2 (·, η), it could be that f (x, ηbR ) has smaller
IMSE than f (x, ηb).
2. Show that if f (x) > 0, x ∈ R, then for the nearest neighbor estimate (11.4.6),
p lim|x|→∞ fek (x) > 0.
Hint. 2b
h(k, x) is the difference X(l) − X(j) of two order statistics where (X(j) , X(l) ]
contains k order statistics. Let r = [nF (x)]; then l ≤ r + 2k, j ≥ r − 2k, (l/n) → F (x),
and (j/n) → F (x).
3. Suppose S(f ) = [a, b] with a and b finite. Show that lim|x|→∞ fek (x) = 0.
Hint. See Problem 11.4.2. To contain k order statistics, the length of the interval (X(j) , X(l) ]
must tend to ∞ as |x| → ∞.
4. Let fek (x) be the kth nearest neighbor density estimate. Assume the conditions of Theorem 11.4.3.1.
(a) Derive an expression of the form (11.2.12) for M SE[fek (x)].
(b) Compare the solution to (a) preceding to M SE[fek (x)] as given by (11.2.12) for the
case K(w) = 1[|u| ≤ 1]/2.
(c) Derive an expression of the form (11.2.14) for IM SE(fek , f ).
(d) Compare the solution to (c) preceding to IM SE(feh , f ) as given by (11.2.14) for
K(u) = 1[|u| ≤ 1]/2.
304
nonparametric inference for functions of one variable
Chapter 11
(e) Justify informally the approximation k ∼
= 2nhf (x) for n large.
(f) Use (e) to informally find the k that minimizes IMSE (see (11.2.19)).
(g) Use (f) to informally find the k that minimizes IMSE for the N (µ, σ 2 ) reference
distribution (see Section 11.2.3).
Problems for Section 11.5
1. Argue that T as defined by (11.5.3) has an asymptotic N (0, 1) distribution.
2. Let X1 , . . . , X1000 be a sample from a χ25 distribution. Plot fbh and the confidence
intervals (11.5.2) at the grid points {0.5, 1.0, 1.5, 2, . . . , 8} using the U (−1, 1) kernel and
h = 0.2.
Problems for Section 11.6
1. Let
Dx (µ(·), µ(·; β)) =
1
fh (x)
Z
[µ(z) − µ(z; β)]2 Kh (z − x)dF (z)
= E{[µ(Z) − µ(Z; β(X))]2 |X = x}
(11.7.3)
where (X, Z) has joint density f (x)qh (z|x), (x, z) ∈ [a, b], with
qh (z|x) =
f (z)Kh (z − x)1(x ∈ [a, b])
.
fh (x)
Show that minimizing (11.7.3) is equivalent to minimizing (11.6.2).
2. Show that if we select µ(z, β) ≡ β ∈ R in (12.6.13), then (12.6.13) yields µ
b NW (·).
3. Establish (11.6.4).
4. Establish (11.6.25).
5. In Remark 11.6.1, give a type (11.6.10) expansion of Bias(b
µNW (x)|X) when K is a
kernel of order 3.
6. In Remark 11.6.2, justify the formula for (a) hoptimal ; (b) the conditional MSE of
4
µ
bNW (x); (c) the optimal convergence rate n− 5 for the MSE in (b).
7. Establish (11.6.26).
8. Consider the model Yni = µ(xni ) + εni , where xn1 ≤ · · · ≤ xnn , E(εni ) = 0,
Var(εni ) = σ 2 (xni ). Let Ink (x) denote the indices of the k values of xn1 , . . . , xnn closest
to x, where ties are broken by inducing only the smallest index. Define the nearest neighbor
estimate
X
Yi /k .
µ
b(x) =
i∈Ink (x)
Section 11.7
305
Problems and Complements
(a) Show that
−2
P (sup |b
µ(x) − E µ
b(x)| ≤ t) ≥ 1 − 8(kt)
x
h n−k
X
2
σ (xni ) +
n
X
i=1
i=1
i
σ 2 (xni ) .
Hint: Use Kolmogorov’s inequality.
(b) Find bounds on supx |E µ
b(x) − µ(x)|. (Bjerve, Doksum and Yandell (1985) bound
supx |b
µ(x) − µ(x)| in probability.)
9. Nonparametric binary regression. Suppose we want to study the effect of a certain
treatment. Consider the following approach. Dosages x1 < . . . < xn are fixed. Dosage
xi is administered to the ith member of this sample and a binary response Yi is recorded;
e.g., Yi = 0 if the treatment does not have a beneficial effect, Yi = 1 if the treatment
has a beneficial effect. Assume that the Yi are independent and that Yi has the Bernoulli
distribution with success probability θ(xi ), i = 1, . . . , n. Let x be some fixed dosage level.
Consider the following nearest neighbor estimate of θ(x):
i+k
1 X
b
θ(x)
=
Yj ,
k j=i+1
where
x ∈ Ii , i = 0, . . . , n − k
1
I0 = − ∞, (x1 + x1+k )
2
1
1
Ii = (xi + xi+k ), (xi+1 + xi+k+1 ) , i = 1, . . . , n − k − 1
2
2
1
In−k = (xn−k + xn ), ∞ .
2
Show that
Pr
sup
−∞<x<∞
b − E θ(x)
b
θ(x)
≤ε ≥1−
Hint: Use Kolmogorov’s inequality.
n
.
ε2 k 2
Chapter 12
PREDICTION AND MACHINE
LEARNING
12.1
Introduction
In this chapter we face the major challenge of modern statistics, devising and analyzing
methods of inference for high dimensional complex data. The issues we discuss in this
chapter are:
1. Prediction with high dimensional covariates
Under this heading we will consider classification, regression and prediction problems
now associated with machine learning, and statistical learning. The methods we shall
briefly discuss include ones easily related to statistical methods such as the univariate curve
estimation in Chapter 11 and the logistic regression in Sections 6.4.3 and 6.5; and ones
more closely related to machine learning such as support vector machines and boosting.
2. Regularization
A central theme as in the last chapter is regularization. How do we control overcomplexification of models and procedures which both theoretically and practically lead to poor
performance? We will discuss major semiparametric and parametric methods such as those
based on sieves (Section 9.1.4), shrinkage (Section I.6), and penalized least squares.
3. Dimension reduction and model selection
While prediction and classification are the primary goals in many engineering and marketing applications, the goal in science is usually to try to isolate key factors which can
be interpreted. Factors with subject matter interpretations, if available, can lead to the
construction of better predictors and scientific insights. Isolating key factors is usually
identified in statistics with model selection and variable selection — see Section I.7.
4. Nonparametric prediction and classification. The Bayes classifier.
We were presented in Section 1.4 with the classical prediction problem: Given a d
dimensional vector Z = (Z1 , . . . , Zd )T of covariates, predict the value of a response Y . In
that section we showed that for quadratic loss, the best predictor is
µ(Z) = E(Y |Z).
307
308
Prediction and Machine Learning
Chapter 12
In the “known P ” case, the joint distribution of (Z, Y ) is known and µ can be computed.
In practice, when P is unknown, we can construct an estimate µ
bn (·) based on a sample
(Z1 , Y1 ), . . . , (Zn , Yn ) i.i.d. as (Z, Y ) ∼ P . When k > n and Zk is known but Yk is
unknown, we can use µ
b(Zk ) to predict Yk from Zk where (Zk , Yk ) ∼ P . In this context,
(Z1 , Y1 ), . . . , (Zn , Yn ) is referred to as a training sample.
b |Z = z) where E
b is
We may try to estimate µ(z) ≡ E(Y |Z = z) by µ
b(z) ≡ E(Y
−1
the expected value under the empirical probability Pb which assign probability n to each
observed data vector (zi , yi ), and 0 elsewhere. If Z is discrete with a small number of
values, this is a parametric problem and our choice of µ
b(·) is reasonable if we observe
all possible values of Z. However, if as is typically the case, Z is continuous, then the
estimate µ
b(z) is undefined for unobserved z. To go further we need to assume some special
properties of µ(·), most naturally, smoothness.
In some parts of this chapter we will consider generalizations of the problem of estimating a nonparametric regression that we discussed in Section 11.6. The major differences
are:
i) The Y to be predicted can be categorical so that 0 − 1 loss as discussed in Section 1.3
is appropriate. That is, we have a problem of classification where Y can take on values
1, . . . , C corresponding to C distinct categories. Throughout we consider the loss function
l(j, d) = 0 if d = j
= 1 otherwise
where d is the predicted value (class) and j is the actual value. Then, if pj (z) is the conditional density of Z given Y = j, and the “prior” probability is P [Y = j] = πj , the
conditional Bayes risk of a classifier d given Z = z is
E l(Y, d)|z = 1 − P (Y = d|z) = 1 − πd pd (z)/q(z)
where q(z) is the density of Z. Thus the conditional Bayes risk is minimized by
δB (Z) = arg max P [Y = j|Z]
j
= arg max πj pj (Z) .
(12.1.1)
j
This δB is called the Bayes classifier. We also see that the minimum Bayes risk is
RB ≡ 1 − E max{πj pj (Z)}/q(Z) .
(12.1.2)
j
Since P [Y = j|Z] = E l(Y, d)|Z , we see that, in some ways, the classification problem
can be viewed as the same as estimating the regression of 1(Y = j) on Z from a sample,
where a sample (Z1 , Y1 ), . . . , (Zn , Yn ) is i.i.d. as (Z, Y ) ∼ P with Y ∈ {1, . . . , C}. The
sample is used to construct a method that can be used to classify Yk from Zk for k > n,
where (Zk , Yk ) ∼ P . This version of classification as estimation is somewhat misleading
in some frameworks, as we shall see in Section 12.2.4 when we consider boosting.
Section 12.1
Introduction
309
ii) The predictor Z is potentially very high dimensional. The coordinates Z1 , . . . , Zd can
be categorical or ordinal as well as real but what matters is that there are a lot of them. The
high dimensional case is important because of new technologies that generate vast amounts
of data.
It is worth noting that high dimensionality is implicit in the case of one dimensional regression as well. As is well known in prediction, covariates can always be added to a regression. Thus, if Z ∈ [0, 1], say, then E(Y |Z) = E(Y |Z, Z 2 , . . . , Z p ), and estimation of the
second form leads naturally to linear regression of Y on the p + 1 vector (1, Z, . . . , Z p )T , a
method we considered in Section 11.6.2, for smooth µ(Z). Similarly, if Z takes on a large
number of known categories, the categories can, for instance, be coded by e1 , . . . , ed , the
standard basis vectors of Rd . That is, e1 = (1, 0, 0, . . . , 0)T , e2 = (0, 1, 0, . . . , 0)T ,. . . ,
ek = (0, 0, . . . , 0, 1). Now if Cj denotes the jth category, E(Y |Z ∈ Cj ) = E(Y |ej ).
iii) In addition to estimation of the regression E(Y |Z), the prediction of Y is of major
concern. In particular, prediction error plays a key role in model and tuning parameter
selection based on cross validation.
✷
Remark 12.1.1. Prediction versus estimation. This chapter examines ways of constructing good predictors; however the discussion is sometimes in terms of estimation. This
is because the problem of selecting a good predictor is sometimes equivalent to selecting
b (·) is based on X =
a good estimator. For squared error loss this is clear because if µ
(Z1 , Y1 ), . . . , (Zn , Yn ) and we are trying to predict Yn+1 from Zn+1 , where (Zn+1 , Yn+1 )
is independent of X, then, for µ(Z) = E(Y |Z),
b (Zn+1 )]2
Prediction MSE = E [Yn+1 − µ
b (Zn+1 )]2
= E [Yn+1 − µ(Zn+1 )]2 + E [µ(Zn+1 ) − µ
= constant + estimation IMSE
because the cross product term is zero by the iterated expectation theorem. Thus optimal
prediction is equivalent to optimal estimation in this case. (Here the constant does not
b , but it contributes a substantial amount to the prediction MSE
depend on the procedures µ
of any predictor. Thus while estimation IMSE typically tends to zero, prediction MSE does
not.)
✷
The formulation given so far is the center of our discussion in this chapter which examines most of the major approaches to regularization, dimension reduction, model selection
and to the construction of prediction rules, regression procedures, and classifiers. We next
add more introductory details.
12.1.1
Statistical Approaches to Modeling and Analyzing
Multidimensional data. Sieves
As in Section 12.1, let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. as (Z, Y ) ∼ P with Z = (Z1 , . . .,
Zd )T and Y ∈ R. Consider the problems of estimating µ(z) = E(Y |Z = z), predicting Y
from Z, or using Z to classify Y . Two methods are:
310
Prediction and Machine Learning
Chapter 12
a) Generalizations of kernel regression methods. These methods for the estimation of
E(Y |z) use product kernels to replace the univariate ones of Section 11.6.2. See Section
12.2.1.
b) Regularization methods based on sieves. We pursue the regularization ideas introduced in Section 9.1.4. Thus for P in a general class of models, we approximate µ(Z) =
EP (Y |Z) by postulating a sequence of parametric models {Pj } for (Z, Y ) with members
Pj of Pj identified by a kj dimensional parameter θ(j) (P ) which is defined for all P . We
assume that each Pj is regular as defined in Section 1.1.3. Define µj (Z) = EPj (Z). We
require that for any P in the nonparametric class we consider, as j → ∞,
µj (Z) → µ(Z) ,
where “→” denotes convergence in an appropriate metric. The parameter θ(j) (P ) is defined by minimum contrast; e.g. if pj denotes the density of Pj , we could use the KullbackLeibler contrast and set
Z
θ (j) (P ) = arg min log pj (z, y))dP (z, y) .
θ(j)
We can then generate a sequence {b
µj (·)} of estimates or predictors with µ
bj (·) ≡ EPbj (Y |Z)
(j) b
b
b
and Pj ∈ Pj corresponding to θ (P ), where P denotes the empirical distribution function
based on (Z1 , Y1 ), . . . , (Zn , Yn ). Here regularization is achieved by using a sieve made up
of regular parametric models. The index j is called a regularization or tuning parameter
and is chosen to minimize an estimate of risk. We postpone the issue of choosing b
j ≡
j(Pb ) to obtain our final estimate µ
bbj (·). See Section 12.5. There is a huge literature on
regularization; see Bickel and Li (2006), for discussions and a bibliography. We now turn
to several special cases of sieves.
(1) Gaussian Sieves
A natural special case, if, say, Z ∈ I d where I ≡ [0, 1], is to choose an orthonormal
basis {fk (·)}k≥1 with f1 ≡ 1,
Z
fa (z)fb (z)dz = 1(a = b)
Id
and postulate as model Pj with dimension kj
Y =
kj
X
(j)
θk fk (Z) + ε
(12.1.3)
k=1
(j)
(j)
where ε ∼ N (0, σ 2 ) independent of Z, and θ(j) = (θ1 , . . . , θkj )T is defined as the
minimizer of mean squared prediction error as in Theorem 1.4.4.
(2) Penalized Least Squares and Ridge Regression
Section 12.1
311
Introduction
Having chosen {fk } as in (12.1.3), it is natural to consider a Lagrangian version of
the method of sieves which simplifies computation and makes the need for regularization
explicit. Choose a function J(θ), J : R × Rkj → R+ and λ > 0; and in model Pj
minimize, over θ(j) ,
k
j
n
X
2
1X
(j)
θk fk (Zi ) + λJ(θ (j) ) .
Yi −
n i=1
k=1
(j)
The population parameter θλ (P ) in model Pj is defined by
(j)
θ λ (P ) = arg min
Z
(y − [θ(j) ]T f (j) (z))2 dP (z, y) + λJ(θ (j) ) : θ(j)
,
T
where f (j) (·) = f1 (·), . . . , fkj (·) and P (·, ·) is the probability distribution of (Z, Y ).
b(j) = θ(j) (Pb). Note that this approach still falls
The empirical plug-in estimate is now θ
λ
λ
under our general formulation of empirical plug-in estimation.
A choice we pursue further is ridge regression which corresponds to
kj
J(θ
(j)
1 X (j) 2
1
)=
[θk ] = |θ(j) |2 .
2
2
k=1
Another interesting choice is
J(θ
(j)
)=
kj
X
k=1
(j)
|θi |
which corresponds to the Lasso.
(3) Bayes and Penalized Least Squares
Ridge regression has a natural Bayesian interpretation. Consider the sequence of models {Pj } as a dense subset of the grand model,
Y =
∞
X
θk fk (Z) + ε
(12.1.4)
k=1
where ε ∼ N (0, σ 2 ). Now assume that θ1 , . . . , θj are i.i.d. N (0, 1/λ), θk = 0, k > j.
b(j) is just the mode of the posterior
Then, for fixed j and λ, the ridge regression estimate θ
λ
distribution of θ(j) given the data.
This formulation suggests that other choices of prior distribution might also give interesting results. A class of possible penalties corresponding to priors in this way is
Jr (θ) =
j
X
k=1
|θk |r ,
0<r<∞,
312
Prediction and Machine Learning
Chapter 12
with
J0 (θ) =
j
X
k=1
1(|θk | 6= 0)
defined by a limiting process as r → 0 and not corresponding to a proper prior. Among
Jr (θ), r > 0, the penalty J1 (θ) which corresponds to θ1 , . . . , θj i.i.d. with prior π(θ) =
1 −|θ|
, the Laplace or double exponential distribution, has particularly interesting proper2e
ties. See Section 12.2.3.
(4) Logistic Sieves. Linear Classifiers
Another important sieve suitable for classification problems is based on logistic regression. Here Y takes on the values 1, . . . , C. The functions to be estimated in the Bayes
classifier (12.1.1) are πa pa (Z), a = 1, . . . , C, where πa = P (Y = a) and pa (Z) is the
density
P of Z given Y = a. Of these, if we have a training sample, πa is estimated by
n−1 ni=1 1(Yi = a), the fraction of class a in the training sample, while estimating pa (Z)
is the problem of estimating a general multivariate density from {Zi : Yi = a}.
Alternatively we can think of, from the beginning, estimating π(1|Z), . . . , π(C|Z) ,
where
C
X
π(a|Z) ≡ P [Y = a|Z] = πa pa (Z)
πk pk (Z) ,
k=1
using a generalized linear logistic regression model (see Sections 1.6.2, 6.4.3, 6.5 and
Hastie, Tibshirani and Friedman (2001, 2009)). That is, given functions f1 (Z), . . . , fkj (Z)
(j)
and 1 ≤ a ≤ C, specify conditionally, with π ≡ (π1 , . . . , πC ), θ ≡ {θa,k : 1 ≤ k ≤ kj ,
1 ≤ a ≤ C}, the jth model as
Pkj
(j)
k=1 θa,k fk (Z)}
Pkj (j)
l=1 πl exp{
k=1 θl,k fk (Z)}
πa exp{
π(a|Z, θ, π) = PC
,
(12.1.5)
the logistic regression model.
In the jth model Pj , the classifier using training data based on (12.1.5) classifies an
unknown Y corresponding to covariate vector Z by
b = a if
δ(Z, θ)
kj
X
k=1
(j)
(l)
(θba,k − θbl,k )fk (Z) ≥ log π
bl /b
πa
(12.1.6)
b and π
b are obtained by maximum likelihood
for all l, with ties broken arbitrarily. Here θ
from
εia
L(θ, π) = Πni=1 ΠC
(a|Zi , θ, π) ,
a=1 π
P
b is the MLE for a canonical exponenwhere εia = 1(Yi = a). Thus π
ba = i εia /n and θ
tial family — see Example 1.6.7 and Section 2.3, for instance.
The classifier
(12.1.6) is an
example of a linear classifier, that is, one such that log π
b(a|Z)/b
π (l|Z) is a linear combination of functions fk (Z), 1 ≤ k ≤ kj . Geometrically, we can think of such classifiers as
Section 12.1
313
Introduction
T
classifying a new point Y with given Z in class l if f1 (Z), . . . , fk (Z) is on one side of
a hyperplane in Rk and “not l” if it is on the other side. We have already seen one example
of such a classifier, Fisher’s linear discriminant function in Section 4.2.
12.1.2
Machine Learning Approaches
The machine learning point of view as discussed in Section 11.4.2 is to specify a parametric family D of possible decision procedures, based on predictors in our case, and select a
member by some optimization criteria which may or may not include a penalty for complexity (overfitting). The focus is, in particular, on classification.
a) Neural Nets
The motivation behind neural nets is a primitive notion of how human stimulus response
learning proceeds. Abstracted it is relatively simple. If Zd×1 is the vector of predictors, let
the classes that a corresponding Y can fall in be {1, . . . , C}. By definition, a neural net
with one hidden layer of M interior nodes, based on {(Z1 , Y1 ), . . . , (Zn , Yn )}, outputs a
PC
b (k|Z) = 1}, of estimated posterior probabilities
vector, {0 ≤ π
b (j|Z) : 1 ≤ j ≤ C, k=1 π
j, 1 ≤ j ≤ n, satisfying
log
π
b(j|Z)
b −β
b )T σ(AZ
b ∗) .
= (β
j
1
π
b (1|Z)
(12.1.7)
b are (M + 1) × 1, A
b is (M + 1) × (d + 1), Z∗ = (1, ZT )T , and σ(ω) =
Here the β
j
T
σ(ω1 ), . . . , σ(ωM+1 ) , where σ(ω) = (1 + e−ω )−1 and ω T = (ω1 , . . . , ωM+1 ). Essentially the classifier is obtained by a fitting process from all classifiers based on
arg max π(j|Z, A, B)
j
where A is (M + 1) × (d + 1), B = (β 1 , . . . , βC ) is (M + 1) × C, and π(j|Z, A, B) is
given by (12.1.7) with the “hats” removed.
Neural nets essentially use logistic regression applied to a nonlinear coordinatewise
transformation of an affine transformation of Z into RM+1 . The second optimization problem is not convex. The tuning parameter here is M , the number of nodes in the hidden
layer parametrized by A. It may be shown that as M → ∞ the functions
log π(j|Z, A, B)/π(1|Z, A, B)
can approximate any log π(j, Z)/π(1, Z) . Neural nets may also be applied to the estimation of the regression mean µ(z) = E(Y |Z = z). See Problem 2.3.41. We do not pursue
this method. A full discussion may be found, for instance, in Ripley (1996) and Hastie et
al (2001, 2009).
b) Support Vector Machines
This method introduced by Vapnik (1996) is limited to classification. We discuss it initially
for the case of C = 2 classes here denoted by “−1” and “1”. These are, in fact, linear
314
Prediction and Machine Learning
Chapter 12
classifiers, but as with neural nets, follow a nonlinear transformation of Z from Rd to a
typically much higher dimensional space RM . As we remarked earlier, given any Z, we
can always choose an orthonormal basis in L2 (P ), {fk (Z), k ≥ 1}, approximating any
L2 (P ) function g of Z in the L2 sense. That is, we can formally write
g(Z) =
∞
X
ak (g)fk (Z)
(12.1.8)
k=1
where ak (g) = Eg(Z)fk (Z).
Now consider a two
class {−1, 1} classification problem, where the Bayes rule is
δ(Z) = 21 π(Z) ≥ 21 − 1 with π(Z) ≡ P [Y = 1|Z|]. Then, if we represent log π(·)
as in (12.1.8), P
the Bayes rule is a linear classifier in R∞ determined by the hyperplane
{(a1 , a2 , . . .) : k ak fk (Z) = log 12 }. All points on one side of the hyperplane are classified as Y = 1 and the others as Y = −1. In fact, it is possible to show (Problem 12.1.2)
that if we map Z1 , . . . , Zn , Zj ∈ Rd , in a 1–1 fashion to RM , M >> max(d, n), and if
Z∗i are the RM images of the Zi , then {Z∗i : Yi = 1} and {Z∗i : Yi = −1} can be separated
by a hyperplane. That is, we can construct a linear classifier that classifies the Y ’s in a
training sample (Z1 , Y1 ), . . . , (Zn , Yn ), 1 ≤ i ≤ n, perfectly. There are evidently many
such hyperplanes if one exists. Vapnik (1996) proposed using that separating hyperplane
which maximizes the minimum distance between points of the training sample and the hyperplane. This problem it turns out can be cast as one in quadratic programming and is then
rapidly and uniquely solvable.
The original form of support vector machines, for C = 2, is valid only in a context
where there is a perfect separation of the two classes in the training sample. However, the
introduction of suitable slack variables in the program permit regularization, i.e. lead to the
choice of a hyperplane which permits a specified minimal number of misclassification of
the training sample. We shall go further into the theory of this linear classifier in Section
12.2.4.
c) Boosting
Schapire (1990), Freund and Schapire (1997), and others proposed a method of classifying
using larger and larger linear combinations of so called weak classifiers. This method
has since been identified (Friedman, Hastie and Tibshirani (2000)) as an algorithm of a
known type (Gauss-Southwell) for minimizing a data-based convex objective function. See
Section 12.2.4.
d) Tree-based Methods
These are hybrid methods introduced independently by Breiman, Olshen, Friedman, Stone
(1984), Quinlan (1987), and earlier by Morgan and Sonquist (1963). They may be viewed
as constructing iteratively a partition of the covariate space Rd into blocks and then, for
regression, estimating conditional expectations of Y in each block or, for classification,
estimating the conditional probabilities of belonging to class j ∈ {1, . . . , C} given that Z
belongs in a block. Again we shall discuss this method in Section 12.2.4.
Section 12.2
12.1.3
Classification and Prediction
315
Outline
Here is the rest of our outline for this chapter. In Section 12.2. we describe methods mentioned in Section 12.1 in more detail and start to give their properties. In Sections 12.3
and 12.4 we shall focus on asymptotic theory both from a statistical (minimax theorems)
and machine learning (oracle inequalities) point of view and focus on the critical implication that “sparsity” in some sense is needed for any statistical successes when the covariate
space Rd is high dimensional. Sparsity in this context means roughly that either the population regression function depends on a small subset {Zj .j ∈ S} of covariates, or that the
function takes on a simple form, e.g. is a sum of functions of one variable Zj only. In other
words, the regression fuction is much less complex than a general function on Rd .
In Section 12.5 we introduce cross validation as a tool for tuning parameter choice and
in Section 12.6, we will discuss model selection and dimension reduction. In this context we shall discuss Bayesian model selection and principal component analysis. Finally,
Section 12.7 discusses multiple testing and has pointers to work not covered in this book.
Summary. In this section, we have introduced the prediction framework both in the context
of classification and regression, and
i) Statistical approaches to these problems, in particular, sieves, penalization, and empirical Bayes methods.
ii) Machine learning approaches to these problems, in particular, support vector machines, boosting, and tree-based methods.
12.2
Classification and Prediction
In this section, we give more detailed descriptions and properties of the approaches outlined in Section 12.1. However, first we introduce, in Section 12.2.1, tools to be used for
classification and prediction. These include a multivariate density estimate that is used
in classification procedures and a nonparametric statistical method based on multivariate
kernel regression that can be used to predict a response Y from a vector x of observed covariate values. In Sections 12.2.2, 12.2.3, and 12.2.4 we describe classification procedures
based on nonparametric methods, sieves, and machine learning approaches.
12.2.1
Multivariate Density and Regression Estimation
The statistical methods in Chapter 11 for univariate nonparametric estimation will be generalized straightforwardly in this section to the multivariate case with continuous X.
(a) Density Estimates
Let X1 , . . . , Xn be i.i.d. as X ∼ F , where X ∈ Rd . We assume that X has a density
f (x) with convex support S(f ) equal to the closure of {x : f (x) > 0}. All the methods
we have considered for estimating f when d = 1 generalize naturally. Here are two of the
generalizations.
316
Prediction and Machine Learning
Chapter 12
(1) Convolution kernel estimates
Let K(u) be a multivariate kernel, that is, a map Rd → R with
Z
K(u)du = 1.
Rd
A kernel is said to be of order r if
Z
Rd
d
Y
j=1
i
ujj K(u)du = 0, for all ij with 1 ≤ i1 + · · · id ≤ r.
For simplicity, we consider a product kernel
K(u) =
d
Y
Kj (uj ),
j=1
where each Kj is a 1 dimensional kernel. A product kernel is of order r iff each of its
components is. Since scale may vary by coordinate, we specify a vector of bandwidths
h = (h1 , . . . , hd )T . Thus, as for d = 1, we define
Kh (u) =
d
Y
h−1
j Kj (uj /hj ).
j=1
Often Kj (u) = K1 (u), j = 1, . . . , d, and K1 (u) is chosen as the normal density ϕ, or if
compact support is desired, K1 (u) = 12 1[|u| ≤ 1] or K1 (u) = 34 (1 − u2 )1(|u| ≤ 1).
The empirical distribution is
Fb(x) = n−1
n
X
i=1
1[Xi ≤ x]
where X ≤ x is componentwise inequality, x ∈ Rd . We define
Z
fh (x, F ) = Kh (x − z)dF (z)
(12.2.1)
and the (plug-in) convolution kernel density estimate,
fbh (x) ≡ fh (x, Fb ) =
Z
n
1X
Kh (x − Xi ).
Kh (x − z)dFb (z) =
n i=1
(12.2.2)
Bias, variance, MSE, and IMSE results are completely analogous to the d = 1 case; see
Silverman (1986), Scott (1992), and the problems. Briefly, let
D2 f (x) =
∂ 2 f (x)
∂xi ∂xj
,
d×d
F = {f : D2 f (x) ≤ M
all x ∈ Rd }.
Section 12.2
317
Classification and Prediction
Then, if K has compact support and is of order 1,
EF fbh (x) = f (x) + O(|h|2 )
VarF fbh (x) = (n|h|d )−1 f (x)
(12.2.3)
Z
K 2 (u)du(1 + o(1))
(12.2.4)
as n → ∞, h → 0; uniformly for f ∈ F . Note that the dimension d inflates
R the variance by
the factor |h|−d . If Kj ≡ K1 , j = 1, . . . , d, with K1 of order 1, if 0 < |D2 f (x)|2 dx <
∞, and we select h = h(1, . . . , 1)T to minimize the approximate IMSE
Z
Z
Z
1
AIMSEfbh (x) ∼
= ( |t|2 K(t)dt)2 ( |D2 f (x)|2 dx)h4 + ( K 2 (u)du)(nhd )−1
4
(12.2.5)
then we obtain the optimal asymptotic rate
1
h ≍ n− d+4 .
For this h, the minimum IMSE is of order n−4/(d+4) (Problem 12.2.2).
Remark 12.2.1. For d large, say d ≥ 4, the rate of convergence of fbh (x) to f (x) is no better
than n−1/4 . Thus if it takes 100 observations to achieve desired precision in a parametric
setting this suggests it will take at least 10,000 in this high dimensional nonparametric
setting. We can do no better with any other method (van der Vaart (1998), Section 24.3)
unless we use stronger regularity conditions. As in the one dimensional case, the optimal
rates become better if one assumes f has bounded derivatives of order p > 2, but there is
no data-dependent way of checking such assumptions. This gloomy picture is relieved by
the observation that data distributions tend to stay close to low dimensional manifolds and
thus the worst case analyses (nonparametric for d large) are much too conservative.
Cross validation methods to estimate h are given in Section 12.5 and the reference
distribution approach is given in Problem 12.2.2.
As in the one dimensional case, we can eliminate difficulties in estimation at the boundary of f if S(f ) is a convex set by using locally polynomial estimates or more generally
plugging into the minimizer of Dh,x (f (·), f (·, β)) defined as in Example 11.3.1, with the
univariate integral over (a, b) replaced by integrating over S(f ) ⊂ Rd , and Kh now a
multivariate kernel. Details are given in Problem 12.2.3.
(2) Multivariate nearest neighbor estimates
Let K(u) = 1(|u| ≤ 1)/V (d) where V (d) is the volume of the unit sphere in Rd (not
a product kernel) and set
u
Kh (u) = h−d K
.
h
Then, if fbh (x) is defined by (12.2.2),
fbh (x) = Fb (z : |z − x| ≤ h)/V (d)hd .
318
Prediction and Machine Learning
Chapter 12
If we take h = b
h(k, x) to be the smallest h such that there are k sample points xi in the set
{z : |z − x| ≤ h} then we obtain the kth nearest neighbour estimate
fek (x) =
k
V (d)[b
h(k, x)]d
=
k
V (d)[|x − X|(k) ]d
(12.2.6)
with |x − X|(1) < · · · < |x − X|(n) denoting |x − Xi |, 1 ≤ i ≤ n, ordered.
(b) Kernel Regression Estimates
We introduce a nonparametric product kernel estimate of E(Y |X = x) and give mean
squared error properties of this estimate. Suppose the observations (Xi , Yi ), i = 1, . . . , n
are, i.i.d. as (X, Y ), where Xi ≡ (Xi1 , . . . , Xid )T , Yi ∈ R, with the Xi having continuous
case density f and loss is mean square error. We assume that S 0 ≡ {x : f (x) > 0}, the
interior of the support Sf of f (·) is an ellipse or a rectangle with possibly one or more
infinite boundaries. We want to estimate µ(x) = E(Y |X = x), where L(Y |X = x)
is defined to be pointmass at zero for x 6∈ S 0 , so Rthat µ(x) = 0, x 6∈ S 0 . Suppose
K : Rd → R is a kernel, that is, a function such that Rd K(t)dt = 1. Then, define
Kh (t) ≡ (h1 h2 . . . hd )−1 K
t1
td ,...,
,
h1
hd
with the tj , hj being the coordinates of t, h, where hj > 0. Evidently Kh is also a kernel.
Assume that K is symmetric, that is K(−t) = K(t). Then
fbh (x) = n−1
n
X
i=1
Kh (Xi − x)
is the multivariate convolution kernel density estimate of f (x) of part (a) of this section.
The generalized multivariate Nadaraya-Watson estimate of the nonparametric regression
P
function µ(x) ≡ E(Y |X = x) is gbh (x)/fbh (x) where b
gh (x) = n−1 Yi Kh (Xi − x),
that is,
Pn
Yi Kh (Xi − x)
,
(12.2.7)
µ
b NW (x) = Pi=1
n
i=1 Kh (Xi − x)
where (0/0) = 0. In the asymptotic analysis, we assume the product kernel
K(t) ≡ Πdj=1 K0 (tj ) , hj = h, 1 ≤ j ≤ d,
where K0 is a univariate, non-negative, symmetric kernel.
We will develop asymptotic results for the multivariate µ
bNW : Let S be a compact set
contained in S 0 = {x : f (x) > 0}. We select S so that it has a “simple” and “smooth”
boundary which we for simplicity take to be an ellipse. Let |·| be the Euclidean norm, let G
denote the distribution of (X, Y ), and let σ 2 (x) = Var(Y |X = x). Moreover, let M > 0
and ε > 0 be generic constants. We will use the assumptions:
A1 : µ(x) = EG (Y |x) is Lipschitz; |µ(x) − µ(z)| ≤ M |x − z|, all x, z ∈ S, and µ(x) ≤
M < ∞ all x ∈ S.
Section 12.2
Classification and Prediction
319
A2 : f (x) is Lipschitz; |f (x) − f (z)| ≤ M |x − z| all x, z ∈ S, and f (x) ≤ M , all x ∈ S.
A3 : K0 ≥ 0 has compact support, is continuous, symmetric, and is bounded above; K0 (t) ≤
M for all t ∈ R.
A4 : f (x) is bounded below on S; f (x) ≥ ε all x ∈ S.
A5 : The conditional distribution of Y |X = x has a bounded moment generating function;
EG {etY |X = x} ≤ M for |t| ≤ δ < ∞, some δ > 0, all x ∈ S.
Let G be the class of G where A1 , . . . , A5 hold uniformly
in G and x. Note that by
A5 ,
b NW (X) 1(X ∈ S) , then
σ 2 (x) ≤ M < ∞ for all x. Let IMSE (b
µ NW ) = E M SE µ
rates of convergence of the Nadaraya-Watson estimates are
Theorem 12.2.1. Under assumptions A1 , . . . , A5 ,
2 b NW (x) = O n− 2+d unif ormly f or x ∈ S and G ∈ G.
(i) inf h>0 MSE µ
2 (ii) inf h>0 IMSE(b
µ NW ) = O n− 2+d unif ormly f or G ∈ G, f or S with smooth
boundaries.
(iii) The minimizers in (i) and (ii) are of the form h = cn−1/(d+2) .
Proof. The proof is in Appendix D.7.
Remark 12.2.2. (1) Assumptions A1 , A2 , and A3 can be considerably weakened. Moreover A4 can be replaced by a condition on how quickly f (x) → 0 as x → δS and A5 can
be replaced by moment conditions. The technical cost of weakening the assumptions does
not seem worthwhile.
(2) If G is defined by Lipschitz assumptions on partial derivatives of µ or more generally,
µ(·) lies in a Sobolev ball, the rate n−2/(2+d) can be replaced by n−2s/(2s+d) , where s
is some measure of smoothness determined by G provided that we use local polynomial
regression estimates of a sufficiently high order, such as linear, quadratic, etc.
(3) For other results of this type see Stone (1984), Ruppert and Wand (1994), Fan and
Gijbels (1996), Jiang and Doksum (2003), and Problem 12.2.4.
✷
Remark 12.2.3. (1) As in the case of density estimation (Remark 12.2.1), we face slow
convergence rates unless the distribution of X concentrates near a low dimesional manifold
(sparsity).
(2) We also face the choice of bandwidth h, or number of nearest neighbours k, if we
use a kernel with support [−1, 1]d and use the h that yields a fixed number k of xi ’s in
[x − h, x + h]. The asymptotically optimal choice of h given in Problem 12.2.4 depends on
an unknown constant c. Prediction suggests the use of cross validatory choice, that is, fitting
of the rule using a fraction of the sample and selection of the regularization parameter h
by optimizing prediction or classification performance on the rest of the sample. We shall
discuss these issues further in Section 12.5.
320
Prediction and Machine Learning
12.2.2
Chapter 12
Bayes Rule and Nonparametric Classification
We begin with a result relating the Bayes classification rule to nonparametric density estimation. We observe a training sample (X1 , I1 ), . . . , (Xn , In ) i.i.d. as (X, I) ∼ P , where
I ∈ {1, . . . , C}, X ∈ X , P ∈ P. Let (Xn+1 , In+1 ) ∼ (X, I) independent of the other
(Xi , Ii ). On the basis of Xn+1 we are to predict In+1 using 0-1 loss: l(I, a) = 0 if a = I,
1 otherwise.
Suppose pbj (·), 1 ≤ j ≤ C, are estimates of pj (·), the conditional densities with respect
to ν(x) of X given I = j, and π
bjn are estimates of πj , 1 ≤ j ≤ C. The Bayes rule is, from
(12.1.1),
δB (x) = j iff πj pj (x) = max{πk pk (x) : 1 ≤ k ≤ C} ,
with ties broken arbitrarily if more than one j achieves the max. Let
δbn (x : X1 , . . . , Xn ) = j iff π
bj pbj (x) = max{b
πk pbk (x) : 1 ≤ k ≤ C}
(12.2.8)
with the same rule for ties. We show that δbn has risk close to the minimum Bayes risk. This
minimum Bayes risk is from (12.1.2)
Z
RB = 1 − max πj pj (x) : 1 ≤ j ≤ C}dν(x) .
Let P be a general class of P ’s satisfying the conditions specified in what follows.
Theorem 12.2.2. Let δbn be as defined in (12.2.8). Suppose that as n → ∞,
Z
P
P
pbj (x) − pj (x) dν(x) −→ 0 and π
bj −→ πj , 1 ≤ j ≤ C.
Then, as n → ∞, the risk of δbn for 0 − 1 loss satisfies
R(P, δbn ) ≡ P [δbn (Xn+1 : X1 , . . . , Xn ) 6= In+1 ] = RB + o(1) .
(12.2.9)
(12.2.10)
If (12.2.9) holds uniformly for P ∈ P, then o(1) in (12.2.10) is also uniform.
Proof. For δ(·) : X → {1, . . . , C}, we note the following useful formula,
Z
max πj pj (x) − πδ(x) pδ(x) (x) dν(x) .
R(P, δ) − RB =
j
Therefore, for b
a = arg max π
bj pbj (x),
Z
R(P, δbn ) − RB = E
max πj pj (x) − πba pba (x) dν(x)
j
Z
max πj pj (x) − min πj pj (x) 1 ∪C
≤E
a=1 Aa dν(x)
j
j
(12.2.11)
Section 12.2
321
Classification and Prediction
where Aa = {x : π
bba pbba (x) > π
ba pba (x), πba pba (x) < πa pa (x)}. By hypothesis, P [Aa ] → 0,
P
since max |b
πa pba (X) − πa pa (X)| −→ 0. The result follows from the dominated convera
gence theorem (Theorem B.7.5).
✷
Definition 12.2.1. A classifier δbn that satisfies (12.2.10) is called Bayes consistent.
We next give examples of rules that satisfy the conditions of Theorem 12.2.2.
Example 12.2.1. Kernel and nearest neighbour rules.
(a) Kernels. It is evident that we can plug in kernel estimates pbj (·) for pj (·) when X is
1
continuous. For instance, suppose
PnX = R, K(u) = 2 1(−1, 1), we use the same bandwidth
−1
h for all pbj , and the π
bj = n
i=1 1(Ii = j) in the sample are unbiased and consistent
estimates of the population πj . Then, a reasonable classification rule is
π
bm
pbm (Xn+1 ) for all m
δbn (Xn+1 : X1 , . . . , Xn ) = j if pbj (Xn+1 ) >
π
bj
where pbj (·) is the kernel density estimate based on {Xi : Ii = j}. This rule is equivalent
to preferring j over m if the ratio of the number Nj (h) of X observations from category
j in the training set at distance ≤ h from Xn+1 to the corresponding number Nm (h) from
category m is larger than π
bm /b
πj .
(b) Nearest Neighbours. If we use the kth nearest neighbour density estimate of Sections
11.4.3, and 12.2.1(a), and if πj is uniform, πj = C −1 , we order |Xn+1 − Xj | to get
|Xn+1 −Xj |(1) < . . . < |Xn+1 −Xj |(n) , the ordered distances between the Xj , 1 ≤ j ≤ n,
and Xn+1 . The arg maxj {b
pj (xn+1 )} rule leads to (Problem 12.2.5): Predict In+1 = Ibjk ,
where b
jk is defined by
|Xn+1 − Xbjk | = |Xn+1 − X|(k) .
That is, b
jk is the subscript of the kth nearest neighbour to Xn+1 . If k = 1, this leads to the
highly plausible rule:
Predict In+1 = Ibi , where
|Xn+1 − Xbi | = min{|Xn+1 − Xm | : 1 ≤ m ≤ n} .
That is, bi is the subscript of the Xi , 1 ≤ i ≤ n, closest to Xn+1 . We call this the nearest
neighbour classifier.
Corollary 12.2.1. Assume the framework of Example 12.2.2 (a). If hn → 0, nhn → ∞, if
C = 2, if pj (·) concentrates on (0,1), if kpj k∞ ≤ M for j = 0, 1, and if δB is the Bayes
rule, then
P [δbn (Xn+1 : X1 , . . . , Xn ) 6= In+1 ] = P [δB (Xn+1 ) 6= In+1 ] + o(1) .
(12.2.12)
Moreover, (12.2.12) holds uniformly over F = {(p0 , p1 ) : kp′′j k∞ ≤ M }.
Proof. Note that, using Theorem 11.2.1,
Z 1
Z 1
2
E
|b
pj (x) − pj (x)| dx ≤ E
(b
pj (x) − pj (x))2 dx = O (nh)−1 + O(h4 )
0
0
322
Prediction and Machine Learning
uniformly on F . The result follows from Theorem 12.2.2.
Chapter 12
✷
Remark 12.2.4 (1) The conditions of Corollary 12.2.1 are much too strong. If no uniformity is required, then (12.2.12) holds for kernel density estimates for any f — see Devroye
and Lugosi (1996) for instance.
(2) If d > 1, the condition nhn → ∞ is replaced by nhdn → ∞.
(3) The kth nearest neighbour classifier is consistent under regularity conditions when
d = 1 if k = kn → ∞, n−1 kn → 0 as n→ ∞. See Theorem 11.4.1 and Problem
12.2.6.
(4) Define a (poor) estimate based on the training sample of the misclassification probability by
C X
n
X
b = n−1
1 Ii = j, δbn (Xi : X1 , . . . , Xn ) 6= j .
R
j=1 i=1
b = 0, a clear underestimate because the
Note that the nearest neighbour classifier has R
nearest neighbour classifier, which corresponds to a kernel estimate with bandwidth h =
OP (n−1 ), has a bandwidth that is too small to give a good density estimate. To obtain
Bayes consistency we have seen that we have to take k = kn → ∞, kn /n → 0. On the
other hand, if δbn (·) is the 1-nearest neighbour rule and R(P, δbn ) is its classification risk
corresponding to the 0 − 1 loss function, then
limn
R(P, δbn )
≤2
RB (P )
where RB (P ) is the minimum Bayes risk (Cover and Hart (1967)). Thus, a terrible density
estimate can lead to a perfectly reasonable, though in general, suboptimal classification
rule.
(5) Kernel and kth nearest neighbour methods were first advocated by Fix and Hodges
(1951). The latter particularly are still in wide use.
(6) Properties of classifiers will be discussed in more detail in Section 12.2.4.
12.2.3
Sieve Methods
Logistic regression (LR)
As we have discussed one of the most widely used classification method is based on
logistic regression. Fitting is particularly easy since it involves just the calculation of the
MLE for a canonical exponential family. The issue of regularization appears in this context
through the choice of the number of functions fi (Z) appearing in the logistic regression. Of
even greater importance is the nature of the sieve determined by the class F = {f1 , f2 , . . .}.
If fj (Z) = Zj ∈ R, then logistic regression may be viewed as a competitor to the rule based
on Fisher’s linear discriminant analysis (LDA) which is constructed on the assumption that
Z has an Nd (µj , Σ) distribution for j = 1, . . . , C. It can be shown (Problem 12.2.8 and
12.2.9) that LDA can never do better than LR asymptotically at least to 0th order, that is
converge to the Bayes rule.
Section 12.2
323
Classification and Prediction
Penalized least squares
Recall the framework of Section 12.1.1. Again the choice of sieve is of first importance.
Consider {fj (Z) : 1 ≤ j ≤ k} as the kth member of a sieve, that is,
Y = FT θ + e ,
(12.2.13)
T
where F = f1 (Z), . . . , fk (Z) , θ = (θ1 , . . . , θk )T ∈ Rk . Let ej be i.i.d. N (0, σ 2 ), so
that the model for n observations is
Yn×1 = Fn×k θ + e .
(12.2.14)
F ≡ fj (Zi ) , 1 ≤ j ≤ k, 1 ≤ i ≤ n, e ≡ (e1 , . . . , en )T , Y ≡ (Y1 , . . . , Yn )T . Then,
penalizing by J(θ) = λ2 |θ|2 is equivalent to minimizing
|Y − F θ|2 +
λ 2
|θ|
2
= YT Y − 2θT F T Y + θ T (F T F +
λ
I)θ
2
(12.2.15)
where I is the k × k identity matrix. The minimizer is easily seen to be
bR = (F T F + λI)−1 F T Y .
θ
(12.2.16)
bR .
µ
b(Z) = F θ
(12.2.17)
The minimizer is always uniquely defined if λ > 0 even if rank(F ) < k. The ridge
bR leads to the estimate of the regression µ(Z),
regression estimator θ
Ridge regression was first advanced (Hoerl and Kennard (1970)) for numerical stability
purposes in situations where F was ill-conditioned, but has important statistical properties
bR can be interpreted as a Bayes estimate. As such,
as well. As is noted in Section 12.1.1, θ
it shrinks the (unpenalized) least squares predictor
bLS (Z) = F [F T F ]−1 F T Y towards the
µ
mean 0 of the prior distribution Nn 0, F F T /λ for F θ.
The choice of regularization parameter λ will be discussed further in Section 12.5.
However, it is straightforward to make a connection to Stein shrinkage as in Section I.6.
Suppose F T F = I, so that the vectors {fj (Zi ) : i = 1, . . . , n} are orthonormal. Then the
MLE of θ is
b ≡ F T Y ∼ θ + ek×1
θ
where ek+1 ∼ N (0, I) and θ ∼ N (0, I/λ) are independent under our (empirical) Bayesian
b ∼ N (0, (λ + 1)I/λ) and the MLE of
model. Thus, marginally, θ
(1 + λ)−1 = 1 −
is
−1
\
(1
+ λ)
d ,
= 1−
b2 +
|θ|
λ + 1 −1
λ
(12.2.18)
324
Prediction and Machine Learning
Chapter 12
b 2 /d is replaced by |θ|
b 2 /(d − 2).
which is almost the Stein positive part estimate in which |θ|
It can also be interpreted as the estimate which for some c(λ) minimizes
|Y − F θ|2
subject to |θ|2 ≤ c(λ).
Pk
r
Recall the notation Jr (θ) =
l=1 |θl | . The penalty J1 (θ) is the closest convex
penalty of the form Jr to J0 . It is also the only convex member of the Jr family which
permits making some coordinates of the optimizing θ to be 0 (Problem 12.2.10). Thus if
bj is a regression coefficient and is set to zero, this corresponds to leaving the jth predictor
θ
Zj out of the model. As mentioned earlier, this method is known as the “Lasso” (least
absolute shrinkage and selection operator) (Tibshirani (1996)). The properties of the Lasso
are discussed extensively by Bühlmann and van de Geer (2011), and Hastie, Tibshirani
and Wainwright (2015). See also Problem 12.2.10. Properties of extensions of the Lasso
to grouped variables (the grouped Lasso) can be found in Yuan and Lin (2007) and Zhao,
Rocha and Yu (2009).
12.2.4
Machine Learning Approaches
We return to the methods discussed in Section 12.1.2 but now with Y ∈ {−1,
1}. Recall
that (Zi , Yi ), 1 ≤ i ≤ n, are i.i.d. as (Z, Y ), where (Z, Y ) is independent of (Z, Yi ), 1 ≤
i ≤ n . The goal is to build a classifier δn (Z) based on a training sample (Zi , Yi ), 1 ≤
i ≤ n, that can be used to classify an individual or object with unknown Y and known
Z ≡ (Zi , . . . , Zd )T as being either in the “Y = −1” or “Y = 1” category.
a) Neural Nets
We do not further discuss these methods but refer to Hastie et al (2001, 2009) and Ripley
(1996).
b) Support Vector Machines
As we indicated, support vector machines, were proposed by Vapnik as a classification
algorithm for 2 classes which finds the hyperplane of “maximum margin” separating the
members of the training sample perfectly according to class. Of course, there may be no hyperplane which can separate at all — for instance, imagine one class distributed uniformly
on a spherical shell and the other limited to the interior of the sphere. But if separation is
possible then we can define the margin for any separating hyperplane H by
M (H) = min |Zi − Π(Zi |H)|
i
where | · | is the Euclidean norm and Π is projection on H. This is just the minimum
distance of the training set from H. The hyperplane sought is the one which maximizes
M (H). Formally, for the hyperplane
H = {z : β T z + β0 = 0}, |β| = 1 ,
Section 12.2
325
Classification and Prediction
the problem may be stated as (Problem 12.2.13)
max min |βT Zi + β0 | : |β| = 1, Yi (β T Zi + β0 ) > 0, 1 ≤ i ≤ n . (12.2.19)
β,β0
i
We refer to Hastie et al (2009), Section 12.2, for a discussion showing how this convex
optimization problem may be transformed upon dropping the restriction that |β| = 1 into
min |β|
(12.2.20)
β,β0
subject to
Yi (ZTi β + β0 ) ≥ 1
(12.2.21)
∗
∗
for all i. Here the maximum margin is 1/|β |, where β is the minimizer in (12.2.20).
If separation is impossible we can modify (12.2.21) to
Yi (ZTi β + β0 ) ≥ 1 − ξi
(12.2.22)
for all i where ξi ≥ 0. Note that if ξi0 > 1, then any hyperplane misclassifying Yi0
but correctly classifying all Yi such that ξi = 0 becomes a candidate for separating all
{Zi : ξi = 0}. The reason is that we can write (12.2.22) as
Yi
ZT β
i
|β|
+
β0 1 − ξi
≥
.
|β|
|β|
Then, if ξi0 > 1, and there is a hyperplane H = {z : zT β ∗ + β0 = 0} which misclassifies
only Yi0 and Yi0 (ZTi0 β ∗ + β0 ) = −δ < 0, we see that, for some 0 < C < ∞, −δ/C ≥
1 − ξi0 and so |β ∗ |/C is feasible, i.e. a candidate for the minimizing |β| as in (12.2.20)
subject to (12.2.22). Evidently, if we permit
P ξi ≥ 0 to be arbitrary we can make the inf in
(12.2.20) to be 0. Thus, the restriction ni=1 ξi ≤ K is added, and we arrive at
1
min |β|2
2
subject to
ξi ≥ 0, Yi (ZTi β + β0 ) ≥ 1 − ξi , i = 1, . . . , n,
n
X
i=1
ξi ≤ K .
(12.2.23)
By Lagrangian duality (see, for instance, Boyd and Vandenberghe (2004)), we note that the
problem is equivalent to
max
n
nX
n
o
1 X
αi −
αi Yi Zi
2 i=1
i=1
subject to
0 ≤ αi ≤ γ,
(12.2.24)
n
X
i=1
αi Yi = 0
326
Prediction and Machine Learning
Chapter 12
for some γ depending on K and with the relation between the optimizing α∗ and β ∗ , β0∗ ,
ξ = (ξ1 , . . . , ξn ), and γ, given by
(12.2.25)
α∗i Yi (ZTi β ∗ + β0∗ ) − (1 − ξi ) = 0 ,
ξi (α∗i − γ) = 0 ,
(12.2.26)
for i = 1, . . . , n and
∗
β =
n
X
α∗i Yi Zi .
(12.2.27)
i=1
Note that α∗i > 0 iff Yi (ZTi β ∗ + β0∗ ) = 1 − ξi . The corresponding Zi are called support
vectors. Note also that all α∗i with ξi > 0 equal γ. Any 0 < α∗i < γ will correspond
to a correctly classified Zi at minimum distance from the hyperplane corresponding to
(β ∗ , β0∗ ). For further discussion of the implementation of these methods via reproducing
kernel Hilbert spaces, we refer to Hastie et al (2009).
c) Boosting
The initial form of this method, as limited to classification into 2 classes was in terms of
reweighting. We are given
(i) A set of “weak” classifiers, hk : Z → {−1, 1}, k = 1, 2, . . . , M , M ≤ ∞, where
{−1, 1} correspond to the classification decisions. “Weak” loosely means that the
classfiers are expected to have P [Misclassification] < 12 , but barely so.
(ii) A training sample (Zi , Yi ), i = 1, . . . , n, where Zi ∈ Rd and Yi ∈ {−1, 1}.
Boosting then produces a sequences of linear classifiers of the form
Gj (Z) = sgn
M
X
k=1
αkj hk (Z) ,
j = 1, 2, . . . ,
by selecting the weights so as to emphasize the classifiers doing particularly well on the
hard to classify observations.
There are many flavors of boosting; see Meir and Rätsch (2003) for an overview. One
division can be made between the algorithms in which (1) the hk are considered in a prespecified order, so that αkj = 0, k > min(j, M ), and (2) the other more common approach
where αkj 6= 0 for at most min(j, M ) indices, but the h to be considered at stage j is chosen optimally.
The basic form of the original AdaBoost algorithm describes a process of generating
the αkj by iterative reweighting of the training sample. Specifically, suppose we introduce
hm at step m,
AdaBoost 1.
1. Initialize G1 (Z)
wi1
err1
α1
=
=
=
=
h1 (Z)
1
n , i = 1, . . . , n
P
n
i=1
wi1 1 Yi 6= G1 (Zi )
log (1 − err1 )/err1 .
Section 12.2
327
Classification and Prediction
Gm
wi(m+1)
=
=
errm+1
αm+1
=
=
2. Given
define Gm+1
Pm
sgn
k=1
αk hk
wim exp αm 1(Yi 6= hm (Zi ) i = 1, . . . , n
Pn
i=1
wi(m+1) 1 Yi 6=hm+1 (Zi )
Pn
i=1 wi(m+1)
log (1 − errm+1 )/errm+1 .
Pm+1
= sgn
k=1 αk hk .
3. Iterate for m = 1, . . . , M, . . . ,
Note. Iterations need not stop once hM has been considered but we can restart with GM
replacing G1 .
AdaBoost 2. Instead of hk being presented in the original order, in the more common
version of the algorithm, hi1 , hi2 , . . . are determined in data driven order. Thus, set h(k) ≡
hik and initialize as before. At stage m
m
X
Gm = sgn
k=1
αk h(k) ,
then at stage m + 1,
him+1 = arg min
k
where
Pn
i=1
wi(m+1) 1 Yi 6= h(k) (Zi )
Pn
i=1 wi(m+1)
wi(m+1) ≡ wim exp αm 1 Yi 6= h(m) (Zi ) .
Then
Gm+1 = sgn
m+1
X
k=1
with
αk h(k)
αm+1 = log (1 − errm+1 )/errm+1
and
errm+1 =
m
X
i=1
wi(m+1) 1 Yi 6= h(m+1) (Zi )
(12.2.28)
✷
As was discovered by a number of authors (see Meir and Rätsch (2003)) the first of these
algorithms is just coordinate ascent as described in Section 2.4.2 and the more common version is a variant usually called the Gauss-Southwell algorithm for a particular optimization
problem. We next establish this result. Given P , a distribution of (Z, Y ), let
Q(α, P ) ≡
Z
exp −
M
n X
j=1
o
αj hj (z) y dP (z, y) .
328
Prediction and Machine Learning
Chapter 12
This is evidently a convex function of α ∈ RM and if Pbn is the empirical distribution of
the training sample, we note that
M
n
o
n X
1X
αj hj (Zi ) .
Qn (α) ≡ Q(α, Pbn ) =
exp −
n i=1
j=1
If we fix αjm = αj(m−1) , 1 ≤ j ≤ m − 1, and αjm = 0, j > m, then,
n
iff
∂
1X
exp − Gm−1 (Z)Yi e−αm 1(hm (Z)Yi = 1)
Qn (α) =
∂αm
n i=1
− eαm 1(hm (Z)Yi = −1) = 0
(e
or
2αm
Pn
e−Gm−1 (Z)Yi
−Gm−1 (Z)Yi 1(h (Z)Y = 1)
m
i
i=1 e
+ 1) = Pn
αm =
i=1
(1 − errm−1 )
1
.
log
2
errm−1
The correspondence to the first version of AdaBoost we described is complete when we
notice that (Problem 12.2.12)
n−1 −αj hj (Z)Yi
e−Gm−1 (Z)Yi = Πj=1
e
Pm
2αj 1(hj (Z)Yi =−1)
e− j=1 αj .
= Πm
j=1 e
(12.2.29)
Define
logit(u) = log[u/(1 − u)], 0 < u < 1 ,
and suppose that
M
X
αj hj (Z)
logit P (Y = 1|Z) =
(12.2.30)
j=1
for some α ≡ (α1 , . . . , αM )T . If M is fixed and if the sample optimizer α
b of Q(α, Pbn )
exists, then if we iterate either version of AdaBoost, the algorithms converge to
δ(Z) = 1 iff
M
X
j=1
α
bj hj (Z) > 0
with probability tending to 1 for fixed n > M (Problem 12.2.14). In turn, α
b converges as
n → ∞ in probability
to
the
minimizer
of
Q(α,
P
)
(Problem
12.2.15).
However,
suppose
logit P (Y = 1|Z) is
of
the
form
(12.2.30)
with
M
=
∞
and
it
is
approximable
by
P
functions of the form m
j=1 αj hj (Z). This is, for instance, the case if the [hj (Z) + 1]/2
Section 12.2
329
Classification and Prediction
are taken as indicators of arbitrary hyper rectangles in Rd . Then, it may be shown that the
population version of the algorithm of type 2 where the “best” direction hj is chosen on
each iteration converges to the Bayes classifier, but the sample version may not unless it is
regularized in some way by stopping the algorithm early. See Problem 12.2.16.
Other convex objective functions than Q(α, P ) can also be considered and these ideas
can also be applied to nonparametric regression and even density estimation — see Bickel,
Ritov, Zakkai (2001), Friedman, Hastie, Tibshirani (2000), and references therein for further discussions of boosting from a statistical point of view. See also Meir and Rätsch
(2003) and Rudin, Daubechies and Schapire (2004) for a discussion from a machine learning/dynamical systems point of view.
We note that support vector machines, as defined by (12.2.24), may be put in the boosting framework (Problem 12.2.17). In particular, solving (12.2.24) is equivalent to minimizing the convex objective function,
R(α, P ) =
Z
W (β0 +
n
X
j=1
αj hj (z))y dP (z, y) + λ|α|2
(12.2.31)
where λ = (2γ)−1 and the function W is given by W (t) = (1−t)+ , with x+ = x1(x ≥ 0).
W may be thought of as the closest convex approximation to 1(t ≤ 1). Note (Problem
12.2.17) that, if λ = 0, and the Bayes rule is of the form
δB (Z) = sgn
M
X
j=1
α∗j hj (Z) ,
then δB minimizes R(α, P ), as well as the Bayes risk,
M
X
α∗j hj (Z) < 0 .
P Y
j=1
For a fuller discussion of these relations, see Hastie et al (2001, 2009).
d) Tree-structured Methods
We define trees informally. We ask that the reader examine the captions of Figure 12.1
for the definition of the terms we use in our discussion. We consider the two-class case,
that is, Y = 0 or 1, say. The general multiclass case is developed in Problem 12.2.17.
We will consider tree-based classifiers based on classification regions derived from a
training sample (Z1 , Y1 ), . . . , (Zn , Yn ). At level 1, a classification rule δ (1) based on a
predictor Z ∈ Z can be viewed as specifying a “critical” region C1 = {z : δ (1) (z) = 1},
where Y is classified as 1, and its complement, C0 ≡ {z : δ (1) (z) = 0}, where Y is
classified as 0. We can represent δ (1) as a two-branch tree as follows. We ask the question:
Does the observed Z belong to C1 ? If so, send Z to node 1 , on level 1 in Figure 12.1,
and predict Y = 1, or, if not, then send Z to node 0 and predict Y = 0. For level 2, we
view C1 and C0 as separate universes with classifiers δ (2) and δ̄ (2) where δ (2) takes values
330
Prediction and Machine Learning
Chapter 12
=
=
=
Figure 12.1. An example of a classification tree. The nodes at the final level are called
terminal nodes. Y is classified to equal the last digit in the terminal node.
0 and 1 on C1 while δ̄ (2) does so on C0 . Let C11 , C10 , C01 , C00 be the sets defined by
{δ (1) = 1, δ (2) = 1}, {δ (1) = 1, δ (2) = 0}, {δ (1) = 0, δ̄ (2) = 1}, {δ (1) = 0, δ̄ (2) = 0}.
These form a partition of Z, even as C0 and C1 did. Classify as follows: If δ (1) sends Z
to C1 , ask if Z ∈ C11 and if yes, send Z to node 11 on level 2 and classify Y as 1; if
not send Z to node 10 and classify Y as 0, etc. We have in this way constructed a more
complex classifier taking values 1 on C11 ∪ C01 and 0 on C01 ∪ C00 . Given a sequence
of such classifiers, we can evidently grow as large a tree as we wish. The classification we
make for a given Z = z always corresponds to the value of the last rule, for the terminal
node we send z to.
We are immediately faced with the questions:
1) How do we select the consecutive rules for the different levels of the tree?
2) How deeply do we grow the tree?
Our goals in 1) are to have each member of the sequence easily computable but exhibiting the feature that the complex rules corresponding to growing the tree to a large depth
can approximate the Bayes rule. Question 2) is more complex and can be viewed as a
regularization question as discussed in Section 12.5.
To understand how Question 1 is answered, we begin with the unrealistic assumption
that we know P , that is, π, p0 , and p1 . Then,
A: We specify a class of simple rules. If Z is real the class of rules for classification and
Section 12.2
Classification and Prediction
331
regression trees (CART)(Breiman, Friedman, Olshen and Stone (1984)) is
{δc , δ̄c : δc ≡ 1[c, ∞), δ̄c ≡ 1 − δc } .
That is, δ(z) = 1 if z ≥ c, 0 otherwise; and δ̄c (z) = 1 if z < c, 0 otherwise. This means
that we consider all possible ways of classifying Y according as z is larger or smaller than
a threshold.
B: Construct level 1. We begin by examining how well the populations would be split if we
used δc or δ̄c . A perfect or “pure” split would occur if
P [Y = 1|Z ≥ c] = 1 − P [Y = 0|Z < c]
or conversely. A way to measure the “goodness of split,” proposed by Breiman et al. (1984),
is to use
τ 2 (c) ≡ E Var Y |δc (Z)
which vanishes iff the split is “pure” as an impurity measure of the split. In the two-class
case this means:
Choose c so that, if Ac ≡ {δc (Z) = 1},
τ 2 (c) ≡ P (Ac )P [Y = 1|Ac ]P [Y = 0|Ac ] + P (Āc )P [Y = 1|Āc ]P [Y = 0|Āc ]
is minimal. Once the optimal c, cOPT is chosen we use δc or δ̄c accordingly as
πP [Y = 1|Ac ] > (1 − π)P [Y = 1|Āc ] .
An alternative to τ 2 , proposed by Quinlan (1993), is to use the entropy (see Problem
12.2.18)
−E E(log P [Y = 1|δc (Z)]) .
Here is an algorithm.
The T algorithm. Suppose Z = (Z1 , . . . , Zk )T ∈ Rk is the vector of predictors. For
each, 1 ≤ j ≤ k, compute cOPT (j), the split of variable j which gives highest purity,
i.e. smallest τ 2 (c). Then, for the first level of the tree, choose that variable Xj which
minimizes τ 2 cOPT (j) , the impurity, over all possible variable choices. Call the index on
this variable j1 and the optimal splitting point c(1) . This gives a rule δ (1) for the first level
of the tree. For the second level, consider splitting separately [c(1) , ∞) and (−∞, c(1) ) into
[c(1) , c′ ), [c′ , ∞) on the right and (−∞, c), [c, c(1) ) on the left for c < c(1) and c′ ≥ c(1) ,
respectively. That is, use δ (2) (z) = δc (z) or δ̄c (z) defined for z ≥ c(1) , and c ≥ c(1) , and
similarly, δ̄ (2) (z) = δc′ (z) or δ̄c′ (z) for z < c(1) , c < c(1) , and proceed as in level 1 to
obtain (c(21) , j21 ), (c(20) , j20 ). Here, c(21) , c(20) are the optimal cut points for variables
Zj21 , Zj20 . Two nodes now arise from each of the nodes of level 1 with predictions carried
out as described earlier — see Figure 1. We can continue in this fashion indefinitely if
P ↔ (π, p0 , p1 ) where p0 , p1 are densities.
✷
It is intuitively clear and can be shown rigorously (Problem 12.2.24), that if each
member of the partion defined by level m of the tree and has positive probability and
332
Prediction and Machine Learning
Chapter 12
Cm (z) is the member of the partition containing z, then Cm (z) ⊃ Cm+1 (z) for all m
and ∩ Cm (z) = {z}. Hence, if |Cm (z)| denotes the length of Cm (z)
m
P [Z ∈ Cm (z)|Y = y]/|Cm (z)| → py (z) as m → ∞
where py (z) is the conditional density of Z given Y = y. Thus, the decision rule
δm (z) = 1 iff πP [Y = 1|Z ∈ Cm (z)] ≥ (1 − π)P [Y = 0|Z ∈ Cm (z)]
converges to the Bayes rule for each z, as m → ∞.
If we have a training sample Z1 , . . . , Zn and replace P by the empirical probability
bm based on Pb , and π with π
Pb , Cm with the version C
b = Pb [Y = 1], we have a way of
growing empirical classification trees. We claim that if we stop at any level m and define
bm (z)|Y = j], the ratio of the number of Yi = j which appear in the
pbmj (z) = Pb[Z ∈ C
node, to which z belongs, to the total number of Yi = j. Then, the pbmj (z) are estimates of
the pj (z) and the rule
b = 1 iff π
δ(z)
b pbm1 (z) ≥ (1 − π
b )b
pm0 (z)
converges to the Bayes rule provided that n → ∞, mn → ∞ slowly — see Breiman,
Friedman, Olshen and Stone (1984) for rules for selecting the depth m.
CART and Sieves
It should be clear that, if we stop at stage m, the estimates of pmj are just histograms
but with bin size adapted to the data. They can be related to a sieve as follows. Take without
loss of generality the support of P to be the cube I k . Let Pm = { All histograms on I k
with up to m break points in each coordinate }. That is, if pj ↔ P ∈ P, j = 0, 1
m
pj (z) =
2
X
i=1
aij 1(z ∈ Amij )
where {Amij } is a partition of I k into k-cubes {z : bi ≤ zi ≤ c1 , 0 ≤ i ≤ m + 1, b0 =
0, bm+1 = 1}. So Pm is parametrizable by {aij , 1 ≤ i ≤ 2m − 1, j = 0, 1}, the
2m−1 left hand vertices of k-cubes, and the indices (j1 , . . . , jm ) of the variables. If we
call this parameter θ, then the tree gives us an estimate νbm of a parameter νm (P ) such
that νm (Pθ ) = θ. However, the method of estimation is not maximum likelihood or more
generally of minimum contrast form. As we have noted, if P ∈
/ P, then Pνm (P ) , the
member of P which is “closest” in the appropriate sense to P converges to P as m → ∞.
Then, again, regularization, which in this case is choice of the tuning parameter m, is called
for. For an extensive discussion of tree-structured methods, see Breiman et al (1984). ✷
Summary. In this section we describe and analyze some of the procedures we mentioned
previously in detail. In Section 12.2.1 we establish consistency and determine the rates
of uniform convergence of the Nadaraya-Watson over suitable nonparametric models in
regression. In Section 12.2.2 we study consistency of kth nearest neighbour rules in the
classification. In Section 12.2.3 we also touch on logistic regressions, ridge regressions as
examples of sieve methods, and their connections to empirical Bayes. Finally, in Section
12.2.4 we describe support vector machines, boosting, and the tree-based method CART.
Section 12.3
12.3
333
Asymptotic Risk Criteria
Asymptotic Risk Criteria
In what sense are the various procedures we have discussed best or even good? The classical
decision theory point of view we have discussed in Section 1.3 of Volume I is to first
consider a probability model P for X ∈ X , specify a decision space D, a loss function
l : P × D → R+ , and the class of all possible (possibly randomized) decision procedures
δ : X → D. We measure the performance of a decision procedure δ by
R(P, δ) ≡ EP l P, δ(X) , P ∈ P .
(12.3.1)
The principle focussed on in the analysis of nonparametric classification and regression
procedures is minimax. That is, we compute
R(δ) ≡ max R(P, δ) ,
P
and consider δ ∗ minimax optimal if
R(δ ∗ ) = arg min R(δ) .
D
A key result which we extend is Theorem 3.3.3 which states that δ ∗ is minimax if we can
find a sequence of priors πk such that
inf r(πk , δ) → R(δ ∗ ) ,
δ
where r(πk , δ) denotes the Bayes risk of δ.
Recall from Section 12.1 that classification, regression, and other prediction problems
are naturally split into two parts:
(i) “P known”: Here we envisage observing a vector of predictors Z ∈ Z and know P ,
where P is the distribution of X = (Z, Y ). Y is unknown and we want to predict Y for
each given Z = z. A decision rule is a prediction rule d(·) : Z → Y where Y is the set of
values or classes to be predicted. The risk of d(·) is
Z
R P, d(·) = l y, d(z) dP (z, y) = EP l Y, d(Z) ,
where l(y, d) is the loss associated with predicting y to be d(z). If we think of Y as a
random “parameter,” there is, for any such problem, an optimal rule, the Bayes rule of
Proposition 3.2.1,
δP∗ (z) = arg min EP l(Y, d(z))|Z = z
d
which yields the minimum Bayes risk,
RP∗ ≡ EP R P, δP∗ (Z) .
(ii) The “real” problem. P unknown. We are given training data Xi ≡ (Zi , Yi ), i =
1, . . . , n, i.i.d. as X ∼ P . A decision procedure is a mapping
δ(· : Xi , 1 ≤ i ≤ n) : Z → D
334
Prediction and Machine Learning
Chapter 12
and has risk
R(P, δ) = EP
Z
l y, δ(z : Xi , 1 ≤ i ≤ n) dP (z, y) ,
where l is the loss when predicting y to be δ(z : Xi , 1 ≤ i ≤ n).
It’s useful in this framework to equivalently measure performance of δ as its regret
e
R(P,
δ) ≡ R(P, δ) − RP∗ ,
(12.3.2)
and ask first that, asymptotically, we have, for a sequence {δn },
e
Consistency: R(P,
δn ) → 0 as n → ∞ for all P , uniformly on P.
And second, given consistency, that we have
Asymptotic minimax regret on P:
e n ) ≡ sup R(P,
e
e
en . (12.3.3)
R(δ
δn ) : P ∈ P ≍ inf sup R(P,
δ) : P ∈ P ≡ R
δ
Here ≍ can mean strongly.
e n)
R(δ
→1
en
R
(12.3.4)
or weakly
e n ) = O(R
en ) and R
en = O R(δ
e n) ,
R(δ
(12.3.5)
en → 0. Typically, unless P is regular parametric,
noting that we are in a situation where R
(12.3.5) is the best that can be hoped for.
When proving minimaxity the essential argument always has the same nature. Exhibit,
or at least show, the existence of a sequence of Bayes priors and corresponding Bayes rules
whose Bayes regret, that is, the regret of the Bayes rules, is asymptotically the same or no
e n ) of the candidate minimax procedures. We begin
better than the maximum regret R(δ
with the parametric case.
12.3.1
Optimal Prediction in Parametric Regression
Models
We establish the consistency and strong asymptotic minimax regret property of an empirical
plug-in Bayes predictor for squared error loss when P is a regular parametric model. Let
X = (Z, Y ) be distributed according to Pθ with density p(z, y|θ) : θ ∈ Rp such that
the model {Pθ : θ ∈ Θ}, Θ ⊂ Rp , is regular and obeys the conditions of Theorem 6.2.2.
Suppose also that, as usual, the marginal density q(z) of Z does not depend on θ, and that
(i) Y is bounded above: |Y | ≤ M for some M > 0.
Section 12.3
Asymptotic Risk Criteria
335
(ii) q(·) is bounded below: P q(Z) ≥ δ = 1 for some δ > 0.
nR
o
∇Tθ e(z, θ)I −1 (θ)∇θ e(z, θ)q(z)dz < ∞,
(iii) max
θ ∈Θ
where, by (1.4.5), the Bayes regret of δ at Pθ for squared error loss is
Z
2
Rθ (δ) ≡ Eθ
δ(z : X1 , . . . , Xn ) − e(z, θ) q(z)dz .
Here, Eθ is expectation with respect to Pθ , and by (1.4.4), the optimal predictor of Y given
θ and Z = z is
Z ∞
e(z, θ) = Eθ (Y |z) =
yp(z, y|θ)dy/q(z) .
(12.3.6)
−∞
b where
When θ is unknown, the natural estimate of e(z, θ) is the plug-in estimate e(z, θ)
b
θ is the MLE of θ, which we assume behaves regularly. Thus, we take
b .
δ ∗ (z : X1 , . . . , Xn ) ≡ e(z, θ)
By the expansion
b = e(z, θ) + ∇T e(z, θ)(θ
b − θ) + oP (θ
b − θ) ,
e(z, θ)
θ
formally,
∗
Rθ (δ ) = Eθ
= Eθ
Z
Z
= n−1
b − e(z, θ) 2 q(z)dz
e(z, θ)
T
b − θ) 2 q(z)dz 1 + oP (1)
∇θ e(z, θ)(θ
Z
1 + o(1)
∇Tθ e(z, θ)I −1 (θ)∇θ e(z, θ)q(z)dz
(12.3.7)
where I(θ) is the Fisher information matrix. These formal calculations can be justified
under our conditions (Problem 12.3.1) and establish Bayes consistency of the decision rule
δ∗.
Proposition 12.3.1. Under (i), (ii), (iii), and the assumptions of Theorem 6.2.2, the predicb is asymptotically Bayes consistent for squared error loss.
tor δ ∗ = e(z, θ)
To show that the asymptotic minimax regret property (12.3.4) holds we will examine
what the Bayes solution of this problem is if we assign to θ a positive continuous bounded
prior density π. Then, for Z independent of X1 , . . . , Xn and θ, the Bayes risk of a decision
rule δ is
Z
2
Rθ (δ)π(θ)dθ = Eπ Eθ δ(Z : X1 , . . . , Xn ) − e(Z, θ) |θ
2
= Eθ Eπ δ(Z : X1 , . . . , Xn ) − e(Z, θ) |Z, X1 , . . . , Xn ,
(12.3.8)
336
Prediction and Machine Learning
Chapter 12
and Theorem 1.4.1 leads to the Bayes rule,
δπ (Z : X1 , . . . , Xn ) = Eπ [e(Z, θ)|Z, X1 , . . . , Xn ] = Eπ [e(Z, θ)|X] ,
where now X denotes (X1 , . . . , Xn ), and the Bayes regret for δπ , the minimizer of Bayes
risk, is
Z
2
Rπ = Eθ Eπ e(Z, θ)|X − e(Z, θ) π(θ)dθ .
(12.3.9)
Next we argue that we get a close approximation to Rπ if in (12.3.9) we replace
b By assumptions (i), (ii), and (iii),
Eπ [e(Z, θ)|X] with e(Z, θ).
1
1
−2
b + vn− 2 ) − e(z, θ)
b = ∇T e(z, θ)vn
b
e(z, θ
+ OP (n−1 ) ,
θ
(12.3.10)
b + ∇θ e(z, θ)(θ
b
b , we have
uniformly for v in a compact set. Writing e(z, θ) ∼
− θ)
= e z, θ)
b |X) .
b Eπ (θ − θ)
b ∼
Lθ Eπ [e(Z, θ)|X] − e(z, θ)
= Lθ ∇θ (e(Z, θ)
Noting that e is uniformly bounded, and applying the Bernstein–von Mises theorem to
√
b
n(θ − θ)|X
(Theorems 5.5.2 and 6.2.3), we get that
Z
b 2 π(θ)dθ
Eθ Eπ e(Z, θ)|X − e(Z, θ)
=
1
n
Z
Eθ
Z
b vT ϕ(v, I(θ))dv
b
∇Tθ e(Z, θ)
2
π(θ)dθ 1 + o(1) ,
(12.3.11)
where ϕ(v, Σ) is the N (0, Σ) density. By the symmetry of the N (0, Σ) distribution the
inside integral is zero. Then, (12.3.11) is o(n−1 ) and using (1.4.5), we can conclude that
Z
b − e(Z, θ) 2 π(θ)dθ + o(n−1 ) .
Rπ = Eθ e(Z, θ)
(12.3.12)
An application of the argument leading to the Bernstein–von Mises theorem similarly yields
Z
Rθ (δ ∗ )π(θ)dθ = Rπ 1 + o(1) .
(12.3.13)
What we have shown is that, by (12.3.12) and the expansion leading to (12.3.7),
Proposition 12.3.2. Under the conditions of Proposition 12.3.1,
(i) the Bayes regret is
Rπ =
where
c(π) =
and thus
Z
c(π)
1 + o(1)
n
Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) π(θ)dθ
Section 12.3
Asymptotic Risk Criteria
337
b is an asymptotically efficient estimate of e(·, θ).
(ii) e(·, θ)
We next show that δ ∗ has the strong asymptotic minimax property.
Proposition 12.3.3. Under the assumptions of Proposition 12.3.1, δ ∗ is asymptotically
minimax regret in the sense of (12.3.4).
Proof. Take π = πt with πt converging weakly to point mass at θ∗ , as t → 0, where
θ ∗ ≡ arg max Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) .
Since the integral of a function is less than its maximum,
max c(π) = max Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) .
π
θ
It follows that the minimum Bayes regret is
1
max Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) 1 + o(1)
n θ
and hence we conclude that δ ∗ is asymptotically minimax regret in the strong sense of
(12.3.4) by the arguments establishing Theorem 3.3.3. That is, we have shown that there
are priors whose Bayes regret has the same asymptotic behavior as the maximum regret of
our candidate rule δ ∗ .
✷
12.3.2
Optimal Rates of Convergence for Estimation and
Prediction in Nonparametric Models
In this section we will develop methods for showing weak regret minimaxity in general
nonparametric frameworks. Again, this involves introducing a candidate procedure and
finding a “least favorable” prior where the Bayes risk converges to the maximum risk of the
candidate rule. Our main application is
Example 12.3.1. Nonparametric regression. We show that with proper bandwidth choice,
the Nadaraya-Watson kernel of Section 12.2.1 is asymptotically minimax regret in the weak
sense for the nonparametric problem of predicting µ(z) where µ(z) ≡ E(Y |Z = z) satisfies a first order Lipschitz condition,
P = P : |µ(z) − µ(z′ )| ≤ M |z − z′ | for all z, z′ .
(12.3.14)
We recall first that the rate n−2/(2+d) is achievable in nonparametric regression for the
problem of predicting µ(Z) = E(Y |Z) with squared error loss on the basis of (Z, Y ),
Z ∈ Rd , when the joint distribution P of (Z, Y ) belongs to P given by (12.3.14). The
following theorem, which is easily generalized to more general Z, Y , follows from the
proof of Theorem 12.2.1 in Appendix D.7.
Theorem 12.3.1. Let P0 = {P ∈ P : P on I d }, Z uniform on I d . Let µ
bNW be the
NW estimate given in (12.2.7) with product kernel Πdj=1 K0 (tj ), where K0 is non-negative,
continuous, bounded, and has compact support. Then, using bandwidth h = cn−1/(2+d) ,
c > 0,
338
Prediction and Machine Learning
Chapter 12
2
(i) For fixed z in the interior of I d , sup | µ
bNW (z) − µ(z)| : P ∈ P0 = O(n− 2+d ) .
2
(ii) sup EP |b
µNW (Z) − µ(Z)|2 : P ∈ P0 = O(n− 2+d ) .
(12.3.15)
Remark 12.3.1(a). In the problems we will show that rate n−2s/(2s+d) can be obtained if
we generalize (12.3.14) to the Hölder class
Ps ≡ {P : |Ds−1 (µ(z) − µ(z′ ))| ≤ M |z − z′ | for all z, z′ }
where Ds−1 is the s − 1 differential of µ, D0 µ = µ, provided we replace Nadaraya-Watson
by a suitable local polynomial estimate.
(b) These rates for Hölder classes or similar Sobolev classes can be achieved by a
variety of predictors — see Tsybakov (2008) and Györfi et al (2002) for examples. This
is also true for classes which permit discontinuous functions. In this case an example of
appropriate procedures is methods based on wavelet expansions of the members of P in
(12.3.14). See Donoho and Johnstone (1994) for basic results in this direction.
✷
Lower bounds
2
Using lower bounds on risk, we will show that for P as in (12.3.14), indeed, n− 2+d is
the best minimax rate one can hope for. We also reveal something equally significant. For
IMSE, the “least favorable” prior distributions are quite unlike those in Section 12.3.1. The
former can be thought of as smooth densities concentrating around a least favorable point
θ∗ . The latter concentrate around probabilities P whose corresponding µ(·) is as wiggly as
possible but still obeys the constraints on P . Such µ seem rather implausible in actual practice and the minimax theorems are not as compelling as in the low dimensional parametric
case. As Einstein said (in English translation), “God is subtle but not malevolent.”
✷
In general, to prove lower bound results, we necessarily want to use asymptotic versions
of the minimax theory of Chapter 8. There are a number of approaches to lower bounds.
Some of the ones we present and others are discussed in full detail in Tsybakov (2008),
Chapter 2.
The testing approach to lower risk bounds
Suppose that our decision problem is of the form: We observe X ∈ X , X ∼ P , P ∈ P
and want to estimate a parameter θ : P → T where we are given a metric ρ on T and our
goal is to construct θe : X → T so that θe has the minimax property
e
b
max EP ρ θ(X),
θ(P ) = min max EP ρ θ(X),
θ(P ) .
P
θb
P
Evidently prediction in regression is a problem of this type with θ(P )(·) = EP (Y |·) =
µ(·). Here X = (Z, Y ), T is a collection of functions of Z and a possible metric to use as
a loss function is
Z
12
2
ρ(t1 , t2 ) ≡
t1 (z) − t2 (z) dν(z)
Section 12.3
339
Asymptotic Risk Criteria
for some probability ν and t1 , t2 ∈ T . For this θ(P ) and ρ, if ν = Q is the distribution of
a Z independent of the data vector X,
b
b(Z) = IM SE(b
µ)
θ(P ) = EQ M SE µ
E ρ2 θ(X),
by Fubini’s theorem. Another example is ν = point mass at a given point z.
Returning to the general case, let θ0 ≡ θ(P0 )(·) where P0 is fixed and for r > 0 let
Sr ≡ P : ρ θ(P ), θ0 ≥ r .
Our strategy is to select a subset
Ωr = {Qλ ∈ S(r) : λ ∈ Λ ⊂ R}
of S(r) parametrized by a parameter λ ∈ Λ and to select a prior πr (λ) on Λ such that
the Bayes test generates lower bounds in risk. We will identify P ∈ Ωr with λ and write
Πr (dP ) for Πr (λ)dλ as well as Πr (Ωr ) for Πr (Λ).
Theorem 12.3.2. Suppose all P ∈ Pr ≡ {P0 } ∪ Ωr have discrete or continuous case
density functions p. Consider the testing problem, H : P = P0 vsK : P ∈ Ωr with
0−1
loss. Let Πr be a prior distribution on Pr such that 12 = Πr {P0 } = 1 − Πr Ωr and let
ϕB be a corresponding Bayes test,
Z
[p(x)/p0 (x)]Πr (dP ) > c
ϕB (x) = 1 if L ≡
Ωr
= 0 if
L<c.
Let
1
RB (Πr ) ≡ r(Πr , ϕB ) =
E0 ϕB (X) +
2
Z
EP 1 − ϕB (X) Πr (dP )
(12.3.16)
be the minimum Bayes risk, and assume 0 < RB (Πr ) < 1. Then, for any θb and any metric
ρ,
b θ(P ) ≥ r ≥ RB (Πr ) .
max P ρ θ,
Pr
2
Proof: Consider the test
ϕ(x) = 1
if
=0
if
r
2
b θ0 ) ≤ r .
ρ(θ,
2
b θ0 ) >
ρ(θ,
Then ϕ has Bayes risk no lower than the Bayes test, thus
1
E0 ϕ(X) + max EP 1 − ϕ(X) ≥ RB (Πr ) .
Ωr
2
(12.3.17)
340
Prediction and Machine Learning
So, either
or
Chapter 12
b θ0 ) > r ≥ RB (Πr )
P0 ρ(θ,
2
b θ0 ) ≤ r ≥ RB (Πr ) .
max P ρ(θ,
Ωr
2
(12.3.18)
In the second case, writing θ for θ(P ) we have, by the triangle inequality,
Then,
b θ0 ) ≥ ρ(θ, θ0 ) − ρ(θ,
b θ) ≥ r − ρ(θ,
b θ) .
ρ(θ,
r
b θ) ≥ r .
=⇒ ρ(θ,
2
2
Using (12.3.18), we see there exists P ∈ Ωr such that
b θ0 ) ≤ r ≤ P ρ(θ,
b θ) ≥ r ,
RB (Πr ) = P ρ(θ,
2
2
b θ0 ) ≤
ρ(θ,
and the result follows.
✷
Further, by Markov’s inequality we obtain a minmax risk bound,
Corollary 12.3.1. Under the conditions of Theorem 12.3.2
b θ(P ) ≥ 2RB (Πr ) .
min max EP ρ θ,
P
b
r
r
θ
Bounds from asymptotic likelihoods in the Gaussian contiguous case.
The main weakness of the bound based on Theorem 12.3.2 is that we are left with
the problem of lower bounding the Bayes risk RB (Πr ) itself. However, inQthe case that
n
(X1 , . . . , Xn )T is a vector of independent random variables with p(x) = j=1 p(j) (xj )
densities, there is a natural approach to asymptotic bounds using the contiguity ideas of
Section 9.5. Let r = rn depend on n and write Πn for Πrn . We are looking for πn (·) on
S(rn ) ∪ {P0 } which make the testing problem as hard as possible, i.e. an (approximately)
least favorable prior distribution πn . Let, Ωn denote Ωrn and
Ln (P ) = log
n
Y
p(j)
(j)
j=1
p0
(Xj ) ,
where P ∈ Ωn . Suppose that we can choose Ωn and Πn such that πn (Ωn ) = 12 = πn (P0 )
and, as n → ∞,
n
σ 2 2 o
P0
,σ
→1.
πn P0 : Ln =⇒
N −
2
It readily follows that, asymptotically, we should take c = 12 in the Bayes test ϕB and then
(Problem 12.3.7), as n → ∞,
σ
RB (Πr ) → 1 − Φ( ) .
2
Section 12.3
341
Asymptotic Risk Criteria
Thus, a general approach to getting lower bounds is to construct Ωn and Πn as above. We
can then conclude by Corollary 12.3.1 that it will not be possible to estimate θ(P )(·) at a
rate faster than rn ↓ 0.
Example 12.3.1. (Continued) Nonparametric regression.
We next apply the likelihood idea to the Lipschitz model P0 of Theorem 12.3.1 and
show minimaxity of the Nadaraya-Watson estimate. As before, for simplicity, assume Z is
uniform on the unit cube I d = [0, 1]d .
(a) Estimation of µ(·) at a Point
We begin with the easier problem of estimating µ(z) at a point, say, z0 ≡ ( 21 , . . . , 21 )T
with loss function |b
µ(z0 ) − µ(z0 )|. For our submodel, we consider
Yi = µh (Zi ) + Ni , i = 1, . . . , n
(12.3.19)
where N1 , . . . , Nn are i.i.d. N (0, 1) and we assume, without loss of generality, that
µ0 (·) ≡ µP0 (·) = 0 .
Given that µ
bNW is the candidate rule, it is natural to consider a prior πn putting all its mass
on µh (·) given by
µh (z) ≡ hg
h
z − z0 1(|z − z0 | ≤ )
h
2
(12.3.20)
where h = hn depends on n, and the kernel g is of the form
g(z) ≡
d
Y
g1 (zj )
j=1
with g1 differentiable and, for some M0 > 0,
g1 : [0, 1] → R,
g1 6= 0,
g1 (0) = g1 (1) = 0,
|g1′ | ≤ M0 .
(12.3.21)
Note that, for z(1) , z(2) ∈ I d ,
(i) |g(z(k) )| ≤ M0d , k = 1, 2 .
(ii) |g(z(1) ) − g(z(2) )| ≤ M0d−1
≤
M0d
d
X
j=1
(1)
(2)
Pd
(2)
(1)
j=1
|g1 (zj ) − g1 (zj )|
1
|zj − zj | ≤ M0d d 2 |z(1) − z(2) | .
1
(12.3.22)
Thus if M0 ≡ (M d− 2 )−d , for some M > 0 and |z(k) − z0 | ≤
k = 1, 2, then
|µh (z(1) )−µh (z(2) )| ≤ h g
z(1) − z 0
h
−g
z(2) − z 0
h
1
2 h,
|z(k) − z0 | ≤
1
2 h,
≤ M |z(1) −z(2) | . (12.3.23)
342
Prediction and Machine Learning
Chapter 12
In view of (12.3.21), hg1 [(z− 12 )/h]1(|z− 12 | ≤ h) is differentiable on (0, 1) with derivative
bounded by M0 in absolute value. Thus the first order Lipschitz condition (12.3.14) holds
for all z1 , z2 . Moreover, by construction,
µh (z) ≤ hM0d ,
(12.3.24)
and
Ln (P ) =
n
X
i=1
µh (Zi )Ni −
µ2h (Zi ) ,
2
(12.3.25)
where Ni are i.i.d. N (0, 1). So, given Z1 , . . . , Zn ,
n
n
X
1X 2
µ2h (Zi ) .
µ (Zi ),
Ln ∼ N −
2 i=1 h
i=1
Now,
µh (Zi ) = 0 if |Zi − z0 | > h
Z − z i
0
= hg
otherwise.
h
Evidently,
and given |Zi − z0 | ≤ h,
E
n
X
i=1
Var
n
X
i=1
Zi −z0
h
P |Zi − z0 | ≤ h ≍ hd
is uniform on I d . Thus,
µ2h (Zi ) ≍ nhd+2
µ2h (Zi ) ≤ nEµ4h (Z) ≍ nhd+4
1
and if we take h = n− d+2 , then
Pn
P
i=1
µ2h (Zi ) −→ σ 2 . Hence,
1
(12.3.26)
Ln (P ) =⇒ N (− σ 2 , σ 2 ) ,
2
R1 2
where σ 2 =
g (z)dz)d . In view of (12.3.23), (12.3.24) and (12.3.26) we can apply
0
2
Corollary 12.3.1 and obtain that the lower bound on the risk has the rate of n− d+2 we
got for the Nadaraya-Watson estimate in Theorem 12.2.1. That is, the Nadaraya-Watson
estimate achieves the optimal rate of convergence.
(b) Estimating µ(·) as a Function. We approach this problem from the estimation point
of view with a prior suggested by our analysis of estimation at a point. Let
µh (z, ε) ≡ h
N
X
j=1
εj g
Z − z j
h
1(z ∈ Cj ) ,
(12.3.27)
Section 12.3
343
Asymptotic Risk Criteria
where the εj are i.i.d. P [εj = 0] = P [εj = 1] = 12 , ε ≡ (ε1 , . . . , εN )T , and
g(z) =
d
Y
g1 (zj )
j=1
where g1 (z) = z1(0 ≤ z ≤ 12 ) + (1 − z)1( 21 < z ≤ 1) for 0 ≤ z ≤ 1, C1 , . . . , CN are a
partition of I d into N = h−d cubes of side length h, and the zj are the centers of the Cj . It
can be shown (Problem 12.3.4) that if N1 , . . . , Nn are i.i.d. N (0, 1) and
Yi = µh (Zi , ε) + Ni ,
i = 1, . . . , n ,
(12.3.28)
1
then, for M = (d− 4 ), (12.3.27) and (12.3.28) define a prior distribution π on P with
|µ(z1 , ε) − µ(z2 , ε)| ≤ M |z1 − z2 | .
We now apply (12.3.6) and (12.3.9) treating ε ≡ (ε1 , . . . , εN )T as θ. Then, e(z, ε) =
µh (z, ε) and
E e(z, ε)|X1 , . . . , Xn = h
N
X
P [εj = 1|X1 , . . . , Xn ]g
j=1
(z − zj )
1(z ∈ Cj ) . (12.3.29)
h
Using the fact that the 1j ≡ 1(z ∈ Cj ) are orthogonal we see from (12.3.9) and (12.3.29)
that the Bayes risk is
Rπ = E h2
= h2 E
Z
N
X
j=1
N
X
j=1
(P [εj = 1|X1 , . . . , Xn ] − εj )g
Eε (P [εj = 1|X1 , . . . , Xn ] − εj )2
z − z 2
j
1j (z) dz
h
Z
g2
z − z j
h
1j (z)dz ,
(12.3.30)
where Eε indicates expectation over ε, while the second E is over X1 , . . . , Xn . But,
Z
Z
z − zj g2
1j (z)dz =
g 2 (z)dz hd ,
(12.3.31)
h
and
2
Eε P [εj = 1|X1 , . . . , Xn ] − εj
2 1
1
= 1 − P [εj = 1|X1 , . . . , Xn ] + P 2 [εj = 1|X1 , . . . , Xn ]
2
2
1
= − P [εj = 1|X1 , . . . , Xn ] 1 − P [εj = 1|X1 , . . . , Xn ]
2
1 1
1
≥ − = .
2 4
4
(12.3.32)
344
Prediction and Machine Learning
Chapter 12
1
Put h = n− d+2 and combine (12.3.30)–(12.3.32) to get that, for some universal C,
Rπ ≥
h2 2
M
4 0
Z
2
g 2 (z)dz = Cn− 2+d
Id
since N hd = 1.
Thus the rate of convergence to zero of the lower bound on risk is no faster than the
rate of convergence of µ
bNW (·), and we have established the asymptotic rate optimality of the
Nadaraya-Watson estimate for IMSE.
✷
As we have seen in our examples establishing the required lower bound on Bayes risks
even with plausible least favorable priors can be subtle. For instance, computing the natural
bound on integrated absolute error is not easy using the method we developed for IMSE in
the second part of Example 12.3.2.
The multiple decision approach to lower bounds
There are a number of general purpose lower bounds for the Bayes risk in problems
with finite parameter spaces and corresponding finite multiple decision problems which
generalize Lemma 12.3.1 and can be similarly applied. Chief among these are lemmas due
to Fano (1952) and Assouad (1983). See also Tsybakov (2008), Theorem 2.4 for a more
direct generalization of Lemma 12.3.1 to multiple decision procedures rather than just tests
as we did in Corollary 12.3.1. We discuss Fano’s Lemma and refer to Tsybakov (2008) for
a general form
of Assouad’s and other lemmas.
Let P = Pθ : θ ∈ {1, . . . , m} be probability distributions on some X and consider
the decision problem of deciding, given an observation X, which is the true Pj with 0-1
loss. Identify Pj with j. A possible rule δ : X → {1, . . . , m} has risk
R(j, δ) = Pj [δ(X) 6= j],
1≤j ≤m.
(12.3.33)
We introduce two quantities familiar in information theory; see Cover and Thomas (1991),
Chapter 2. For (X, Y ), a pair of random variables, with X taking on m values {1, . . . , m}
only, define
m
X
H(X) = −
P [X = j] log P [X = j] ,
j=1
the entropy of X. It is immediate that
H(X) = log m − K(PX , U ) ,
(12.3.34)
where PX is the distribution of X, U is the uniform distribution on {1, . . . , m}, and K is
the Kulback-Leibler divergence defined in (2.2.24). Next define the mutual information of
(X, Y ) as
I(Y, X) = I(X, Y ) ≡ K(P(X,Y ) , PX PY )
Z Z
p(x, y) =
pX (x)pY |X (y|x) log
dx dy ,
pX (x)pY (y)
(12.3.35)
Section 12.3
345
Asymptotic Risk Criteria
the Kullback-Leibler divergence between the joint distribution of (X, Y ) and the distribution with independent coordinates having pX , pY , as marginals. Identifying Θ with Y ∼ U ,
L(X|Θ = j) with Pj as above, Fano’s inequality states
Theorem (Fano). For any δ as in (12.3.33), m ≥ 2,
m
1 X
I(Θ, X)
R(j, δ) ≥ 1 −
m j=1
log m
h K(P , P ) + log 2 i (m − 1)
k
l
.
≥ 1 − max
k,l
log m
m
max R(j, δ) ≥
j
(12.3.36)
(12.3.37)
The interpretation of (12.3.36) is natural in information theory; see Cover and Thomas
(1991) for a discussion and proof of (12.3.36). The modification (12.3.37) is not the best
possible — see Le Cam (1986) for a better but somewhat harder to prove version.
To prove (12.3.37) we argue as follows. If Y ∼ U and X takes values in {1, . . . , m},
log
h
m
X
−1 p(x, y) i
= log p(x|y) m−1
p(x|k)
pX (x)pY (y)
k=1
m
X
−1
= log p(x|y) − log m
p(x|k)
(12.3.38)
k=1
m
m
k=1
k=1
1 X
1 X
p(x|y)
≤ log p(x|y) −
,
log p(x|k) =
log
m
m
p(x|k)
since − log z is convex in z. Substituting (12.3.38) back in (12.3.35) we get
m Z m
m−1
1 X
p(x|j)
1 X
dx ≤
max K(Pj , Pk )
p(x|j)
log
I(Θ, X) ≤
j,k
m j=1
m
p(x|k)
m
k=1
and (12.3.37) follows.
✷
Fano’s inequality can be utilized as a generalization of the testing argument
of
Corollary
12.3.1. We look for m points P1 , . . . , Pm in P such that ρ θ(Pk ), θ(Pl ) ≥ rn for all k 6= l,
rn
and ∪m
= P, where S(θ, r) is the ρ sphere of radius r around θ, and define
k=1 S θ(Pk ), 2
b
b θ(Pj ) < rn /2. Note that j is uniquely defined (Problem 12.3.8).
the rule δ(x)
= j iff ρ(θ,
Then, by Markov’s inequality, we have
max K(Pl , Pk ) b θ(P0 ) ≥ rn P ρ θ,
b θ(P0 ) ≥ rn ≥ rn 1 − l6=k
.
EP ρ θ,
log m
Therefore
b θ(P ) ≥ rn δ
max EP ρ θ,
(12.3.39)
P
for some δ > 0, all n, provided that max K(Pl , Pk )/ log m is bounded above by 1 − δ.
l6=k
346
Prediction and Machine Learning
Chapter 12
We apply this method to Example 12.3.1 with integrated squared error loss. It is natural
to consider the m = 2N functions
µε (z) =
N
X
εj hg
j=1
z − zj h
as ε varies. Any two such functions µε1 , µε2 differ on at least one of the Cj and, by the
2
R
calculation made for part (i) of the example, if ρ2 (µ1 , µ2 ) = I d µ1 (z) − µ2 (z) dz, then
2
ρ (µε1 , µε2 ) ≥ h
2
Z
g 2 (z)dz .
(12.3.40)
Id
On the other hand, the maximum divergence corresponds to the functions
µ2 (z) =
N
X
hg
j=1
z − zj 1(z ∈ Cj ) ,
h
µ1 (z) ≡ 0 .
If P1n , P2n are two product probabilities for which (Zi , Yi ), 1 ≤ i ≤ n are i.i.d., then
K(P1n , P2n ) = nK(P11 , P21 ) ,
(12.3.41)
and if we let W ∼ N (0, 1), then
1
ϕ(Y1 − µ2 (Z))
= E µ2 (ZW ) + Eµ22 (Z)
K(P11 , P21 ) = −E1 log
ϕ(Y1 )
2
Z
1
= h2
g 2 (z)dz .
(12.3.42)
2
Id
1
d
If we put rn = n− 2+d , h = rn so that N = n 2+d , we see from (12.3.40)–(12.3.42) that
(12.3.39) with δ = 21 yields
2
r2
b(Z) − µ(Z) ≥ n
max EP µ
P
2
in accordance with our results in part (ii) of Example 12.3.1.
Remark 12.3.2. (a) This method based
R on Fano’s inequality permits us to consider other
loss functions ρ such as ρ(µ1 , µ2 ) = I d |µ1 (z) − µ2 (z)|dz.
(b) We can also consider other P than the ones corresponding to Lipschitz µ. For
instance, suppose P corresponds to {µ : |Ds µ| ≤ M < ∞} where Ds is the collection of
sth partial derivatives of µ, s ≥ 1. Then, if we use integrated squared error, we can show
(Problem 12.3.9) that the minmax risk of the Nadaraya-Watson estimate is no smaller in
2s
order than n− 2s+d . It is also fairly easy to show (Problem 12.3.10) that this rate can be
achieved. This type of result was first obtained by Stone (1982).
Section 12.3
12.3.3
347
Asymptotic Risk Criteria
The Gaussian White Noise (GWN) Model
Donoho and Johnstone (1994) andKerkyacharian and Picard (1995) initiated the study of
behaviour we expect in nonparametric formulations generally.
µ ≡ (µ1 , . . . , µp , . . .)T be an infinite dimensional vector lying in l2 ≡ {µ :
P Let
2
2
2
j µj < ∞}. Suppose σ = σ0 is known and that we observe Y1 , . . . , Yn i.i.d. as
Y with
Y =µ+ε,
where µ ∈ l2 and ε is a vector of independent N (0, σ02 ) variables. Note that Y ∈
/ l2 since
P
ε2i = ∞ with probability 1. By sufficiency of Ȳ this model reduces to the model
Ȳ = µ + ε̄ ,
(12.3.43)
where ε̄ is a vector of i.i.d. N (0, σ02 /n) variables. This, (12.3.43), is the Gaussian white
noise model (GWN).
Note that if the vector were p rather than ∞ dimensional this would be a known variance
linear model. Just as we did in Remark 9.5.4, we will argue that, at least formally, the GWN
is the limit of nonparametric models in analyzing decison procedures just as the canonical
linear model is the limit of regular parametric models. As we shall see doing so not only
suggests convergence rates and the difficulty of problems, but also procedures. We illustrate
by considering again
Example 12.3.1 (Continued). Nonparametric regression. Using the same notation as in
Example 11.3.2, suppose that we estimate by orthogonal series. For instance, consider first
one dimensional Fourier series on I ≡ [0, 1]. That is,
µ(x) =
∞
X
θk ϕk (x)
k=1
where
ϕ1 (x) ≡ 1 , ϕ2k (x) =
√
√
2 cos(2πkx) , ϕ2k+1 (x) = 2 sin(2πkx) ,
is the Fourier basis. Note that this basis is orthonormal in L2 (0, 1) because
√
ϕ2k (x) + iϕ2k+1 (x) = 2e2πikx .
We can promote these functions by tensor products, to orthogonal series for µ(·) on I d .
We consider
H = {h : h(x) = Πdj=1 ϕij (xj ) : 1 ≤ ij < ∞} .
Then, if µ ∈ L2 (I d ), we can write, for hl , hk ∈ H,
µ(Z) =
∞
X
l=1
θl hl (Z) ,
θk (P ) =
Z
Id
µ(z)hk (z)dz .
(12.3.44)
348
Prediction and Machine Learning
Chapter 12
Consider the sieve of approximating regression models,
Yi =
p
X
θk hk (Zi ) + εi
i = 1, . . . , n
k=1
T
where the εi are i.i.d. N (0, σ02 ). By Theorem 6.1.4, the MLE of θ(p) ≡ θ1 (P ), . . .,θp (P )
is
b(p) = G−1 (θ
b∗ , . . . , θ
b∗ )T ,
θ
n
1
p
where
n
Gn ≡ k
1X
hj (Zi )hk (Zi )kd×d
n i=1
and
n
X
b∗ = 1
θ
Yi hj (Zi ),
j
n i=1
1≤j ≤p.
Given Z1 , . . . , Zn , by Section 6.1,
b(p) ∼ Np θ(p) , σ 2 G−1 /n .
θ
0 n
(12.3.45)
As n → ∞, for fixed p,
√
b(p) − θ(p) ) =⇒ N (0, σ 2 Jp )
n(θ
0
P
since Gn −→ Jp by our choice of h1 , . . . , hp , . . .. If we now let p → ∞ we see that we
have arrived at the Gaussian white noise model
(12.3.43). Then, estimation of µ(·) and
estimation of θ(P ) = θ1 (P ), . . . , θp (P ), . . . ∈ l2 are formally equivalent, and for IMSE
and squared error on θ rigorously so, since
Z
Id
X
2
(θbj − θj )2
µ
b(z) − µ(z) d(z) =
j
by the Parseval identity. That suggests that if we can carry over descriptions in µ(·) space
into ones on θ in l2 we can gain some insight into the behaviour of procedures, even as
we did from the correspondence between “limits” of finite dimensional parametric models
and the canonical Gaussian linear model in Chapter 6. We will only proceed formally but
want to note that the rigorous theory of asymptotic approximation of experiments, developed by Le Cam, which fully justifies the equivalence of results such as Wilks’ theorem in
the parametric case has been developed by Brown et al, Nussbaum, and others in an important series of papers (1996–2004) to apply to regression, density estimation, and other
nonparametric methods.
✷
Section 12.3
12.3.4
Asymptotic Risk Criteria
349
Minimax Bounds on IMSE for Subsets of the
GWN Model
In Example 12.3.1 (continued) we restrict θ to subsets of l2 of the form,
∞
X
Θβ = θ :
(k β θk )2 ≤ M , β > 0 .
(12.3.46)
k=1
These sets are, if θk are Fourier coefficients, sets that specify some degrees of smoothness
on µ(·). For instance, Θ1 corresponds to the set, {µ(·) : µ(·) is absolutely continuous, and
R1 ′ 2
0 µ (s) ds ≤ M }. To see this we can use Parseval’s identity since the Fourier coefficients
of µ′ are (Problem 12.3.5)
θe1 = 0,
θe2k = 2πk,
θe2k+1 = −2πk .
b ∈ R∞ with loss
We consider estimation of θ by θ
b θ) ≡
l(θ,
∞
X
j=1
(θbj − θj )2
(12.3.47)
(12.3.48)
and return to the problem of setting lower bounds in the GWN model with IMSE risk. We
use a direct approach. Note that because Σk −α converges iff α > 1,
e λ ≡ {θ : |θk | ≤ Mλ k −λ , λ > β + 1 , k ≥ 1}
Θβ ⊃ Θ
2
for suitable Mλ .
We restrict, without loss of generality, to priors making the θk independent. See Problem 12.3.6. It turns out that such priors are asymptotically least favorable. In that case,
it seems reasonable to consider θk ∼ Fk where Fk is least favorable for the univariate
model {θk : |θk | ≤ M k −λ }. It has been shown (see Casella and Strawderman (1981),
Bickel (1981), and Levit (1981)), that the least favorable priors, πδ , for estimating µ, given
X ∼ N (µ, 1), |µ| ≤ m > 0, have finite support and as m ↓ 0 are symmetric on points
{−c, 0, c} and in the limit are point mass at 0. That is, there exists m1 √
> 0 such that the
least favorable prior is of the above form for m ≤ m1 . Replacing Yi by nYi ∼ N (µi , 1)
1
e λ corresponds to {µ : |µk | ≤ Mλ k −λ n 12 } so that for
with µi ≡ n 2 θi we see that Θ
1
1
k ≥ [Mλ n 2 mo ] λ ≡ k0 the least favorable prior is pointmass at 0.
In general, let δ{t} denote point mass at t, consider the prior
hδ
i
{−c} + δ{c}
πδ = εδ{0} + (1 − ε)
.
2
The Bayes estimate is (Problem 12.3.12)
350
Prediction and Machine Learning
√ √ ϕ (t − m) n − ϕ (t + m) n
δm (t) = (1 − ε)m
√
√ ϕ(t − m n) + ϕ (t + m) n
√ 1 − e−2m nt
√ = (1 − ε)m
1 + e−2m nt
Chapter 12
(12.3.49)
and its Bayes risk is
2
Rm ≡ (1 − ε)Em δm (Y ) − m + εm2
√
√
= m2 (1 − ε)E e−4m nY (1 + e−2m nY )−2 + ε .
(12.3.50)
1
Put m = n− 2 , ε = 0, and Y = Z + m where Z ∼ N (0, 1) and get
Rm =
4
4 −4mZ
Ee
(1 + e−2mZ )−2 1 + o(1) =
1 + o(1) .
n
n
1
It is easy to see that putting θk ∼ πm , m = n− 2 , k < k0 , and m = 0, k ≥ k0 , yields a
Bayes risk for the loss (12.3.48) that is bounded by
∞
∞
X
X
1
4k0
4
2
2
θk =
k −2λ ≍ n−(1− 2λ ) .
(k0 − 1) +
+ Mλ n
n
n
k=k0
k=k0
−1
Since this holds for all λ > β + 12 we expect that n−2β(1+2β) is a correct minmax order
2
bound for Θβ . In particular, β = 1 give us the familiar Lipschitz n− 3 rate. This is, in fact,
established in Theorem 2.7 of Tsybakov (2008). The Bayes estimate achieves minimaxity
in terms of rate but the simpler estimate,
θbk = Yk ,
= 0,
1 ≤ k ≤ k0
k > k0 ,
(12.3.51)
does as well. One can, in this and related cases, do better and determine the actual value of
the minmax bound asymptotically by using Pinsker’s theorem (1980). See also Theorem
3.1 in Tysbakov (2008). By taking θbk of the form (1 − αk )Yk 1(k ≤ k0 ), it is possible to
obtain the bound asymptotically.
12.3.5
Sparse Submodels
The GWN also accommodates subsets which are “sparse” but where the small {θk } can
occur anywhere, and not just for k large. These types of models arise quite generally. An
example arises with covariance matrices of vectors (X1 , . . . , Xd )T which represent gene
expression in a microarray. Then we expect that most genes are independent corresponding
to 0’s in the off diagonal part of the covariance matrix, but, if the genes are dependent, we
Section 12.3
351
Asymptotic Risk Criteria
expect the dependence is strong. Donoho and Johnstone (1994) suggested a class of GWN
subsets exhibiting this type of sparsity
Θq,r ≡ {θ : θk = 0, k > r,
r
X
k=1
|θk |q ≤ M }
(12.3.52)
for 0 ≤ q < 1. The case q = 0 where the sum is simply counting nonzero elements
in {θk : 1 ≤ k ≤ r} is the prototype. The minmax risk of squared error estimation
for such models is discussed extensively in Donoho and Johnstone (1994), Kerkyacharian
and Picard (1995), and more recently in Abramovich, Benjamini, Donoho and Johnstone
(2006) and Johnstone and Silverman (2005). A key emphasis is on adaptive procedures,
procedures defined here as achieving uniform minimax rates. We will only derive minmax
rates for Θ0,n and exhibit procedures which achieve them.
The principle that we should consider priors with independent components is still legitimate, as is the principle that least favorable priors asymptotically concentrate on {−c, 0.c}.
However, we do not know which are likely to be the 0 coordinates other than θj , j > r. On
the other hand there is no reason to behave differently for different nonzero coordinates.
Fano’s inequality suggests considering the points θS where S ranges over all subsets of
size M of {1, . . . , r} given by θj = 0, j ∈
/ S, θj = ±c, j ∈ S. The l2 distance between two
such points is ≥ c. The Kullback-Leibler divergence between the probability distribution
corresponding to two such points is
K(Pθ1 , Pθ2 ) = n
r
c2 X
1(θ1j 6= θj2 ) ,
2 i=1
so that
max K(Pθ1 , Pθ2 ) = nM c2 .
r−M
M M
r
r
1− M
≤ 1, we obtain from
There are m = 2M M
such sets. Since M
r
r
(12.3.39), for M/r → 0, the bound
if
b ≥ c2
min max Eθ l(θ, θ)
θb Θ0,r
nM c2
r ≥δ .
M log M
For r = n and M bounded we obtain the bound M δn−1 log(n/M ) for c2 . We have shown
that
Proposition 12.3.4. For squared error loss, the asymptotic minimax rate for Θ0,n is
n−1 log n.
The natural way to achieve this rate is by hard or soft thresholding (Problem 12.3.11).
Here
δch (y) ≡ y1(|y| ≥ c)
352
Prediction and Machine Learning
Chapter 12
and
δcS (y) = (y − c sgn y)1(|y| ≥ c)
are the hard and soft thresholded estimates, respectively.
Essentially for hard thresholding, we test H : θj = 0 for each j and use a critical value
1
of the form K(n−1 / log n) 2 . If we accept, we take θbj = 0. Else, we leave
θbj = Yj .
On theoretical grounds, there is no reason to prefer one type of thresholding over the other.
Summary. In Section 12.3.1 we go more fully into asymptotic theory for classification
and regression, introducing optimal rates of convergence in various models, in particular,
regular parametric ones. In Section 12.3.2 we introduce asymptotic minimax lower bounds
and develop ways of obtaining these explicitly, in particular via testing, Theorem 12.3.2,
and Fano’s inequality, and apply these to obtain minimax rates for nonparametric regression. We show the asymptotic rate optimality of the Nadaraya-Watson estimate under a
first order Lipschitz condition. In Section 12.3.3 we introduce the Gaussian white noise
model (GWN), which enables us to put nonparametric problems in a canonical context just
as we did for parametric ones in Chapter 6. Section 12.3.4 discusses minimax rate bounds
on IMSE for subsets of the GWN model and Section 12.3.5 establishes such a bound for a
sparse submodel of the GWN model.
12.4
Oracle Inequalities
As we mentioned in Section I.1 oracle inequalities are another approach to measuring the
performance of prediction methods. In the minimax approach we consider, for a model
P, all possible prediction rules from the point of view of their worst behaviour on P. The
oracle inequality point of view is instead to consider a limited class of rules D. We compare
the statistician’s performance to what an “oracle” who knows P but is restricted to using
δ ∈ D could achieve, at each P ∈ P. Abstractly, suppose D = {δt }, t ∈ T , and we define,
Rt (P ) = R(P, δt ) = EP ℓ(P, δt ) ,
where ℓ is the loss we consider. An oracle then picks t∗ (P ) such that,
R∗ (P ) ≡ Rt∗ (P ) (P ) = min Rt (P ) .
t
Suppose the statistician has a data determined rule for choosing t, call it b
t. We would like
R(P, δbt ) ≈ R∗ (P ) .
(12.4.1)
An oracle inequality makes (12.4.1) precise through a statement such as
R(P, δbt ) ≤ R∗ (P ) + ∆(P ) ,
(12.4.2)
Section 12.4
353
Oracle Inequalities
where ∆(P ) can be explicitly bounded independent of P .
Example 12.4.1. As in Section I.4, we consider X = (Z, Y ) where Z is a vector of
predictors and Y the response variable to be predicted. Let λ(y, d) denote the loss when d
is the predicted value of y. Given a predictor δ : Z −→ Y we define the prediction loss of
δ as the expected loss under P ,
Z
ℓ(P, δ) = λ y, δ(z) dP (x) .
(12.4.3)
With this formulation we can also think of the oracle as choosing
t∗ = arg min ℓ(P, δt ) .
t
Suppose we have a training sample X = X1 , . . . , Xn = (Z1 , Y1 ), . . . , (Zn , Yn ) from P
and use the b
t that minimizes a criterion for how closely δt (Zi ) predicts Yi , 1 ≤ in . In this
case we want an inequality of the form,
P ℓ(P, δbt ) ≤ ℓ(P, δt∗ ) + ∆(P, Pb , ε) ≥ 1 − ε
(12.4.4)
where Pb is the empirical probability and ∆(P, Pb , ε) can be bounded independent of P . An
inequality of the type (12.4.4) implies one of type (12.4.2) under integrability conditions if
∆(P, Pb , ε) is boundable independent of ε.
✷
Oracle inequalities have three important features.
(i) They provide information at each P
(ii) They are “nonasymptotic” since the bounds ∆(·) are more or less explicit.
(iii) They exhibit explicit important features of the problem such as the role of dimension
on the performance of δt .
The first two features are limited, however, since the first requires one to know how well the
best of class D can do against a particular P (the oracle’s performance) and the second can
be thought of as merely making asymptotic calculations carefully since the bounds are only
of interest if the bound on ∆ tends to 0 as sample size n tends to infinity. Also, the bound
when made independent of P corresponds to worst case behaviour, and is very conservative
in practice.
How do we choose b
t? At first sight we might try
b
t = arg min ℓ(Pb, δt ) ,
(12.4.5)
t
Unfortunately, as we have seen in Section I.7, this will not work unless P is regular parametric. We are typically considering P = ∪t Pt where Pt is a sequence of nested parametric models with δt based on a minimum contrast estimate of the parameters in Pt . Then,
(12.4.5) inevitably leads to choosing the largest model in P, i.e. “overfitting”.
354
Prediction and Machine Learning
Chapter 12
As we have noted in Sections I.7, 9.1 and Chapter 11, we need to regularize, typically
using a penalty pen(t) which can also depend on the data. That is we choose b
t as,
b
t = arg min ℓ(Pb , δt ) + pen(t) ,
t
and ∆(P ) as defined in (12.4.2) reflects the penalty.
We will follow Tsybakov (2008) in first clarifying these notions in the context of the
GWN model.
Example 12.4.2. Nested linear models. Consider data (X1 , Y1 )T , . . . , (Xn , Yn )T i.i.d. as
(X, Y ) where X is an N vector and Y is scalar, and consider the model
(j)
Yi = Xi β i + εi for some 1 ≤ j ≤ N,
i = 1, . . . , n ,
where {εi } are i.i.d. N (0, σ02 ), and β(j) = (β1 , . . . , βj , 0, . . . , 0)TN ×1 . This is a sequence
of nested regresssion models, Pj . As we did in Section 6.1.1 we can by an orthogonal
transformation A(X1 , . . . , Xn ) transform the model to one involving its sufficient statistics
only given by
ε′
ηbk = ηk + √k ,
1≤k≤N ,
n
where the model Pj now corresponds to ηk = 0, k > j, and {ε′k } are i.i.d. N (0, σ02 ),
1 ≤ k ≤ N . This is just a GWN model with coordinates equal to zero after the N th entry.
12.4.1
Stein’s Unbiased Risk Estimate
A key tool for studying and constructing estimates in the GWN model is Stein’s unbiased risk estimate (Stein (1981)). We follow our notation in Section 12.3.3 identifying ηbi
with Yi and ηi with µi , where Yi = µi + εi , with ε1 , . . . , εn independent N (0, σ02 ). Let
δ(Y1 , . . . , Yn ) be an estimate of µ ≡ (µ1 , . . . , µN )T . Using (8.3.33) we find that if δ
satisfies the conditions of Theorem 8.3 and
∆(y) ≡ δ(y) − y ,
then,
Eµ |δ(Y) − µ|2 =
N
X
∂∆j
σ2
N σ02
+ 2 0 Eµ
(Y) + Eµ |∆(Y)|2 .
n
n
∂y
j=1
If we define
(12.4.6)
√
σn ≡ σ0 / n
we arrive at Stein’s unbiased risk estimate of R(µ, δ) − N σ02 /n,
bU (δ) = 2σ 2
R
n
N
X
∂∆j
j=1
∂µj
(Y) + |∆(Y)|2 .
(12.4.7)
Section 12.4
355
Oracle Inequalities
Following our general prescription, given a family {δ t }, t ∈ T , of estimates of η with
δ t ≡ (δt1 , . . . , δtN , . . .)T , such that δtj = 0, j > N , it seems reasonable to pick b
t by
bU (δ t ). Consider the classes of estimates
minimizing R
D0 ≡ {δ λ (b
η ) = (λb
η1 , . . . , λb
ηN )T , 0 ≤ λ ≤ 1}
D1 ≡ {δ m (b
η ) = (b
η1 , . . . , ηbm , 0, . . . , 0, . . .)T , 1 ≤ m ≤ N }
indexed by t = λ and t = m, respectively. The members of D0 are Stein-type shrinkage
estimators. They “shrink” ηbj = Yj towards zero when 0 ≤ λ < 1, uniformly for all
coordinates. The members of D1 are truncated estimators.
12.4.2
Oracle Inequality for Shrinkage Estimators
We first compute the MSE of δ ∈ D0 for η ∈ PN where PN is the GWN model with mean
vector (η1 , . . . , ηN , 0, . . .). This MSE is evidently
rN,n (λ) ≡
N
X
j=1
E(λb
ηj − ηj )2 = N λ2
N
X
σ02
ηj2 .
+ (1 − λ)2
n
j=1
(12.4.8)
Using calculus, for a given η the oracle risk is obtained for
λ(η) =
|η|2
,
N σn2 + |η|2
where σn2 ≡ σ02 /n. By substitution in (12.4.8), the oracle risk is given by
RO (N, n, η) =
N σn2 |η|2
.
N σn2 + |η|2
(12.4.9)
To consider this problem in a different form we note that every δ ∈ D0 can for some γ
be identified with the penalty estimator
e
δ γ ≡ arg min
η
N
nX
i=1
(Yi − ηi )2 + γ
N
X
i=1
o
ηi2 ,
which we recognize as the posterior mode if the ηi are considered as i.i.d. N (0, 1/γ), and
also as ridge regression (Section 12.2.3) for the canonical Gaussian regression model of
Section 6.1.1.
For δ ∈ D0 and η ∈ PN , Stein’s unbiased estimate of risk is (Problem 12.4.1)
bO = 2N σn2 (1 − λ) + (1 − λ)2
R
N
X
j=1
ηbj2 ,
(12.4.10)
356
Prediction and Machine Learning
Chapter 12
which when minimized on λ ∈ [0, 1] gives Stein’s positive part type estimate,
δλb (b
η) =
N σn2
b.
1−
η
|b
η |2 +
(12.4.11)
What can we say about the risk of this estimate?
Theorem 12.4.1. For N ≥ 4 and η ∈ PN , the risk of δλb (b
η ) for squared error loss satisfies
R(η, δ λb ) = Eη |δ λb (b
η ) − η|2 ≤ RO (N, n, η) + 4σn2 .
Proof. (After Tsybakov (2008)). By (8.3.39) if
δ ∗ (b
η) = 1 −
is the Stein estimator, then
N σn2 b
η
|b
η |2
R(η, δ λb ) ≤ R(η, δ ∗ ) .
Now, by (8.3.31)
R(η, δ ∗ ) = N σn2 − (N 2 − 4N )σn4 Eη |b
η |−2 .
But, by Jensen’s inequality,
E|b
η |−2 ≥ (E|b
η |2 )−1 = |η|2 + σn2 N )−1 .
Substituting in, after some arithmetic (Problem 12.4.1),
R(η, δ λb ) ≤
4σn4 N
N σn2 |η|2
+
≤ RO (N, n, η) + 4σn2 .
2
2
|η| + N σn
|η|2 + σn2 N
✷
Note. The oracle inequality penalty term for not knowing η does not depend on η.
Remark 12.4.1.
2β
(a) If we know β and can choose N we can achieve the minimax rate of n− 2β+1 over Θβ .
1
The oracle risk is achieved by taking N = n 2β+1 , since |η| is uniformly bounded
over Θβ (Problem 12.4.1).
(b) In the absence of the possibility of varying N , an asymptotically minimax rule is not
possible using D0 .
(c) On the other hand, as we shall see in the next subsection, using D1 , the estimate
selected using Stein’s unbiased risk estimate is what we shall define as adaptively
minimax over all β > 0.
Section 12.4
12.4.3
357
Oracle Inequalities
Oracle Inequality and Adaptive Minimax Rate for
Truncated Estimates
We first establish that the oracle minimax rate for η ∈ Θβ and D1 agrees with the general
minimax rate of Section 12.3.4, and then show we can obtain this rate adaptively. Here, for
δm ∈ D1 and σn2 = σ02 /n,
∞
X
R(η, δm ) = mσn2 +
ηi2 .
(12.4.12)
i=m+1
Let
m0 (η) = arg min R(η, δm ) .
m
P∞
Then, if η ∈ Θβ = η : j=1 (j β ηj )2 ≤ M , some M > 0,
since
P∞
j=N +1
R(η, δm0 (η) ) ≤ N σn2 + N −2β M ,
ηj2 ≤ N −2β
P∞
j=N +1 (j
β
ηj )2 . Minimizing over N we obtain
1
N = {2βM n/σ02 } 2β+1
and the oracle minimax rate
2β
R(η, δm0 (β)) ≍ n− 2β+1 .
(12.4.13)
We proceed to show using an oracle inequality that for procedures in D1 , if we select m
using Stein’s unbiased estimator of risk to obtain m,
b then δ m
b has the correct minimax rate
for all β. This is an example of adaptive (asymptotic) minimaxity which we next define
in the context of i.i.d. sampling. Generally, for a statistical procedure δbn with risk R,
δbn (X1 , . . . , Xn ) is asymptotically minimax over a model P iff
sup R(P, δbn ) : P ∈ P ≍ inf sup R(P, δ) : P ∈ P
δ
as n → 0 where δ ranges over all decision procedures. Now suppose P1 , P2 , . . . is a
b
sieve of models with ∪∞
m=1 Pm = M (with closure in law). Then δn is asymptotically
b
adaptive minimax over M iff δn is asymptotically minimax for each m. Note that δbn has
no knowledge of M.
Here is our procedure based on Stein’s unbiased risk estimator. For rules in D1 ∆ ≡
(∆1 , . . . , ∆n )T is given by ∆j = 0, j ≤ m; ∆j = −yj , m + 1 ≤ j ≤ n. It follows that
Stein’s unbiased estimate of risk is
R(m) = −2σn2 (n− m)+
n
X
j=m+1
b 2j = −(2ρ2 n+
η
n
X
j=1
ηbj2 )+ (2ρ2 m−
m
X
j=1
ηbj2 ) . (12.4.14)
358
Prediction and Machine Learning
Chapter 12
The second term is just “Mallows’ Cp ” and since this is the only term that depends on m,
minimizing Stein’s unbiased risk estimate leads to Mallows’ selector
m
b = arg min{2σn2 m −
m
X
j=1
ηbj2 : 1 ≤ m ≤ n} .
Theorem 12.4.2. For all η ∈ Θβ , all M , 0 < β <
1
2
2β − 2β+1
2
.
sup Eη |b
ηm
b − η| : η ∈ Θβ = O n
(12.4.15)
The technical proof of this result is sketched in Problems 12.4.2–12.4.6.
Theorem 12.4.2 does establish rate optimality as n → ∞, defined by
limn sup
P
EP ℓ(P, δbt )
≤ K0 .
ℓ(P, δt∗ (P ) )
(12.4.16)
Thus, as n → ∞, the statistician’s method, in this case Mallows’, achieves the same rate of
convergence of the risk to 0 as does the oracle, uniformly on P.
More difficult to achieve but evidently even more satisfactory is asymptotic equivalence
defined by
lim sup E
n→∞ P
ℓ(P, δbt )
=1,
ℓ(P, δt∗ )
(12.4.17)
uniformly on P. More sophisticated methods are able to achieve (12.4.17) for a class of
rules. See Tsybakov (2008), Theorem 3.6.
Remark 12.4.2.
1) Other Models: Although, as we have seen, the methods discussed apply to regression
with Gaussian errors, the Gaussian white noise model can not be appealed to for density estimation or regression with non-Gaussian errors, not to speak of nonlinear semiparametric
models such as logistic regression. However, deep results of Nussbaum (1996) and Brown
and collaborators (1997, 2004) show that it is possible to generalize these conclusions to
models which can be approximated in the strong sense of Le Cam’s distance between experiments and include the models cited.
2) Bayesian Selection: If we assume that P ∈ Pm0 for some m0 and if m0 is known we
can estimate η efficiently using δ m0 and obtain the optimal minimax rate of n−1 . This is
possible without knowing m0 if one uses Schwarz’s Bayes criterion (SBC) (Schwarz(1978)
), also called Bayes information criterion (BIC). (See also Rissanen (1983) for the equivalent (in this case), Minimum Descriptive Length criterion.) For SBC, for some constant
c > 0, we select (see Section 12.6.1)
m
b = arg max
m
X
j=1
ηbj2 − 2c2 m log n : 1 ≤ m ≤ n .
Section 12.4
359
Oracle Inequalities
If, however, η is in the closure of ∪∞
m=1 Pm , then the rate of convergence of the risk of
bm
bm
the SBC η
b is worse than that of Mallows’ η
b by a factor of log n (Problem 12.6.1). Note
that the Mallows’ rule has the same formulation as SBC with 2 log n replaced by 2.
For the SBC m
b (Problem 12.4.7),
p
σn−1 |b
ηm
2 log n .
b| ≤
12.4.4
An Oracle Inequality for Classification
The oracle inequality formulation in the forms (12.4.2) and (12.4.4) arose early on in the
work of Vapnik (1998) on classification. Classification is a problem with a more complex
loss function than quadratic loss and the difficulty of the problem is determined not only by
that of estimating the regression,
P [Y = 1|Z] ,
but also the contours of the regression, a task governed by what is called the margin (Mammen and Tsybakov (1995)). On the other hand, the boundedness of all variables involved
makes inequalities come out as simple consequences of the empirical process bounds of
Section 7.1 and Appendix D.2. Specifically consider a two-class problem so that the training samples (Zi , Yi ) are i.i.d. as (Z, Y ) ∼ P with Y = ±1 and Z ∈ H. In this case,
δt : H → {−1, 1} .
We use 0 − 1 loss so that
R(P, δt ) = P [Y δ(Z) = −1] .
Let λ(y, d) denote the loss when Y is classified as y ∈ {−1, 1}. An
oracle would use
t∗ (P ) that minimizes the classification loss ℓ(P, δ) = EP λ(Y ), δ(Z) ,
ℓ(P, δt∗ ) = min ℓ(P, δt ) .
t
Since we don’t know P but have a training sample X = {(Zi , Yi ) : 1 ≤ i ≤ n}, we want
to construct δbt (·) such that
R(P, δbt ) = E P Yn+1 δbt (Zn+1 ) = −1 X
bt
is small. A natural approach in the context of this section is to find an unbiased estimate R
of R(P, δt ) and minimize it. If we know nothing about P a natural approach is to use
n
and use δbt where
X
bt ≡ 1
R
1 Yi δt (Zi ) = −1
n i=1
bt .
b
t = arg min R
t
360
Prediction and Machine Learning
Chapter 12
bb is an underestimate of R(P, δb). We use an oracle inequality to relate
We expect that R
t
t
R(P, δbt ) to R(P, δt∗ ) although this does not actually give us an idea of what R(P, δbt ) is.
What an oracle inequality requires is ∆P such that
Eℓ(P, δbt ) ≤ Eℓ(P, δt∗ ) + ∆P .
Before going to an oracle inequality we establish the fundamental,
Theorem 12.4.3
bt − ℓ(P, δt )| .
R(P, δt∗ ) ≤ R(P, δbt ) ≤ R(P, δt∗ ) + 2E sup |R
(12.4.18)
t
Proof. By definition,
Hence,
bb ≤ R
bt∗ ,
R
t
R(P, δt∗ ) ≤ R(P, δbt ) .
bb + sup |R
bt − ℓ(P, δt )| ≤ R
bt∗ + sup |R
bt − ℓ(P, δt )| ≤ ℓ(P, δt∗ )
ℓ(P, δbt ) ≤ R
t
t
t
bt − ℓ(P, δt )| .
+ 2 sup |R
t
The theorem follows.
✷
We now go to empirical process theory. Suppose that the bracketing number of FT ≡
{ℓ(P, δt ) : t ∈ T } satisfies
N[ ] δ, FT , L2 (P ) ≤ cδ −d ,
(12.4.19)
for all P , δ sufficiently small, c independent of P . Then, we can use a simplified version of
Theorem 7.1.3, given below, which is appropriate since |ℓ(P, δt )| ≤ 1, and we do not need
to bound the variance of |ℓ(P, δt )| separately. Recall the notation
En (ft ) = n−1
n
X
i=1
ft (Xi ) − EP ft (Xi ) .
Theorem 12.4.4. (van der Vaart and Wellner (1996) Theorem 2.14.29). Given
FT : {ft :
t ∈ T } with ft satisfying, |ft |∞ ≤ 1, for all t ∈ T , and N δ, FT , L2 (P ) ≤ cδ −d for all
P, δ, and c as in (12.4.19), then, for a universal constant D ≡ D(c),
1
D d −2a2
P n 2 sup |En (ft )|∞ > a ≤ √
e
.
d
t∈T
(12.4.20)
Theorem 12.4.5. If FT satisfies (12.4.19) then
√
− 21 D d 2π
√
.
R(P, δbt ) ≤ inf R(P, δt ) + n
t
4
d
(12.4.21)
We can now give an oracle inequality:
Section 12.5
Performance and Tuning via Cross Validation
361
Proof. Use Theorem 12.4.3 and (12.4.20) and the standard (see Example 4.4.7) formula,
Z ∞
P [U ≥ v]dv if U ≥ 0 .
EU =
0
✷
Remark 12.4.3. 1. The bound (12.4.21) does not rely on P belonging to any family P
and as such may be poor. For instance, suppose δt is the Bayes rule if P = Qt where
{Qt : t ∈ Rp } is a regular parametric model. Then, although δbt is not chosen optimally,
as it is in Example 12.3.1, it may still be shown that its Bayes regret is O(n−1 ) whenever
1
P = Qt0 for some t0 , rather than O(n− 2 ) given by (12.4.21).
2. On the other hand (12.4.21) is indeed correct for ascertaining minmax bounds over large
families P — see Marron (1983) where the first such results were presented.
3. The bounds can give qualitative information on where ordinary asymptotic theory may
not hold well such as the behaviour of the best one can do as a function of the bracketing
number “dimension” d. For regular parametric families FT , this is just the dimension of the
parameter whose effect on performance is not apparent in results such as Theorem 6.2.2.
This additional type of information is a valuable feature of oracle inequalities.
4. This particular oracle inequality does not capture the phenomenon of “margin.” Roughly
speaking, Bayes classification requires that we estimate, not necessarily P [Y = 1|Z = z]
as a function of z well but rather its contours C = {z : P [Y |Z = z] = 21 }. This
may be easy even if the function is estimated badly if the set of z’s near C has small
probability. On the other hand the contours of a polynomial in several dimensions may
be very hard to estimate. The correct combination of conditions on P which determine
minmax classification risk was determined by Mammen and Tsybakov (1995).
Summary. In this section we introduce the notion of oracle inequalities, the principal tool
in the machine learning literature for characterizing the performance of classification rules
both for restricted and unrestricted model classes P. They are of value not only in their own
right but also as principal tools in establishing the performance of potentially minmax rules.
We consider the first, oracle inequalities, for submodels of the Gaussian white noise model,
in which we also study the behaviour of Stein’s unbiased risk estimate in constructing
adaptive estimates. We next consider the basic type of such inequalities valid for all P
considered first by Vapnik (1998) in the context of classification. The literature and uses of
such inequalities is perpetually growing. For many more examples see Györfi et al (2002),
Bishop (2006), and van de Geer (2000). A very thorough treatment is in Buhlmann and van
der Geer (2010).
12.5
Performance and Tuning via Cross Validation
In the previous section we have discussed some methods of choosing the tuning parameter
t that determines the level of regularization. We continue this theme in Section 12.5 for
a highly important approach referred to as cross validation. Not only does this approach
362
Prediction and Machine Learning
Chapter 12
yield good prediction but it also provides methods of assessing prediction performance.
We have considered prediction performance from the point of view of oracle inequalities.
These bounds are worst case lower bounds on performance and as such are much too conservative. On the other hand we have, in the past, considered data-based measures such as
confidence intervals which though not giving the guarantees of oracle bounds are asymptotically correct and do well in practice. The analogue in the “machine learning context”
is to give reasonable estimates of misclassification probabilities and confidence bounds in
regression.
To illustrate the difference between oracle bounds and data-based methods consider the
simple problem of estimating the mean µ of a distribution concentrated on [−1, 1] using a
sample X1 , . . . , Xn i.i.d. as X. Our usual data-based approach outside this chapter would
be to use
1
X̄ ± z1−α/2 σ
bn n− 2
P
as an approximate confidence interval, with σ
b2 = n1 ni=1 (Xi − X̄)2 . On the other hand,
we could give an oracle type guaranteed bound on µ. By Hoeffding’s inequality, (7.1.10),
since |X| ≤ 1,
√
v2
P [ n|X̄ − µ| ≥ v] ≤ 2e− 2 .
This leads to guaranteed level (1 − α) confidence bounds for µ,
p
1
X̄ ± − log(α/2)n− 2 .
Evidently, if α and Var(X) are small, this is a grossly conservative bound and Hoeffding’s
inequality gives a poor estimate of the performance of X̄. Thus the data-based bound is
preferable. Cross validation is a data-based approach.
Cross validation will be used to estimate criteria functions. Let δt denote a decision
rule with tuning parameter t ∈ T . Crude estimates of performance such as the empirical
misclassification rate
n
1X
1 δt (Zi ) 6= Yi
n i=1
(12.5.1)
are poor, as oracle inequalities make clear. Since regularization is based on optimizing
an estimate of performance over a family of procedures, using a good estimate of performance is of importance for choosing the tuning parameter that corresponds to an appropriate amount of regularization; see Arlot and Celisse (2010). We pursue this aspect of
data determined selection of the tuning parameter that yields the optimal estimated risk,
returning to measurement of performance at the end.
12.5.1
Cross Validation for Tuning Parameter Choice
A number of methods involving sample splitting or approximations to such methods have
come under the heading of cross validation. The one most widely studied in the statistical
literature is
Section 12.5
Performance and Tuning via Cross Validation
363
Leave one out cross validation and empirical risk minimization
The cross validation idea works for any prediction problem and is closely related to the
jackknife — see Section 10.3. Let S ≡ {Xi = (Zi , Yi ), i = 1, . . . , n} with Y1 ∈ R, Zi ∈
Rp , be an i.i.d. sample, let t be a tuning parameter, and {δt (·, X1 , . . . , Xn ) : t ∈ T } be a
family of predictors which has been determined, for example, by the fitting of a sieve. Let
l(y, d) be the prediction loss corresponding to the response y to be predicted and decision
d. Thus, we have the examples
l(y, d) = 1(y 6= d)
2
l(y, d) = (y − d)
0-1 loss
squared error loss
The naive estimate of the Bayes risk of δt (·) is, as we have noted, the empirical risk,
n
X
b t) = 1
R(δ
l Yi , δt (Zi ) .
n i=1
Although typically asymptotically correct for any fixed t, this is, as we have noted, overoptimistic, since the performance of the rule is being judged on the sample used to create it.
One way to reduce this negative bias is to use leave one out cross validation (jackknife)
with risk defined as
n
X
(−i)
bJ (δt ) = 1
l Yi , δt (Zi ) ,
R
n i=1
(12.5.2)
(−i)
where δt
is a rule based on X (−i) ≡ {Xj , j 6= i}. Implicit in this is the assumption that
δt (·|X (−i) ) is defined. We consider δt that can be written
′
δt (·|X1′ , . . . , Xm
) = δt (·|Pbm )
′
where Pbm is the empirical distribution of a subsample X1′ , . . . , Xm
, of S and δt (·|P ) is
assumed to be defined for all P used in the following discussion. An indicator of the
bJ (δ) is an unbiased estimate of the Bayes
promise of leave one out cross validation is that R
risk of δ for sample size n − 1, that is,
Z
bJ (δ) = E l(y, δ(z|Pbn−1 )dP (z, y) .
ER
(12.5.3)
We define the leave one out cross validation choice of t as
bJ (δt ) .
b
tJ = arg min R
Barron, Birgé and Massart (1999) exhibit oracle inequalities for δbtJ and show a close
relation between b
tJ and Mallows’ Cp .
V fold Cross Validation
364
Prediction and Machine Learning
Chapter 12
A more general, computationally faster and much more broadly applicable method is
V fold cross validation. We consider
t ∈ T ≡ {1, . . . , Mn } .
The sample S is split into V disjoint parts and each part has m = n/V observations. Call
the V subsamples S1 , . . . , SV . V = 5 and V = 10 are customary choices. If n = 500
and V = 10, we will have 10 subsamples of S, each with m = 50 observations. The
bV of the Bayes risk of a rule δt is defined as follows. Let S (−j) = S − Sj and let
estimate R
(j)
δt = δt (·|P (j) ), where P (j) is the empirical distribution of {Xi : Xi ∈ S (−j) }. Then,
for loss function ℓ,
X
(j)
bV (δt ) ≡ 1
R
l(Yi , δt (Zi )) : i ∈ Sj , j = 1, . . . , V
mV
.
(12.5.4)
(j)
That is, leave out the first m X’s, use the remaining m(V − 1) X’s to compute δt , and
average the loss on the m X’s left out. Then do the same for the next m observations, and
so on for the V subsamples, then average over the subsamples. In practice, the division into
bV ’s averaged.
V pieces may be made several times and the resulting R
Define the V fold cross validation tuning parameter selector as
bV (δt ) .
b
t ≡ arg min R
We will consider properties of δbt for the problem of estimating µ(z) = E(Y |Z = z)
using the squared error loss function, ℓ(y, d) = (y − d)2 . In particular, we shall state a set
of assumptions under which V fold cross validation for quadratic loss is optimal in a sense
we shall define.
We give the argument for sample splitting cross validation. Our proof will make
it clear that V fold cross validation is argued in the same way. We split the sample
(Z1 , Y1 ), . . . , (Zn , Yn ) into a test sample of cardinality m and a training sample of cardinality n − m. The predictor is computed for the training set and its loss is evaluated at
the test set. Without loss of generality let the test set be S1 = (Z1 , Y1 ), . . . , (Zm , Ym ).
Use the notation, for a ≡ a(Z, Y ),
Z
kak2P ≡ a2 (z, y)dP (z, y)
m
kak2m ≡
1 X 2
a (Zi , Yi ) .
m i=1
We introduce the following assumptions and further notation.
A0: Y is bounded: |Y | ≤ C < ∞.
A1: Write δt (·) for δt (·|P (1) ), where P (1) is the empirical distribution of the training set
{Xi : Xi ∈ S1 }.
Section 12.5
Performance and Tuning via Cross Validation
365
Let
t∗ = arg min kY − δt (Z)k2P = arg min kµ(Z) − δt (Z)k2P .
t
t
Assume there exists a best oracle rule defined by
δO (z) ≡ δt∗ (z)
which for fixed P ∈ P, and some sequence rn , satisfies:
kδO (Z) − µ(Z)k2P = rn2 1 + o(1) .
A2: There exists a best “CV selected rule” defined by
where b
t = arg min kY − δt (Z)k2m .
δb ≡ δbt
t
A3: Let ∆t (Z) = |δt (Z) − µ(Z)|, and assume that for fixed P ∈ P,
max
t
n k∆ (Z)k2
o
t
m
−
1
:
t
∈
{1,
.
.
.
,
M
}
= oP (1) .
n
k∆t (Z)k2P
If we assume |δt (Z)| ≤ |Y | ≤ Cn < ∞ for all t ∈ {1, . . . , Mn } and n−1 log Mn = o(1),
then A3 holds by Hoeffding’s inequality.
1
A4:
log 2 Mn
2
mrn
= o(1).
Let
Theorem 12.5.1. Under A0–A4,
rbn = rn 1 + oP (1) .
rbn = kδbt − µkP .
(12.5.5)
Remark 12.5.1. For Theorem 12.5.1,
1) A0 can be weakened with the result still holding.
2) The conditions can be strengthened to yield an oracle inequality form of the Theorem.
3) The difference between this sample splitting result and the result for V > 2 is that
sample splitting is only carried out once to form the risk estimate. This is inessential
because the average of V such risk estimates is more accurate although the summands
are highly correlated.
366
Prediction and Machine Learning
Chapter 12
4) It is, as we have seen, common to have for P ∈ P, nonparametric,
rn ∼ cn−1+δ ,
δ>0.
If Mn = nK , K < ∞, and m = n/ω(n), where ω(n) ↑ ∞ slowly enough and A3 and
A4 hold, then the theorem holds. The take-home story is that V fold cross validated choice
where V = ω(n), i.e. is almost bounded, yields oracle prediction for quadratic loss.
Proof of Theorem 12.5.1
By definition,
b b) ≤ R(δ
b t∗ )
R(δ
t
(12.5.6)
kY − δt∗ k2P ≤ kY − δbt k2P .
(12.5.7)
Rewrite (12.5.6) as
m
m
2 X
2 X
2
b
b
µ(Zi )− δ(Zi ) εi +kδ−µkm ≤
µ(Zi )−δ0 (Zi ) εi +kδ0 −µk2m (12.5.8)
m i=1
m i=1
where εi ≡ Yi − µ(Zi ). Note that
m
1 X
b i ) εi
µ(Zi ) − δ(Z
m i=1
2
m
n 1 X
µ(Zi ) − δt (Zi ) εi 2 o
b 2 .
≤ max
kµ − δk
m
1≤s≤Mn
m i=1
kµ − δt km
(12.5.9)
The first factor of the RHS of (12.5.9) is of the form
m
n 1 X
Un ≡ max
ait εi
m i=1
2
: t ∈ {1, . . . , Mn }
o
where {ait } and εi are independent given Z1 , . . . , Zm , E(εi |Z1 , . . . , Zm ) = 0 and
P
m
2
i=1 ait = 1.
By an inequality of Pinelis which is an application of Hoeffding’s inequality (7.2.9),
and the boundedness of the εi , |εi | ≤ 2C, we have
EUn ≤ 16C 2 log Mn /m .
(12.5.10)
Conditioning on Xm+1 , . . . , Xn we obtain from (12.5.8)–(12.5.10) that
1
Next
1
b 2 ≤ 4Cm− 2 (log Mn ) 2 kµ − δk
b m + kµ − δ0 k2 .
kµ − δk
m
m
k∆s k2m
k∆bs k2m
− 1 ≤ max
− 1 = oP (1) ,
2
t
k∆bs kP
k∆s k2P
(12.5.11)
(12.5.12)
Section 12.6
Model Selection and Dimension Reduction
367
by A3, and the same holds for ∆s0 . From (12.5.11) and A4, we obtain
b 2 ≤ 1 + oP (1) kµ − δ0 k2 ,
kµ − δk
m
m
and applying (12.5.12) the theorem follows.
✷
Remark 12.5.2. This V fold criterion is the most widely used one for tuning parameter
choice. It can be shown (Barron, Birgé and Massart (1999), Propositions 1, 2 and Theorem
3) that if we consider multiple regression with all 2p − 1 submodels possible as a sieve and
p ∼ log n, then leave one out cross validation does not regularize least squares sufficiently
while V fold, for suitable V, n, p, does.
We now turn briefly to assessment of performance.
12.5.2
Cross Validation for Measuring Performance
b is typically an underestimate of misclassification
As we have noted, the naive estimate R
risk. Efron (1983) shows both theoretically, through higher order expansion, and experibJ is a better estimate
mentally, by Monte Carlo, that the leave one out jacknife estimate R
b
of the Bayes risk than R. He develops an even better procedure using a combination of 2
b In this case, 2 fold cross validation itself is not very satisfacfold cross validation and R.
tory since one is essentially unbiasedly estimating the behaviour of the rule using a training
b
sample of size n/2, thus overestimating the Bayes risk for a sample of size n. Since R
is an underestimate it is reasonable to combine the two and Efron shows that the weights
b are best. It seems plausible, in view
.632 for the cross validation estimate and .368 for R
of Theorem 12.5.1, that using log n fold cross validation to estimate the Bayes risk of say
δbbt should behave well. Unfortunately, using log n cross validation for t, itself chosen using
log n cross validation, is apt to yield an underestimate. Arlot has shown that using Efron’s
.632 estimate to select s has given excellent results; see Arlot and Celisse (2010). But again
the goals of performance assessment and optimization clash although we could apply the
.632 approach to the optimized rule.
Summary. In this section we first study cross validation, the chief method for selecting
tuning parameters in prediction methods, and show the wide applicability of V fold cross
validation as opposed to leave one out cross validation with a theorem on the asymptotic
optimality of V fold cross validation for squared error loss. Actually estimating performance of a final data determined choice of decision rule is difficult although Efron’s .632
rule may be generally helpful.
12.6
Model Selection and Dimension Reduction
The first of these two topics starts with a model point of view more akin to Chapters 1–5
where we focussed on parametric models. We still, as in the previous sections, assume we
have a nested sieve of models whose closure is nonparametric, but add the assumption that
one of the finite dimensional members of the sieve is generating the data. Specifically, if
368
Prediction and Machine Learning
Chapter 12
we are, say, considering regression models with a fixed number k of real factors Z1 , . . . , Zk
and we can expand E(Y |Z) by means of tensor bases to a function of an infinite number of
regression covariates, as in Example 12.3.1, we assume that one of these, say a polynomial
of order d in the factors, is, in fact, generating the data.
Our second topic is the more fundamental but the hardest to quantify. The point of
departure is a set of data, say a sample from a population, which is very high dimensional
and has a very complex representation. The object is to find a low dimensional, simple
representation which can give subject matter insights and lead to efficient prediction and
classification procedures. There is no commitment to a particular model a priori. We are
dealing, for example, with a sample of small size in relation to the dimension of the data
so that there is no hope of nonparametric estimation of the density. We seek to find a
representation of the data and/or the model which is sufficiently low dimensional to be
worked with, but sufficiently accurate to enable us to find the essential features of the
situation we are analyzing. In this context we will introduce Principal Component Analysis
(PCA), a classical method for dimension reduction. We will discuss some difficulties of this
and related methods in modern applications. We will also discuss the relationship between
the importance of factors and sparsity of models.
12.6.1
A Bayesian Criterion for Model Selection
We consider the problem of selecting a model M from a sieve of regular parametric models,
M0 ⊂ · · · ⊂ Mp , where
Mk ≡ {Pθ(k) (·) : θ(k) ∈ Θk open in RNk , N0 < . . . < Np } .
For a given P in the sieve let k0 = k0 (P ) be the smallest k such that P ∈ Mk , and let
(k )
θ0 = θ0 0 ∈ Θk0 denote the subscript on this P . Then P is also in the models with k > k0 .
(k (P ))
If P is in the sieve and generates the data, k0 (P ) and θ 0 0
are assumed identifiable.
(k )
We will discuss the asymptotic determination of k0 and estimation of θ0 0 within Θk0 .
Here we consider the simplest model Mk0 to be the most parsimonious or “best” model
and we want to have high probability of selecting Mk0 for large sample sizes.
To focus ideas, first consider a simple sieve with data Xi = (Zi , Yi ) i.i.d. and the
familiar regression model,
Yi = β0 +
d
X
βj Zij + εi , i = 1, . . . , n
j=1
where εi are i.i.d. N (0, σ 2 ), Zi = (Zii , . . . , Zid )T . Suppose we can order the factors, Z1 , . . ., Zd , in terms of importance to Y . Then, Mk is parametrized by θ(k) ≡
(β0 , . . . , βk , σ 2 )T , Θk ⊂ Rk+2 , Np = d + 2, and P ∈ Mk iff βk+1 = · · · = βd = 0.
Determining k0 , the number of factors Zj with non-zero beta coefficients, is the subject of
much research. We could try the following: “Consecutively test Hj : βj = 0 vs βj 6= 0,
Section 12.6
Model Selection and Dimension Reduction
369
j = 1, . . . , d, stopping as soon as we do not reject Hj , and set k0 = j − 1”. The main issue
with this approach is what critical value to use, that is, how to balance type I and II error
probabilities. Here we turn to the simplest approach, which is based on a Bayesian point of
view.
Pp
Returning to the general case, we place prior mass αk on Mk , k=0 αk = 1, αk > 0.
That is, αk is the probability that k0 = k. Given k0 = k and Pθ ∈ Mk , we assume
θ = θ(k) has a prior density πk (·) on Θk where πk > 0 and continuous. A natural rule for
choosing k0 (P ) is “Select b
k which maximizes the posterior probability α
bk of Mk ”. That
is,
b
k = arg max α
bk ≡ arg max Prob(k0 = k|X)
k
k
T
where X = (X1 , . . . , Xn ) with X1 , . . . , Xn i.i.d. as X ∈ Rq , q ≥ 1. We prove, under suitable regularity conditions, a theorem which gives a general asymptotically optimal
b
Bayes rule b
k. Schwarz (1978) gave this result for the special case of exponential families.
In our Bayesian framework, we assume that given k and θ(k) , Xi has density p(·|θ(k) ),
θ(k) ∈ Θk . Then the density of X given k is
Z
Πni=1 p(xi |θ(k) )πk (θ (k) )dθk .
p(x|k) =
Θk
Let Ik (θ (k) ) = Eθ (k) ∇∇T log p(X|θ(k) ), θ(k) ∈ Θk , be the Fisher information matrix and
let K(θ(k) , θ0 ) = −Eθ0 log p(X|θ(k) )/p(X|θ0 ) be the Kullback-Leibler divergence
(k )
where θ 0 ≡ (θ01 , . . . , θ0Nk0 )T ≡ θ0 0 and k0 = k0 (P ) corresponds to the true P .
We assume
(i) {p(·|θ(k) )}, k ≥ 0, satisfy the conditions of Theorem 6.2.3, for each Mk .
(k)
(ii) θ∗
≡ arg min{K(θ(k) , θ0 ) : θ(k) ∈ Θk } > 0 for k < k0 .
(iii) If k < k0 , and Pθ 0 . generates the data, then the assumptions of Theorem 6.2.1 apply
(k)
with ψ = log p(x|θ(k) ) and θ(P ) = θ∗
Conditions that imply (iii) can be obtained from the M -estimate consistency results of
b(k) maximizes n−1 Pn log[p(Xi |θ(k) )/p(Xi |θ 0 )] so that the
Chapters 6 and 9 because θ
i=1
corresponding population parameter minimizes K(θ(k) , θ0 ) over θ(k) ∈ Θk , k < k0 .
By Bayes formula, for the appropriate constant c > 0, the posterior probability of model
Mk is
Z
α
bk = αk Πni=1 p(xi |θ(k) )Πk (θ (k) )dθ(k) /c .
We next give approximations to the maximizer b
k of α
bk and establish their consistency in a
frequentist context.
370
Prediction and Machine Learning
Chapter 12
Qn
b(k) ) where θ
b(k) is the
Theorem 12.6.1. Under (i), (ii), and (iii), let Lk = i=1 p(Xi |θ
MLE of θ (k) in model Mk , 0 ≤ k ≤ p. Suppose the data is generated by P ≡ Pθ 0 with
(k0 )
θ0 = θ0
. Then, if | · | denotes determinant,
(a) For k > k0 , as n → ∞,
log
α
bk
n 1
Lk
k − k0
|Ik (θ (k0 ) )|
αk
= log
−
log
− log 0 (k) + log
+ oP (1) .
α
bk0
Lk0
2
2π
2
αk0
|Ik (θ )|
0
(12.6.1)
(b) For k < k0 ,
α
bk P
−→ 0 as n → ∞ .
α
bk0
b
(c) If b
k is the maximizer of α
bk and b
k maximizes
b
2 log Lk − k(log n − log 2π) + log αk + log |Ik (θ
(12.6.2)
(k)
)| ,
b
then b
k is consistent in the sense that
b b
P [b
k = k = k0 (P )] −→ 1 as n → ∞ .
(12.6.3)
(12.6.4)
(d) The universal criterion
1
k ∗ = arg max{log Lk − k log n}
2
satisfies
P k∗ = b
k = k0 (P ) → 1 as n → ∞ .
Proof. We begin with (12.6.1). Lettting k > k0 , write
Z Y
n
p(Xi |θ(k0 ) )
α
bk
Lk
πk0 (θ (k0 ) ) dθ (k0 )
log
= log
− log
(k0 )
α
bk0
Lk0
b
)
i=1 p(Xi |θ
Z Y
n
(k)
p(Xi |θ )
αk + log
πk (θ (k) ) dθ (k) + log
.
(k)
αk0
b )
i=1 p(Xi |θ
(12.6.5)
(12.6.6)
Arguing as in the proofs of Theorems 5.5.2 and 6.2.3, reasoning as for (5.5.12), noting that
(k)
(k )
Pθ(k) = Pθ(k0 ) , and both θ0 and θ 0 0 are interior points of their respective parameter
0
0
spaces, we find that for n → ∞ the two integrals are
Z
1
(k0 +1)
k0 +1
(k )
exp − tT Ik0 (θ 0 0 )t dt 1 + oP (1)
A = (2π) 2 n− 2
2
Z
1 T
(k+1)
(k+1)
(k)
−
B = (2π) 2 n 2
exp − t Ik (θ 0 )t dt 1 + oP (1)
2
Section 12.6
Model Selection and Dimension Reduction
371
where the A integral is over RNk0 and B is over RNk . Evaluating A and B, using the fact
that the integral of a density is 1, we obtain (12.6.1). Note that even though the information
in Mk , Mk0 is evaluated at the point P , they are typically different. Now, (12.6.1) follows.
To prove (12.6.2) note that, for k < k0 ,
Z Y
n h
p Xi |θ(k0 ) i
(k )
αk ) = log
log(b
αk0 /b
πk0 (θ0 0 ) dθ(k0 )
(k0 ) b
i=1 p Xi ||θ
Z Y
n h
p(Xi |θ (k) ) i
αk0 (k)
(k)
− log
π
(θ
)
dθ
+
log
k
αk
b(k0 ) )
i=1 p(Xi |θ
≡ C1 + C2 + C3 .
The first term is identical to the corresponding k > k0 term and thus converges to − log A,
which is negative and of order log n. We establish in (12.6.2) by showing that the second
term is positive and of order n. Note that, by Jensen’s inequality,
Z Y
n h
b(k0 ) ) i
p(Xi |θ
πk (θ (k) ) dθ (k)
(k)
p(X
|θ
)
i
i=1
Z X
n h
b(k0 ) ) i
p(Xi |θ
log
≥
πk (θ (k) ) dθ (k) .
(k)
p(X
|θ
)
i
i=1
C2 = log
(12.6.7)
b(k) is a MLE, p(x|θ
b(k) ) ≥ p(x|θ (k) ). After replacing θ (k) with θ
b(k) only the πk
Because θ
part of the integrand depends on θ (k) . Because the integral of a density is one, we obtain
n
b(k0 ) )/p(Xi |θ0 ) o
p(Xi |θ
1 Xn
1
log
C2 ≥
n
n i=1
b(k) )/p(X |θ )
p(X |θ
i
i
0
n
n
b(k0 ) )
p(Xi |θ0 )
1X
p(Xi |θ
1X
log
+
log
=
n i=1
n i=1
p(Xi |θ0 )
p(Xi |θ (k)
∗ )
n
b(k) )
p(Xi |θ
1X
log
≡ D1 + D2 + D3 .
−
n i=1
p(Xi |θ(k)
∗ )
The first term D1 converges by the law of large numbers almost surely to K(θ(k)
∗ , θ 0 ) > 0.
For the second term, note that 2nD2 is the likelihood ratio statistic for testing H : θ = θ0 ,
and thus converges in law to a χ2k0 random variable by Theorem 6.3.2. It follows that
D2 = Op (n−1 ). We can also show that 2nD3 converges in law to a χ2k random variable
by arguing as we did to obtain Theorem 6.3.2 from Theorem 6.2.2. Thus D3 = Op (n−1 )
also. We conclude that C2 is positive and of order n. Thus αk /αk0 → 0 as n → ∞ when
k < k0 .
To show consistency of b
k, note that maximizing α
bh is equivalent to maximizing ρbk ≡
k 6=
log[b
αk /b
αk0 ]. Because ρbk0 = 0, then max ρbk ≥ 0. We next use this to show that P (b
k0 ) → 0.
372
Prediction and Machine Learning
Chapter 12
a) If k < k0 , then P (b
k = k) ≤ P (b
ρh ≥ 0) → 0 by (12.6.2).
b) If k > k0 , then, by (12.6.1),
P (b
k = k) ≤ P ρbk ≥ 0| = P (2 log[Lk /Lko ]) ≥ (k − k0 ) log n + oP (1) .
k = k) → 0.
By Theorem 6.3.2, 2 log[Lk /Lk0 ] converges to a X 2 variable. Thus, P (b
b
b
Next consider the consistency of k. It is of the form arg max{b
ρk + ak } where ak =
b
b
Op (1), and thus if k 6= k0 , then P (k = k) ≤ P ρbk ≥ Op (1) → 0 as before. Similar
b
b and k ∗ .
arguments apply to k
✷
Remark 12.6.1
(a) As we have noted in Section I.7, in the context of regression models, the asymptotic
Bayes criteria k ∗ in (12.6.5), (called SBC and BIC) has the same structure as Mallows’ Cp
but with log n replacing 2. So log n corresponding to a test with significance level tending
to 0 gives the correct threshold according to this criterion. The equivalence up to threshold
continues to hold for any sequence of smooth nested models if Cp is replaced by the Akaike
AIC criterion.
(b) As we have noted in the Gaussian white noise (GWN) framework, Section 12.4, if
the models are not nested, for instance, if we consider all regression models with d or
fewer variables, the correct threshold for prediction changes to something close to BC. For
instance, this is the case if we modify Section I.7 slightly and suppose we observe
Xij = µj + εij ,
j = 1, . . . , d, i = 1, . . . , n, with εij i.i.d. N (0, 1), that is, one way ANOVA. Let any subset
{µj1 , . . . , µjk }, k ≤ d, correspond to the model where µj1 = · · · = µjk = 0, all√other µ’s
vary freely. Then, for fixed j, the best test is to reject H : µj = 0 if |X j | ≥ c/ n. If we
want to choose the correct model with probability tending to 1 we need to take c of order
√
2 log d. But if we consider the sparse set
d
X
1(µj 6= 0) ≤ M
ΘM ≡ µ :
j=1
1
we saw in Section 12.3.3 that, if M is bounded, hard thresholding with c of order (log d) 2
1
yields the minimax MSE order (n−1 log d) 2 .
(c) Software for the SBC rule can be found under “BIC” in “R” and other software packages.
(d) The prior αk does not appear in the rule k ∗ .
✷
12.6.2
Inference after Model Selection
b denote a consistent estimate of k0 .
Again consider the sieve of Section 12.6.1 and let k
b
Given consistency of k it follows that acting as if Mbk is the true model gives asymptotically
Section 12.6
Model Selection and Dimension Reduction
373
correct results for testing and estimation of the parmeters θ1 , . . . , θk0 . To see this suppose
L
Tn is a standardized estimate or test statistic such that Tn −→ T as n → ∞ for model Mk0 ,
then P0 (Tn ≤ t) = P0 (Tn ≤ t|b
k = k0 )P0 (b
k = k0 ) + P0 (Tn ≤ t|b
k 6= k0 )P0 (b
k 6= k0 )
which implies P0 (Tn ≤ t|b
k = k0 ) → P0 (T ≤ t) as n → ∞.
Consider our regression example with d coefficients β1 , . . . , βd and p = d + 2. It seems
rather unlikely in most cases that the number of factors p − k0 which have zero beta coefficients bears no relation to the selection of k and asymptotic inference. Put another way, if
p − ko → ∞, it is unlikely that there exists a k0 = k0 (P ) that is bounded independent of p.
We expect indeterminancy of k0 to affect inference procedures. In this case, the regression
example makes clear what happens. Suppose Zj = hj (Z) are a basis of L2 (Z) where Z is
a univariate regressor and consider the model
Yi =
d
X
βj hj (Zi ) + εi , i = 1, . . . , n
j=1
R
where {εi } are i.i.d. N (0, σ 2 ) and Ehj (Zi ) = 0. If the hj are orthogonal, i.e. hj (z)hk (z)
dP (z) = 0, j 6= k, the MLE βbj of βj stabilizes no matter how big p fixed is since the vecT
tors hj (Z1 ), . . . , hj (Zn ) are asymptotically orthogonal.
Using Example 6.2.1, we can show that
βbj =
n
X
i=1
n
X
h2j (Zi ) + oP (1)
Yi hj (Zi )
(12.6.8)
i=1
and arguing formally
E(βbj ) = E E(βbj |Z) = βj + o(1)
because
E(βbj |Z) =
and E(Y |Z) =
E(Y hj (Z)|Z)
+ oP (1) = βj + oP (1) ,
Eh2j (Z)
Pd
j=1
Varβbj = n−1
(12.6.9)
βj hj (Z). Similarly, by (1.4.6),
E[h2 (Z)Var(Y |Z)]
+ o(n−1 )
[Eh2 (Z)]2
(12.6.10)
since E(βbj |Z) is nearly constant. Thus inference does not depend on the number of
nonzero coefficients to first order. However if the hj (Z) are not orthogonal, Example 6.2.1
shows how βbj depends on h1 (Z), . . . , hd (Z) in general. For fixed 1 ≤ k0 < d
E(βbj |Z) =
E(Y hjk0 (Z)|Z)
+ oP (1)
Eh2jk0 (Z)
(12.6.11)
374
Prediction and Machine Learning
Chapter 12
where hjk (Z) = hj (Z) − Π hj (Z)|hl (Z), 1 ≤ l ≤ k0 , l 6= j and Π is the projection of
Z on the linear span of h1 (Z), . . . , hk0 (Z). This reduces to
n Pd
hk0 (Z)hjk0 (Z)βl o
+ o(1)
2 (Z)
Ehk
0
Pd
l=k0 +1 E hl (Z)hjk0 (Z)
βl + 0(1) .
= βj +
2 (Z)
Ehk
0
E(βbj ) = E
l=1
(12.6.12)
Thus, βbj is biased with the bias depending on d. Its true variance is not the same as that
obtained by acting as if b
k = k0 is correct. If we let d → ∞ which makes b
k → ∞, the bias
goes away but at a rate lower than n−1 . This is the same phenomenon we have observed
with nonparametric regression and kernel density estimates. The minimax root MSE rate is
1
of order smaller than n− 2 and both bias and variance are of that order. Although asymptotic
normality still holds, mean and variance depend on the smoothness assumptions made on
the nonparametric function being estimated.
12.6.3
Dimension Reduction via Principal Component
Analysis
Approximating nonnparametric models by finite dimensional parametric models is one way
of reducing the dimension of the problem and in Sections 9.1.4, 11.4.2, 12.1.1 and 12.2.3
we have seen how sieves play a role in this approach. Techniques which also play a major role involve approximating high dimensional data by lower dimensional data. A very
important method in this regard is principal component analysis (PCA) which is applied
to samples {Xi : 1 ≤ i ≤ n} of p dimensional vectors where p is large and can exceed
n. It is a way of reducing the number of covariates by replacing them by weighted sums
of covariates thereby reducing the number of covariates. The theory is developed in Rao
(1973), Seber (1984) and Anderson (2003) while applications can be found in Mardia et al
(1979); see also Examples 8.3.11(continued), 9.1.5, Problem 8.3.24, Johnson and Wichern
(2003), and Hastie, Tibshirani and Friedman (2009).
The focus is on the covariance matrix Σ(P ) of X = (X1 , . . . , Xp )T , where the X’s
have been centered so that EX = 0. By spectral theory (see Appendix B.10),
Σ(P ) =
K
X
λj (P )ξ j (P )ξ Tj (P )
j=1
where K ≤ p, the λj are (not necessarily distinct) eigenvalues of Σ(P ), and the p dimensional eigenvectors {ξj } are orthonormal. The linear combination
Zj = ξTj X, 1 ≤ j ≤ K
is called the jth principal component (PC). A small K corresponds to a sparse subspace of
the parameter space of Σ.
Section 12.6
375
Model Selection and Dimension Reduction
Although the eigenvalues up to this point appear as artificial constructs they are quite
natural parameters. If λ1 (P ) > · · · > λp (P ) are distinct, the eigenvector ξ 1 corresponding
to λ1 (P ) solves the problem
ξ 1 = arg max{Var(aT X) : |a| = 1} .
That is, the first principal component Z1 is the linear combination
P
a21j = 1 with the maximum variance. Moreover (Problem 12.6.3)
λ1 (P ) = Var
p
X
j=1
P
ξ1j Xj = ξ 1 Σ ξ1 .
a1j Xj subject to
(12.6.13)
That is, ξ1 is the maximizer of aT Σa.
P
second principal component Z2 is the linear combination
a2j Xj subject to
P The
a22j = 1, aT1 a2 = 0, with the maximum variance, and so on. Also
p
X
a2j Xj )
λ2 (P ) = Var(
j=1
Pp
Pp
j=1 VarXj .
l=1 λl (P ) =
Even more, ξ1 equivalently gives the direction of the line in Rp through the origin
closest, on average, to X. More precisely, let x = αa, α ∈ R, be the line through the
origin in direction a, where aT a = 1. The distance between an arbitrary point y ∈ Rp and
the line is yT y − (aT y)2 (Problem 12.6.4). Next consider an unobserved random vector
X in Rp with E(X) = 0 and Cov(X) = Σ. Finding the line αa such that the expected
squared distance of X from the line is a minimum means maximizing
and
E(aT X)2 = aT E(XXT )a = aT Σa .
Or in other words, ξ 1 solves the minimization problem
(α1 , ξ1 ) = arg min{E|X − αξ|2 : α ∈ R, |ξ| = 1}
α,ξ
We have shown that ξ1 defines the one dimensional (dependent on L(X)) linear subspace
of Rp which is, on average, closest to X. Similarly ξ 2 defines the line in Rp through the
origin perpendicular to αξ 1 closest to X, etc.
The eigenvectors and eigenvalues can be estimated from the empirical covariance mab using established methods such as singular value decomposition, yielding λ
bj and
trix Σ
b
ξj = (ξb1j , . . . , ξbpj )T , 1 ≤ j ≤ p. The estimated eigenvalues are arranged in decreasing
order and thresholded to 0 when they become too small. The resulting first d < p eigenvectors and corresponding eigenvalues give a dimension reduction by replacing the original
b i ; 1 ≤ i ≤ n}, where
{Xi ; 1 ≤ i ≤ n} by the sample principal components {Z
bij =
Z
p
X
k=1
b
ξkj Xik , 1 ≤ j ≤ d, 1 ≤ i ≤ n .
376
Prediction and Machine Learning
Chapter 12
These principal components are uncorrelated. It follows that if we regress {Yi } on
bij } using least squares, the fitted regression is
{Z
b = Ȳ 1 +
Y
d
X
j=1
bj
ηbj Z
P
P
b j = (Z
b ij −Z
b1j , . . . , Z
bnj )T and ηbj = n (Yi −Ȳ )(Zbij −Z
b̄j )/ n (Z
b̄j )2 . In the
where Z
i=1
i=1
context of the regression model where we observe (X1 , Y1 ), . . . , (Xn , Yn ) i.i.d., Xi ∈ Rp ,
Y ∈ R with p large, sparse modeling with small d makes it possible to do classification
b 1 , Y1 ), . . . , (Z
b n , Yn ).
and prediction using (Z
Association studies
In the preceding PCA approach the effects of the individual original variables on the
response in not apparent. In genome-wide association studies (GWAS) this is fixed by
bij }, 1 ≤ i ≤ n,
singling out one variable Xk one at a time and regressing Yi on {Xik , Z
1 ≤ j ≤ d. When the number of X’s p is larger than n, this approach is difficult to
bij is unstable. However, for testing the hypothesis that Xk and Y
implement because Z
are uncorrelated, the software package “eigensoft” give tests that produce approximately
correct Bonferroni p-values for balanced case-control studies where the number of cases
and the number of controls are nearly the same, and other assumptions are satisfied (see
Yang (2013), and Yang, Doksum, and Tsui (2014)).
b depend on the scale of the X’s, the corresponding
Remark 12.6.2. Because Σ and Σ
correlation matrices are often used in their place.
✷
Remark 12.6.3. One approach to sparse PCA is to use only the first 10 principal compobj , j > 10, are nearly zero. When p is
nents which often can be justified by checking that λ
large, using d = 10 PC’s often produce nearly the same results as d chosen
P by model selection procedures, e.g. cross-validation. Another approach is to replace pj=1 a2kj = 1 in the
Pp
optimization PC problem with adjusted versions of the Lasso type condition j=1 |akj | =
1. Adjustments are made to obtain convex optimization problems. See Hastie, Tibshirani,
and Wainwright (2015) and Remark 8.3.2.
We do not discuss this further here but note again that for large p, the empirical eigenvectors and eigenvalues can be very poor estimates of the population quantities of interest.
Here too, as in regression, we typically regularize reflecting our beliefs that the data nearly
live in a much lower dimensional subspace. For instance, λk+1 = · · · = λp = 0 correspond
to a sparse subspace where p − k of the predictors are linearly dependent on the others and
X is in a k dimensional subspace of Rp . We will discuss such issues even more briefly in
our last Section 12.7 in which we point to enormous areas, barely touched by this book,
and current areas of research interest.
Summary
In this section we have briefly focused on the huge body of research on model selection and dimension reduction of data, one of the most important issue in modern statistical
practice. The reduction in model dimension centered on an asymptotic Bayes criteria while
Section 12.7
Topics Briefly Touched and Current Frontiers
377
dimension reduction of data briefly discussed principal component analysis. We had already considered, in a slightly different context, dimension reduction via Mallows’ Cp and
cross validation in Section 12.4.
12.7
Topics Briefly Touched and Current Frontiers
Complex data structures
As we have noted in the past, modern statistical data are full of structures. To the old
types of temporal data, time series, have now been added spatio-temporal data, matrix time
series, images which can be thought of as spatial data, but have a special complexity and
structure, networks as in genomics or the World Wide Web, dynamical computer models,
hierarchical structures such as text, and so on.
Beyond a few examples in time series such as autoregressive processes we have not
discussed modelling such situations parametrically or nonparametrically mainly because
models are evolving for these complex situations.
For stationary Gaussian time series, the spectrum defines the most general semiparametric model and its estimation and its analysis has been highly developed (see, for example,
Brillinger (2001)). However, just as with linear regression, the tool of choice in the Gaussian case, while the spectral techniques have been of great value even in the absence of
Gaussianity, robustness and other considerations have led to time domain modelling.
Here semiparametric models such as general autoregressive and more general Markov
and hidden Markov models have been used and analyzed. Asymptotics as the time series
lengthens have been the major theoretical tools. Bootstrap and other Monte Carlo techniques have been developed for some of these models (Politis, Romano and Wolf (1999),
Bühlmann and Künsch (1995), Künsch (1989), and Bickel, Götze and van Zwet (1997)).
Hierarchical, latent variable and Bayesian graphical models
There has been an enormous revival of Bayesian modelling. It has taken the form of
hierarchical models, or, more generally, Bayesian graphical models of the type also familiar in the frequentist world, such as factor analysis, random effects, and mixed models, i.e.
models where the observed variable structure is postulated to be due to unobservable or unknown hidden variables. A mixed model, for example, might be modelling genetic factors
in Type II diabetes. To what extent do genetic factors account for Type II diabetes? This development has been driven by the possibility of incorporating partial scientific knowledge
into the statistical model in this way and also because, with the advent of the MCMC techniques discussed in Section 10.4, computation of posterior distributions becomes, in principle, feasible in complex situations. Accounts may be found in Berger (1985) and Gelman,
Carlin, Stern, Dunson, Vehtari and Rubin (2014). These approaches have proved successful in the context of prediction and machine learning and are favored by many Bayesian,
frequentists, or the most common type of statistician, these days, pragmatists. See Wainwright and Jordan (2008). The reason for this, as we indicated earlier, is that no matter
how a prediction algorithm is constructed its value is judged nonparametrically through
checking predictions from cross validation based on training samples on known cases.
378
Prediction and Machine Learning
Chapter 12
From a scientific point of view there is an interest in establishing causal relationships,
importance of variables, etc. Equivalently, we may start with a graph of variables (nodes)
and edges (associative relations). We want to end with a directed graph where the directions represent causation. This can be problematic. We typically observe association
between variables which can be causally explained in many ways. For instance, if we observe (X, Y ) ∼ N2 (0, 0, 1, 1, ρ) we can’t tell whether Y was generated from Y = ρX +ε1 ,
ε1 ⊥ X, ε1 ∼ N (0, 1 − ρ2 ), or from X = ρY + ε2 with ε2 ⊥ Y , ε2 ∼ ε1 . However,
some causative relations can be ruled out as being inconsistent with others. For instance, a
loop in the graph, A → B → C → A is problematic since either A causes B or B causes
A. Important applications of these notions have appeared in statistics and biostatistics —
see Pearl (2009), Rubin (1990), and van der Laan and Robins (2003). Not infrequently, the
hierarchical Bayesian approach appears dangerous. Distributional and causal assumptions
are thrown in at various points of the hierachy only for convenience of computation and if
final posterior probabilities are taken seriously, it is difficult to disentangle what is coming
from the data, what is coming from reasonable scientific assumptions, and what is coming
from convenience assumptions. This is a problem shared by frequentists building hierarchical models as much as Bayesians. It seems plausible that what matters in a complex
hierarchical model are the parametric assumptions made along the way rather than whether
one puts a prior on at the root of a hierarchical model step or uses maximum likelihood.
This last remark is supported by the Bernstein-von Mises theorem (Theorems 5.5.2 and
6.2.3).
Modelling Deterministic Situations
Computer models of great complexity for traffic, wildfires, and atmospheric science
are now common; see Sacks et al (1989). Although, in principle, these situations are deterministic and can, again in principle, be modelled only by subjective Bayesians, in fact
probabilistic models are very useful and important tools in their analysis. For instance, numerical PDE computations running on super computers for climate prediction are “assimilated” with data (with stochastic error structure) to produce far more accurate predictions
than could be done using data alone.
Computational questions
An issue that high dimensionality and possibly sample size have made paramount is
computation time. Despite the ever increasing speed and capabilities of computers, running
all possible regressions on a set of n observations with p covariates requires on the order
of 2p n operations, an impossibly large number if p is even in the hundreds. Things are,
of course, much worse for optimization algorithms for functions on Rp arising routinely
in maximum likelihood or for that matter for Markov Chains Monte Carlo simulations in
high dimensional spaces. It has been clear that to get good statistical performance, we
need to add computational performance, to get so called Polynomial (P) rather than NonPolynomial (NP) algorithms although even that may be excessive. That is, if p characterizes
the number of parameters, numbers, objects that the algorithm needs to deal with, then, for
some α, a (P ) algorithm takes at most on the order of pα operations to yield an answer
whatever be the values of the parameters. An (NP) algorithm either will need more than pα
for any α for some situations or have an unknown worst case performance. The issues are,
Section 12.7
Topics Briefly Touched and Current Frontiers
379
in fact, rather subtle since as for decision procedures, this is based on a worst case analysis
and an algorithm may perform well in practice even though it is known to be NP. A famous
example is the simplex method for solving linear frequency problems. These issues are
only now starting to be treated systematically; see Chandrasekaran and Jordan (2013) for
instance.
Behaviour of classical procedures when p, n are comparable or p ≫ n
It has been known for some time in the statiscal physics and probability communities
that if p/n → c > 0, then the behaviours of various statistical functions of the empirical covariance matrix are very different than if p/n → 0. A striking example studied by
Wachter (1978), Geman(1984), and more recently by El Karoui (2005) following classical
work in statistical physics is that of the empirical covariance matrix of n i.i.d vectors which
are multivariate normal with mean 0 and variance covariance matrix the p × p identity.
The distribution of the eigenvalues of this matrix does not tend to point mass at 1, as they
should and do for fixed p. Moreover, for slight perturbations to the identity covariance, the
empirical eigenvectors have strange properties. Thus, principal component analysis (PCA)
becomes suspect. Similar difficulties arise for the classical Gaussian linear discriminant
function (Bickel and Levina (2008a,b)). These can be studied by using models with different forms of sparsity and regularization such as those we have analysed earlier in this
chapter for regression. See Remark 8.3.2 for references to these studies of sparse PCA.
It is worth considering situations where the vector of parameters is unstructured a priori
such as the p × p population covariance matrix when sampling from a p-dimensional population and situations where some of the population parameters are structured, for instance,
the Fourier coefficients of a smooth density function appearing in a semiparametric model
Here, smoothness corresponds to a sparsity assumption i.e. most of the coefficients vanish.
Examples of this type of large p situation have been dealt with in Sections 12.3, 12.4, and
12.6. Another example of situations where many parameters appear but sparsity is assumed
is the following:
Studying many parameters simultaneously from testing and related points of view
In genomics and other fields, it has become routine to test many null hypotheses at
the same time, in the expectation that only some small fraction of these if any will not be
null. The classical approach here has been to use the Bonferroni inequality (see Section
4.4). That is, if there are N hypotheses, to ensure that the probability of type I error, in the
sense of making any mistake whatsoever, is no more than α, one tests at level α/N . The
probability of at least one Type I error when testing N hypotheses is called the family-wise
error rate (FWER). That is, if V denotes the number of true null hypotheses rejected, then
F W ER = P (V ≥ 1). If the model distribution P assumes that all N null hypotheses
are true, then F W ER ≤ α is called weak control of FWER, while if P is not required to
satisfy this assumption, F W ER ≤ α is called strong control of FWER. The popularity of
the Bonferroni multiple testing method is in part due to:
Proposition 12.7.1. Suppose we have available tests ϕ1 , . . . , ϕN for testing N null hypotheses H1 , . . . , HN . Let pbj denote the p-value of ϕj for testing Hj , j = 1, . . . , N . Then
the multiple testing procedure that rejects the Hj with pbj ≤ α/N strongly controls FWER.
380
Prediction and Machine Learning
Chapter 12
Proof. Let J0 = {j : Hj is true} and N0 = cardinality of J0 . Then
F W ER = P (V ≥ 1) = P ( ∪ {b
pj ≤ α/N }) ≤
j∈J0
X
j∈J0
P (b
pj ≤ α/N ) ≤
N0
α≤α.
N
✷
Holm (1979) constructed a simultaneous testing procedure that strictly improves on the
Bonferroni method: Let p(1) ≤ . . . ≤ p(N ) denote the ordered p-values for N tests. Reject
Hj if
α
.
p(j) ≤
N −j+1
A proof of strong control of FWER for the Holm method is outlined in Problem 12.7.1.
When the number of hypotheses is large and α ≤ 0.1, the Bonferroni and Holm methods rule out all but the very strongest signals and then makes discovery of strong but less
extreme signals impossible. To remedy this failing, Benjamini and Hochberg (1995) proposed replacing the Bonferroni criteria by bounding the False Discovery Rate (FDR), the
expected fraction of falsely rejected hypotheses to the total number of rejections, and derived a simple rule for doing so with desirable properties. This rule rejects Hj if p(j) ≤ p(bj) ,
where
b
j = arg max{j : p(j) ≤ α/N } .
If the p-values are independent for j with Hj true, the rule controls the FDR at level α for
such j. An overview may be found in papers of Benjamini and Hochberg (1995), Lehmann,
Romano and Shaffer (2003), Dudoit, Shaffer and Boldrick (2003), and others.
The monograph of Efron (2010) gives an important overview of issues in modern large
data contexts from an empirical Bayesian point of view. This is an area of continuing
research — see, for example, the problems and a related solution to a different problem in
Q. Li et al (2011) as well as the study of other features of the distribution of p values which
underlies the work of Benjamini and Hochberg.
Dimension reduction, clustering, and related concepts
Sufficiency is not usually a useful concept in the high dimensional world we are now
dealing with. But dimension reduction and the related concepts, clustering, or, as the information theorists call it, “lossy compression,” are essential for analysis and understanding.
Such methodologies interact intimately with regularization methods we have mentioned. If
we do not regularize or, more specifically, set some parameters to 0, the analysis we then
give may end up giving us an incorrect qualitative picture of the mechanics behind the data.
We end with an augmentation of a famous quotation from Box (1979):
“Models, of course, are never true but fortunately it is only necessary that they be
useful” to which we would like to add
“Useful models generate conclusions which can point to new verifiable questions.”
Randomized methods as part of statistics
A number of procedures have arisen in recent years in addition to the bootstrap, MCMC
and similar methods, which provide new methodology for high dimensional data. An intriguing example now called “sketching” is a class of methods for dimension reduction.
Section 12.8
Problems and Complements
381
Here random projections of the data points into lower dimensional subspaces are analyzed,
yielding both computational and possibly inferential advantages at a possible cost of information loss. We see this as an emerging frontier on the borders of statistics and computation. See Baraniuk, Cevher, and Wakin (2010) as a recent nonstatistical reference.
A method using randomization intrinsically that we discuss more fully is Random
Forests (RF). This method for classification and regression, or supervised learning, and
clustering or unsupervised learning, was introduced by Breiman (2001), (see also Amit and
Geman (1997)), as a follow up to CART. In this approach, ensembles of random classification or regression trees are grown on randomly chosen training sets. Classification is
determined by majority voting, and regression by averaging empirical predictors produced
the trees. The randomness is introduced in two ways.
1) Each tree is grown on a bootstrap sample of the same sample size as the training
sample. Each bootstrap sample, because of sampling with replacement, leaves out
about 30% of the data vectors. This leaves, as a complement of the bootstrap training
sample, an independent subsample for assessing importance of predictors (also called
features). But the trees are typically dependent since the training samples overlap.
2) Even more significantly, when the vector of features/predictors is large, each tree
uses only a small random subset of features on which to determine a split. That is, the
CART splits are each made by optimizing over their own small subset of variables,
of the order of 5, with sets varying randomly from split to split. For classification,
although this is not necessary, the trees are grown to purity. Each leaf contains only
cases of one data type.
Various advantages accrue. The methods are much more stable than CART since they
smooth by averaging, even though the trees are not independent. More significantly, the use
of different variables at each split and the use of subsamples enables efficient exploration
of high dimensional spaces. The details of RF make analysis of the method difficult. However, its empirical performance is superb, and recently research and understanding have
advanced. See Hastie, Tibshirani and Friedman (2009) for recent references, and Scornet,
Biau, and Vert (2015).
Summary
In this final section we have listed a number of topics which are of active interest at
least to us. Many equally workable directions have not been mentioned. But this is a book
of “Selected Topics.”
12.8
Problems and Complements
Problems for Section 12.1
1. With the notation of Section 12.1, suppose, initially, Z = (1, Z1 )T where Z1 ∈ [0, 1].
Show that if we enlarge Z to Z∗ = (1, Z1 , Z12 , . . . , Z1d )T , then for any continuous µ(Z1 ),
and ε > 0, there exists d, a ≡ (a0 , . . . , ad ) such that,
E(µ(Z1 ) − aT Z∗ )2 ≤ ε2 .
382
Prediction and Machine Learning
Chapter 12
Hint: Use the Weierstrass approximation theorem.
2. Suppose x1 , . . . , xN are N points in general position in Rd and (y1 , . . . , yN ) are associated signs y ∈ {−1, 1}. Show that there exists a 1–1 map Φ : Rd → RM for some M ≥ d
and a vector v ∈ RM such that
sgn(vT xi ) = yi ,
i = 1, . . . , N .
Hint: Let e1 , . . . , eN be the coordinate axes in RN , map xi onto ei , and let v = (y1 , . . . ,
yN )T .
Problems for Section 12.2
1. Let fbh (x) denote the multivariate convolution kernel estimate. Assume that K has
convex compact support and is of order 1; and assume that D2 f (x) ≤ M for all x ∈ S(f )
and all f , some M > 0. Show that for x in the interior S 0 of S(f )
EF fbh (x) = f (x) + O(|h|2 )
Z
VarF fh (x) = (n|h|d )−1 f (x) K 2 (u)du 1 + o(1) .
Hint: Follow the steps in the proofs of Propositions 11.2.1 and 11.2.2.
R
2. Assume that 0 < |D2 f (x)|2 dx < ∞. Let fbh (x) be as defined by (12.2.1).
(a) Show that AIMSE expression (12.2.5) for fbh (x) is minimized by
1
h = af n− d+4
for some af > 0. Give an expression for af .
(b) Show that the minimum of the AMISE expression (12.2.5) is
4
bf n− d+4
for some bf > 0. Give an expression for bf .
(c) Evaluate the constant af ≡ a(Σ) in part (a) of this problem when X is Nd (0, Σ)
b be the sample
and Kh (u) is a product kernel with Kj (u) = 1[|u| ≤ 1]/2. Let Σ
covariance matrix. Then
1
b
b − d+4
h = a(Σ)n
is an estimate of the h that yields the optimal asymptotic IMSE if the true density is
N (µ, Σ). Here Nd (0, Σ) is called a reference distribution. See Section 11.2.3.
3. Boundary kernel density estimates. Consider h = (h, . . . , h). Suppose the support
of f is a rectangle Πdj=1 [aj , bj ] and that D2 f (x) ≤ M for all xεS(f ) and some M > 0.
Also assume that K(u) = Πdj=1 K0 (uj ), where K0 is symmetric and has support [−1, 1].
Let Bλ = Πdj=1 [aj , aj + λh], 0 < λ < 1, denote the left boundary region, and let xh =
(a1 + λh, . . . , ad + λh) denote a point in Bλ .
Section 12.8
383
Problems and Complements
(a) Show that the kernel estimate is asymptotically biased in Bλ . That is, as h → 0
(b) Let W (x) =
R
S(f )
|fbh (xh ) − fh (xh )| = OP (1) .
Kh (z − x)dz and define
feh (xh ) =
Pn
Kh (Xi − x)
1 x ∈ S(f ) .
nW (x)
i=1
Show that feh (xh ) is asymptotically unbiased in Bλ , That is, as h → 0,
feh (xh ) − fh (xh ) = oP (1) .
Hint: See Problem 11.3.1. Jiang and Doksum (2003) give further asymptotic properties of feh (x).
(c) Show that feh (x) = β0 (x, Fb), where
β0 (x, F ) = arg min E [f (Z) − β0 ]2 X = x
and (Z|X = x) has density
qh (z|x) =
Kh (z − x)Πdj=1 1(aj ≤ zj ≤ bj )
.
W (x)
Hint: See Section 11.3.
4. Let Dk be the differential operator defined in Section B.8 and let k k∞ denote sup norm
over S, and let mj and νj be as defined in Section 11.6.2. Suppose that inf{f (x) : x ∈
S} > 0, kDf k∞ < ∞, kDJ+2 µk∞ < ∞, kDσ 2 k∞ < ∞, mJ+2 (K) < ∞, ν5 (K) < ∞,
mj (K) = 0, j = 1, . . . , J, and J is even. Show that
(a) as n → ∞, h → 0, nh3 → ∞, uniformly for x ∈ S,
b NW (x)|X(n) ≡ E |b
µ NW (x)−µ(x)|2 X(n) = OP (hJ+2 )+OP (n−1 h−d ) .
MSE µ
b NW (x)|X(n) is of the form
(b) The minimizer of the asymptotic version of MSE µ
h=c
1
J+2+d
1
(J + 2)n
b NW (x)|X(n) of order
which leads to MSE µ
J+2
n− J+2+d .
384
Prediction and Machine Learning
Chapter 12
Hint: See Section 11.6.2.
b j = C −1 , 1 ≤ j ≤ C, and if pbjk (·) denotes the kth
5. In Example 12.2.2, show that if Π
nearest neighbour estimate, then the classifier based on replacing pj (·) with pbjk (·) in the
Bayes rule classifies In+1 as Ibjk where b
jk is the subscript on the kth nearest neighbour to
Xn+1 .
6. Assume that d = 1, that the support of f is [a, b] with a and b finite, that sup{|f ′′ (x)| :
x ∈ [a, b]} ≤ M , and that k = kn → ∞, n−1 kn → 0 as n → ∞. Then the kth nearest
neighbour classifier is consistent.
Hint: See Section 11.2 and Problem 11.4.3.1
7. Assume that f satisfies the conditions of Problem 6 above. Show that the bandwidth b
h
that corresponds to the nearest neighbour classifier satisfies b
h = OP ( n1 ).
8. Suppose (Xi , , Yi ) are i.i.d. where Yi = 1 with probability π0 , Yi = −1 with probability
1 − π0 , and Xi ∼ N (µ0 Yi , Σ0 ) given Yi . Assume 0-1 loss and that µ0 , Σ0 and π0 are
known.
(a) Show the Bayes classification rule for Yn+1 given Xn+1 is given by
Ybn+1 = 1 if (Xn+1 − µ0 )T Σ−1
0 (Xn+1 − µ0 )
− (Xn+1 + µ0 )T Σ−1
0 (Xn+1 + µ0 )
> 2 log
= −1 otherwise
or equivalently
π0
1 − π0
1 − π0
.
Ybn+1 = 1 iff µT0 Σ−1
0 Xn+1 > log
π0
(b) Show that the Bayes risk if π0 =
1
2
is Rθ = 1 − Φ(µT0 Σ−1
0 µ0 ).
9. Assume (Xi , Yi ), i = 1, . . . , n are i.i.d. as (X, Y ) with Y = 1 or −1 and
d
X
αj Xj + α0
logit P (Y = 1|X) =
j=1
where logit(u) = log u/(1 − u) , 0 < u < 1. This is the logistic regression model.
(a) Show that the Bayes classification rule given Xn+1 = (X1,n+1 , . . . , Xd,n+1 )T is
Ybn+1 = 1 iff
d
X
j=1
αj Xi,n+1 > −α0 .
Section 12.8
385
Problems and Complements
(b) Suppose that α is unknown. Show that as n → ∞ the rule
d
X
b
b
Y n+1 = 1 iff
α
bj Xj,n+1 + α
b0 > 0
j=1
nP
n
b = arg max
where α
i=1
converges to the Bayes rule.
Pd
j=1 (αj Xji
d
+ α0 ) − log(1 + eΣj=1 αj Xji +α0
o
(c) Show that if the density of X is (1−π0 )N (µ0 , Σ0 )+π0 N (−µ0 , Σ0 ), then the Bayes
rule of (a) coincides with that of Problem 12.2.8.
(d) Assume 0-1 classification loss. If we replace Σ0 , µ0 , π0 by their MLEs under the
Gaussian model mentioned in Section 12.2.3, show that the resulting classifier (LDA)
is, as n → ∞, no better than that of the logistic regression model classifier and, in
general, is worse.
10.(a) Show that the minimizer of
L(x, µ) ≡ (x − µ)2 + λ|µ|
in µ is given by soft thresholding
µ
b = x, |x| ≤ c.
= x − c(sgn x),
|x| > c
(b) Suppose Y is n × 1 and X is n × p. Deduce that if λ > 0 and n > d the Lasso:
Minimize |Y − Xβ|2 + λ|β|1
b = (βb1 , . . . , βbd )T where for some S(λ) ⊂ {1, . . . , d},
has a unique sparse solution β
βbj = 0,
j ∈ S(λ) .
Pd
Pd
(c) Show that if we replace = j=1 |βj | by |β|c = j=1 |βj |c , c > 1, then if Y has a
b is never sparse with probability 1.
density β
Hint. Use Lagrange multipliers or equivalently the Karush-Kuhn-Tucker conditions (KKT).
P
Remark. The adaptive Lasso (Zou (2006)) replaces λ|β|1 with λ wj |βj |, where wj are
weights of the form 1/|βj∗ | with βj∗ consistent estimates of βj , 1 ≤ j ≤ d. It has desirable
consistency and oracle properties. Software is available online.
11. Let M (H) denote the margin of the separating hyperplane H. Show that the hyperplane
that maximizes M (H) can be found by solving (12.2.19).
Hint. H = {x : β T (x − x0 ) = 0} where β T x0 = β0 and π(x|β) = (β T x)β.
12. Verify (12.2.29).
386
Prediction and Machine Learning
Chapter 12
13. Argue geometrically that the support vector machine optimization problem, may be
stated as in (12.2.19).
14. Assume model (12.2.30). For M fixed show that the first Adaboost algorithm converges
for fixed n > M to the classifier specified by the sample version of (12.2.30) i.e.
δ(Z) = 1 iff
M
X
j=1
α
bj hj (Z) > 0
b = arg min Q(α, Pbn ) is assumed to exist.
where α
Hint. Apply the methods used to prove Theorem 12.2.2.
15. Assume model (12.2.30). Show that as n → ∞,
(a) the minimizer of Q(α, Pbn ) converges to the minimizer of Q(α, P ), and
(b) if the true distribution P (z, y) is such that Z has finite support, then the first Adaboost
algorithm is Bayes consistent in the sense of (12.2.10).
P∞
16. Suppose logit P (Y = 1|Z) = j=1 αj hj (Z), αj 6= 0 for all j and Adaboost 1 is
iterated forever using the function h1 , h2 , h3 , . . .. Show that Adaboost 1 doesn’t converge.
17. (a) Show that minimizing R(α, P ) in (12.2.31) with λ = (2γ)−1 is equivalent to
solving (12.2.24).
(b) Show that if λP= 0 in (12.2.31)
and the Bayes classifier for model (12.2.30) is of
the form δB = sgn
α∗j hj (Z) , then δB minimizes both R(α, P ) with λ = 0 and the
PM
minimum Bayes risk P Y j=1 α∗j hj (Z) < 0 .
(c) Show that under the assumptions of (b) support vector machines can be put in the
framework of boosting.
18. (Breiman et al. (1984)). Define a measure of impurity of a finite probability distribution,
{p1 , . . . , pK } ∈ SK , the K simplex, as a function φ : SK → R such that
(i) φ is maximized uniquely for pj =
1
K,
1≤j≤K
(ii) φ is minimized for any ej ≡ {δij : i = 1, . . . , K} where δij = 1(i = j).
(iii) φ is symmetric.
(a) Show that with 0 log 0 = 0 the following are both impurity measures with minimums
zero:
K
K
X
X
2
pj log pj .
pj ; φ2 = −
φ1 =
j=1
j=1
(b) Show that both φ1 and φ2 are strictly concave functions of p, that is,
φ(αp1 + (1 − α)p2 ) ≥ αφ(p1 ) + (1 − α)φ(p2 )
for all α ∈ [0, 1] with equality iff p1 = p2 .
Section 12.8
387
Problems and Complements
Hint. Because the sum is concave, it is enough to check concavity for k = 1.
19. Suppose Y ∈ {1, . . . , K} so that there are K classes. Let t be a node with class
fractions (p1 , . . . , pK ). Split the node into K pieces with corresponding compositions
PK
πj (p1j , . . . , pKj ), j = 1, . . . , K, i=1 pij = 1 where πj is the fraction assigned to descendant node j.
(a) Show that for φ1 , φ2 as in Problem 18, for any such split S, if we define the total
purity gain of the descendant nodes, by
∆(S) ≡
K
X
j=1
πj φ(p1j , . . . , pKj ) − φ(p1 , . . . , pK ) ,
then, ∆(S) ≥ 0.
Hint. Use concavity.
(b) Show that for K = 2, φ = φ2 , the splitting rule described in the text as the T
algorithm is
Sb ≡ arg max ∆(S) .
S
(c) Show that ∆(S) = 0 iff pij = pi for all i, j.
20. (a) Show that if f, g are densities of P and Q with respect to µ,
Z
sup |P (A) − Q(A)| = |f − g| dµ .
A
(b) Deduce that if X = {x1 , . . . , xm } and (p1 , . . . , pn ), (q1 , . . . , qm ) are corresponding
vectors of probabilities,
X
X
X
{pj : j ∈ A} −
{qj : j ∈ A} =
|pj − qj |
max
A
Hint. |P (A) − Q(A)| =
R
A (f
− g) dµ . Take A = {x : fg (x) ≥ 1}.
21. Consider the following splitting scheme, called twicing, for a node t with fractions
(p1 , . . . , pK ). Let C = {j1 , . . . , jm }, C ⊂ {1, . . . , K}, C¯ = {1, . . . , K} − C. Split the
node t into tL , tR such that the node tL corresponds to (p1L , . .P
. , pKL ) and similarly for
tR . If πL , πR are the fractions of the two nodes, let p(1|t) = {pj : j ∈ C}. Define
p(1|tL ), p(1|tR ) similarly. Then
p(1|t) = πL p(1|tL ) + πR p(1|tR ) .
Show that the purity gain from a split into two classes C and C¯ is
∆(S, C) = − 2p(1|t) 1 − p(1|t) − πL p(1|tL ) 1 − p(1|tL ) − πR p(1|tR ) 1 − p(1|tR ) .
388
Prediction and Machine Learning
Chapter 12
22. Suppose K = 2 and φ = φ2 in the previous problem. Given a split
S, let t, tL , tR be
the parent left and right nodes with compositions p(1|t), 1 − p(1|t) and
πL p(1|tL ), 1 − p(1|tL ) , πR p(1|tR ), 1 − p(1|tR )
where p(1|t), p(1|tL ), p(1|tR ) are the (conditional) probability masses for class 1 and
πL = 1 − πR is the marginal probability of being assigned to tL . Show that
2
∆(S, C) = 2πL πR p(1|tL ) − p(1|tR ) .
Hint. Write p(1|tL ) = 21 + ∆1 , p(1|tR ) = 21 + ∆2 , p(1|t) = 21 + πL ∆1 + πR ∆2 .
2
Note that p2 (1|t) + 1 − p(1|t) = 1 − 2p(1|t) 1 − p(1|t) .
1
1
1
− (πL ∆1 + πR ∆2 )2 .
∆(S, C) = −2 πL ( − ∆21 ) + πR ( − ∆22 ) −
4
4
4
23. Let
p(C|t) =
X
{pj : j ∈ C}, p(C|tL ) =
X
{pjL : j ∈ C}, p(C|tR ) =
X
{pjR : j ∈ C}
and let ∆(S, C) be as in Problem 21. Show that the split S(C ∗ ) which maximizes ∆(S, C)
over all S, C may be obtained by maximizing
X
2
{|pjL − pjR |}
πL πR
j
over all splits, yielding s∗ , and the optimal
C ∗ = {j : p(j|t∗L ) ≥ p(j|t∗R )}
where t∗L , t∗R are the results of s∗ .
24. Show that in the T algorithm for classification trees, if pi > 0, i = 0, 1, . . . , Cjm , j =
1, . . . , 2m is the partition of Rm defined by level m of the tree and Cm (z) is the member
of the partition containing z, then Cm (z) ⊃ Cm+1 (z) for all m and ∩ Cm (z) = {z}.
m
Problems for Section 12.3
1. Justify the calculations in (12.3.7) using the conditions of Theorem 6.2.2.
2. Suppose (Z, I) has joint density p defined by the logistic regression model
log
where z is d × 1, θ ∈ Rd and
p(z, θ) = θT z
1 − p(z, θ)
p(z, θ) ≡ P [I = 1| Z = z] = 1 − P [I = 0| Z = z] .
Section 12.8
389
Problems and Complements
(a) Show that if Xi = (Zi , Ii ), 1 ≤ i ≤ n, are i.i.d., the joint conditional density of the
Ii given Zi ≡ (Zi1,...,Zid )T follows a canonical experimental family with statistics
Tj (X) =
n
X
Zij Ii .
i=1
b is asymptotically Gaus(b) Using classical MLE theory show that the conditional MLE θ
sian and give its limiting mean and variance covariance matrix.
3. Let (Z, I) be as in Problem 12.3.2. Suppose the conditional density of Z given I = j is
N (µj , Σ), j = 0, 1, and that P [I = j] = πj , j = 0, 1.
(a) Show that the joint distribution of (Z, I) can be written, for suitable η = (µ, Σ), g,
g(Z, θ, η) exp(Iθ T Z)(1 + eθ
T
Z −1
)
.
(b) Assume 0-1 classification loss. Deduce that if Z ∼ g(·, θ, η) then the linear discriminant analysis (LDA) classification rule is better than the logistic regression classification
rule in terms of regret, but that the asymptotic Bayes risk of LDA is the same as that of
logistic regression.
(c) Show that if g is not of the form in (a), LDA can be arbitrarily worse than logistic
regression and even not asymptotically Bayes.
Hint. See Problem 12.2.9.
4. Show that in the nonparametric regression example, (12.3.27) and (12.3.28) define a
prior distribution with
1
|µ(z1 , ε) − µ(z2 , ε)| ≤ d− 2 |z1 − z2 | .
5. Establish the Fourier identities (12.3.47).
6. For the framework of Section 12.3.4 show that for any prior π and quadratic loss there
is a prior with independent components which has the same Bayes risk.
Hint. Given π consider the prior with independent components and the same marginals as
π.
7. Show that in the Gaussian contiguous case, c = 12 is the best asymptotic choice to put in
the Bayes test ϕB and then the Bayes risk (12.3.16) converges to 1 − Φ( σ2 ) as n → ∞.
2
Hint. Λn =⇒ N ( σ2 , σ 2 ) under P (n) ∈ Ωn and
t = 0. See Example 3.3.2.
1
2
2
2
Φ̄(
t+ σ2
σ
+Φ
t− σ2
σ
) is maximized for
8. Consider the notation in the proof of Fano’s inequality. Establish the existence and
b
b Pj ) < rm /2.
uniqueness of the decision rule δ(x)
= j iff ρ(θ,
9. For P as in Remark 12.3.2(b), show that the minmax risk of the Nadaraya-Watson
2s
estimate is no smaller in order than n− 2s+d .
390
Prediction and Machine Learning
Chapter 12
Hint: Apply Fano’s inequality for suitable P1 , . . . , Pm .
10. Exhibit an estimate achieving the rate of Problem 12.3.8.
Hint: Use Nadaraya-Watson with kernel having vanishing moments.
11. In the framework of Proposition 12.3.2, show that by using the hard thresholding rule
δch (y) with suitable c = cn one can construct estimates minmax on Θ0,n .
12. Establish (12.3.49) and (12.3.50).
13. Consider the Gaussian shift model xi = θi + εi , i = 1, 2, . . ., where xi ’s are the
observations and θi ’s are the parameters
P∞ to be estimated. Assume that the parameter vector
θ = (θ1 , θ2 , . . .) lies in an ellipsoid i=1 i2m θi2 ≤ M , where M is a fixed positive constant
and m is a known positive integer. The εi ’s are independent normal random variables with
mean 0 and variance n−1 .
In the P
following problems, you may use this fact: there exists a positive constant C1
−2
such that ∞
≤ C1 λ−1/2m for all λ > 0, where ρi = i2m − 1. You may
i=1 (1 + λρi )
also assume the existence of the solutions to the problems below.
(a) The penalized likelihood estimator θbi is the solution to the optimization problem:
min
θ
∞
X
i=1
(xi − θi )2 + λ
∞
X
ρi θi2 .
i=1
Show that if λ is of order O(n−2m/(2m+1) ), then there exists C2 > 0 such that
P∞
E (θbi − θi )2 ≤ C2 n−2m/(2m+1) .
i=1
(b) The projection estimator θei is defined by
xi , i ≤ n1/(2m+1)
e
θi =
0,
otherwise.
Show that there exists C3 > 0 such that
P∞
i=1
E(θei − θi )2 ≤ C3 n−2m/(2m+1) .
(c) Show that the estimator resulting from L1 penalty estimation:
)
(∞
∞
X
X
2
|θi |
(xi − θi ) + 2λ
min
θ
i=1
i=1
has the form sgn(xi )[|xi | − λ]+ .
(d) Show that the estimator resulting from L0 penalty estimation:
)
(∞
X
2
2
(xi − θi ) + λ {# of θi that is not 0}
min
θ
i=1
has the form xi 1{|xi |>λ} .
Section 12.8
391
Problems and Complements
(e) Consider now a rectangular subset of the original ellipsoid
P∞ set of parameters: θi ≤ ξi ,
where ξi ’s are given positive numbers satisfying i=1 (1 + ρi )ξi2 ≤ M . Find the
minimax linear estimator of the form θi = αi xi for all i, in terms of mean squared
error.
Problems for Section 12.4
1. (a) Verify (12.4.10) and (12.4.11).
(b) Complete the proof of Theorem 12.4.1 by establishing the last inequality in the proof.
(c) Derive the results of Remark 12.4.1 (a).
P
2
2. χ2 inequality. Suppose X ∼ χ2m , X = m
i=1 Zi , Zi ∼ N (0, 1) independent. Show
that
√
√ mt t2
t
P |X − m| > t m ≤ 2 exp − t2 (1 + √ )−1 ≤ 4 e− 2 + e− 2 .
m
√
√
−1
Hint: P X ≥ m − √tm ≤ inf e−s( m+t) 1 − √2sm
for s < 2m . Use,
s
√
t
t −1
1
m
2s = √ 1+ √
< ,
arg min − s m − st −
log 1 − √
2
2
m
2 m
m
and − log(1 − x) ≤ x + x2 , 0 ≤ x < 21 . Similarly, P X < m + √tm = P [m − X >
√
− m
√
−t m] ≤ inf es( m+t) 1 + √2sm 2 .
s
2
3. Deduce, if Xm ∼ Xm
, that for some universal c, D
√
E sup |Xm − m| − c m log n + : 1 ≤ m ≤ n ≤ D .
Hint: If supm ∆m,n is the random variable above
Z ∞
n Z
X
E sup ∆m ≤ K +
P [sup ∆m,n > t] dt ≤ K +
K
m=1
∞
P [∆m,n > t] dt. (12.8.1)
K
By Problem 12.4.2,
Z ∞ h
t i
|Xm − m|
√
dt
P
> c log n + √
m
m
K
Z ∞
Z ∞
√
−(c log n+ √tm )2 /2
≤4
e
e−(t+c m log n)/2 dt
dt +
K
K
Z ∞
√
2
√
− v2
−c m
e
≤4 m
dv + 4n
.
c log n
The claimed bound follows for c > 1 by substituting this bound in (12.8.1).
4. Show that if ηbi are as in the GWN model
E sup
n
n
X
i=m+1
ηi (b
ηi − ηi ) − cρ
n
X
i=m+1
ηi2
21 o
≤D
p
log n .
392
Prediction and Machine Learning
Hint: ρ2
Pn
i=m+1
ηi2 = Var
e
5. If R(m)
is defined by
Pn
i=m+1
e
(a) E R(m)
= σn2 m −
Pm
e
(b) VarR(m)
=
ηi (b
ηi − ηi ).
e
R(m)
= 2σn2 m −
show that
j=1
Pm
m
X
j=1
ηbj2 ,
ηj2
j=1
Var ηbj2 = 2σn4 m + σn2
2β
e
(c) max min E R(m)
= O(n− 2β+1 ).
Θβ 1≤m≤n
Chapter 12
Pm
j=1
ηj2 .
1
Hint: Plug in m = n 2β+1 .
6. Show that if m
b is defined as in Theorem 12.4.2
m
b
n
i
n hX
o
X
2β
(b
ηi − ηi )2 +
ηi2 : Θβ ≤ Kn− 2β+1 log n for K = K(β) .
max E
i=1
i=m+1
b
Hint: By Problem 12.4.6 (using C, D, etc. generically)
E
m
b
X
i=1
Therefore,
√ 2
(b
ηi − ηi )2 − Eσn2 m
b ≤ D + C log nE m
b σn ≡ ∆ .
m
b
m
b
m
b
X
X
X
ηbi2 + ∆
(b
ηi − ηi )2 −
b +E
ηbi2 = E σn2 m
E 2σn2 m
b−
= E σn2 m
b −2
=E
=E
σn2 m
b
m
b
X
i=1
=E
−2
m
b
hX
i=1
i=1
i=1
i=1
m
b
X
i=1
m
b
X
i=1
ηbi ηi +
m
b
X
ηi2 + ∆
i=1
(b
ηi − ηi )ηi −
(b
ηi − ηi )2 +
(b
ηi − ηi )2 +
n
X
i=m+1
b
n
X
i=m+1
b
m
b
X
i=1
ηi2 + ∆
n
m
b
X
X
ηi2 + 2∆
(b
ηi − ηi )ηi −
ηi2 − 2E
i=1
i=1
ηi2
i
n
n
n
i
h X
X
X
ηi2 + 2∆ .
(b
ηi − ηi )ηi −
+ 2
(b
ηi − ηi )ηi − 2
i=m+1
b
i=1
i=1
Section 12.8
393
Problems and Complements
Using Problem 12.4.6,
E
n
X
(b
ηi − ηi )ηi ≤ Cσn E
i=m+1
b
n
X
ηi2
i=m+1
b
21 p
log n .
Finally,
E
m
b
X
i=1
(b
ηi − ηi )2 +
≤ 2∆ + σn E(
n
X
ηi2
i=m+1
b
n
X
i=m+1
b
ηi2
m0
12 p
X
(b
ηi − ηi )2 +
log n + E
i=1
i=m0 +1
P m0 2 ηi . But
since E 2σn2 m
b − i=1 ηbi2 ≤ E 2σn2 m0 − i=1
√
1
Eσn2 m
b ≤ σn2 (E m)
b 2
n
n
X
12
X
1 2
≤
σn E
σn + E
ηi2
ηi2
2
Pm
b
n
X
ηi2
i=m+1
b
i=m+1
b
and the results follows.
7. Show that in Remark 12.4.1, if m
b is selected by SBC (BIC), then
p
−1
|b
ηm
2 log n .
b | ≤ σ0 n 2
Problems for Section 12.5
1. Establish the sufficient conditions for A3 leading to Theorem 12.5.1.
(Z)
Hint. Apply the law of large numbers using (1) to reduce to the case k∆S k2P = 1. Apply
Hoeffding’s or a similar inequality.
2. (The Efron–Stein Inequality). Let T (x1 , . . . , xn−1 ) be a symmetric function of n −
1 variables
and let X1 , . . . , Xn be i.i.d. as X ∈ R. Set Ti = T (X (−i) ) and T· =
Pn
−1
n
i=1 Ti , where
X(−i) = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )T .
(a) Show that,
Var Tn (X1 , . . . , Xn−1 ) ≤ E
n
X
i=1
(Ti − T· )2 .
Hint. Use Var U = 12 E(U − U ′ )2 , where U , U ′ are i.i.d., as U . Also, condition on
X(−i) .
(b) Use (a) to show that the leave one out cross validation estimate of variance tends to
overestimate the variance.
394
Prediction and Machine Learning
Chapter 12
3. Why crossvalidate? Consistent estimation of prediction error (PE). Let (X1 , Y1 ), . . . ,
(Xn , Yn ), where Xi = (Xi1 , . . . , Xip )T , be i.i.d. as X ∈ Rp , Y ∈ R. Assume the model
Yi = XTi β + εi ,
εi ∼ i.i.d. N (0, σ 2 )
with E|X|4 < ∞, Var(X) = Σ0 , where Σ0 is nonsingular and known. Let XD ≡
b = (X T XD )−1 Y be the least squares (and maxi(Xij )n×p be the design matrix and let β
D
mum likelihood) estimate of β (see (6.1.14)). Show that if β = β0 is known,
(a) The average prediction error when we use XTi β0 to predict Yi , 1 ≤ i ≤ n, is
1X
PE = E
(Yi − XTi β 0 )2 = σ 2 .
n
b in the formula in (a) is
(b) The estimate of PE obtained by replacing Xi β0 by Xi β
biased, in fact,
n
X
p
.
(Yi − XTi β)2 = σ 2 1 −
E n−1
n
i=1
(c) Show that if p is fixed,
b = β + n−1 Σ−1
β
0
0
n
X
Xi εi + OP (n−1 ) .
i=1
b based on (Xi , Yi ), i = 1, . . . , m
(d) Show that if we use the least squares estimate β
m
to estimate β, and we use (Xi , Yi ); i = m + 1, . . . , n to estimate prediction error,
then
n
X
1
b )2 = σ 2 + OP ( 1 )
d
PE ≡
(Yi − XTi β
m
n − m i=m+1
n
provided (m/n) → λ for some λ ∈ (0, 1).
T
Hint. (b), condition on XD . For (c) and (d) use the delta method and E|Xi Σ−1
0 Xi | =
−1
E|Xi Σ0 2 |2 .
Problems for Section 12.6
1. Show that if we apply the Bayesian criterion (12.6.5) to choose m in the sieve method
for estimating θ in Θβ of the GWN model as in Theorem 12.4.2, and if η is in the closure
2β
2β
−( 2β+1
)
of ∪∞
rather than n−( 2β+1 )
n=1 Pm , then we obtain the minimax risk rate log n n
obtained by using Stein’s unbiased risk estimate (Mallows’ Cp ), as shown in Problems
12.4.2–12.4.6.
2. Consider the one-sample model
Yi = µ + εi , i = 1, . . . , n
with {εi } i.i.d. N (0, 1). Let M0 be the model with µ = 0 and let M1 be the model with
µ ∈ R arbitrary.
Section 12.8
395
Problems and Complements
(a) Find the model selection rule based on the Bayesian criterion (12.6.5).
(b) Show that the rule in (a) selects the correct model with probability tending to one as
n → ∞.
(c) The AIC criterion with known σ 2 = 1 takes the form
AIC(Mk ) = 2
n
X
i=1
log p(yi |θbk ) − 2k .
Find the model selection rule based on AIC.
(d) Show that the rule in (c) satisfies Pµ=0 (Decide model M1 ) = P (χ21 > 2). That
is, it provides a test wth level P (χ21 > 2). It does not select the correct model with
probability tending to one as n → ∞.
3. Show that the maximal eigenvalue of the covariance matrix Σ is given by
λ1 (P ) = Var
p
X
j=1
a∗j Xj
Pp
where |a∗ | = 1 and a∗ = arg max Var
j=1 aj Xj : |a| = 1 .
4. In Section 12.6.3, show that the distance between a point y ∈ Rp and the line x = αa is
yT y − (aT y)2 .
Hint. Consider the point t ∈ Rp where the perpendicular from y to the line intersects the
line. Then t = αa where α must satisfy aT (y − αa) = 0. Because aT a = 1, this gives
α = aT y. Thus
|y − t|2 = |y − (aT y)a|2 .
Next expand the right hand side.
Problems for Section 12.7
1. Let p(1) ≤ . . . ≤ p(N ) be the ordered p-values and let H (j) denote the null hypothesis
corresponding to p(j) , 1 ≤ j ≤ N . Set p(N +1) = 1 and define
j∗ =
min
1≤j≤N +1
{j : p(j) > α/(N + 1 − j)} .
The Holm (1979) multiple testing procedure rejects H (1) , . . . , H (j
method strongly controls FWER at level α.
∗
−1)
. Show that this
Hint. Let J0 = {j : H (j) is true}, N0 = cardinality of J0 , and j0 = N − N0 + 1. Let
A be the event “p(j ∗ ) < α/N0 ” and B = “p(j) > α/N0 for all j ∈ J0 .” Show that
B =⇒ “j ∗ < j0 ” =⇒ A, and that P (B) ≥ 1 − α.
2. Optimal multiple testing.
396
Prediction and Machine Learning
Chapter 12
(a) (Spjótvoll, 1972). Let g01 , . . . , g0N and g1 , . . . , gN be integrable functions of a data
vector X and let S(γ) be all multiple tests ψ = {ψ1 , . . . , ψN } of the null hypotheses
H1 , . . . , HN that satisfy
N Z
X
ψj (x)g0j (x)dν(x) = γ
(12.8.2)
j=1
for a given probability ν. Show that for an appropriate c, the multiple test
ψ = (ψ1 , . . . , ψN ) = 1 gj (x) > cg0j : 1 ≤ j ≤ N , c > 0
maximizes
N Z
X
ψj (x)gj (x)dν(x)
(12.8.3)
(12.8.4)
j=1
among all tests ψ ∈ S(γ).
Hint. Extend the proof of the Neyman-Pearson Lemma.
Remark. The functions gj and g0j need not be densities. They are general integrable
functions.
PN (b) Let R =
j=1 1 ψj (x) = 1 be the number of “discoveries,” f0j (x) and fj (x)
densities of X under the hypothesis Hj and alternative Aj , respectively, g0j ν =
[f0j /R]1(R > 0) if Hj is false, 0 otherwise, and gj ν = [fj /R]1(R > 0) if Hj is
false, 0 otherwise. With this notation, by definition, (12.8.2) is the False Discovery
Rate (FDR) and (12.8.4) is the Correct Discovery Rate (CDR). Give the rule that
maximizes the CDR subject to FDR = γ.
(c) (Spjótvoll (1972), Storey (2007), Sun and Cai (2007)). Give the multiple test that
maximizes the expected number of true positives for a fixed level γ of the expected
number of false positives.
(d) Suppose we express ψj in a multiple test rule as ψj = ψ(b
pj ) = 1(b
pj ≤ a) where pbj
is the p-value for the hypothesis Hj . Express the solutions to (b) and (c) in terms of
p-values.
Hint. When Hj is true, pbj ∼ U nif [0, 1].
(e) Bayes. Let πj denote the prior probability that Hj is true; and let f0 and fj denote
the densities of X when Hj is true and false, respectively. The posterior probability
of Hj is
πj f0 (x)
.
P (Hj |x) =
πj f0 (x) + (1 − πj )fj (x)
′
N
Consider the Bayes rule ψ B = (ψB
, . . . , ψB
) that rejects the Hj with P (Hj |x) ≤ q
for some preassigned q ∈ (01, ). Then
ψ B = {1(fj (x) > (1 − q)πj f0 (x))/(1 − πj ) : 1 ≤ j ≤ N } .
Thus, if πj = π, ψ B is optimal in the sense of problem (a).
Section 12.8
397
Problems and Complements
(f) Empirical Bayes (Efron 2008, 2010). Suppose we write ψj in the multiple test rule as
ψj (Z) = 1(Zj ≤ bj ) where Zj = Φ−1 (b
pj ) and Φ = N (0, 1) df. Then Zj ∼ N (0, 1)
when Hj is true. Let f0 denote the the N (0, 1) density, let πj denote the prior
probability that Hj is true, and let fj denote the density of Zj when Hj is false.
Then the posterior probability of the null hypothesis Hj given Zj = zj is
P (Hj |zj ) =
πj f0 (zj )
.
πj f0 (zj ) + (1 − πj )fj (zj )
Now reject Hj if P (Hj |zj ) ≤ q, q ∈ (0, 1).
Efron (2010) operationalizes this rule by constructing estimates of Πj and fj , thereby
obtaining an empirical Bayes rule which can be found in the software R under “loc
FDR” (local False Discovery Rate).
3. False Discovery Distributions. Let T1 , . . . , TN be test statistics for testing Hj : µj = 0,
1 ≤ j ≤ N . Each is based on a sample of size n. Assume that the joint distribution of
T1 , . . . , TN is defined by the equation
Tj =
√
1
nµj + ρZ + (1 − ρ2 ) 2 εj
where Z, ε1 , . . . , εN are i.i.d. N (0, 1). Let pj be the p-value of the test that rejects Hj for
|Tj | large, let
N
N
X
X
1(|Tj | > −zu/2 )
1(pj ≤ u) =
R(u) =
j=1
j=1
be the number of discoveries, let J0 = {j : µj = 0}, and let the number of false discoveries
be
X
X
V (u) =
1(pj ≤ n) =
1(|Tj | > −zu/2 ) .
j∈J0
j∈J0
(a) Let N0 = {#j : µj = 0}. Show that the conditional distribution of V (u) given Z
can be approximated by the distribution of
z
o
n z
u/2 − ρZ
u/2 + ρZ
√
+
Φ
,
W (µ) ≡ N0 Φ √
1 − ρz
1 − ρz
P
Z ∼ N (0, 1) .
P
(b) Show that (i) W (u) −→ N0 u as ρ → 0; and (ii) W (u) −→ N0 as ρ → 1. Find the
limit as ρ → −1.
Hint for (ii). Condition on I = 1(Z > 0).
(c) Give the range of values of the probability P V (u) ≥ 1 of at least one false rejection as ρ ranges from −1 to 0 and from 0 to 1.
398
Prediction and Machine Learning
Chapter 12
(d) Let J1 = {j : µj 6= 0} and N1 = {#j : µj 6= 0}.
(i) Find a variable S(u) whose distribution
approximates the conditional distribution
√
of R(u) given Z when µi = µ/ n and
(ii) N1 /N0 → 0, (ii) N1 /N0 → λ ∈ (0, 1).
(e) In (d) above, find the probability limit of S(u) as ρ → 0 and as ρ → ±1 for the cases
(i) and (ii).
Fan, Han and Gu (2012), among others, have given approaches to FDR for dependent
p-values.
Appendix D
SOME AUXILIARY RESULTS
D.1
Probability Results
A collection {Yni ; 1 ≤ i ≤ n} of random variables is called a double array if the probability distribution Pni of Yni depends on both i and n. An example is the regresssion
model
Yni = α + βn xni + εi
P
√
where βn = c/ n, n−1 (xni − x̄n )2 → τ ∈ (0, 1) and ε1 , . . . , εn are i.i.d. Such models
are used when investigating asymptotic power. See Section 6.3.2. Another example is the
“bootstrap” asymptotic theory of Section 10.3.4. A key result for double arrays is
1. The Lindeberg-Feller Central Limit Theorem.
Theorem D.1 (Lindeberg-Feller) Suppose that for
Pneach 2fixed n, Yn1 , . . . , Ynn are indepen2
. Then , for n → ∞,
dent with 0 < σni
≡ Var(Yni ) < ∞. Set σn2 = i=1 σni
max
1≤i≤n
2
σni
→0
σn2
(D.1.1)
and
n
1 X
L
[Yni − EYni ] → N (0, 1)
σn i=1
(D.1.2)
iff for each ε > 0,
n
1 X E (Yni − EYni )2 1(|Yni − EYni | > εσn ) = 0 .
2
n→∞ σn
i=1
lim
(D.1.3)
Corollary D.1 Suppose Y1 , . . . , YP
n are i.i.d. as Y ∼ P , where P does not depend on n. Let
µ = EY, σ 2 = VarY , and Tn = ni=1 ci Yi , then for n → ∞ the following are equivalent
Pn
Tn − µ i=1 ci L
→ N (0, 1) ,
(D.1.4)
Pn
1
(σ 2 i=1 c2i ) 2
Pn
( i=1 c2i )
2
→ ∞.
(D.1.5)
vn ≡
max c2i
1≤i≤n
399
400
Supplements to Text
Appendix D
Proof. See Problem D.1.2.
Theorem D.2 The condition (D.1.3) holds for all ε > 0 if for some δ > 0
kn
X
i=1
E|Yni − EYni |2+δ = o(σn2+δ ) .
(D.1.6)
Remark D.1 (D.1.3) is called Lindeberg’s Condition and (D.1.6) is called Liapunov’s Condition.
Remark D.2 Suppose n−1 σn2 → c for some c > 0. Then (D.1.3) becomes
1X
E[(Yni − EYni )2 1(|Yni − EYni | > εσn )] → 0 .
n
(D.1.7)
A multivariate version is
Theorem D.3. (Multivariate Lindeberg-Feller.) Suppose for each n, Yn1 , . . . , Ynkn are
independent random vectors in Rd with Cov(Yni ), 1 ≤ i ≤ kn , finite. Let | · | denote
Euclidean norm. If for ε > 0,
kn
X
E |Yni |2 1 (|Yni | > ε) = 0
lim
n→∞
i=1
and
kn
X
i=1
for Σ positive definite, then
kn
X
i=1
Cov(Yni ) → Σ
L
(Yni − EYni ) −→ N (0, Σ) .
Remark D.3. In Theorem D.3, the Lindeberg condition is for uncentered Yni . This turns
out to be convenient for the asymptotics of bootstrap methods. Clearly Theorem D.3 also
holds if Lindeberg’s condition is for centered Yni .
2. Almost Sure Convergence and the Borel–Cantelli Lemma
Almost sure convergence of a sequence of random variables and Kolmogorov’s Strong
Law of Large Numbers (SLLN) appear in Section B.7 of Volume I. Here are some useful
additional facts about a.s. convergences.
P
D.4 Convergence in probability and a.s. convergence. Z n → Z as n → ∞ iff every subsequence {Z nk }, k ≥ 1, has a subsubsequence {Z n′k }, {n′k } ⊂ {nk }, k ≥ 1 such that
a.s.
Z n′k → Z as n′k → ∞.
a.s.
P
D.5 If Z n −→ Z, then Z n → Z.
Section D.2
401
Supplement to Section 7.1
a.s.
D.6 If Zn −→ Z iff lim P ( sup |Zn − Z| > ε) = 0.
n→∞
m≥n
d
D.7 Uniform SLLN (Chung (1951)). Let ZP
1 , Z2 , . . . be i.i.d. as Z ∈ R , Z ∼ P ∈ P. Let
n
−1
| · | be Euclidean norm and let Z̄n = n
i=1 Zi . If E|Z| < ∞, and
lim sup EP |Z|1(|Z| ≥ M ) = 0 ,
M→∞ P ∈P
then the SLLN holds uniformly, that is, for ε > 0,
lim sup P sup |Z̄n − EP (Z)| ≥ ε = 0 .
n→∞ P ∈P
m≥n
D.8 The Borel–Cantelli Lemma. Suppose
∞
X
n=1
P [ |Z n − Z| ≥ ε] < ∞
a.s.
for all ε > 0. Then, Z n → Z as n → ∞.
3. Kolmogorov’s Inequality
The following result is useful for establishing uniform convergence in probability of
nonparametric regression estimates. See Problems 11.6.8 and 11.6.9.
Theorem D.4. Let X1 , X2 , . . . be independent random variables with E(Xj ) = 0, E(Xj2 ) <
Pj
∞, and let Sj = i=1 Xi . Then, for every ε > 0,
P
D.2
max |Sj | > ε ≤ Var(Sn )/ε2 .
1≤j≤n
Weak Convergence of Functions. Supplement to
Section 7.1
Proof of Theorem 7.1.1. The full proof of this theorem may be found in van der Vaart
and Wellner (1996), for instance. However, it is easier and less technical to see that if q is
Lipschitz continuous, then (i) and (ii) imply that L(q(Zn (·))) → L(q(Z(·))). Here’s the
argument for q Lipschitz. Let tmj , 1 ≤ j ≤ km , m ≥ 1, be points such that tmj ∈ Tmj .
Pkm
(m)
(m)
Let Zn (t) =
similarly. Comparing the
j=1 1(t ∈ Tmj )Zn (tmj ) and define Z
characteristic functions of q(Zn ) and q(Z), we compute
|E exp{isq(Zn )} − E exp{isq(Z)}|
≤ |E exp{isq(Zn )} − E exp{isq(Zn(m) )}|
+ |E exp{isq(Zn(m) )} − E exp{isq(Z (m) )}|
+ |E exp{isq(Z (m) )} − E exp{isq(Z)}| .
(D.2.1)
402
Supplements to Text
Appendix D
Let ∆m (Zn ) = max{sup[|Zn (s) − Zn (t)| : s, t ∈ Tmj ] : 1 ≤ j ≤ km } and let
(m)
Amn = {|∆m (Zn )| ≤ εm }. Note that, |Zn − Zn |∞ ≤ ∆m . We claim that
|E exp{isq(Zn )} − E exp{isq(Zn(m) )}|
≤ |s|E|q(Zn ) −
q(Zn(m) )|1(Amn ))
+
(D.2.2)
2P [Acmn ].
This follows from |eia − eib | ≤ |a − b| and |eia | ≤ 1. But by the Lipschitz nature of q,
E|q(Zn ) − q(Zn(m) )|1(Amn ) ≤ M εm
and by (7.1.5), P [Acmn ] ≤ δm . By the separability of Z and the FIDI convergence of Zn (·),
we have P [∆m (Z) ≥ εm ] ≤ δm also (Problem 7.1.1), and the third term in (D.2.1) can be
bounded just as the first term has been. Combining these bounds, we have from (D.2.1),
lim sup |E exp{isq(Zn )} − E exp{isq(Z)}|
(D.2.3)
n
≤ 2(M |s|εm + δm ) + lim sup |E exp{isq(Zn(m))} − E exp{isq(Z (m) )}|.
n
(m)
The second term in (D.2.3) is 0 by the FIDI convergence of Zn (·) to Z (m) (·). Now, let
m → ∞ to complete the proof. Note that Z (m) as defined yield tightness of Z(·).
✷
The general proof requires showing that (D.2.3) implies that, for every ǫ > 0, there
exists a compact set Kǫ such that P [Zn (·) ∈ Kǫ ] ≥ 1 − ǫ and P [Z(·) ∈ Kǫ ] ≥ 1 − ǫ. By a
classical result, functions which are continuous on a compact set are uniformly continuous.
Proof of Theorem 7.1.2. Consider the bracket set collection, Tm , corresponding to δ =
m−α , α > 1 to be chosen later, m = 1, 2, . . .. Without loss of generality, we can suppose
that the bracket sets Tmj form a partition of T in the sense that if Tmj = (f mj , f¯mj ) ∈ Tm ,
j = 1, . . . , Jm , then Tmj ∩ Tmk = ∅ unless j = k. This can be achieved in such a
way (Problem D.2.2) that (ii) will still hold for some constant c. Further, let gmj be a
representative member of Tmj for each j, m. It follows that we can identify each f ∈ T by
a sequence (j1 (f ), j2 (f ), . . . , ) where f ∈ Tm,jm (f ) , m = 1, 2, . . .. That such jm (f ) exist
follows from the partition property.
Let gm = gm,jm (f ) , and define f¯m , f m similarly. That (j1 (f ), j2 (f ), . . .) characterize
0
WP (f ) (and f ) (up to sets of probability 0) follows from
WP0 (f ) = WP0 (g1 ) +
∞
X
(WP0 (gm+1 ) − WP0 (gm ))
(D.2.4)
m=1
in the sense that
WP0 (g1 ) +
m
X
j=1
(WP0 (gj+1 ) − WP0 (gj )) = WP0 (gm+1 )
(D.2.5)
and
EP (f − gm+1 )2 (X) ≤ EP (f¯m+1 − f m+1 )2 (X) ≤ m−α → 0
(D.2.6)
Section D.2
403
Supplement to Section 7.1
as mP
→ ∞. Let b, w1 , w2 , . . . be non-negative weights which will in fact depend on λ with
∞
b + m=1 wm = 1. Pick m0 which will also depend on λ and write
WP0 (f ) = WP0 (gm0 ) +
∞
X
m=m0
(WP0 (gm+1 ) − WP0 (gm )).
Note that
[|WP0 (f )| ≥ λ] ⊂ [|WP0 (gm0 )| ≥ bλ] ∪
∞
[
m=m0
[|WP0 (gm+1 ) − WP0 (gm )| ≥ wm λ].
Hence, using (D.2.5),
P [sup{|WP0 (f )| ≥ λ : f ∈ T }]
≤ P [max{|WP0 (gm0 ,j )| : 1 ≤ j ≤ Jm0 } ≥ bλ]
∞
X
+
P [max{|WP0 (gm+1,j ) − WP0 (gmj ′ )| : 1 ≤ j ≤ Jm+1 ,
(D.2.7)
m=m0 +1
1 ≤ j ′ ≤ Jm } ≥ wm λ]1(Tm+1,j ∩ Tmj ′ 6= ∅) .
The bound (D.2.7) is fundamental. It comes from a generalization of the classical
P [|X + Y | ≥ ǫ] ≤ P [|X| ≥ 2ǫ ] + P [|Y | ≥ 2ǫ ], bounding |WP0 (gm )| by
max{|WP0 (gmj )| : 1 ≤ j ≤ Jm } and using a similar bound for the differences. Go
further and replace the terms on the right hand side of (D.2.7) by
J m0
X
j=1
+
P [|WP0 (gmj )| ≥ bλ]
∞
X
Jm+1 Jm
X X
m=m0 +1 j=1
j ′ =1
(D.2.8)
P [|WP0 (gm+1,j ) − WP0 (gmj ′ )| ≥ wm λ]1(Tm+1,j ∩ Tmj ′ 6= ∅) .
Although gm+1,j may not belong to Tm,jm (f ) , because f ∈ Tm,jm (f ) ∩ Tm+1,jm+1 (f ) , it is
still true that
E(gm+1,j − gmj )2 ≤ 2{E(gm+1,j − f )2 + E(gmj − f )2 } ≤ 4m−2α .
Therefore, we can use the bounds (7.2.8) and (D.2.6) to bound (D.2.8) by
2 2 −2 2
∞
X
λ b γ
λ
2
2α
2Jm0 exp −
+2
Jm Jm+1 exp − (wm m ) .
2
8
m=m +1
(D.2.9)
0
Finally, use (ii) to bound (D.2.9) by
2
2 2 −2 ∞
X
λ
λ b γ
2αd
2
2α
+
C
m
exp
−
C1 mαd
exp
−
(w
m
)
, (D.2.10)
2
0
2
2 m
m=m +1
0
404
Supplements to Text
Appendix D
′
where C1 and C2 are generic constants. By taking wm = cm−α , α′ > 1, it is not hard
to show (Problem D.2.4) that we can bound (D.2.10) as specified in the statement of the
theorem.
✷
Discussion: This argument uses the idea of “chaining” introduced by Kolmogorov. It
exploits the representation of WP0 (f ) given by (D.2.4) which in turn reflects the “decimal”
representation of f . The key observation is that the increments in (D.2.4) have variances
which go down and, as a consequence of the bound (7.2.10), even if the deviation λ from
the mean 0 is weighted by a small wm , the probability of that deviation drops much more
rapidly than wm does. Note that this argument shows that the order of magnitude of the
probability of deviation by λ or more of the maximum is essentially of the same order
exp{−λ2 /2γ 2}
as that of a single term.
D.3
Functional Derivatives. Supplement to Section 7.2
Recall that if f : O → R where O is open ⊂ Rd then, at a point x ∈ Rd , f is partially
∂f
(x) iff the limits
differentiable with partial derivatives ∂x
j
f (x + hej ) − f (x)
∂f
(x) = lim
h→0
∂xj
h
exist, where ej is the jth basis vector. The function f has a total differential iff a linear
approximation to f holds. That is, suppose
Df (x)(t) = lim
h→0
f (x + ht) − f (x)
h
(D.3.1)
Pp ∂f
holds uniformly for |t| bounded. Then Df (x) is the linear map t →
j=1 ∂xj (x)tj ,
T
t = (t1 , . . . , tp ) given by (D.3.1). As is well known, functions with partial derivatives but
no total differential are easy to exhibit.
If f : B → R where B is a linear space and
DG f (x)(t) = lim
h→0
f (x + ht) − f (x)
,
h
x, t ∈ B
(D.3.2)
exists and is well defined, DG is called the Gâteaux derivative of f at x in the direction
t. The Gâteaux derivative is the analogue of the partial differential. The more useful and
stronger notions correspond to uniformity in t statements about the limit and B, a normal
linear space (Banach). That is, we want D such that
sup |f (x + ht) − f (x) − hDf (x)(t)| : t ∈ T = o(h) .
(D.3.3)
If T is any bounded set in B it follows that Df (x)(·) is a bounded linear functional on B
and Df (x)(·) is called the Fréchet derivative while if T is any compact set, Df (xT )(·)
Section D.4
405
Supplement to Section 9.2.2
is called the Hadamard derivative and Df (x) is a linear functional. If B = Rd the two
notions produce the same total differential.
Our interest typically focusses
on B = {All finite signed µ measures on Rd with total
variation norm, kµk = sup |µ(A)| : A ∈ A where A is the class of all sets in Rd for
which we can assign probability. Here “all signed measures µ on Rd ” is the class of all
µ on A of the form {aP − bQ : a, b ∈ R, P and Q are probabilities on Rp }. For this
B, a natural subspace of linear functionals can be identified with the space ofRall bounded
continuous functions q on Rd , with corresponding linear functional, Lq (µ) = q(v)dµ(v)
where dµ(v) = adF (v) − bdG(v) with F and G equal to the df’s corresponding to P and
Q, v ∈ Rd . If Df (x) is of the form Lq and Df is Hadamard then if t is a signed measure,
Z
Df (x)(t) = q(v)dt(v)
(D.3.4)
and q(x) = Df (x)(δx ) where δx is point mass at x.
Identifying these quantities with objects in Rd we see that (D.3.4) corresponds to the
usual formula for the derivative in a particular direction and q(x) corresponds to the partial
derivatives in the directions of the coordinate axes. See Example 7.2.1, for these analogies.
If B is not the space of all signed measures but just the convex cone of probabilities, we
can still carry out limit (D.3.2) in a direction t = G − F at x = F where F and G are df’s
since, then, F + h(G − F ) = (1 −R h)F + hG, a member of the cone
if 0 ≤ h ≤ 1. The q
R
we obtain is then
not
unique
since
q(v)
+
c
d(G
−
F
)(v)
=
q(v)d(G
− F0 )(v), but
0
R
if we require q(v)dF0 (v) = 0 we obtain the derivative as a unique influence function.
The Hadamard sense is what is needed to carry out the delta method while its computation is via Gateaux. See van der Vaart (1998) for a careful treatment using this notion of
derivative and the preceding treatment of Huber (1974) and Reeds (1976).
D.4
Asymptotics for the Cox Estimate. Supplement to
Section 9.2.2
The Influence Function of the Cox Partial Likelihood Estimate
1. Derivation of the Gâteaux derivative of Γ̇τ (β, P ). To find the Gâteaux derivative set
Pε = (1 − ε)P + εQ, Pjε = (1 − ε)Pj + εQj ; pjε = dPjε , j = 1, 2 .
Then we seek
∂ h
∂
Γ̇τ (β, Pε ) =
∂ε
∂ε
Z
0
τ
−
Ṡ0
(t, β, Pε )dP2ε (t) +
S0
By taking Q1 to be pointmass at Z, we find
Z
i
1(t ≤ τ )zdPε (z, t)
II = Z1(T ≤ τ ) − E Z1(T ≤ τ ) .
ε=0
≡ I + II .
Write S0 (ε), Ṡ0 (ε) for S0 (t, β, Pε ), Ṡ(t, β, Pε ) and S0 , S0 for S0 (0), Ṡ0 (0). For I, we
need
∂ Ṡ0
∂
∂
(ε) =
Ṡ0 (ε)S0−1 − Ṡ0 S0−2
S0 (ε) ,
∂ε S0
∂ε
∂ε
406
Supplements to Text
Appendix D
where
T
T
∂
∂
S0 (ε) =
EPε eβ Z 1(T ≥ t) = eβ Z 1(T ≥ t) − S0
∂ε
∂ε
is obtained by letting Q be pointmass at X = (Z, T ). Similarly,
T
∂
Ṡ0 (ε) = Zeβ Z 1(T ≥ t) − Ṡ0 .
∂ε
Thus
T
T
∂ Ṡ0
(ε) = S0−1 Zeβ Z 1(T ≥ t) − Ṡ0 − S0−2 Ṡ0 eβ Z 1(T ≥ t) − S0
∂ε S0
T
= S −1 eβ Z 1(T ≥ t) Z − (Ṡ /S ) .
0
0
0
Because p2ε |ε=0 = p2 and ∂p2ε (t)/∂ε = q2 − p2 for T ≤ τ , this gives
Z T
Z τ
T
Ṡ0
Ṡ0
Ṡ0
I = −eβ Z
S0−1 Z −
(t, β, P ) dP2 (t) −
(T, β, P ) +
(t, β, P )dP2 (t) .
S0
S0
0
0 S0
Now I + II = γ(X, P ).
2. Proof of Theorem 9.2.2. Suppose we can show that γ(X, P ) is an influence function for
b
Γ̇−1
τ (β, P ) in the sense of (7.2.2). Then
n
X
√ √
b − β) = Γ̈−1 (β ∗ , Pb ) n n−1
γ(Xi , P ) + oP (1) .
n(β
τ
i=1
∗ b
By Slutsky’s theorem (Corollary 7.1.2) for processes, we can replace Γ̈−1
τ (β , P ) with
−1
Γ̈τ (β, P ) and the results (ii) and (iii) follow. We leave details to the reader in Problem
D.4.1.
✷
D.5
Markov Dependent Data. Supplement to Section 10.4
A. Markov Sequences
Let X1 , X2 , . . . , be a sequence of X valued random variables. The i.i.d. sequences we
have considered so far have two critical properties that we will generalize in this section.
(i) Independence
L(Xm |Xm−1 , . . . , X1 ) = L(Xm ), m ≥ 2 .
(ii) Identical distribution
L(Xm ) = L(X1 ),
m≥2.
The natural generalization of the first property is the Markov property defined by
L(Xm |Xm−1 , . . . , X1 ) = L(Xm |Xm−1 ),
m≥2.
(D.5.1)
Section D.5
407
Supplement to Section 10.4
The most important generalization of the second property is stationarity defined by
L(X1 , . . . , Xm ) = L(X1+k , . . . , Xm+k ), m ≥ 1, k ≥ 0 .
(D.5.2)
An intermediate property (Problem D.5.1) implied by (D.5.1) and (D.5.2), but not by either
one singly, is called (e.g. Grimmett and Stirzaker, Section 6.1) homogeneous Markov. It is
defined by (D.5.1) and “L(Xm |Xm−1 ) does not depend on m,” that is,
L(Xm |Xm−1 ) = L(X2 |X1 ),
m≥2.
A sequence is stationary and Markov iff it is homogeneous Markov and
L(Xm ) = L(X1 ),
m≥2.
(D.5.3)
Thus, an i.i.d. sequence is stationary Markov.
To study Markov sequences {Xn , n ≥ 1} it is best to think of n as time and Xn as the
state of a dynamical system initiated at X1 . Thus, we refer to X as the state space and its
members as states. We think of the evolution mechanism, L(X2 |X1 ), as being given by a
Markov kernel K : X × X → R defined by
K(x, y) = P X2 = y|X1 = x , x, y ∈ X
if X is discrete, countable. If X is Euclidean and X2 |X1 has a continuous case density
PX2 |X1 (x2 |x1 ), we define the Markov kernel by
K(x, y) = PX2 |X1 (y|x),
x, y ∈ X .
We call L(X1 ) the initial distribution and if the Markov sequence is stationary, L(X1 ) is a
stationary distribution because in this case L(Xn ) = L(X1 ) for all n ≥ 1.
In what follows, we will define stationary distributions more generally and describe
how they, under certain conditions, approximate the distribution of Xn for n large.
B. Homogeneous Finite Markov Chains
We sketch the theory for X = {1, . . . , N }, that is for finite Markov chains. In fact,
for Markov Chain Monte Carlo (MCMC), and many other applications, we need Markov
sequences for X countable or X general Euclidean. Such generalizations hold only under
additional conditions which we shall give appropriate references to, as needed.
Homogenous chains satisfy P (Xm+n = y|Xm = x) = P (Xn+1 = y|X1 = x), so we
can call
pn (x, y) ≡ P (Xn+1 = y|X1 = x)
n-step transition probabilities. The n-step transition matrix Pn is defined by
Pn = pn (i, j) N ×N .
A key property is
Lemma D.5.1. If X1 , X2 , . . . is a finite homogeneous Markov chain, then
Pn = Kn ,
n = 1, 2, . . .
408
Supplements to Text
Appendix D
where Kn is the nth power of the transition matrix K = K(i, j) N ×N . In particular
P (Xn+1 = j|X1 = i) = (Kn )i,j ,
the (i, j) entry of Kn .
Proof. Argue by induction: P1 = K1 by definition. By A.4.4, the Markov property,
homogeneity, and induction,
P (Xn = j|X1 = i) =
N
X
P (Xn = j|Xn−1 = k, X1 = i)P (Xn−1 = k|X1 = i)
k=1
=
N
X
P (Xn = j|Xn−1 = k)P (Xn−1 = k|X1 = i)
k=1
n−1
= (K
K)i,j = (Kn )i,j .
✷
Next, consider the marginal probabilities
(n)
(n)
(n)
q(n) = (p1 , . . . , pN )T , pi
= P (Xn = i), i ∈ {1, . . . , N } .
Lemma D.5.2. Under the conditions of Lemma D.5.1, (q(n) )T = (q(1) )T Kn .
P
(n)
Proof. By A.4.4 and Lemma D.5.1, pj =
i P (Xn = j|X1 = i)P (X1 = i) =
P (1)
(1)
(1) T
n
Pn )j = (q ) K )j .
✷
i pj pn (i, j) = (q
Thus the behavior of the chain is determined by the initial transition matrix K and the
initial distribution q(1) . Moreover, we can determine the asymptotic distribution of Xn by
examining the limit of Kn .
C. Some Definitions
Important features of homogeneous Markov chains include
(i) Irreducibility: All points in X are accessible from any other point. That is, P (Xm =
j|X1 = i) > 0 for some m, all i, j. For finite chains this is equivalent to “Km has
all elements positive for some m on.”
(ii) Recurrence: A chain is recurrent if for all x ∈ X .
P Xk = x for some k X1 = x = 1
(iii) Positive recurrent: This refers to recurrent chains with Eτx < ∞ for all x ∈ X ,
where τx = min{n ≥ 1 : Xn = x} is the first visit time to x.
(iv) Aperiodicity: The period k of a state x is the largest integer k for which returns to
state x occurs in multiples of k steps. Formally, k = gcd{n : P (Xn = i|X1 = i) >
0}, where “gcd” denotes greatest common divisor. A Markov chain is aperiodic if
k = 1 for all x ∈ X .
Section D.5
409
Supplement to Section 10.4
T
(v) A stationary probability distribution on X is a vector π = π(1), . . . , π(N )
such that π(i) ≥ 0, Σπ(i) = 1, Σi π(i)K(i, j) = π(j), that is,
π T K = πT .
Equivalently, if π is the distribution of Xn and π T K = π T , then this sequence is
stationary Markov. A stationary distribution may not exist.
(vi) A stationary Markov chain with kernel K and pi = P (Xn = xi ) satisfies detailed
balance if
pi K(i, j) = pj K(j, i) .
Summing over j we obtain
pi =
N
X
pj K(j, i)
j=1
so that (p1 , . . . , pN )T = π is a stationary distribution. Such chains are also called
reversible because detailed balance is equivalent to (X1 , X2 ) ∼ (X2 , X1 ). Algebraically, if D = diag(p1 , . . . , pN ), then KD = DKT .
(vii) If π = (π(1), . . . , π(N ))T is a stationary distribution, the stationary transition matrix is


π(1), . . . , π(N )


..
Π =  ...
 .
.
π(1), . . . , π(N )
If Y1 , Y2 , . . ., is a homogeneous Markov chain with transition matrix Π, then Y1 , Y2 , . . .
satisfy detailed balance and Y1 , Y2 , . . . are i.i.d. as π.
(viii) The operator norm on the class of N × N matrices is defined by
kM k = sup{|M x| : |x| ≤ 1}
where x is N × 1 and | · | denotes Euclidean norm. The Frobenius norm for M =
1
P
2 2
(mij )N ×N is
. It is equivlent to k · k for finite matrices.
i,j mij
D. Fundamental Theorem of Finite Markov Chains. Let X1 , X2 , . . . be an irreducible,
aperiodic, homogeneous finite Markov chain. Then there exists a unique stationary distribution π, and if Pn is the n-step transition matrix, then
kPn − Πk → 0 .
(D.5.4)
If π satisfies detailed balance, then more is true: there exists c < ∞ and ρ < 1 such that
for all n,
kPn − Πk ≤ cρn .
(D.5.5)
410
Supplements to Text
Appendix D
Note that (D.5.4) implies that any initial distribution q(1) ,
|(q(1) )T Pn − πT | → 0 .
(D.5.6)
Moreover, (D.5.5) is equivalent to (Problem D.5.4)
N
X
i=1
|P (Xn = i) − π(i)| ≤ M ρn ,
(D.5.7)
for some M < ∞. That is we can compute the stationary distribution arbitrarily closely
by starting at any distribution and running the Markov chain. This is sometimes called the
forgetting property.
The proof of the theorem and bounds on c are in Diaconis and Strook (1991). See also
Grimmett and Stirzaker (2001, page 295).
Remark D.5.1. As we noted, such results are needed for countable state spaces. What
modifications are needed for this case? See Grimmett and Stirzaker (2001, Chapter 6, page
232) for details. The conditions needed are irreducibility, aperiodicity, and positive recurrence. Unfortunately only (D.5.4) can be established under these conditions for countable
X.
The potential slowness of convergence if |X | = ∞ is not only a weakness of the proof.
In fact, (see Gilks et al (1995)) arbitrarily slowly converging Markov chains can be constructed. The situation when X = Rd is even worse. Conditions for (D.5.4) with k · k
the variational norm are more difficult to formulate and convergence in high dimensional
situations is known empirically and theoretically to typically be very slow. See Meyn and
Tweedie (2009).
D.6
Asymptotics for Locally Polynomial Estimates.
Supplement to Section 11.6
1. Proof of Theorem 11.6.1. We will use (11.6.11). Write
Snj = hj−1
n
X
Uij K(Ui ), j = 0, 1, . . .
(D.6.1)
i=1
where Ui = (Xi − x)/h with density hf (x + hu). By Chebychev’s inequality (A.15.2)
and Taylor expansion,
Snj
p
= ESnj + OP ( Var Snj )
= nhj
R
uj K(u)f (x + hu)du + OP (hj−1
p
nE[U 2j K 2 (U )])
= nhj {f (x)mj + hf ′ (x)mj+1 + OP (cn )}
(D.6.2)
Section D.6
411
Supplement to Section 11.6
1
where cn = (nh)− 2 (1 + h2 ). The last equality follows because if K is symmetric, so is
K 2 , ν5 (K) = 0, and
R
E[U 2j K 2 (U )] = h u2j K 2 (u)f (x + hu)du
R
u2j K 2 (u){f (x) + huf ′ (x) + 21 f ′′ (x + hu∗ )h2 }du, |u∗ | ≤ |u|
=
h
=
h[f (x)ν2j (K) + O(h2 )].
Next, for ǫ = ǫn = OP (cn ) and any constant c, we have
(c + ǫ)−1 = c−1 − c−2 ǫ + OP (ǫ2 ) as
ǫ→0
which gives
−1
Sno
= n−1 {[f (x)]−1 + OP (cn )}.
By combining (11.6.11), (D.6.1), (D.6.2), and (D.6.3), the result follows.
(D.6.3)
✷
2. Proof of Theorem 11.6.2. For Xi close to x consider the approximation
σ 2 (Xi ) = σ 2 (x) + OP (Xi − x).
Using this and the arguments leading to (D.6.2) with j = 0, we have
n
X
i=1
Similarly,
−2
Sno
σ 2 (Xi )Kh2 (Xi − x) = h−1 nf (x)σ 2 (x)ν(K){1 + oP (1)}.
= n−2 {f −2 (x) + oP (1)}, and the result follows from (11.6.12).
✷
3. The Bias of Local Polynomial Estimates. Let Sn = ZT WZ = (Sn,j+ℓ )0≤j≤p,1≤ℓ≤p ,
Tn = (Sn,p+1 , . . . , Sn,2p+1 ), S = (mj+ℓ )0≤j≤p,0≤ℓ≤p , S−1 = (S jℓ )0≤j≤p,0≤ℓ≤p , mp =
(mp+1 , . . . , m2p+1 ), and H = diag(1, h, . . . , hp ).
Theorem D.6.1 Under the conditions of Theorem 11.6.3,
h
i
b
(i)
Bias β(x)|X
= H−1 S−1 mp β p+1 hp+1 {1 + oP (1)} .
Proof. Using (11.6.17), (11.6.18), and (D.6.2), the conditional bias is
T
p+1
S−1
+ oP (Xi − x)p+1 1≤i≤n
n Z W βp+1 (Xi − x)
p+1
= S−1
)
n βp+1 Tn + oP (nh
(D.6.4)
(D.6.5)
p+1
f (x)Hmp + o(h) + oP (nhp+1 ).
= S−1
n βp+1 nh
Next note that (D.6.2) also gives
Sn = nf (x)HSH{1 + oP (1)}
(D.6.6)
412
Supplements to Text
Appendix D
which together with (D.6.5) yields Theorem D.6.1. Theorem 11.6.3 follows by using
H−1 = diag(1, h−1 , . . . , h−p ) and a little algebra (Problem D.6.2).
✷
Let ηp+1 ≡ ηp+1 (K) be the first entry of the vector S−1 mp . A useful result is (see
Problem D.6.1):
Lemma D.6.1. Under the conditions of Theorem 11.6.3, if p is even, then ηp+1 (K) = 0.
Proof. Recall that S = (mj+l )0≤j≤p,0≤l≤p and mp = (mp+1 , . . . , m2p+1 ). When K
is symmetric, then mj = 0 for j odd. It follows that the vector S−1 mp has zero as its first
entry, that is ηp+1 (K) = 0.
✷
The local linear case
Lemma D.6.2. Assume the conditions of Theorem 11.6.3. If p = 1, then η2 (K) =
m2 (K).
Proof. The symmetry of K implies that mj = mj (K) = 0 for j odd. Thus, in the local
linear case where p = 1, we have S−1 = diag(1, m−1
2 ) and m1 = (m2 , 0), which implies
η2 (K) = m2 (K).
✷
We have shown:
Theorem D.6.2. Under the conditions of Theorem 11.6.3, if p = 1, then
Bias [b
µ(x)|X] =
1
m2 (K)µ′′ (x)h2 + oP (h2 ) .
2
4. The Variance of Local Polynomial Estimates. Let σ 2 (x) = Var(Y |X = x), Σ =
diag[Kh2 (Xi − x)σ 2 (Xi )]1≤i≤n , then from (11.6.16),
b
Var(β(x)|X)
= (ZT WZ)−1 (ZT ΣZ)(ZT WZ)−1 .
(D.6.7)
∗
Let S∗n = ZT ΣZ = (Sn,j+ℓ
)0≤j≤p,0≤ℓ≤p , where
∗
Sn,j
=
n
X
i=1
(Xi − x)j Kh2 (Xi − x)σ 2 (Xi ) ,
R
and let S∗ = (νj+ℓ )0≤j≤p,0≤ℓ≤p , where νp (K) = [K(u)]2 du.
Theorem D.6.3. Under the conditions of Theorem 11.6.4,
σ 2 (x) −1 −1 ∗ −1 −1
b
=
Var β(x)|X
H S S S H {1 + oP (1)} .
f (x)nh
Proof: By using the expansion leading to (D.6.2), we find
Sn∗ = nh−1 f (x)σ 2 (x)HS ∗ H[1 + oP (1)] .
Theorems D.6.3 and 11.6.4 follow from this and (D.6.7) (Prob