advertisement

Volume II Statistics The book covers asymptotic efficiency in semiparametric models from the Le Cam and Fisherian points of view as well as some finite sample size optimality criteria based on Lehmann– Scheffé theory. It develops the theory of semiparametric maximum likelihood estimation with applications to areas such as survival analysis. It also discusses methods of inference based on sieve models and asymptotic testing theory. The remainder of the book is devoted to model and variable selection, Monte Carlo methods, nonparametric curve estimation, and prediction, classification, and machine learning topics. Features • Develops basic asymptotic tools, including weak convergence for random processes, empirical process theory, and the functional delta method • Discusses the classical theory of statistical optimality in a decision-theoretic context • Presents inference procedures and their properties in a variety of applications and models, such as Cox’s regression model, models for censored data, and partial linear models • Describes properties of Monte Carlo/simulation-based methods, including the bootstrap and Markov chain Monte Carlo (MCMC) • Examines the nonparametric estimation of functions of one or more variables • Covers many topics related to statistical learning, including support vector machines and classification and regression trees (CART) Using the tools and methods developed in this textbook, you will be ready for advanced research in modern statistics. Numerous examples and problems illustrate statistical modeling and inference concepts. Measure theory is not required for understanding. Mathematical Statistics Mathematical Statistics: Basic Ideas and Selected Topics, Volume II presents important statistical concepts, methods, and tools not covered in the authors’ previous volume. This second volume focuses on inference in non- and semiparametric models. It not only reexamines the procedures introduced in the first volume from a more sophisticated point of view but also addresses new problems originating from the analysis of estimation of functions and other complex decision procedures and large-scale data analysis. Texts in Statistical Science Mathematical Statistics Basic Ideas and Selected Topics Volume II Bickel Doksum • • • • • Access online or download to your smartphone, tablet or PC/Mac Search the full text of this and other titles you own Make and share notes and highlights Copy and paste text and figures for use in your own documents Customize your view by changing font size and layout K25643 ISBN: 978-1-4987-2268-1 90000 9 781498 722681 w w w. c rc p r e s s . c o m WITH VITALSOURCE ® EBOOK Peter J. Bickel Kjell A. Doksum Mathematical Statistics Basic Ideas and Selected Topics Volume II CHAPMAN & HALL/CRC Texts in Statistical Science Series Series Editors Francesca Dominici, Harvard School of Public Health, USA Julian J. Faraway, University of Bath, UK Martin Tanner, Northwestern University, USA Jim Zidek, University of British Columbia, Canada Statistical Theory: A Concise Introduction F. Abramovich and Y. Ritov Practical Multivariate Analysis, Fifth Edition A. Afifi, S. May, and V.A. Clark Practical Statistics for Medical Research D.G. Altman Interpreting Data: A First Course in Statistics A.J.B. Anderson Introduction to Probability with R K. Baclawski Linear Algebra and Matrix Analysis for Statistics S. Banerjee and A. Roy Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, Second Edition P. J. Bickel and K. A. Doksum Mathematical Statistics: Basic Ideas and Selected Topics, Volume II P. J. Bickel and K. A. Doksum Analysis of Categorical Data with R C. R. Bilder and T. M. Loughin Statistical Methods for SPC and TQM D. Bissell Introduction to Probability J. K. Blitzstein and J. Hwang Bayesian Methods for Data Analysis, Third Edition B.P. Carlin and T.A. Louis Second Edition R. Caulcutt The Analysis of Time Series: An Introduction, Sixth Edition C. Chatfield Introduction to Multivariate Analysis C. Chatfield and A.J. Collins Problem Solving: A Statistician’s Guide, Second Edition C. Chatfield Statistics for Technology: A Course in Applied Statistics, Third Edition C. Chatfield Bayesian Ideas and Data Analysis: An Introduction for Scientists and Statisticians R. Christensen, W. Johnson, A. Branscum, and T.E. Hanson Modelling Binary Data, Second Edition D. Collett Modelling Survival Data in Medical Research, Third Edition D. Collett Introduction to Statistical Methods for Clinical Trials T.D. Cook and D.L. DeMets Applied Statistics: Principles and Examples D.R. Cox and E.J. Snell Multivariate Survival Analysis and Competing Risks M. Crowder Statistical Analysis of Reliability Data M.J. Crowder, A.C. Kimber, T.J. Sweeting, and R.L. Smith An Introduction to Generalized Linear Models, Third Edition A.J. Dobson and A.G. Barnett Nonlinear Time Series: Theory, Methods, and Applications with R Examples R. Douc, E. Moulines, and D.S. Stoffer Introduction to Optimization Methods and Their Applications in Statistics B.S. Everitt Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models J.J. Faraway Linear Models with R, Second Edition J.J. Faraway A Course in Large Sample Theory T.S. Ferguson Multivariate Statistics: A Practical Approach B. Flury and H. Riedwyl Readings in Decision Analysis S. French Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition D. Gamerman and H.F. Lopes Bayesian Data Analysis, Third Edition A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin Multivariate Analysis of Variance and Repeated Measures: A Practical Approach for Behavioural Scientists D.J. Hand and C.C. Taylor Practical Longitudinal Data Analysis D.J. Hand and M. Crowder Logistic Regression Models J.M. Hilbe Richly Parameterized Linear Models: Additive, Time Series, and Spatial Models Using Random Effects J.S. Hodges Statistics for Epidemiology N.P. Jewell Stochastic Processes: An Introduction, Second Edition P.W. Jones and P. Smith The Theory of Linear Models B. Jørgensen Principles of Uncertainty J.B. Kadane Graphics for Statistics and Data Analysis with R K.J. Keen Mathematical Statistics K. Knight Introduction to Multivariate Analysis: Linear and Nonlinear Modeling S. Konishi Nonparametric Methods in Statistics with SAS Applications O. Korosteleva Modeling and Analysis of Stochastic Systems, Second Edition V.G. Kulkarni Exercises and Solutions in Biostatistical Theory L.L. Kupper, B.H. Neelon, and S.M. O’Brien Exercises and Solutions in Statistical Theory L.L. Kupper, B.H. Neelon, and S.M. O’Brien Design and Analysis of Experiments with R J. Lawson Design and Analysis of Experiments with SAS J. Lawson A Course in Categorical Data Analysis T. Leonard Statistics for Accountants S. Letchford Introduction to the Theory of Statistical Inference H. Liero and S. Zwanzig Statistical Theory, Fourth Edition B.W. Lindgren Stationary Stochastic Processes: Theory and Applications G. Lindgren Statistics for Finance E. Lindström, H. Madsen, and J. N. Nielsen The BUGS Book: A Practical Introduction to Bayesian Analysis D. Lunn, C. Jackson, N. Best, A. Thomas, and D. Spiegelhalter Introduction to General and Generalized Linear Models H. Madsen and P. Thyregod Time Series Analysis H. Madsen Pólya Urn Models H. Mahmoud Randomization, Bootstrap and Monte Carlo Methods in Biology, Third Edition B.F.J. Manly Introduction to Randomized Controlled Clinical Trials, Second Edition J.N.S. Matthews Statistical Methods in Agriculture and Experimental Biology, Second Edition R. Mead, R.N. Curnow, and A.M. Hasted Statistics in Engineering: A Practical Approach A.V. Metcalfe Statistical Inference: An Integrated Approach, Second Edition H. S. Migon, D. Gamerman, and F. Louzada Beyond ANOVA: Basics of Applied Statistics R.G. Miller, Jr. A Primer on Linear Models J.F. Monahan Decision Analysis: A Bayesian Approach J.Q. Smith Elements of Simulation B.J.T. Morgan Applied Statistics: Handbook of GENSTAT Analyses E.J. Snell and H. Simpson Applied Stochastic Modelling, Second Edition B.J.T. Morgan Probability: Methods and Measurement A. O’Hagan Introduction to Statistical Limit Theory A.M. Polansky Applied Bayesian Forecasting and Time Series Analysis A. Pole, M. West, and J. Harrison Statistics in Research and Development, Time Series: Modeling, Computation, and Inference R. Prado and M. West Introduction to Statistical Process Control P. Qiu Sampling Methodologies with Applications P.S.R.S. Rao A First Course in Linear Model Theory N. Ravishanker and D.K. Dey Essential Statistics, Fourth Edition D.A.G. Rees Stochastic Modeling and Mathematical Statistics: A Text for Statisticians and Quantitative Scientists F.J. Samaniego Statistical Methods for Spatial Data Analysis O. Schabenberger and C.A. Gotway Bayesian Networks: With Examples in R M. Scutari and J.-B. Denis Large Sample Methods in Statistics P.K. Sen and J. da Motta Singer Spatio-Temporal Methods in Environmental Epidemiology G. Shaddick and J.V. Zidek Analysis of Failure and Survival Data P. J. Smith Applied Nonparametric Statistical Methods, Fourth Edition P. Sprent and N.C. Smeeton Data Driven Statistical Methods P. Sprent Generalized Linear Mixed Models: Modern Concepts, Methods and Applications W. W. Stroup Survival Analysis Using S: Analysis of Time-to-Event Data M. Tableman and J.S. Kim Applied Categorical and Count Data Analysis W. Tang, H. He, and X.M. Tu Elementary Applications of Probability Theory, Second Edition H.C. Tuckwell Introduction to Statistical Inference and Its Applications with R M.W. Trosset Understanding Advanced Statistical Methods P.H. Westfall and K.S.S. Henning Statistical Process Control: Theory and Practice, Third Edition G.B. Wetherill and D.W. Brown Generalized Additive Models: An Introduction with R S. Wood Epidemiology: Study Design and Data Analysis, Third Edition M. Woodward Practical Data Analysis for Designed Experiments B.S. Yandell Texts in Statistical Science Mathematical Statistics Basic Ideas and Selected Topics Volume II Peter J. Bickel University of California Berkley, California, USA Kjell A. Doksum University of Wisconsin Madison, Wisconsin, USA CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20150909 International Standard Book Number-13: 978-1-4987-2270-4 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com To Erich L. Lehmann CONTENTS PREFACE TO THE 2015 EDITION xv I 1 1 5 5 6 8 8 10 11 15 20 INTRODUCTION AND EXAMPLES I.0 Basic Ideas and Conventions I.1 Tests of Goodness of Fit and the Brownian Bridge I.2 Testing Goodness of Fit to Parametric Hypotheses I.3 Regular Parameters. Minimum Distance Estimates I.4 Permutation Tests I.5 Estimation of Irregular Parameters I.6 Stein and Empirical Bayes Estimation I.7 Model Selection I.8 Problems and Complements I.9 Notes 7 TOOLS FOR ASYMPTOTIC ANALYSIS 7.1 Weak Convergence in Function Spaces 7.1.1 Stochastic Processes and Weak Convergence 7.1.2 Maximal Inequalities 7.1.3 Empirical Processes on Function Spaces 7.2 The Delta Method in Infinite Dimensional Space 7.2.1 Influence Functions. The Gâteaux and Fréchet Derivatives 7.2.2 The Quantile Process 7.3 Further Expansions 7.3.1 The von Mises Expansion 7.3.2 The Hoeffding and Analysis of Variance Expansions 7.4 Problems and Complements 7.5 Notes 21 21 21 28 31 38 38 47 51 51 54 62 72 ix x Contents 8 DISTRIBUTION-FREE, UNBIASED, AND EQUIVARIANT PROCEDURES 8.1 Introduction 8.2 Similarity and Completeness 8.2.1 Testing 8.2.2 Testing Optimality Theory 8.2.3 Estimation 8.3 Invariance, Equivariance, and Minimax Procedures 8.3.1 Group Models 8.3.2 Group Models and Decision Theory 8.3.3 Characterizing Invariant Tests 8.3.4 Characterizing Equivariant Estimates 8.3.5 Minimaxity for Tests: Application to Group Models 8.3.6 Minimax Estimation, Admissibility, and Steinian Shrinkage 8.4 Problems and Complements 8.5 Notes 73 73 74 74 85 87 92 92 94 96 102 104 107 112 123 9 INFERENCE IN SEMIPARAMETRIC MODELS 9.1 Estimation in Semiparametric Models 9.1.1 Selected Examples 9.1.2 Regularization. Modified Maximum Likelihood 9.1.3 Other Modified and Approximate Likelihoods 9.1.4 Sieves and Regularization 9.2 Asymptotics. Consistency and Asymptotic Normality 9.2.1 A General Consistency Criterion 9.2.2 Asymptotics for Selected Models 9.3 Efficiency in Semiparametric Models 9.4 Tests and Empirical Process Theory 9.5 Asymptotic Properties of Likelihoods. Contiguity 9.6 Problems and Complements 9.7 Notes 125 125 125 133 142 145 151 152 153 161 175 181 193 209 10 MONTE CARLO METHODS 10.1 The Nature of Monte Carlo Methods 10.2 Three Basic Monte Carlo Metheds 10.2.1 Simple Monte Carlo 10.2.2 Importance Sampling 10.2.3 Rejective Sampling 211 211 214 215 216 217 Contents 10.3 The Bootstrap 10.3.1 Bootstrap Samples and Bias Corrections 10.3.2 Bootstrap Variance and Confidence Bounds 10.3.3 The General i.i.d. Nonparametric Bootstrap 10.3.4 Asymptotic Theory for the Bootstrap 10.3.5 Examples Where Efron’s Bootstrap Fails. The m out of n Bootstraps 10.4 Markov Chain Monte Carlo 10.4.1 The Basic MCMC Framework 10.4.2 Metropolis Sampling Algorithms 10.4.3 The Gibbs Samplers 10.4.4 Speed of Convergence and Efficiency of MCMC 10.5 Applications of MCMC to Bayesian and Frequentist Inference 10.6 Problems and Complements 10.7 Notes xi 219 220 224 227 230 235 237 237 238 242 246 250 256 263 11 NONPARAMETRIC INFERENCE FOR FUNCTIONS OF ONE VARIABLE 11.1 Introduction 11.2 Convolution Kernel Estimates on R 11.2.1 Uniform Local Behavior of Kernel Density Estimates 11.2.2 Global Behavior of Convolution Kernel Estimates 11.2.3 Performance and Bandwidth Choice 11.2.4 Discussion of Convolution Kernel Estimates 11.3 Minimum Contrast Estimates: Reducing Boundary Bias 11.4 Regularization and Nonlinear Density Estimates 11.4.1 Regularization and Roughness Penalties 11.4.2 Sieves. Machine Learning. Log Density Estimation 11.4.3 Nearest Neighbor Density Estimates 11.5 Confidence Regions 11.6 Nonparametric Regression for One Covariate 11.6.1 Estimation Principles 11.6.2 Asymptotic Bias and Variance Calculations 11.7 Problems and Complements 265 265 266 269 271 272 273 274 280 280 281 284 285 287 287 290 297 12 PREDICTION AND MACHINE LEARNING 12.1 Introduction 307 307 xii Contents 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.1.1 Statistical Approaches to Modeling and Analyzing Multidimensional data. Sieves 12.1.2 Machine Learning Approaches 12.1.3 Outline Classification and Prediction 12.2.1 Multivariate Density and Regression Estimation 12.2.2 Bayes Rule and Nonparametric Classification 12.2.3 Sieve Methods 12.2.4 Machine Learning Approaches Asymptotic Risk Criteria 12.3.1 Optimal Prediction in Parametric Regression Models 12.3.2 Optimal Rates of Convergence for Estimation and Prediction in Nonparametric Models 12.3.3 The Gaussian White Noise (GWN) Model 12.3.4 Minimax Bounds on IMSE for Subsets of the GWN Model 12.3.5 Sparse Submodels Oracle Inequalities 12.4.1 Stein’s Unbiased Risk Estimate 12.4.2 Oracle Inequality for Shrinkage Estimators 12.4.3 Oracle Inequality and Adaptive Minimax Rate for Truncated Estimates 12.4.4 An Oracle Inequality for Classification Performance and Tuning via Cross Validation 12.5.1 Cross Validation for Tuning Parameter Choice 12.5.2 Cross Validation for Measuring Performance Model Selection and Dimension Reduction 12.6.1 A Bayesian Criterion for Model Selection 12.6.2 Inference after Model Selection 12.6.3 Dimension Reduction via Principal Component Analysis Topics Briefly Touched and Current Frontiers Problems and Complements D APPENDIX D. SUPPLEMENTS TO TEXT D.1 Probability Results D.2 Supplement to Section 7.1 D.3 Supplement to Section 7.2 D.4 Supplement to Section 9.2.2 D.5 Supplement to Section 10.4 309 313 315 315 315 320 322 324 333 334 337 347 349 350 352 354 355 357 359 361 362 367 367 368 372 374 377 381 399 399 401 404 405 406 Contents xiii D.6 Supplement to Section 11.6 D.7 Supplement to Section 12.2.2 D.8 Problems and Complements 410 413 419 E SOLUTIONS FOR VOLUME II 423 REFERENCES 437 SUBJECT INDEX 455 AUTHOR INDEX 461 PREFACE TO THE 2015 EDITION Volume II This textbook represents our view of what a second course in mathematical statistics for graduate students with a good mathematics background should be. The mathematics background needed includes linear algebra, matrix theory and advanced calculus, but not measure theory. Probability at the level of, for instance, Grimmett and Stirzaker’s Probability and Random Processes, is also needed. Appendix D1 in combination with Appendices A and B in Volume I give the probability that is needed. However, the treatment is abridged with few proofs. This Volume II of the second edition presents what we think are some of the most important statistical concepts, methods, and tools developed since the first edition. Topics included are: asymptotic efficiency in semiparametric models, semiparametric maximum likelihood estimation, finite sample size optimality including Lehmann-Scheffé theory, survival analysis including Cox regression, prediction, classification, methods of inference based on sieve models, model and variable selection, Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo, nonparametric curve estimation, and machine learning including support vector machines and classification and regression trees (CART). The basic asymptotic tools developed or presented, in part in the text and in part in appendices, are weak convergence for random processes, empirical process theory, and the functional delta method. With the tools and concepts developed in this second volume students will be ready for advanced research in modern statistics. Volume II includes too many topics to be covered in one semester. Chapter 8 can be omitted without losing much continuity. The following outline of Volume II chapter contents can be used to select topics to be included in a one semester course. A course leaning towards basic statistical theory could include most of Chapters 7, 8, 9, and 10 plus Sections 11.6, 12.4, and 12.6, while a course leaning towards statistical learning could include most of Chapters 7, 9, 10, and 12 plus Section 11.4. A great number of other possible combinations can be constructed from the following chapter outline. Volume II Outline Chapter I is an introductory chapter that starts by presenting basic ideas about statistical modeling and inference, then gives a number of examples that illustrate some of the basic xv xvi Preface to the 2015 Edition ideas. Chapter 7 gives the asymptotic tools to be used in parts of the rest of the book. These tools include asymptotic empirical process theory, the delta method and derivatives on function spaces and their use in deriving influence functions, as well as von Mises and Hoeffding expansions. Chapter 8 presents some of the classical theory of statistical optimality in a decision theoretic context. Here, we derive for fixed sample sizes procedures which are optimal over restricted classes of methods, such as distribution free tests and unbiased, invariant tests or equivariant estimates. Alternatively, we study the weak but general property of minimaxity, as well as admissibility. Some results important for Chapter 12 such as Stein’s identity and estimate are also introduced. The asymptotic counterparts of these notions are introduced in Chapter 9, enabling us to present generalizations which go well beyond classical exponential family and invariant models. The methods of Chapter 7 are key in this development. In Chapter 9, we first introduce the concepts of regularization, modified maximum likelihood, and sieves. We then develop asymptotic efficiency for semiparametric models, asymptotic optimality for tests, and Le Cam’s asymptotic theory of experiments. These concepts are applied to estimation in Cox’s semiparametric proportional hazard regression model, partially linear models, semiparametric linear models, biased sampling models, and models for censored data. For testing, the asymptotic optimality of Neyman’s Cα and Rao’s score tests are established. Chapter 10 develops Monte Carlo methods, that is, methods based on simulation. Simulation enables us to approximate methods stochastically, which are difficult to compute analytically. Examples range from distributions of test statistics and confidence bounds to likelihood methods to estimates of risk. In this chapter we first give methods based on simple random sampling where data are generated from specified distributions, including simple posterior distributions in a Bayesian setting. Next, importance sampling and rejective sampling are introduced as methods that are able to generate random variables with a desired distribution when simple random sampling is not possible. In Chapter 10 we also present Efron’s bootstrap. Here, important features of the population distribution, that is, parameters, are expressed as functionals of this unknown distribution and then the distribution is replaced by an estimate such as the empirical distribution of the experimental data. The functional evaluated at the empirical distribution is approximated by drawing Monte Carlo samples from the empirical distribution. Chapter 10 goes on to present Markov Chain Monte Carlo (MCMC) techniques. These are appropriate when direct generation of independent identically distributed variables is not possible. In this approach, a sequence of random variables are generated according to a homogenous Markov chain in such a way that variables far out in the chain are approximately distributed as a sample from a target distribution. The Metropolis algorithm and the Gibbs sampler are developed as special cases. MCMC is in particular important for Bayesian statistical inference, but also in frequentist contexts. Chapter 11 examines nonparametric estimation of functions of one variable, including density functions and the nonparametric regression of a response on one covariate. Estimates considered are based on kernels, series expansions, roughness penalties, nearest neighbors, and local polynomials. Asymptotic properties of mean squared errors are xvii Preface to the 2015 Edition developed. Chapter 12 has a number of topics related to what is known as statistical learning. Topics include prediction, classification, nonparametric estimation of multivariate densities and regression functions, penalty estimation including the least absolute shrinkage and selection operator (Lasso), classification and regression trees (CART), support vector machines, boosting, Gaussian white nose modeling, oracle inequalities, Steinian shrinkage estimation, sparsity, Bayesian model selection, regularization, sieves, cross-validation, asymptotic risk and optimality properties of predictors and classifiers, and more. We thank Akichika Ozeki, Sangbum Choi, Sören Künzel, Joshua Cape, and many students for pointing out errors and John Kimmel and CRC Press for production support. For word processing we thank Dee Frana and especially Anne Chong who processed 95% of Volume II and helped with references, indexing and error detection. Kjell Doksum thanks the statistics departments of Harvard, Columbia and Stanford Universities for support. Last and most important we would like to thank our wives, Nancy Kramer Bickel and Joan H. Fujimura, and our families for support, encouragement, and active participation in an enterprise that at times seemed endless, appeared gratifyingly ended in 1976 but has, with the field, taken on a new life. Volume I For convenience we repeat the preface to Volume I. In recent years statistics has changed enormously under the impact of several forces: (1) The generation of what were once unusual types of data such as images, trees (phylogenetic and other), and other types of combinatorial objects. (2) The generation of enormous amounts of data—terabytes (the equivalent of 1012 characters) for an astronomical survey over three years. (3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore’s “law”) and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from small sample optimality results in a number of directions: (1) Methods for inference based on larger numbers of observations and minimal assumptions—asymptotic methods in non- and semiparametric models, models with “infinite” number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. xviii Preface to the 2015 Edition (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo. (4) The development of techniques not describable in “closed mathematical form” but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. (5) The study of the interplay between numerical and statistical considerations. Despite advances in computing speed, some methods run quickly in real time. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. There have been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. These will not be dealt with in our work. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. Volume I Outline This volume presents the basic classical statistical concepts at the Ph.D. level without requiring measure theory. It gives careful proofs of the major results and indicates how the theory sheds light on the properties of practical methods. The topics include estimation, prediction, testing, confidence sets, Bayesian analysis and the more general approach of decision theory. We include from the start in Chapter 1 non- and semiparametric models, then go to parameters and parametric models stressing the role of identifiability. From the beginning we stress function-valued parameters, such as the density, and function-valued statistics, such as the empirical distribution function. We also, from the start, include examples that are important in applications, such as regression experiments. There is extensive material on Bayesian models and analysis and extended discussion of prediction and k-parameter exponential families. These objects that are the building blocks of most modern models require concepts involving moments of random vectors and convexity that are given in Appendix B. Chapter 2 deals with estimation and includes a detailed treatment of maximum likelihood estimates (MLEs), including a complete study of MLEs in canonical k-parameter exponential families. Other novel features of this chapter include a detailed analysis, including proofs of convergence, of a standard but slow algorithm (coordinate descent) for convex optimization, applied, in particular to computing MLEs in multiparameter exponential families. We also give an introduction to the EM algorithm, one of the main ingredients of most modern algorithms for inference. Chapters 3 and 4 are on the theory of testing and confidence regions, including some optimality theory for estimation as well and elementary robustness considerations. Preface to the 2015 Edition xix Chapter 5 is devoted to basic asymptotic approximations with one dimensional parameter models as examples. It includes proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference and a section relating Bayesian and frequentist inference via the Bernstein–von Mises theorem. Finally, Chapter 6 is devoted to inference in multivariate (multiparameter) models. Included are asymptotic normality and optimality of maximum likelihood estimates, inference in the general linear model, Wilks theorem on the asymptotic distribution of the likelihood ratio test, the Wald and Rao statistics and associated confidence regions, and some parallels to the optimality theory and comparisons of Bayes and frequentist procedures given in the one dimensional parameter case in Chapter 5. Chapter 6 also develops the asymptotic joint normality of estimates that are solutions to estimating equations and presents Huber’s Sandwich formula for the asymptotic covariance matrix of such estimates. Generalized linear models, including binary logistic regression, are introduced as examples. Robustness from an asymptotic theory point of view appears also. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the more advanced topics of Volume II. Volume I includes Appendix A on basic probability and a larger Appendix B, which includes more advanced topics from probability theory such as the multivariate Gaussian distribution, weak convergence in Euclidean spaces, and probability inequalities as well as more advanced topics in matrix theory and analysis. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. As in the first edition, we do not require measure theory but assume from the start that our models are what we call “regular.” That is, we assume either a discrete probability whose support does not depend on the parameter set, or the absolutely continuous case with a density. Hilbert space theory is not needed, but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. Appendix B is as self-contained as possible with proofs of most statements, problems, and references to the literature for proofs of the deepest results such as the spectral theorem. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. For the first volume of the second edition we would like to add thanks to Jianging Fan, Michael Jordan, Jianhua Huang, Ying Qing Chen, and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. We also thank Faye Yeager for typing, Michael Ostland and Simon Cawley for producing the graphs, Yoram Gat for proofreading that found not only typos but serious errors, and Prentice Hall for generous production support. Peter J. Bickel bickel@stat.berkeley.edu Kjell Doksum doksum@stat.wisc.edu Chapter I INTRODUCTION AND EXAMPLES I.0 Basic Ideas and Conventions Recall from Volume I that in the field of statistics we represent important data-related problems and questions in terms of questions about distributions and their parameters. Thus our goal is to use data X ∈ X to estimate or draw conclusions about aspects of the probability distribution P of X. The probability distribution P is assumed to belong to a class P of distributions called the model. Examining what models are useful for answering data-related questions is an important part of statistics. In Volume I we considered three cases with a focus on the first: (1) P is a member of a parametric class of distributions {Pθ : θ ∈ Θ}, Θ ⊂ Rd , and our interest is in θ or some vector q(θ). (2) P is arbitrary except for regularity conditions, such as finite second moments or continuity of the distribution function, and our interest is in functionals ν(P ) that may be real valued, vectors, or functions. (3) Our class of distributions is neither smoothly parametrizable by a Euclidean parameter nor essentially unrestricted. In Volume I we focussed mostly(1) on parametric cases and on situations where the number of parameters we were dealing with was small in at least one of two ways: (i) The complexity of the regular model {Pθ : θ ∈ Θ}, Θ ⊂ Rd , as measured by the dimension d of the parametrization, was small in relation to the amount of information, as measured by the sample size n of the data. In particular, when examining the properties of statistical procedures, d does not increase with n. (ii) The procedures we considered, estimation of low dimensional Euclidean parameters, testing, and confidence regions, corresponded to simple (finite or low dimensional) action spaces A, where A is the range of the statistical decision procedure. In this volume we will focus on inference in non- and semiparametric models. In doing so, we will not only reexamine the procedures introduced in Volume I from a more sophisticated point of view but also come to grips with new problems originating from our analysis 1 2 Introduction and Examples Chapter I of estimation of functions and other complex decision procedures that appear naturally in these contexts. The mathematics needed for this work is often of a higher level than that used in Volume I. But, as before, we present what is needed in the appendices with proofs or references. Modeling Conventions The guiding principle of modern statistics was best formulated by George Box (see Section 1.1.): “Models, of course, are never true, but fortunately it is only necessary that they be useful” One implication of this statement is that the parameters we deal with are the parameters of the distribution in our model class closest to the unknown true distribution. See Sections 2.2.2, 5.4.2, and 6.2.1. For instance, a linear regression model can detect linear trends that provide useful information even if the true population relationship is not linear. See Figure 1.4.1. This leads to an interesting dilemma and accompanying research questions: The more general a class we postulate the more closely we will be able to approximte the true population distribution. However, using a very general class of models means more parameters and more variability of statistical methods. Achieving a balance leads to useful models. One approach is to use a nested sequence of “regular” parametric models (sieves) that become more general as we add parameters and then select the number of parameters by minimizing estimated prediction error (cross validation). See Chapter 12. As in Volume I (see Section 1.1.3), except for Bayesian models, our parametric models are restricted to be regular parametric models. P = {Pθ : θ ∈ Θ}, Θ ⊂ Rd , where Pθ is either continuous or discrete, and in the discrete case {x : p(x; θ) > 0} does not involve θ. But see Section I.5 for a general concept of regular and irregular parameters that includes semiparametric and nonparametric models. As in our discussion of Bayesian models in Section 1.2, conditioning of continuous variables by discrete variables and vice versa generally preserves the interpretation of the conditional p as being a continuous or discrete case density, respectively. If X = (I, Y )T where RI is discrete and Y is continuous, then p(i, y) is a density if it satisfies P (I = i, Y ≤ y y) = −∞ p(i, t)dt. Readers familiar with measure theory will recognize that all results remain meaningful when p = dP/dµ, where µ is a σ-finite measure dominating all P under discussion, and conditional densities are interpreted as being with respect to the appropriate conditional measure. All of the proofs can be converted to this general case, subject only to minor technicalities. Finally, we will write h = 0 when h = 0 a.s. (almost surely). More generally, a.s. equality is denoted as equality. As in Volume I, throughout this volume, for x ∈ Rd we shall use p(x) interchangeably for frequency functions p(x) = P [X = x] and for continuous case density R functions p(x). We willP call p(·) a density function in both cases. When we write h(x)dP (x) R we will mean x h(x)P [X = x] or h(x)p(x)dx. Unless we indicate otherwise, statements which can be interpreted under either interpretation are valid under both, although proofs in the text will be given under one formalism or the other. That is, when we write R R h(x)p(x)dx, for instance, we really mean h(x)dP (x) as interpreted above. When R R d = 1, we let F (x) = P (−∞, x] and often write h(x)dF (x) for h(x)dP (x). Section I.0 3 Basic Ideas and Conventions Selected Topics Statistical methods in Volume II include the bootstrap, Markov Chain Monte Carlo (MCMC), Steinian shrinkage, sieves, cross-validation, censored data analysis, Cox proportional hazard regression, nonparametric curve (kernel) estimation, model selection, classification, prediction, classification and regression trees (CART), penalty estimation such as the Lasso, and Bayesian procedures. The effectiveness of statistical methods is examined using classical concepts such as risk, Bayes risk, mean squared error, power, minimaxity, admissibility, invariance, and equivariance. Statistical methods that are optimal based on such criteria are obtained in a finite sample context in Chapter 8. However, most of the book is concerned with asymptotic theory including empirical process theory and efficient estimation in semiparametric models as well as the development of the asymptotic properties of the statistical methods listed above. We present in Section I.1–I.7 a few more details of some of the topics in Volume II. Notation X ∼ F , X is distributed according to F statistic, a function of observable data X only L(X), the distribution, or law, of X Xn =⇒ X, L(Xn ) → L(X) Xn converges weakly (in law) to X df, distribution function J, identity matrix = diag(1, . . . , 1) F̄ = 1 − F , the survival function [t], greatest integer less than or equal to t i.i.d., independent identically distributed sample, X1 , . . . , Xn i.i.d. as X ∼ F Pb and Pbn , empirical probability of a sample X1 , . . . , Xn B(n, θ), binomial distribution with parameters n and θ Ber(θ), Bernoulli distribution = B(1, θ) E(λ), exponential distribution with parameter λ (mean 1/λ) H(D, N, n), hypergeometric distribution with parameters D, N , n M(n, θ1 , . . . , θq ), multinomial distribution with parameters n, θ1 , . . . , θq N (µ, σ 2 ), normal (Gaussian) distribution with mean µ and variance σ 2 4 Introduction and Examples Chapter I ϕ, N (0, 1) density Φ, N (0, 1) df zα , αth quantile of Φ : zα = Φ−1 (α) N (µ1 , µ2 , σ12 , σ22 , ρ), bivariate normal (Gaussian) distribution N (µ, Σ), multivariate normal (Gaussian) distribution P(λ), Poisson distribution with parameter λ U(a, b), uniform distribution on the interval (a, b) d.f., degrees of freedom χ2k , chi-square distribution with k d.f. ≡, defined to be equal to ⊥, orthogonal to, uncorrelated 1(·), indicator function The OP , ≍P , and oP Notation The following asymptotic order in probability notation is from Section B.7. Let Un and Vn be random vectors in Rd and let | · | denote Euclidean distance. Un = oP (1) Un = OP (1) iff iff Un = oP (Vn ) iff Un = OP (Vn ) iff Un ≍P Vn iff Un = ΩP (Vn ) P Un −→ 0, that is, ∀ǫ > 0, P (|Un | > ǫ) → 0 ∀ǫ > 0, ∃M < ∞ such that ∀n P [|Un | ≥ M ] ≤ ǫ |Un | = op (1) |Vn | |Un | = OP (1) |Vn | Un = OP (Vn ) and Vn = OP (Un ) iff Un ≍P Vn Note that OP (1)oP (1) = oP (1), OP (1) + oP (1) = OP (1), L (I.1) and Un −→ U ⇒ Un = OP (1). Suppose Z1 , . . . , Zn are i.i.d. as Z with E|Z| < ∞. Set µ = E(Z), then Z̄n = 1 µ + op (1) by the weak law of large numbers. If E|Z|2 < ∞, then Z̄n = µ + Op (n− 2 ) by the central limit theorem. Section I.1 I.1 Tests of Goodness of Fit and the Brownian Bridge 5 Tests of Goodness of Fit and the Brownian Bridge Let X1 , ..., Xn be i.i.d. as X with distribution P . For one dimensional observations, the distribution function (df) F (·) = P [X ≤ ·] is a natural infinite dimensional parameter to consider. In Example 4.1.5 we showed how one could use the Kolmogorov statistic T (Fb , F0 ), where T (Fb , F ) ≡ supt∈R |Fb (t) − F (t)| and Fb is the empirical df, to construct a test of the hypothesis H : F = F0 . The test is designed so that one can expect it to be consistent against all alternatives, so that our viewpoint is fully nonparametric. In Example 4.4.6 we showed how to construct a simultaneous confidence band for F (·) using the pivot T (Fb , F ). In both cases we noted that the critical values needed for the test and confidence band could be obtained by determining the distribution of T (Fb , U), where U is the U nif [0, 1] distribution function under F = U nif [0, 1], and stated that these values could be determined by Monte Carlo simulation. How does T (Fb, F ) behave qualitatively? We will show in Section 7.1 that, although infinite dimensional, F (·) is a “regular” parameter. In this case, what “regular” means is that the stochastic process, √ (I.2) En (x) ≡ n (Fb (x) − F (x)), x ∈ R , converges in law in a strong sense (called “weak convergence!”) to a Gaussian process W 0 (F (·)). Here W 0 (u), 0 ≤ u ≤ 1, is a Gaussian process called the “Brownian bridge” with mean 0 and covariance structure given by, Cov(W 0 (u1 ), W 0 (u2 )) = u1 (1 − u2 ), u1 ≤ u2 . By “Gaussian” we mean that the distribution of W 0 (u1 ), . . . , W 0 (uk ) is multivariate normal for all u1 , . . . , uk . Note that Cov W 0 (F (x1 )), W 0 (F (x2 )) = Cov(En (x1 ), En (x2 )) = F (x1 )(1−F (x2 )), x1 ≤ x2 . The weak convergence of En (·) to W 0 F (·) , to be established in Section 7.1, will enable us to derive the Kolmogorov theorem, that when F = U nif [0, 1], T (Fb , F ) converges in law to L(sup{|W 0 (u)| : 0 ≤ u ≤ 1}), which is known analytically. This approach is based on heuristics due to Doob (1949) and developed in Donsker (1952). See also Doob (1953). We will discuss the heuristics in Section 7.1 and apply them to this and other examples in Section 7.2. These results will provide approximate size α critical values for the Kolmogorov statistics and other interesting functionals of distribution functions. The critical values yield confidence regions for distribution functions and related parameters. See Examples 4.4.6, 4.4.7 and Problems 4.4.17–4.4.19, 4.5.14–4.5.16. ✷ I.2 Testing Goodness of Fit to Parametric Hypotheses In Examples 4.1.6 and 4.4.6 we considered the important problem of testing goodness-offit to a Gaussian distribution H : F (·) = Φ( ·−µ σ ) for some µ, σ. We deduced that the 6 Introduction and Examples goodness-of-fit statistic Chapter I b sup |G(x) − Φ(x)|, x b is the empirical distribution function of (Z1 , . . . , Zn ) with Zi = (Xi −X̄)/b where G σ , has a null distribution which does not depend on µ and σ, so that critical values can be calculated by simulating from N (0, 1). In Section 8.2, we will consider other classes of hypothesis models which admit reasonable tests whose critical values can be specified without knowledge of which particular hypothesized distribution is true. However, when we consider the general problem of testing H : X ∼ P ∈ P where P = {Pθ : θ ∈ Θ} is a regular parametric model, we quickly come to situations where the methods of Chapter 8 will not apply. For instance, suppose that in the Gaussian goodness-of-fit problem above, our observations X1 , . . . , Xn which are i.i.d. as F (x) = Φ([x − µ]/σ) are truncated at 0, that is, we assume that we observe Y1 , . . . , Yn i.i.d. distributed as X|X ≥ 0 where X ∼ N (µ, σ 2 ). Then, P [0, ∞) = 1 and H is = 1 − Φ µ−t Φ µσ , t ≥ 0 P [Y ≤ t] ≡ Gµ,σ2 (t) = PP(0≤X≤t) (X≥0) σ = 0, t < 0. The only promising approach here is to estimate µ and σ 2 consistently using, for instance, maximum likelihood or the method of moments (Problem I.2.1) by µ b and σ b2 and estimate the null distribution of √ Tn ≡ sup n|Fb (t) − Gµb,bσ2 (t)| t≥0 by simulating samples of size n from N (b µ, σ b2 ) truncated at 0, i.e., keep as observations only the nonnegative ones. But can this method, called the “parametric bootstrap,” be justified? To answer this question we need to consider asymptotics: It turns out that, under H, Tn converges in law to a limit. This helps us little in approximating the null distribution of Tn since an analytic form for its limiting distribution is not available. But, as we show in Section 9.4, such results are essential in justifying the parametric bootstrap. The more important “nonparametric bootstrap” and other methods for simulating or approximately simulating observations from complicated distributions, often dependent on the data, such as Markov Chain Monte Carlo, are developed in Chapter 10. I.3 Regular Parameters. Minimum Distance Estimates We have seen in Chapters 5 and 6 how to establish asymptotic normality and approximate linearity of estimates that are solutions to estimation equations (M estimates) and then used these results to establish efficiency of the MLE under suitable conditions. There are many types of estimates which cannot be characterized as solutions of estimating equations. Examples we have discussed are the linear combinations of order statistics, such as the trimmed mean introduced in Section 3.5, n−[nα] X̄α = (n − 2[nα])−1 X i=[nα]+1 X(i) Section I.3 7 Regular Parameters. Minimum Distance Estimates where X(1) ≤ · · · ≤ X(n) are the ordered Xi . Here is another important class of such estimates. Suppose the data X ∈ X have probability distribution P . Let {Pθ : θ ∈ Θ, Θ ⊂ Rq } be a regular parametric model. There may be a unique θ such that Pθ = P , or if not, we choose the θ that makes Pθ closest to P in some metric d on the space of probability distributions on X . For instance, if X = R, examples of such metrics are d∞ (P, Q) = sup |P (−∞, t] − Q(−∞, t]| (I.3) t and d22 (P, Q) = Z ∞ −∞ (P (−∞, t] − Q(−∞, t])2 ψ(t)dt where ψ(·) is a nonnegative weight function of our choice with (I.4) R∞ −∞ ψ(t)dt < ∞. A minimum distance estimate θ(Pb ) is obtained by plugging the empirical distribution distribution Pb into the parameter θ(P ) = arg min{d(P, Pθ ) : θ ∈ Θ} where we assume that d(Q, Pθ ) is well defined for Q in M, a general class of distributions containing all distributions with finite support. Thus M contains the probability distribution P generating X and the empirical probability Pb. See Problems 7.2.10 and 7.2.18 for examples and properties of minimum distance estimates θ(Pb ). These problems show √ n consistency of θ(Pb ). They also show that θ(Pb) may not have a linear approximation in the sense of Section 7.2.1, and they may not be asymptotically normally distributed. Note that theRminimum contrast estimates of Section 2.1 are of this form but, in this case, d(Q, Pθ ) = ρ(x, θ)dQ(x), which is not a metric, but is linear in Q, whereas metrics are not. Can minimum distance estimates θ(Pb ) be linearized in the sense of (6.2.3), and are they asymptotically Gaussian as we have shown M estimates to be in Section 6.2.1 and 6.2.2? When this is true asymptotic inference is simple as we have seen in Section 6.3. We have effectively studied this question for X finite in Theorem 5.4.1. To do the general case, we need to extend the notion of Taylor expansion to function spaces, and apply so called maximal inequalities discussed in Section 7.1. In fact, we shall go further and examine function valued estimates such as the quantile function and study conditions under which these can be linearized and shown to be asymptotically Gaussian in the sense of weak convergence which will be rigorously defined in Section 7.1. Moreover, we want to conclude that asymptotic Gaussianity holds uniformly in a suitable sense. This is an important issue which we did not focus on in Volume I. As the Hodges Example 5.4.2 shows, it is possible to have estimates whose asymptotic behavior is not a good guide to the finite n case because of lack of uniformity of convergence when we vary the underlying distribution. In Section 9.3 we will be concerned with regular parameters θ(P ), ones whose plug 1 in estimates θ(Pb) converge to θ(P ) at rate n− 2 uniformly over a suitable subset M0 of a nonparametric family M of probability distributions. 8 Introduction and Examples Chapter I Definition I.1. θ(Pb) converges to θ(P ) at rate δn over M0 ⊂ M iff for all ε > 0 there exists c < ∞ such that sup{P [ |θ(Pb) − θ(P )| ≥ cδn ] : P ∈ M0 } ≤ ε . There is another important set of questions having to do with the extension of the notion of efficient estimation from parametric to non- and semiparametric models. For instance, consider the parameter corresponding to a minimum contrast estimate Z ν(P ) = argmin ρ(x, θ)dP (x) : θ ∈ Θ . Suppose that, as usual, we assume ν is defined for all P ∈ M, a nonparametric model. Is there any estimate of ν(P ) which behaves regularly and yet is able to achieve smaller asymptotic variance than the minimum contrast estimate ν(Pb ) at some P in M? Note that this is not the same question as asking whether ν(Pb ) is improvable by another such estimator η(Pb) such that η(Pθ ) = ν(Pθ ) = θ on a parametric model {Pθ : θ ∈ Θ}. Intuitively, ν(Pb ) is to first order the only regular estimate of ν(P ) on all of M and should not be and indeed, is not improvable. Regularity is essential here to exclude the Hodges Example phenomena. We develop this theme further in the context of semiparametric as well as nonparametric models in Section 9.3. I.4 Permutation Tests In Section 4.9.3 we considered the Gaussian two-sample problem. We want to compare two samples X1 , . . . , Xn1 (control) and Y1 , . . . , Yn2 (treatment) from distributions F and G and, in particular, want to test the hypothesis H : F = G of no treatment effect. We studied the classical case in which F and G were N (µ, σ 2 ), N (µ + ∆, σ 2 ), respectively. Then H becomes ∆ = 0 and we arrived at the classical two-sample t-test. In Example 5.3.7 R we showed that, even if F = G was not Gaussian, if x2 dF (x) < ∞, the asymptotic level of the test is preserved as n1 , n2 → ∞. This can be thought of as a result for testing the hypothesis that the semiparametric model {(F, R G) : F = G} R holds within the full nonparametric model {(F, G) : F, G arbitrary, x2 dF (x) < ∞, y 2 dG(y) < ∞}. But is it possible to construct tests with reasonable properties which have level 0 < α < 1 for fixed n1 , n2 and all F, G? The answer is yes. We shall study such permutation tests and their simple special subclass, the rank tests, in Sections 8.2 and 8.3 in the context of classes of composite semiparametric and parametric hypotheses H : P ∈ P0 ⊂ P which allow the construction of tests whose null distribution does not depend on where we are in P0 . We have, in fact, already seen examples of such parametric hypotheses in the Gaussian oneand two-sample problems with unknown variance. I.5 Estimation of Irregular Parameters We now show that the phenomena we encounter with irregular parameters are not simply a function of infinite dimension but rather manifest themselves as soon as the parameter Section I.5 9 Estimation of Irregular Parameters space complexity p, as measured naively by parameter space dimension, is comparable to the information in the data as measured naively by the sample size n. Although the distribution function is a reasonable object to study for X = R, a much more visually informative parameter, even for this dimension and certainly for X = Rd with d > 1, is the density function, assuming that we postulate that only P which have continuous case densities p (in the usual sense) are considered. If P ∈ P = {Pθ : θ ∈ Θ} is a regular parametric model and Pθ has density p(·, θ), we can estimate the density by plugging in, b where θb is the MLE. If P is essentially nonparametric, there is no natural say, pb = p(·, θ) extension of the function valued parameter ν(P ) ≡ p(·) to all of the nonparametric class M since, if Pb denotes the empirical probability (2.1.15), ν(Pb ), the density of Pb , has no meaning. This leads to a strategy of “regularization” by which we first approximate ν(P ) on P by νn (P ), which extends smoothly to νen on M, i.e., for which νen (Pb) makes sense and yet νn (P ) − ν(P ) → 0 in some uniform sense as n → ∞. “Regularization” refers precisely to the change of estimation from the “irregular” ν to the “regular” νn . In essence, there is now a natural decomposition of the estimation error, νen (Pb ) − ν(P ) = (e νn (Pb) − νn (P )) + (νn (P ) − ν(P )). (I.5) The first term in parenthesis can be interpreted as the source of random variability, and the second as that of deterministic “bias.” We will loosely refer to this as the “bias-variance decomposition.” For instance, consider the usual histogram estimate of a one dimensional density p(·), pbh (t) = Pb [Ij (t)]/h where Ij = (jh, (j + 1)h], −∞ < j < ∞, and Ij (t) is the unique Ij which contains t. Now pbh (t) is the plug-in estimate for the parameter, ph (t) ≡ P [Ij (t)]/h. Of course, ph 6= p for h > 0 but if h = hn ↓ 0, then ph (t) → p(t) for all t and phn (·) is a sequence of approximating parameters. Now, E(b ph (t) − ph (t)) = 0, Var pbh (t) = ph (t)(1 − hph (t))/hn. Thus, the “variance” part of the decomposition tends to 0 only if h → 0 slower than n−1 . On the other hand, the rate of convergence of the “bias” part to 0 BIAS(h) ≡ 1 h Z (j+1)h jh (p(s) − p(t))ds is fastest when h → 0 fastest. In fact, typically, at best h−1 BIAS(h) → c 6= 0 (Problem I.5.1). So we see a tension present here in choosing the rate at which h → 0, which is a new phenomenon whose consequences we shall investigate in Chapter 11 and 12. Irregular parameters play a critical role in nonparametric regression, classification, and prediction which we shall also study in Chapters 11 and 12, as well as even in some aspects 10 Introduction and Examples Chapter I of regular parameter estimation in semiparametric models. As mentioned in Section I.3, the distinction between regular and irregular is loose, roughly corresponding to the distinction 1 between parameters which can at least asymptotically be estimated unbiasedly at rate n− 2 1 b and for which the usual asymptotic Gaussianb ≍ n− 2 for some θ, in the sense that MSE(θ) based methods of inference apply straightforwardly; and those which cannot be treated in this way. I.6 Stein and Empirical Bayes Estimation For conceptual reasons, we consider the analysis of variance p-sample Gaussian model (Example 6.1.3), specified by X = {Xij : 1 ≤ i ≤ n, 1 ≤ j ≤ p} where the Xij are independent, N (µj , σ02 ), with µp = (µ1 , . . . , µp )T unknown. We write P(n, p) for this class of distributions for X. The MLE of µp is X̄p ≡ (X·1 , . . . , X·p )T Pn −1 2 where X·j ≡ n i=1 Xij . Evidently, X̄p ∼ N (µp , (σ0 /n)J) where Jp×p is the iden2 tity. Let our loss function be l(µp , d) = |µp − d| /p, where |t| is the Euclidean norm of the vector t. Then, the MSE is, p σ2 1X E(X·j − µj )2 = 0 . R(µp , X̄p ) = p j=1 n (I.6) We can show X̄p is minimax (Problem I.6.1) for each n and p and asymptotically efficient as n → ∞ for p fixed. But, even if p is only ≥ 3, a remarkable phenomenon discovered by Stein (1956(b)) occurs: X̄p is not admissible, whatever be n. That is, there exist minimax estimates δ ∗ (X̄p ) such that R(µp , δ ∗ (X̄p )) < σ02 /n for all µp . A famous simple and intuitively reasonable estimate is Stein’s positive part estimate, p−2 ∗ δ (X̄p ) = 1 − X̄p (I.7) |X̄p |2 + where if y is a scalar, y+ ≡ max(y, √ 0). This estimate shrinks X̄p towards 0 and if the distance of X̄p from 0 is smaller than p − 2 declares the estimate to be 0. This result can be made more plausible by considering why X̄p becomes a poor estimate as p → ∞ for fixed n. Suppose n is fixed. Because X̄p is sufficient and normally distributed with independent components we can without loss of generality set n = 1. In this case we write X1 , . . . , Xp for the data. Put a prior distribution Π on Rp according to which the µi are i.i.d. with density π0 on R. If π0 is known, the posterior mean of (µ1 , . . . , µp ) given (X1 , . . . , Xp ), which minimizes the Bayes risk, is δ 0 (X1 , . . . , Xp ) = (δ0 (X1 ), . . . , δ0 (Xp )) where x−µ µ φ π0 (µ)dµ σ0 f ′ (x) −∞ = x + σ02 0 δ0 (x) = R ∞ x−µ f0 (x) π0 (µ)dµ −∞ φ σ0 R∞ (I.8) Section I.7 and 11 Model Selection 1 f0 (x) = σ0 Z ∞ −∞ φ x−µ σ0 π0 (µ)dµ , the marginal density of X1 (Problem I.6.2). The Bayes risk of this estimate is just r(π0 , δ0 ) ≡ σ02 [1 − σ02 I(f0 )] < σ02 (I.9) R where I(f0 ) = {[f0′ (x)]2 /f0 (x)} dx, the Fisher information for location of f0 (Problem I.6.4). Suppose now that π0 is unknown. We shall show, in Chapter 12, following Robbins (1956), that we can construct estimates fb0′ /fb0 using X1 , . . . , Xp which, when plugged in to (I.8), yield a purely data dependent estimate δb0 , such that if r(π0 , δ) denotes Bayes risk (see Section 3.3), then r(π0 , δb0 ) → r(π0 , δ0 ) as p → ∞ for all π0 . This strictly improves the performance of X̄p for n = 1, for all π0 (but not uniformly). Here δb0 is an example on an empirical Bayes estimate. What if both p and n tend to ∞? There is no change in our conclusion that δ0 is optimal if π0 is fixed and known. √ An alternative and more informative analysis leads us to consider priors π0n such that nµ/σ0 , the signal to noise ratio, stabilizes. The extent to which we can or want to try to estimate π0n now depends on the family of priors. We shall consider this as well as related questions in the context of the so called Gaussian white noise model in Chapter 12. The next section looks at this situation from a different point of view. I.7 Model Selection How complex should our model be? Usually this question can be reduced to asking how many parameters should be included in the model. We continue to consider the model P(n, p) of Section I.6. We identify the model with its parameter space Rp for (µ1 , . . . , µp ). Consider nested submodels Pt (n, p), 0 ≤ t ≤ p, specified by ωt = {µp : µt+1 = · · · = µp = 0}. Here t is unknown and is to be selected on the basis of the data X. Consider the problem of estimating µp as a vector with quadratic loss l(µp , d). If we knew t and, in fact, that µp ∈ ωt , t < p we could use δt (X) = (X·1 , . . . , X·t , 0, . . . , 0)T and obtain R(µp , δt ) = pσ 2 tσ02 < 0 = R(µp , δp ), µp ∈ ωt n n where R is the risk function R(µ, δ) = E|δ(X) − µ|2 and | · | denotes Euclidean distance. This seemingly artificial appearing situation is, in fact, a canonical model for a general linear model replicated n times: Yki = p X j=1 zkj βj + eki , 1 ≤ k ≤ p, 1 ≤ i ≤ n 12 Introduction and Examples Chapter I where the eki are i.i.d. N (0, σ02 ) and the zkj are distinct covariate (predictor) values. Then, b denotes the MLE, if β p 2 b ≡ (βb1 , . . . , βbp )T ∼ Np (β , σ0 [ZT Zp ]−1 ) β p p n p where β p = (β1 , . . . , βp )T , Zp = kzkj kp×p , and our model P(n, p) is the special case ZTp Zp = Jp×p . Our submodels {βp : βt+1 = · · · = βp = 0} are natural if we think the p predictors or covariates Z1 , . . . , Zp whose influence on the distribution of Y is governed by the βj can be ordered in terms of importance and we expect (p − t) of them to be independent of the response Y . In the context of the model P(n, p), the questions we address briefly now and more extensively in Chapter 12 are (1) Suppose p = ∞ (the possible number of predictors we can measure is “very large”) and we believe that µ ∈ ωt for some t < ∞. That is, µ ∈ {µ : µj = 0, j > t}. Can we, without knowledge of t, estimate t by b t so that, as n → ∞, E|δbt (X) − µ|2 = tσ02 (1 + o(1)) n (I.10) P∞ where |a|2 = j=1 a2j for a = (a1 , a2 , . . .)T , that is, can we asymptotically do as well not knowing t as knowing it? The optimal procedure for the case where t is known is called the oracle solution. P∞ (2) Suppose that all µj can be nonzero but j=1 µ2j < ∞. Then, we can write the risk as Vn (t, µ) ≡ E|δ t (X) − µ|2 = ∞ X tσ02 + µ2j n j=t+1 (I.11) There is clearly a best t(µ, n), one such that t(µ, n) = arg min Vn (t, µ). t Note that t(µ, n) = ∞, that is, estimating each µj by X·j is always a bad idea! Putting t = 0 will always do better. Can we select b t(n) such that E|δ bt(n) (X) − µ|2 Vn (t(µ, n), µ) −→ 1 (I.12) as n → ∞? For question (1) we want (I.10) and (I.12) to hold uniformly over “moderately large” sets of µ. It is natural to consider b t of the form: b t is the largest k such that for a suitable decreasing sequence {cn } of positive numbers, |X·j | ≤ cn for all j such that k+1 ≤ j ≤ n. Suppose p ≤ n. Finding the cn such that this b t solves (I.10) then turns out to lead to a special case of the solution to Schwarz’s Bayes criterion (SBC) (1978) which also is called Section I.7 13 Model Selection the “Bayes information criterion” (BIC). Let Pµ denote computation under µ; then take cn such that P0 [max{|X·j | : 1 ≤ j ≤ n} ≥ cn ] → 0 (I.13) Pµ [|X·j | ≤ cn ] → 0 . (I.14) and Since, for j ≥ t + 1, the X·j are i.i.d. N (0, σ02 /n) it is easy to see (Problem I.7.1) that p cn = σ0 (2 log n)/n will achieve (I.10) without knowledge of t. To see this compute for µ ∈ ωt , Pµ [b t 6= t] = Pµ [b t < t] + Pµ [b t > t] (I.15) ≤ Pµ [|Xt | ≤ cn ] + Pµ [max{|X·j | : t + 1 ≤ j ≤ n} > cn ] (Problem I.7.2). So, (I.13), (I.14), and (I.15) establish our answer to question (1). The solution to question (2) is subtler and somewhat different and was first proposed in a time series contextPby Akaike (1969) and in the regression context by Mallows (1973). n We will show that j=t+1 X·j2 can be used to obtain an unbiased estimate of Vn (t, µ). Evidently n n X X σ2 µ2j . Eµ X·j2 = (n − t) 0 + n j=t+1 j=t+1 Therefore, 2 σ 0 X·j2 + 2t − σ02 = Vn (t, µ), Eµ n j=t+1 n X and it seems plausible since σ02 doesn’t depend on t that minimizing the contrast ρn (X, k) ≡ n X j=k+1 X·j2 + 2k σ02 , n 1≤k≤n will yield bb t which achieves (I.12). This can be shown under suitable conditions (Shibata (1981)). However, it is interesting to note that this solution, which is called Mallows’ Cp , is quite different from the SBC solution. Because ρn (X, j) − ρn (X, j − 1) = −X·j2 + 2 σ02 n then ρn (X, j) ≤ ρn (Xn , j − 1) iff (nX·j2 /σ02 ) ≥ 2. That is, we prefer j parameters to j − 1 √ √ √ √ iff n|X·j | ≥ 2. Thus, Mallows’ Cp essentially uses a threshold of 2 on n|X·j |/σ0 14 Introduction and Examples Chapter I √ rather than the corresponding threshold 2 log n → ∞ for SBC. These methods and others will be discussed and contrasted with oracle solutions in Chapter 12. ✷ Remark I.7.1. Mallows’ Cp is usually discussed in the context of squared prediction error rather than squared estimation error. These two criteria are equivalent under general conditions; see Problem I.7.5. Remark I.7.2. More generally we can consider risks other than those based on squared error. We shall do so in Chapter 12. Remark I.7.3. This introductory chapter has barely touched on some of the main topics of Chapter 12. These topics include what is referred to as “Statistical Learning,” See Section 12.1 for an introduction to this topic. Summary We have in this chapter introduced, through examples, some of the main issues, topics, and procedures to be considered in this volume. One issue is the level of accuracy possible in the estimation of a parameter. In Section I.3, we illustrated regular parameters, √ which are parameters that are relatively easy to estimate in the sense that n times the estimation error is well behaved as n → ∞. We also discussed, in Section I.1, Doob’s heuristic, which suggests how to approximate the distribution of a functional of a sample-based stochastic process by the distribution of the functional of the limiting stochastic process. Other methods for obtaining approximations, Monte Carlo Methods and the bootstrap, are illustrated in Section I.2. We also gave, in Section I.3, important parameters that are regular but nevertheless do not fall under the framework of Vol. I and discussed efficient estimation of parameters for a given model. In Section I.4 we discussed situations where we can construct tests, called permutations tests, whose significance level α remain the same for every member of a general class of distributions P0 . An important subclass of the permutation tests is the rank tests. In Section I.5 we illustrated irregular parameters, and the notion of plugging into approximations of irregular parameters such as densities. In this context we introduced the bias variance tradeoff. In Section I.6 we introduced Stein estimation, which, in our illustration with p Gaussian population means, consists of providing vector estimates with smaller average mean squared error than the vector of p sample means. We then connected this approach to the empirical Bayes approach where we solve the Bayes estimation problem with p Gaussian means assuming the prior is known and then use the data to estimate the unknowns in the Bayes procedure. Finally, in Section I.7 we considered model selection where the question is how many parameters should be included in a model used to analyze the results of an experiment. A complex model with many parameters may represent the true distribution of the data better than a simpler model with fewer parameters. However, we illustrated that trying to estimate too many parameters for the given sample size may result in worse inference than fitting a simpler model providing a less adequate approximation to the truth. This is another reflection of the bias-variance tradeoff discussed in Section I.5. We used a simple p-sample framework to outline the derivations of two procedures that give the most accurate result in the sense of minimizing average mean squared error for the problem of estimating a vector of means. The first method, which applies to the situation where all but a finite number of parameters are zero, turns out to be a special case of the method generated by Schwarz’s Bayes criterion (also called BIC), while the Section I.8 15 Problems and Complements second method, which allows for an infinite sequence of positive means, yields a special case of Mallows’ Cp . I.8 PROBLEMS AND COMPLEMENTS Problems for Section I.1 P 1. Let X1 , . . . , Xn be i.i.d. as X ∼ F , let Fb (x) = n−1 ni=1 1[Xi ≤ x] be the empirical √ b distribution function, and let En (x) = n[F (x) − F (x)], x ∈ R. L (a) Show that for fixed x0 , En (x0 ) → N (0, F (x0 )[1 − F (x0 )]) (b) Find Cov(En (x1 ), En (x2 )) for x1 ≤ x2 . (c) Find the limiting law of (En (x1 ), En (x2 )) for x1 ≤ x2 using the bivariate central limit theorem. 2. Let X1 , . . . , Xm be i.i.d. as X ∼ F , Y1 , . . . , Yn i.i.d. as Y ∼ G, and let the X’s and b be the empirical df ’s, define Y ’s be independent (see Example 1.1.3). Let Fb (·) and G(·) N = m + n, and r mn b EN (t) = {G(t) − Fb (t) − [G(t) − F (t)]} N Assume that m N → λ as N → ∞ with 0 < λ < 1. (a) Show that for fixed t0 , as N → ∞, p L √ EN (t0 ) → λZ1 (t0 ) + (1 − λ)Z2 (t0 ) where Z1 (t0 ) and Z2 (t0 ) are independent with Z1 (t0 ) ∼ N (0, G(t0 )[1 − G(t0 )]) and Z2 (t0 ) ∼ N (0, F (t0 )[1 − F (t0 )]). (b) Describe the weak limit of EN (t). Problems for Section I.2 1. Let X1 , . . . , Xn be i.i.d. as X ∼ F (x) = Φ([x−µ]/σ). Suppose we observe Y1 , . . . , Ym i.i.d. as Y ∼ (X|X ≥ 0). Let µ = E(X), σ 2 = Var(X), and θ = (µ, σ 2 )T . (a) Express E(Y ) and Var(Y ) as functions of θ. By replacing E(Y ) and Var(Y ) by Y and σ bY2 in these two equations and solving them for θ numerically, we have method of moment estimates of µ and σ 2 . (b) Use Proposition 5.2.1 to outline an argument for the consistency of the estimates of µ and σ 2 in part (a). 16 Introduction and Examples Chapter I (c) Use Theorem 5.2.3 to outline an argument for the consistency of the MLE of θ. Problems for Section I.5 1. Suppose p is positive and Lipschitz continuous over Ij , that is, for some γ > 0 and all x, z ∈ Ij , |p(x) − p(z)| ≤ γ|x − z|. Show that (a) BIAS pbh (t) = O(h), all t ∈ Ij . (b) Var pbh (t) = O((nh)−1 ), all t ∈ Ij . (c) If p(x) is not constant on Ij , then for some constant c > 0, c < inf h−1 |BIASb ph (t)| ≤ sup h−1 |BIASb ph (t)| ≤ c−1 . h>0 (d) Let t be as in (c) above. Assume that p′ exists and that 0 < |p′ (x)| ≤ M < ∞ for x in a neighbourhood of t. Show that 2 infh MSE pbh (t) = Var pbh (t) + [BIAS pbh (t)]2 ≍ (n− 3 ) , where An ≍ Bn iff An = O(Bn ), Bn = O(An ). Hint (a) and (b). By the mean value theorem, for some x0 ∈ Ij , Z p(x)d(x) = hp(x0 ) . P [Ij (t)] = Ij Now show that |BIAS pbh (t)| ≤ γ|x0 − t| ≤ γh, Var pbh (t) ≤ p(x0 ) . nh Hint (d). Show that MSE pbn (t) = b(nh)−1 + ch2 plus smaller order terms for some constants b and c. Now minimize A(h) ≡ b(nh)−1 + ch2 with respect to h. Problems for Section I.6 1. Show that X̄P is the minimax estimate for the model P(n, p) and the risk (I.6). Hint. See Example 3.3.3. 2. Derive formula (I.8). Hint. The first equality is from (3.2.6). 3. Show that the Bayes risk for estimating q(θ) with quadratic loss when θ has prior π is Z 2 q(θ) − δ ∗ (x) π(θ)p(x, θ)dθdx, EVar(q(θ|X) = where δ ∗ (x) is the Bayes estimate of q(θ). Section I.8 17 Problems and Complements 4. Derive formula (I.9). Hint. Use (I.8), (5.4.32), and Problem I.6.3. Problems for Section I.7 1. Show that if X1 , . . . , Xn are i.i.d. N (0, 1), then p P [max(X1 , . . . , Xn ) ≤ 2 log n] → 1 . Hint. Use the following refinement of (7.1.9) (Feller (1968), p. 175) (x−1 − x−3 )φ(x) ≤ (1 − Φ(x)) ≤ x−1 φ(x) for all x > 0. 2. Verify (I.15). Hint. If Z1 , . . . , Zn are i.i.d. N (0, 1), P [max{|Zi | : 1 ≤ i ≤ n} ≥ z] ≤ n P [|Z1 | ≥ z]. Use 1 − Φ(z) ≤ φ(z)/z for all z > 0. 3. Selecting the model to minimize MSE. The t-test revisited. Consider the simple linear regression model (e.g. Example 2.2.2) Yi = β0 + β1 zi + εi , i = 1, . . . , n (I.8.1) 2 where ε1 , . . . , εn are i.i.d. P as ε ∼ N (0, σ ) and the zi ’s are not all equal. Without loss of generality we assume zi = 0 and β0 = E(Y ). We are interested in estimating µi ≡ µ(zi ) ≡ E(Yi ) = β0 + β1 zi , i = 1, . . . , n. In this problem we assume that (I.8.1) is the true model. Even so, it may be that the MLE µ b10 = βb0 = Y based on the submodel with β1 = 0 has smaller risk than the MLE b = (b µ1 , . . . , µ bn )T we define µ b1i ≡ Y + βb1 zi based on the full model. For any estimate µ T the risk for estimating µ = (µ1 , . . . , µn ) as n b = R(µ, µ) 1 1X b − µ|2 = E|b µ i − µ i |2 E|µ n n i=1 where the expectation is computed for the full model (I.8.1). b j = (b (a) Show that for µ µj1 , . . . , µ bjn )T , (1 − j)β12 (1 + j)σ 2 + n n Hint. See Example 6.1.2, pages 381–382. bj) = R(µ, µ Pn 2 i=1 zi , j = 0, 1 b j ), j = 0, 1, are (b) Show that unbiased estimates of R(µ, µ b b 0 ) ≡ n−1 [RSS0 − RSS1 ] = n−1 |b b 0 |2 = βb12 R(µ, µ µ1 − µ (I.8.2) Pn b b 1 ) = 2s2 /n R(µ, µ Pn bji ]2 , j = 0, 1, and s2 = RSS1 /(n − 2). where RSSj = i=1 [Yi − µ Hint. See Section 6.1.3. 2 i=1 zi n 18 Introduction and Examples Chapter I b b1) < (c) Show that the model selection rule that decides to keep β1 in the model iff R(µ, µ b b 0 ) is equivalent to a likelihood ratio test of H : β1 = 0 vs K : β1 6= 0. Show R(µ, µ that the rule and the test can be written as “Keep β1 in the model iff t2 > 2” Pn √ where t = βb1 /(s/ nb σz ) with σ bz2 = n−1 i=1 zi2 is the t-statistic for regression. Hint. See Example 6.1.2, pages 381–382. (d) Show that√the limit as n → ∞ of the significance level of the test in (c) equals P (|Z| > 2) = 0.16 where Z ∼ N (0, 1). That is, we decide to keep β1 if the p-value is less than 0.16. Show this result without the normality assumption. Instead assume that ε1 , . . . , εn are i.i.d. as ε with E(ε) = 0, E(ε2 ) = σ 2 , E(ε4 ) < ∞. Also assume maxi {zi2 /Σzi2 } → 0. Hint. Use Lindeberg-Feller and Slutsky’s theorems. See Example 5.3.3. √ b 1 ) < R(µ, µ b 0 ) iff τ 2 > 1 where τ = β1 /(σ/ nb (e) Using (a), show that R(µ, µ σz ) is the noncentrality parameter of the distribution of the t-statistic. Hint. See page 260, Section 4.9.2. (f) Show that τb2 ≡ ct2 − 1 with c = (n − 4)/(n − 2) is an unbiased estimate of τ 2 . Using τb2 , deciding in favor of keeping β1 is now equivalent to t2 > 2(n − 2)/(n − 4) for n ≥ 5. p L 2 , and Z and V are Hint. t =(Z + τ )/ V /(n − 2) where Z ∼ N (0, 1), V ∼ Xn−2 independent. Use Problem B.2.4. b b b 1 ) < R(µ, b 0 ) is also equivalent to r2 > 2/n, where r2 is the sam(g) Show that R(µ, µ µ ple correlation coefficient and we assume n ≥ 3. Also show t2 ≥ 2(n − 2)/(n − 4) iff r2 ≥ 2/(n − 2). P Hint. t2 = nβb12 σ bz2 /RSS1 and r2 = βb12 σ bz2 /b σY2 where σ bY2 = n−1 ni=1 (Yi − Y )2 . Now use the ANOVA decomposition nb σY2 = RSS1 + βb2 nb σz2 (see Section 6.1.3). 4. Selecting the model to minimize MSE. The F-test revisited. Consider the linear model (see 6.1.26) (I.8.3) Y = Z1 β1 + Z2 β2 + ε where Z 1 is n×q and Z 2 is n×(p−q), β 1 is q×1, β 2 is (p−q)×1, and ε = (ε1 , . . . , εn )T with ε1 , . . . , εn i.i.d. as ε ∼ N (0, σ 2 ). The entries of Z 1 and Z 2 are constants. Assume that Z ≡ (Z 1 , Z 2 ) has rank p. We are interested in estimating the effect of the covariates on the response mean E(Y ). Thus our parameters are (1) (2) µi ≡ µ(z i ) ≡ E(Yi ) = z i β 1 + z i β 2 , i = 1, . . . , n (j) where z i is the ith row of Z j , j = 1, 2. It may be that the coefficients in β 2 are so small that in the bias-variance tradeoff we get more efficient estimates of the µi ’s if we use the Section I.8 19 Problems and Complements b 0 = H1 Y model with β 2 = 0. Thus we want to compare the risks of the two estimates µ b = HY where H1 and H are the hat matrices (e.g., H1 = Z 1 (Z T1 Z 1 )−1 Z T1 ) for and µ the models with β2 = 0 and general β = (β T1 , β T2 )T , respectively. The risk for estimating µ = (µ1 , . . . , µn )T is b ) = n−1 E|b R(µ, µ µ − µ|2 = n−1 n X i=1 E[b µ i − µ i ]2 where the expectation is computed for the full model. (a) Set µ0 = H1 µ. Show that b0) = Rq ≡ Rq (µ, µ qσ 2 |µ − µ0 |2 + , 1≤q≤p. n n b0 = µ b , and Rp (µ, µ b ) = pσ 2 /n. Note that when q = p, then µ Hint. See (6.1.15). P (b) (i) Let s2 = [n − (p + 1)]−1 [Yi − Ybi ]2 where Ybi is the predicted value of Yi in the full model (I.8.3). Show that an unbiased estimate of Rp − Rq for model (I.8.3) is 2 b−µ b 0 |2 bp − R bq = 2(p − q)s − |µ R . n n bp < R bq is equivalent to (ii) Show that deciding to keep Z2 β2 in the model when R deciding β 2 = 0 when F > 2(p − q) where “F > 2(p − q)” is a likelihood ratio test of H : β2 = 0 vs K : β 2 6= 0. Hint. See Example 6.1.2. (c) (i) Use (a) to show that Rp < Rq is equivalent to θ2 > p − q, where θ2 = σ −2 |µ−µ0 |2 is the noncentrality parameter of the distribution of the F -statistic of Example 6.1.2. (ii) Show that θb2 = (n − p)−1 (p − q)(n − p − 2)F − (p − q) is an unbiased estimate of θ2 . Thus, using θb2 > (p − q), we select the full model iff F > (p − q + 1)(n − p) , n>p+2. (p − q)(n − p − 2) P 2 −1 −1 Hint. By Problem 8.3.13, F = [(p − q)−1 (Z1 + θ)2 + p−q V ] i=1 Zi (n − p) 2 where Zi ∼ N (0, 1), V ∼ Xn−p , and Z1 , . . . , Zp−q , V are independent. 5. Prediction and estimation are equivalent for squared error. Assume that E(Yi ) = µi and Var(Yi ) = σi2 where µi depends on a vector z i of covariates while σi2 does not, i = b be any estimate of µ 1, . . . , n. Let Y = (Y1 , . . . , Yn )T , µ = (µ1 , . . . , µn )T , and let µ based on Y with 0 < Varb µi < ∞ for i = 1, . . . , n. Let Y10 , . . . , Yn0 be variables to be predicted using Y1 , . . . , Yn . We assume that Y 0 = (Y10 , . . . , Yn0 )T is independent of Y , 20 Introduction and Examples Chapter I b to predict and that Yi0 has the same mean and variance as Yi , i = 1, . . . , n. We will use µ Y 0 and define the mean squared prediction error as MSPE = n−1 E|b µ − Y 0 |2 . Show that selecting the model (covariates, as in Problem 4 preceding) that minimizes MSPE is equivalent to selecting the model (covariates) that minimizes the mean squared error MSE = n−1 E|b µ − µ|2 . Hint. µ bi − Yi0 = [µi − Yi0 ] + [b µi − µi ]. Complete the square keeping the square brackets intact. Because Yi0 is independent of µ bi , we have E[b µi − Yi0 ]2 = σi2 + E[b µ i − µ i ]2 . Pn Note that the equivalence fails if i=1 σi2 depends on the covariates. I.9 Notes Notes for Section I.1. (1) Important exceptions are Sections 1.4 and 6.6. (I.8.4) Chapter 7 TOOLS FOR ASYMPTOTIC ANALYSIS In this chapter we will present the basic tools needed for most of our subsequent chapters. These tools include empirical process theory and maximal inequalities, the delta method in infinite dimensional space, influence functions, functional derivatives, and Taylor type expansions in function space including the von Mises and Hoeffding expansions. These tools are essential for developing asymptotic inference for semiparametric models. Some of the proofs and references will be deferred to Appendix D. 7.1 Weak Convergence in Function Spaces 7.1.1 Stochastic Processes and Weak Convergence As we discussed in Section I.1, when our inference involves estimates of functions such as the distribution F and functionals ν(F ) such as ν(F ) = supx |F (x)− F0 (x)|, we are faced with problems involving the convergence of stochastic processes such as √ { n[Fb (x) − F0 (x)] : x ∈ R} . We now turn to this subject. Our treatment largely parallels that of van der Vaart and Wellner (1996). Our story has to do with sequences of collections of random variables {Z(t) : t ∈ T }, also written as {Z(·)} or {Z(t; w) : t ∈ T, w ∈ Ω}, defined on a suitable probability space (Ω, A, P ), which have a type of approximability property called separability by Doob (1953). Specifically, {Z(t) : t ∈ T } is separable if there exists a countable set C ≡ {tj ∈ T : j ≥ 1} such that for any open interval (a, b) ⊂ R and any S ⊂ T, ω ∈ Ω0 : Z(t; w) ∈ (a, b), all t ∈ S = ω ∈ Ω0 : Z(t; w) ∈ (a, b), all t ∈ S ∩ C (7.1.1) where Ω0 = Ω − N with N a set in A with P (N ) = 0. See Doob (1953) or van der Vaart and Wellner (1996) for more details. 21 22 Tools for Asymptotic Analysis Chapter 7 The reason we need to introduce separability is that if T is uncountable, the distribution of functionals such as sup{Z(t) : t ∈ T } is not determined by the finite dimensional distributions. For instance, suppose Z(t) = c > 0 for all 0 ≤ t ≤ 1. Let τ be uniform on T = [0, 1] independent of Z(·) and let Z ∗ (t) = Z(t) , ∗ Z (τ ) = 2c . t 6= τ Then Z ∗ (t) = c with probability 1, but sup{Z ∗ (t) : t ∈ T }. Doob essentially shows that for certain collections of finite dimensional distributions Lt1 ,...,tk , tj ∈ T , 1 ≤ j ≤ k, there is, on a suitable probability space, a separable stochastic process {Z(t) : t ∈ T } with the same finite dimensional distributions — see Doob (1953) for an extensive discussion. What (7.1.1) means in addition to guaranteeing that we can rigorously talk about the probability of the left hand side of (7.1.1), is that if, for all {ti1 , . . . , tim } ⊂ S ∩ C, all m ≥ 1, P [Z(tij ) ∈ (a, b), 1 ≤ j ≤ m] ≥ 1 − ǫ, then P [Z(t) ∈ (a, b) for all t ∈ S] ≥ 1 − ǫ. We shall call a collection {Z(t) : t ∈ T } satisfying (7.1.1) a stochastic process. An important example of a sequence of processes which we have already encountered in Section I.1 is the classical empirical process, √ En (t) ≡ n Fb(t) − F (t) , t ∈ R, where Fb is the empirical df of X1 , . . . , Xn , i.i.d. F . Note, however, that this does not mean that we restrict T to be discrete or Euclidean(1). Definition 7.1.1. Finite Dimensional Convergence. A sequence of stochastic processes F IDI Zn (·) converge FIDI to the stochastic process Z(·) Zn (·) −→ Z(·) iff L(Zn (t1 ), . . . , Zn (tk )) → L(Z(t1 ), . . . , Z(tk )) (7.1.2) as n → ∞ for all {t1 , . . . , tk } ∈ T , all k < ∞. In Definition 7.1.1 Z(·) is not uniquely defined by (7.1.2) but we assume that a particular Z(·) is given. F IDI As we noted in Section I.1, En (·) −→ W 0 (F (·)), where {W 0 (u) : 0 ≤ u ≤ 1} is the Brownian bridge defined in Section I.1. As the notation suggests, it is possible and important to think of any Z(·) as an abstract valued random quantity, at the very least Z(·) ∈ F (T ), the set of all real valued functions on T . More precisely, if points of the set Ω, on which Z(·) are defined, are denoted by ω, then, for each ω ∈ Ω, Z(·; ω) is a function on T . If T is Euclidean, it is customary to speak of a realization of Z(·) as a sample function. It is clear that En (·) has bounded sample Section 7.1 23 Weak Convergence in Function Spaces functions. It is far less obvious but true (Problem D.2.3) that W 0 (·) has sample functions which are not only bounded but continuous. P [ω : W 0 (·; ω) is continuous ] = 1 . (7.1.3) Much more is known about W 0 (·) including the distribution of random variables such R1 R1 as sup{|W 0 (u)| : 0 ≤ u ≤ 1}, 0 [W 0 ]2 (u)du, and 0 W 0 (u) a(u) du (Problem 7.1.11). A discussion of some of its properties and the closely related Brownian motion or Wiener process W (·) defined in Section 7.1.2 is given in Billingsley (1968) and Shorack and Wellner (1986). Here is an example which is important for asymptotic inference. Example 7.1.1. The log likelihood process. Let X1 , . . . , Xn be i.i.d. P , where P ∈ P = {Pθ : θ ∈ R}, a regular 1 dimensional model satisfying conditions A0–A4 and A6 of Section 5.4.2 applied to ψ = ∂l/∂θ. Define as usual, ln (θ) = n X l(Xi , θ), i=1 where l ≡ log p and p = p(·, θ) is the density of X under Pθ . Define for all t ∈ R, t Zn (t) = ln (θ0 + √ ) − ln (θ0 ) n with Pθ0 being the true underlying probability. Then, for all t ∈ R, by Problem 5.4.5, as n → ∞, Zn (t) = tZn0 − t2 I(θ0 ) + oPθ0 (1) 2 (7.1.4) 1 P where I(θ0 ) is Fisher information and Zn0 = n− 2 ni=1 l′ (Xi , θ0 ) with l′ = ∂l/∂θ. By the central limit theorem and Slutsky’s theorem, for all t1 , . . . , tk , L (Zn (t1 ), . . . , Zn (tk )) →(Z(t1 ), . . . Z(tk )) where Z(t) ≡ tZ 0 − t2 I(θ0 ) 2 (7.1.5) F IDI and Z 0 ∼ N (0, I(θ0 )). Therefore, Zn (·) −→ Z(·) for the index set T = R. The result may readily be generalized to the case Θ ⊂ Rp , Θ open with the model satisfying A0–A4 and A6 of Section 6.2.1. We take T = Θ and define, for all t ∈ Θ, n sufficiently large, 1 Zn (t) ≡ ln (θ 0 + tn− 2 ) − ln (θ 0 ). 24 Tools for Asymptotic Analysis Chapter 7 It follows (Problem 7.1.2) that 1 Zn (t) = tT Z 0n − tT I(θ0 )t + oPθ0 (1) 2 (7.1.6) where I(θ0 ) is the Fisher information matrix, and Z 0n ≡n − 12 n X l̇(Xi , θ0 ) i=1 F IDI with l̇ the gradient vector (∂l/∂θ1 , . . . , ∂l/∂θp)T . Generalizing (7.1.5), Zn (·) −→ Z(·) where 1 Z(t) = tT Z 0 − tT I(θ0 )t 2 and Z 0 ∼ Np (0, I(θ0 )). ✷ We return to the general case where the index set T may not be Euclidean but for instance be a function space. Let |h|∞ ≡ sup{|h(t)| : t ∈ T }, where | · | is the Euclidean norm in Rd . Definition 7.1.2. For the index set T , l∞ (T ) denotes the class of sample functions Z(·) on T consisting of the set of all bounded real valued functions h on T endowed with the sup norm |h|∞ . Note that En (·) ∈ l∞ (R) and W 0 (·) ∈ l∞ ([0, 1]). Doob’s (1949) heuristics which started the fields of weak convergence and empirical processes were of the following type: F IDI We know from the multivariate central limit theorem that En (·) −→ W 0 (F (·)). This does not imply that the law of functions of En (·) such as supt |En (t)| whose value depends on more than a fixed finite number of t’s is necessarily close to that of the corresponding function of W 0 (·), but it seems plausible that further analysis will yield L(supt |En (t)|) −→ L(supt |W 0 (t)|). This heuristic was shown to be correct in this and related cases by Donsker (1952), but the theory has taken on its most satisfactory form only during the last two decades. F IDI What are needed are conditions on processes Zn (·) −→ Z(·) and on functions q : F (T ) −→ R such that L (q(Zn (·)) −→ L(q(Z(·))). This strong and useful notion is called weak convergence. Definition 7.1.3. Suppose that Z(·) and Zn (·), 1 ≤ i ≤ n, are in l∞ (T ). Zn (·) converges weakly to Z(·), written L(Zn (·)) −→ L(Z(·)), or more commonly, Zn =⇒ Z, iff F IDI (i) Zn −→ Z. (ii) For every continuous function q(·) : l∞ (T ) −→ R such that q(Zn (·)) is a random variable, L(q(Zn (·))) −→ L(q(Z(·)) . Section 7.1 Weak Convergence in Function Spaces 25 (iii) For each m, there exists a partition of T into a finite number m of sets and Z (m) (·) ∈ P l∞ (T ), each constant on the members of the partition, such that |Z (m) − Z|∞ −→ 0, as m → ∞. Continuity of q here means that if hn , h ∈ l∞ (T ) and |hn − h|∞ → 0, then q(hn ) → q(h). Note that q(h) = |h|∞ is certainly continuous, in fact, uniformly continuous, that is, for all ǫ > 0, there exists δ(ǫ) > 0 such that if |h1 − h2 |∞ ≤ δ(ǫ), then |q(h1 )− q(h2 )| ≤ ǫ. Even more, this q is Lipschitz continuous, that is, |q(h1 ) − q(h2 )| ≤ M |h1 − h2 |∞ , for all h1 , h2 for some constant M > 0. Property (iii), finite approximation, is called tightness and any Z(·) with this property is called tight. It implies separability. We relate our definition of weak convergence to classical definitions such as the one in Billingsley (1968) or more modern ones in terms of outer measures such as the one in van der Vaart and Wellner (1996) by example as follows: Let C(T ) be all real valued continuous functions on T . If P Z(·) ∈ C(T ) = 1 and T = [0, 1], then (iii) is satisfied by taking tmj = j/m, 0 ≤ j ≤ m, and Z (m) the piecewise constant interpolation of Z(·) based on Z(tmj ), 0 ≤ j ≤ m . More generally, (iii) essentially says that with high probability, Z(·) is arbitrarily close to a compact subset of l∞ (T ) such as an equicontinuous uniformly bounded subset of l∞ (T ). See Theorem 7.1.1 below. See also van der Vaart and Wellner (1996) for further discussion. We can now state the basic theorem giving sufficient (and essentially necessary) checkable conditions for weak convergence. Theorem 7.1.1. Suppose {Zn (·)}, Z(·) ∈ l∞ (T ), and F IDI (1) Zn (·) −→ Z(·). m Tmj = T and δm , εm → 0 as (2) There exist subsets Tm1 , . . . , Tmkm of T with ∪kj=1 m → ∞ such that, for all m, lim sup P [max{sup[|Zn (s) − Zn (t)| : s, t ∈ Tmj ] : 1 ≤ j ≤ km } > εm ] ≤ δm . n→∞ Then, Zn =⇒ Z. A proof of a weaker form of this theorem is given in Appendix D.2. The full proof may be found in van der Vaart and Wellner (1996), for instance. The basic idea is that Pk for processes Zn of the form Zn (t) = j=1 Zn (tj )1(t ∈ Tmj ) where tj ∈ Tmj , FIDI and weak convergence coincide. Condition (2) links uniform approximation of {Zn (·)} by processes {Znm (·)} in a way which makes the weak limit of the Znm , call them Z (m) , approximate Z in the sense of condition (iii) of Definition 7.1.3. Here is a development of Example 7.1.1. Example 7.1.2.(Example 7.1.1 continued) Weak convergence of likelihood processes. Note that, for θ ∈ R, conditions A0–A4 and A6 of Section 5.4.2 applied to ψ = ∂l/∂θ imply 26 Tools for Asymptotic Analysis Chapter 7 that, under Pθ0 , sup{|Zn (t) − tZn0 + t2 I(θ0 )| : |t| ≤ M } = oPθ0 (1). 2 We claim that weak convergence of Zn (·) to Z(·) holds on all intervals T = [−M, M ]. To see this note first that Zn (·) is separable on all such T since θ → ln (θ) is continuous. FIDI convergence was shown in Example 7.1.1. Next, to show condition (2) of Theorem 7.1.1, j let Tmj = M j−1 , m m , −m + 1 ≤ j ≤ m. Then, sup{|(s − t)Zn0 − t2 s2 − 2 2 I(θ0 )| : s, t ∈ Tmj } ≤ |Zn0 | M + I(θ0 ) M . m m It follows that, P max sup[|Zn (s) − Zn (t)| : s, t ∈ Tmj ], : −m + 1 ≤ j ≤ m > ε m X s2 t2 ε P max{sup[|(s − t)Zn0 − ( − )I(θo )| : s, t ∈ Tmj ]} > ≤ 2 2 2 j=−m+1 s2 ε + P sup{|Zn (s) − sZn0 + I(θ0 )| : |s| ≤ M } > 2 2 εm + o(1) . ≤ 2m P |Zn0 | + M I(θ0 ) > 2M (7.1.7) Choose ε = εm → 0 so slowly that 1 εm m . ≤ P |Z 0 | + M I(θ0 ) > 2M 2m2 Since Z 0 is Gaussian, using Problem D.2.10 or (7.1.9) of the next subsection, εm = (log m)/m will do. Now take δm = 1/m and weak convergence follows from Theorem 7.1.1. This example also illustrates condition (iii) of Definition 7.1.3 since it is evident that we can approximate Z(t) on the partition member Tmj by Z (m) (t) = M 2 j 1 j Z0 − M I(θ0 ). m m 2 ✷ Theorem 7.1.1 has an easy corollary (Problem 7.1.3). Let pm (ε) = lim sup max{P [sup{|Zn (s)−Zn (t)| : s, t ∈ Tmj } ≥ ε] : 1 ≤ j ≤ km } . (7.1.8) n FIDI Corollary 7.1.1. If km pm (ε) → 0 for all ε > 0 and Zn −→ Z, then Zn ⇒ Z. Section 7.1 27 Weak Convergence in Function Spaces Here is a continuous mapping theorem for processes: Proposition 7.1.1. If Zn =⇒ Z ∈ l∞ (T ) and g is a continuous map from l∞ (T ) to l∞ (S) for some set S, then g(Zn ) =⇒ g(Z) ∈ l∞ (S) . Proof. We show that for any function q that maps l∞ (S) to R continuously, q g(Zn ) → q g(Zn ) on l∞ (S). Because the continuity of q and g implies the continuity of the compostion q · g, (ii) of Definition 7.1.3 follows from the weak convergence of Zn . We leave (i) and (iii) of Definition 7.1.3 for Problem 7.1.6. ✷ The previous concepts extend readily from processes Zn (t) ∈ l∞ (T ) to processes Zn (t), Un (s) ∈ l∞ (T ) × l∞ (S). For such processes a useful result is Slutsky’s theorem for processes. Corollary 7.1.2. For sets T , S, and V , suppose g : l∞ (T ) × l∞(S) → l∞ (V ) continuously and, (i) Zn =⇒ Z P (ii) Un −→ u P where Zn ∈ l∞ (T ), Un , u ∈ l∞ (S), u is a nonrandom function, and −→ convergence is in l∞ (S). Then, g(Zn , Un ) =⇒ g(Z, u) . Proof. It is equivalent to show that (Zn , Un ) =⇒ (Z, u). We establish the result for Zn , Un obeying the conditions of Theorem 7.1.1. Let {Tmj } be the sets for {Zn }, {Smj } for {Un }. Take {Qml } = {Smj × Tmk : all j, k}. Evidently {Qml } satisfy (2) of Theorem 7.1.1 for (Zn , Un ). On the other hand, F IDI Zn (t1 ), . . . , Zn (ta ), Un (s1 ), . . . , Un (sb ) =⇒ Z(t1 ), . . . , Z(ta ), u(s1 ), . . . , u(sb ) by the “classical” Slutsky theorem (Theorem B.7.2). Thus, by Theorem 7.1.1, (Zn , Un ) =⇒ (Z, u) and the result follows. Example 7.1.3. The standardized empirical process. Let X1 , . . . , Xn be i.i.d. F on R with F continuous. Assume Donsker’s theorem 7.1.4, En (·) =⇒ W 0 (·) . For t such that Fb(t) 1 − Fb (t) > 0, define √ Fb (t) − F (t) q Zn (t) = n , Fb (t) 1 − Fb (t) 28 Tools for Asymptotic Analysis Chapter 7 and set Zn (t) = 0, otherwise. Let I ≡ F −1 (ε), F −1 (1 − ε) for ε > 0. Then W 0 F (·) Zn (·) =⇒ q F (·) 1 − F (·) on I. This follows by representing Xi as Xi = F −1 (Ui ), i = 1, . . . , n, where Ui are i.i.d. uniform (0, 1) so that √ n Fb (t) − F (t) = En F (t) , and using Corollary 7.1.2 — see Problem 7.1.9. Remark 7.1.1. Weak convergence in this general sense also has many of the important properties of FIDI convergence, mainly Theorems B7.1, B7.2, B7.4, B7.5 and generalizations of these to functions g : l∞ (T ) → l∞ (S). We shall use these results as needed in this section and subsequently. Summary. We introduced the concept of a stochastic process as a collection {Z(t) : t ∈ T } of random variables that are separable in the sense that the probability that Z(t) will stay in the interval (a, b) for all t in a set S ⊂ T can be computed by restricting t to the set S ∩ C for some countable set C. We consider stochastic processes that take values in the class l∞ (T ) of all bounded real valued functions h on T with the sup norm |h|∞ . A sequence {Zn (t) : t ∈ T } of stochastic processes converges weakly to a stochastic process Z(t), t ∈ T , if each finite collection Zn (t1 ), . .. , Zn (tk ) converges in law to Z(t1 ), . . . , Z(tk ), if q Zn (·) converges in law to q Z(·) for all continous real valued P functions q on l∞ (T ), and if Z(·) is tight in the sense that |Z (m) − Z|∞ −→ 0 as m → ∞ for some {Z (m) (·)} ∈ l∞ (T ), where each Z (m) (·) takes on only a finite set of values. We give conditions on fluctuations of Zn (·) over small sets that imply weak convergence in Theorem 7.1.1 and Corollary 7.1.1, and verify that they hold for the likelihood process √ which is defined to be the increments of the log likelihood over intervals (θ 0 , θ0 + t/ n] for a smooth parametric model for i.i.d. X1 , . . . , Xn . 7.1.2 Maximal Inequalities Corollary 7.1.1 suggests that the essential tools we need are bounds on P [sup{|Zn (s) − Zn (t)| : s, t ∈ S} ≥ λ] for subsets S of T of the form Tmj described in Theorem 7.1.1. There is a standard way of bounding tail probabilities for random variables (see (A.15.4)): P [|W | ≥ λ] ≤ Eh(W ) for h ≥ 0, h non-decreasing. h(λ) By studying h(t) = exp(γt) for appropriate γ > 0, one can derive Bernstein’s and Hoeffding’s inequalities (B.9.5), (B.9.6). If W is N (0, 1), an important bound we shall use (Problem D.2.10) is, for λ ≥ 1, 2 φ(λ) λ . (7.1.9) P [|W | ≥ λ] ≤ 2 ≤ 2 exp − λ 2 Section 7.1 29 Weak Convergence in Function Spaces An important refinement of Bernstein’s inequality (B.9.5) — see Hoeffding (1963) or Shorack and Wellner (1986), is the following Hoeffding’s inequality. Let Y1 , . . . , Yn be independent, |Yi | ≤ M for some constant M > 0, E(Yi ) = 0 for i = 1, . . . , n, and σn2 = Var(Y1 + . . . + Yn ). Then for each λ > 0, ( −1 ) |Y1 + . . . + Yn | Mλ 1 2 P . (7.1.10) > λ ≤ 2 exp − λ 1 + σn 2 3σn The M λ/3σn term makes this inequality weaker than (7.1.9) for the normal case. If, however, we have a collection of random variables, {Z(t) : t ∈ T }, tail bounds P (|Z(t)| > λ) on the individual Z(t) are not translatable into useful maximal inequalities without additional conditions. To take a trivial example, suppose T is the rational numbers and Z(t) are independent N (0, 1). Then, no matter how small δ > 0 is, for each λ > 0 P [sup{|Z(t)| : |t| ≤ δ} ≥ λ] ≡ 1. Yet, as we shall see in this section and Appendix D.2, it is quite possible to have inequalities of the form (7.1.9) and (7.1.10) improved to maximal inequalities of the same form for stochastic processes. Essential ingredients are appropriate tail bounds on |Z(s) − Z(t)| and the structure of T . To develop intuition we begin with T = [0, 1]. The following criterion, due to Kolmogorov, has been elaborated and extended in Billingsley ((1968), p.89). Proposition 7.1.2 Suppose that the separable stochastic process Z(·) on [0, 1] satisfies P [|Z(t) − Z(s)| ≥ ε] ≤ M (ε)δ 1+γ (7.1.11) for all s, t, |s − t| ≤ δ, all δ > 0, and some γ > 0. Then, if the length of S ⊂ [0, 1] is ≤ δ0 , P [sup{|Z(t) − Z(s)| : s, t ∈ S} ≥ ε] ≤ KM (ε)δ01+γ (7.1.12) for a universal constant K = K(γ). Note that by (A.15.4), (7.1.11) follows if, for some c > 0, r > 0 and γ > 0, E|Z(t) − Z(s)|r ≤ c|t − s|1+γ . (7.1.13) A remarkable feature of the proposition is that (7.1.11) is a condition which can be checked using bivariate distributions. In the following application, we will need Definition 7.1.4. The Wiener process or Brownian motion on [0,1] is a separable Gaussian process W (t), 0 ≤ t ≤ 1, with EW (t) = 0, Cov(W (s), W (t)) = s ∧ t. Example 7.1.4. The partial sum process. The following process arises in statistics primarily in the context of sequential analysis. Let X1 , . . . , Xn be i.i.d. as X, E(X) = Pk 0, V ar(X) = σ 2 < ∞, E|X|2+γ < ∞, γ > 0. Define Sk = j=1 Xj , 1 Wn (t) = n− 2 S[nt] , 0≤t≤1 σ (7.1.14) 30 Tools for Asymptotic Analysis Chapter 7 where [x] is the largest integer ≤ x. Proposition 7.1.3. If W (·) is the Wiener process on [0,1], then Wn ⇒ W . Proof: Note that Wn (t) = [nt] n 21 S[nt] 1 σ[nt] 2 , n−1 ≤ t ≤ 1. Let t1 < . . . < tk . Then (Problem 7.1.16), by the k variate central limit theorem, (Wn (t1 ), Wn (t2 ) − Wn (t1 ), . . . , Wn (tk ) − Wn (tk−1 )) ⇒ (U1 , . . . , Uk ), (7.1.15) where U1 , . . . , Uk are independent U1 ∼ N (0, t1 ), Uj ∼ N (0, tj − tj−1 ), 2 ≤ j ≤ k. It is easy to see that (U1 , U1 + U2 , . . . , U1 + . . . + Uk ) ∼ (W (t1 ), . . . , W (tk )) . Now we show that convergence is not just FIDI but weak by using Proposition 7.1.1 and Corollary 7.1.1. By Markov’s inequality P [|Wn (t) − Wn (s)| ≥ λ] ≤ λ−(2+γ) E|Wn (t) − Wn (s)|2+γ = λ−(2+γ) E [nt] X j=[ns]+1 Xj √ σ n 2+γ . (7.1.16) Without loss of generality, take σ = 1. By an extension of the proof of (5.3.3) to all γ > 0 based on the Marcinkiewicz-Zygmund’s inequality (1939), E [nt] X j=[ns]+1 2+γ Xj γ ≤ ([nt] − [ns])1+ 2 E|X|2+γ , and thus for M = E|X|2+γ , γ E|Wn (t) − Wn (s)|2+γ ≤ M |s − t|1+ 2 . (7.1.17) We conclude from Proposition 7.1.2 that, for all fixed t0 , γ P [sup{|Wn (t0 ) − Wn (s)| : t0 ≤ s ≤ t0 + δ} ≥ λ] ≤ λ−(2+γ) M δ (1+ 2 ) and P [sup{|Wn (t) − Wn (s)| : t0 ≤ s, t ≤ t0 + δ} ≥ λ] λ ≤ 2P [sup{|Wn (s) − Wn (t0 )| : t0 ≤ s ≤ t0 + δ} ≥ ] 2 −(2+γ) γ+3 1+ γ2 . ≤ λ 2 Mδ (7.1.18) Section 7.1 Weak Convergence in Function Spaces 31 j j+1 Take Tmj = m , m , 0 ≤ j ≤ m − 1. Then, δ = m−1 , and by (7.1.18), pm (ε) of γ Corollary 7.1.1 is bounded above by cm−(1+ 2 ) . Thus because km = m, Corollary 7.1.1 establishes weak convergence. ✷ Remark 7.1.2. 1) The structure of the processes Wn (·) appears only in a weak way, through the FIDI convergence and the inequality (7.1.12). Thus, results such as Proposition 7.1.2 can be proved for dependent Xj as well. 2) The condition E|X|2+γ < ∞ in Example 7.1.3 is superfluous. Only second moments are needed—see Billingsley(1968). 3) The process W (·) has continuous sample functions (Problem D.2.3). 4) The property (7.1.15) reveals the fundamental property of the Wiener process. It has stationary independent Gaussian increments. ✷ Summary. We gave a result which gives bounds on the probability that the maximum of the fluctuations of a stochastic process over a set S exceeds ε. We introduced the Wiener process and the partial sum process and showed how the result can be used to establish the weak convergence of the partial sum process to a Wiener process. 7.1.3 Empirical Processes on Function Spaces We could at this stage prove weak convergence results for the classical empirical process of Section I.1. We choose instead to introduce and develop the general theory of weak convergence of empirical processes which we use in Chapters 9–12. We specialize in the rest of this section to a subset T of the set of functions h from X to R, in the context of X1 , . . . , Xn ∈ X i.i.d as X ∼ P , and EP h2 (X) < ∞. We define the empirical process on T by n 1 X (h(Xi ) − EP h(Xi )) En (h) = √ n i=1 (7.1.19) for h ∈ T . The empirical process on the line that we defined earlier is a special case with T = {h : h(·) = hx (·), x ∈ R} where hx (·) = 1(−∞, x](·). Note that we have made a notational change by now writing En (hx ) for what was denoted by En (x) in Section I.1 and Section 7.1.1. Weak convergence of En for this T will be what we need for Donsker’s theorem and, as we noted, maximal inequalities for En on suitable subsets of T are what we’ll need for the proof. Another En on a suitable T appeared in Examples 7.1.1 and 7.1.2 with T = {log p(·, θ) : θ ∈ Θ} . If we define, generally, ||En ||T ≡ sup{|En (h)| : h ∈ T } we will find that remainder terms in the proofs of asymptotic results in this book will be naturally bounded by ||En ||T for suitable T . 32 Tools for Asymptotic Analysis Chapter 7 To obtain useful maximal inequalities for En (·), we need essentially to consider T which are not too big. If T = {all bounded functions} for instance, we do not get useful bounds (Problem 7.1.7). There are many ways of restricting T . We limit ourselves here to bounding the bracketing number. Note in passing that by specializing to En , we have completely specified the joint behavior of En (h1 ), . . . , En (hk ) and hence FIDI limits of the En (·). The bracketing numbers take advantage of this structure. Definition 7.1.5. A bracket, (f , f¯), is a set of functions h on X such that h ∈ (f , f¯), that is f (x) ≤ h(x) ≤ f¯(x) for all x. We assume both f and f¯ are in L2 (P ), but f and f¯ don’t have to belong to the bracket. Definition 7.1.6. The δ bracketing number N[ ] (δ, T, L2(P )) for a subset T of functions on X is the smallest number of brackets (f i , f¯i ) with f i , f¯i ∈ T such that (a) for every h ∈ T , there exist f i , f¯i such that h ∈ (f i , f¯i ), (b) EP (f¯i − f i )2 (X) ≤ δ 2 Definition 7.1.7. The envelope function for sets T in Definition 7.1.6 is s(T ) = sup{|h(X )| : h ∈ T }. We can get upper bounds on N[ ] by finding the number of brackets for some collection of brackets satisfying (b). Here is an example. Example 7.1.5. The classical empirical process. Suppose X = R, F (x) = P (−∞, x] is continuous and strictly increasing, and T = {1(−∞, x] : x ∈ R}. For a given δ ∈ (0, 1/2), set k = [1/δ 2 ] where [t] is the greatest integer ≤ t and let x(δ 2 ), x(2δ 2 ), . . . , x(kδ 2 ) be the unique δ 2 , 2δ 2 , . . . , kδ 2 quantiles of P . We set x(0) = −∞ and x((k + 1)δ 2 ) = ∞. Next, for i = 0, . . . , k, define the brackets (f i , f¯i ) = [1(−∞, x(iδ 2 )], 1(−∞, x((i + 1)δ 2 )]] . Then, for each h ∈ T , there is i ∈ {0, . . . , k} such that h ∈ (f i , f¯i ), and EP (f¯i − f i )2 (X) = P (x(iδ 2 ) < X ≤ x((i + 1)δ 2 )) = δ 2 . For this collection of brackets, the number of brackets is k + 1 = [1/δ 2 ] + 1. Thus, N[ ] (δ, T, L2 (P )) ≤ 1 + [1/δ 2 ]. This upper bound is, in fact, sharp — but only upper bounds are needed. For this X and T , the envelope function is s(T ) ≡ 1. ✷ We continue by stating and proving a maximal inequality not for the empirical process but for its Gaussian limit, the “Brownian bridge.” We then state the appropriate analogue for the empirical process and discuss the new difficulties introduced but only give proofs of the theorem in Appendix D.2. Let T be a set of functions as above. Definition 7.1.8. The Brownian bridge on T (with respect to P) is the Gaussian process WP0 (f ), f ∈ T with EWP0 (f ) = 0, Cov(WP0 (f ), WP0 (g)) = CovP (f (X), g(X)) . Section 7.1 33 Weak Convergence in Function Spaces Note that by the central limit theorem, for hj ∈ T , En (h1 ), . . . , En (hk ) will converge in law to WP0 (h1 ), . . . , WP0 (hk ), which incidentally establishes that WP0 is definable as a Gaussian process as well as that En (·) converges FIDI to WP0 (·). The critical connection between WP0 (f ) and bracketing is that if (f , f¯) is a bracket with bracketing number δ and if f, g ∈ (f , f¯), then E(WP0 (g) − WP0 (f ))2 ≤ EP (g − f )2 (X) ≤ E(f¯ − f )2 (X) ≤ δ 2 . We state an analogue of Theorem 2.14.16 of van der Vaart and Wellner (1996), which is also a crude version of a result of Talagrand (1994). Theorem 7.1.2. Suppose that for all x ∈ X , f ∈ T (i) For some fixed enveloped function s ∈ T , |(f (x)| ≤ s(x). (ii) For some positive constants c and d, N[ ] (δ, T, L2 (P )) ≤ cδ −d for all 0 < δ < 1. (iii) For some positive constant γ, EP f 2 (X) ≤ γ 2 . Then, 2d(1+ǫ) λ λ2 P [sup{|WP0 (f )| : f ∈ T } ≥ λ] ≤ C 1 + exp − 2 , γ 2γ (7.1.20) for all ǫ > 0 and C a constant. Note that (ii) implies (i) (Problem D.2.9). The proof we shall sketch in Appendix D.2 is essentially that of van der Vaart and Wellner (1996) following Pollard (1984). A much better result due to Talagrand (1994) shows that 2d(1+ǫ) can be replaced by d−1 with C replaced by Cγ −2d — see Proposition A.2.7 in van der Vaart and Wellner. This is the best possible (Problem D.2.7). In the future, we shall refer to the condition (ii) above as the “polynomial bracketing” condition. In the discussion following the proof of Theorem 7.1.2 in Appendix D.2 it is argued that the result follows from a “chaining” argument which involves showing that we can write WP0 (f ) = WP0 (gm0 ) + ∞ X m=m0 [WP0 (gm+1 ) − WP0 (gm )] (7.1.21) where the {gm } are representative members of the bracket sets (f mj , f mj ). It is also argued that the order of magnitude of P {sup |WP0 (f )| : f ∈ T } ≥ λ is of the same order exp{−λ2 /2γ 2 } as that of a single WP0 (f ). This result enables us to show that WP0 (f ) is continuous in | · |∞ on T — see below. Remarkably, a result almost as good is available for En , independent of n. This is a slightly cruder form of Theorem 2.14.16 of van der Vaart and Wellner (1996). 34 Tools for Asymptotic Analysis Chapter 7 Theorem 7.1.3. Suppose T satisfies the conditions of Theorem 7.1.2 and the envelope function s(·) in conditon (i) of Theorem 7.1.2 is uniformly bounded by M < ∞. Then, 4d 1 λ 1 P [||En ||T ≥ M λ] ≤ Cγ −2d 1 + exp{− [λ2 (γ 2 + (3 + λ)n− 2 )−1 ]} . (7.1.22) γ 2 The argument for this theorem, which we discuss somewhat further in Appendix D.2, replaces the use of the Gaussian tail inequality (7.1.9) by Hoeffding’s inequality (7.1.10) and the change in the bound reflects this. The chaining method described in Appendix D.2 relies on bounds for the probabilities of deviation from 0 of increments with small variance. For En , this behaviour is intrinsically worse than for WP0 . For instance, the classical √ empirical process sample functions, n(Fb (·) − F (·)), have jumps while WP0 (F (·)) has continuous sample functions. This leads to the need for arguing that although individual gm+1 − gm may be unusually large on their scale, having more than one such element of a chain deviate is sufficiently unlikely. Furthermore, the requirement that s(·) is bounded has to be imposed since, for an arbitrary F (·), a bound of the type given in (7.1.22) couldn’t hold even if T were a single function. We now apply Theorem 7.1.2 to a first example. Corollary 7.1.3. Continuity of WP0 (f ). Suppose T satisfies the conditions of Theorem 1 7.1.2. Let ||f ||2 ≡ EP2 f 2 (X). Then, WP0 (·) is continuous in probability with respect to || · ||2 , in the sense that, for f0 ∈ T , P [sup{|WP0 (f ) − WP0 (f0 )| : ||f − f0 ||2 ≤ δ, f ∈ T } ≥ ǫ] −→ 0 (7.1.23) for each ǫ > 0 as δ → 0. 0 0 Proof. Set W0 (f ) = W W0 (f ) is a Gaussian process with mean P (f ) − WP (f 0 ). Then zero and covariance E W0 (f )W0 (g) = E (f − f0 )(X)(g − f0 )(X) . That is, W0 (f ) has the same distribution as WP0 (f − f0 ). The result follows from the bound of Theorem 7.1.2. ✷ Note that (7.1.23) is weaker than the more natural definition of continuity with probability one: On a suitable probability space (Ω, A, P ), we can define WP0 (·, w) such that P [ω : sup{|WP0 (f, ω) − WP0 (f0 , ω)| : ||f − f0 ||2 ≤ δ, f ∈ T } → 0 as δ → 0] = 1. (7.1.24) But this can also be derived. See Problem D.2.3. As an application of Corollary 7.1.3, take T = {1[0, u] : u ∈ [0, 1]} and P the uniform distribution on [0, 1]. Identify 1[0, u] with u. Then W 0 (u) is the usual Brownian bridge of Section I.1. We deduce from Corollary 7.1.3 that, for u0 ∈ [0, 1], P [sup{|W 0 (u) − W 0 (u0 )| : |u − u0 | ≤ δ, u ∈ [0, 1]} ≥ ǫ] → 0 as δ → 0 for all ǫ > 0. To see this, note that if v > u, E{1(X ∈ [0, v]) − 1(X ∈ [0, u])}2 = E{1(X ∈ (u, v])}2 = v − u, Section 7.1 35 Weak Convergence in Function Spaces and use Example 7.1.4 and Theorem 7.1.2. The statement (7.1.24) also holds in this example. We continue with the identification of T with {1[0, u] : u ∈ [0, 1]} and write En (u) and WP0 (u) for En (1[0, u]) and WP0 1[0, u] as before. Theorem 7.1.4 (Donsker (1952)). If T = {1[0, u] : u ∈ [0, 1]} and P = U[0, 1], then for En and W 0 as above En =⇒ W 0 . Proof. The structure of this argument parallels that of Proposition 7.1.2. We need only check the conditions of Corollary 7.1.1. FIDI convergence, as we noted previously, follows from the multivariate central limit theorem. Let Tmj = (jδm , (j + 1)δm ], 0 ≤ j < [1/δm ] − 1 with δ = δm to be chosen. Now En (v) − En (jδm ) = En (1(jδm , v]). The set of all {1(jδm , v] : jδm < v ≤ (j + 1)δm } is easily seen to obey the conditions of Theorem 7.1.2 with d = 1, γ 2 = δm (1 − δm ) by arguing as in Example 7.1.4. Thus using (7.1.22), λ P sup{|En (v) − En (jδm )| : jδm < v ≤ (j + 1)δm } > 2 2 1 1 λ −2 4 −1 − 21 −1 . ≤ Cδm (1 + λδm ) exp − (δm + (3 + λ)n ) 2 2 Put λ = 2/m, δm = 1/m3 (say), and the condition of Corollary 7.1.1 holds. The theorem follows. ✷ More generally, we state (see Problem 7.1.17) Theorem 7.1.5. Suppose T can be partitioned into sets {Tmj }, j = 1, . . . , km , such that each Tmj satisfies the conditions of Theorem 7.1.3, polynomial {c, d} bracketing, and uniformly bounded envelope s(·) ≤ M . Suppose c, d, and M are independent of m, and suppose 2 sup{EP (f − g)2 (X) : f, g ∈ Tmj } ≤ γm 2 for all m, where γm = o (1/ log km ) . Then, if X1 , . . . , Xn are i.i.d. P and f ∈ T , En (f ) =⇒ WP0 (f ) . Note, for future reference, that the only properties specific to En that are used are the maximal inequalities. Thus weak convergence results can be expected to hold for much wider classes of processes, for instance, empirical processes for weakly dependent stationary sequences — see Doukhan (1995) for instance. We continue with Example 7.1.6. The Kolmogorov statistic. We immediately deduce from Donsker’s theorem that if F0 = U(0, 1), under H : F = F0 , ∞ X √ 2 2 (−1)j+1 e−2j λ (7.1.25) P [ n sup |Fb(u)−F0 (u)| > λ] → P [sup |W 0 (u)| > λ] = u u j=1 36 Tools for Asymptotic Analysis Chapter 7 where the second equality is from Shorack and Wellner (1986). From Proposition 4.1.1, we know that the left hand side of (7.1.25) in fact has the same distribution under F0 for any F0 continuous. What if F0 is not continuous? It follows from Theorem 7.1.5 that, under H : F = F0 , for T = {1(−∞, u] : u ∈ R}, with 1(−∞, u] identified with u, En (·) =⇒ WF00 (·) ≡ W 0 F0 (·) . Although the distribution of supx |WF00 (x)| depends on F0 if F0 is not continuous, it is √ still true that the distribution of n supx |Fb (x) − F0 (x)| converges weakly to that of supx |WF00 (x)|. We will use this remark and the bootstrap in Chapter 10 to set confidence bounds for F based on Kolmogorov statistic which are narrower than ones using the tabled values based on the right hand side of (7.1.25). See Example 10.3.4. ✷ Example 7.1.7. The Cramér-von Mises statistic of Problem 4.1.11 is Z ∞ Cn ≡ n (Fb(x) − F0 (x) )2 dF0 (x). −∞ If F0 is continuous, the distribution of this statistic under H does not depend on F0 . Take R1 F0 = U(0, 1). Since the map f → 0 f 2 (u)du is continuous from l∞ [0, 1] to R+ , we immediately deduce from Donsker’s theorem that Z 1 LF0 (Cn ) → L( [W 0 ]2 (u) du). 0 Although the density function of this limit does not have a nice analytic form (see Example 7.3.1), its characteristic function does. Z 1 − 12 E exp{it [W 0 ]2 (u) du)} = Π∞ j=1 (1 − 2λj it) 0 with λj = (jπ)−2 from Shorack and Wellner (1986), p.215. ✷ Classes of functions T for which En (·) =⇒ WP0 (·) are called Donsker classes. They include indicators of quadrants and hyperplanes in Rd and much else. Validation of weak convergence involves, as we have seen, being able to find partitions of the sets of functions T such that the oscillations of En over each member of the partition tend to be small. This requires using maximal inequalities in the way we have. An important, but in general overly crude application of the maximal inequalities we have, establishes generalizations of the law of large numbers. We obtain from Theorem 7.1.3 Proposition 7.1.4. If T obeys the conditions of Theorem 7.1.3, n sup{ 1X P |h(Xi ) − EP h(X)| : h ∈ T } −→ 0. n i=1 Proof. The result follows because Theorem 7.1.3 tells us that sup{|En (h)| : h ∈ T } = OP (1) (7.1.26) Section 7.1 37 Weak Convergence in Function Spaces 1 which implies (7.1.26) on multiplying both sides by n− 2 . ✷ In fact, more can be shown: The convergence in (7.1.26) is almost sure (Problem 7.1.10). A famous special case is the Glivenko-Cantelli theorem: With probability 1 sup |Fb(x) − F (x)| → 0 (7.1.27) x We will apply the weak convergence results further in Sections 7.2 and 9.2. The maximal inequalities will be directly useful both in Section 7.2 and in Chapters 9 and 11. We close the discussion in this section noting that we do not enter further into the other beautiful ideas underlying maximal inequalities, such as symmetrization, viewing the sample obtained as a sample without replacement from a larger sample and VapnikChervonenkis theory, among others. Our focus in this section and the following sections has been and will be the application of these ideas to develop the asymptotic tools needed for statistical theory. Summary. We introduced separable stochastic processes {Z(t) : t ∈ T } on a general space T which typically is a space of functions and defined the weak convergence of a sequence of stochastic processes Zn (·) to a tight stochastic process Z(·). Weak convergence allows us to conclude that the distribution of a l∞ continuous function q of Zn (·) converges to the distribution of q(Z(·)). We gave sufficient conditions for weak convergence in terms of probability bounds on oscillations of Zn (·). This led to the consideration of maximal inequalities that provide such bounds. We applied this framework and results to the likelihood and partial sum processes and established weak convergence of the partial sum process to the Wiener process. We devoted Section 7.1.3 to the empirical process 1 En (h) = n 2 Z [h(x) − Eh(X)] dPb (x) for h in some space T of functions, where Pb is the empirical probability of X1 , . . . , Xn i.i.d. as X ∼ P . En (·) converges weakly to a Brownian bridge WP0 (·) on T , which is defined to be a zero mean Gaussian process with Cov WP0 (g), WP0 (h) = CovP g(X), h(X) . For such processes, we introduced the bracketing entropy (number) and established maximal inequalities for WP0 (·) and En (h) under envelope bounds on h(x) and polynomial bounds on the bracketing number (the polynomial bracketing condition). We established Donsker’s theorem which states that for uniform P the classical empirical process converges weakly to the Brownian bridge W 0 and we generalized this result to give conditions 0 under which {En (h) : h ∈ T } converges weakly to {WP (h) : h ∈ T }. A class of functions T such that En (h) : h ∈ T converges weakly to WP0 (h) : h ∈ T is called a Donsker class. Finally, we used the results to derive the asymptotic distributions of the Kolmogorov and Cramér-von Mises statistics, and to give a proof of the Glivenko-Cantelli theorem. 38 Tools for Asymptotic Analysis 7.2 Chapter 7 The Delta Method in Infinite Dimensional Space We have applied the stochastic delta method successfully to M estimates in Sections 5.4 and 6.2 by Taylor expanding the estimating equations defining the parameter ν(P ). There are many examples, such as the minimum distance estimates and trimmed mean of Section I.3, where it is unclear what a linear approximation to νbn = ν(Pb ) means. The first and perhaps the most valuable part of this section is the development of a heuristic method which yields the “correct” linear approximation if one exists. In this section, as in most of the text, we are limited to the i.i.d. case. 7.2.1 Influence Functions. The Gâteaux and Fréchet Derivatives Let Pb denote the empirical distribution of X1 , . . . , Xn i.i.d. as P . What is a linear approximation to a statistic νbn = ν(Pb ) which is a consistent estimate of a parameter ν(P )? In analogy with the one and d dimensional approximations (5.3.23), (5.4.22), and (6.2.3), we define it to be an expression of the form n ν(P ) + 1X ψ(Xi , P ) = ν(P ) + n i=1 Z ψ(x, P )dPb(x) (7.2.1) where EP ψ(X, P ) = 0 and EP ψ 2 (X, P ) < ∞. Since n 1 1X ψ(Xi , P ) = OP (n− 2 ) n i=1 we shall say that (7.2.1) is a valid linear approximation to νbn if Z 1 νbn − ν(P ) − ψ(x, P )dPb (x) = oP (n− 2 ). (7.2.2) √ Thus νbn − ν(P ) is asymptotically an average of i.i.d. variables and n νbn − ν(P ) will be asymptotically normal. For instance, if νbn = ν(Pb) = h(X̄) for some differentiable function h, then ψ(x, P ) = h′ (µ)(x − µ) and √ √ n h(X̄) − µ = n h′ (µ)(X̄ − u) + oP (1) . See the proof of (A.14.17). The function ψ(·, P ) is unique if it exists (Problem 7.2.19) and we shall call it the influence function of ν(Pb). The terminology and the basic idea of the approximation come from Section 5.4.1. Let ν:M→R be a parameter where Section 7.2 The Delta Method in Infinite Dimensional Space 39 (i) M ⊃ P, with P the model of primary interest. (ii) M ⊃ all distributions with finite support ⊂ X , i.e., M includes the empirical distribution. (iii) M is convex. If P and Q belong to M, then (1 − ε)P + ε Q ∈ M for all 0 ≤ ε ≤ 1 . Define, in accord with Huber (1981) the Gâteaux derivative of ν at P in the direction of Q by ψ0 (Q, P ) = ∂ ν( (1 − ε)P + εQ)|ε=0 ∂ε (7.2.3) where the one-sided derivative is assumed to exist. If the influence function as defined above exists, then under general conditions (e.g., Huber (1981), Bickel, Klassen, Ritov and Wellner (1993,1998)), it can be computed as a function of x ∈ X as, ψ(x, P ) ≡ ψ0 (δx , P ) (7.2.4) where δx is point mass at x. Under appropriate conditions (see Example 7.2.1 and Problem 7.2.3) Z ψ0 (Q, P ) = ψ(x, P ) dQ(x) (7.2.5) for all Q ∈ M. In particular, since ψ0 (P, P ) = 0, we have a fundamental property of the influence function, EP ψ(X, P ) = 0 . (7.2.6) Why should we expect (7.2.2) and (7.2.5) to hold? Here is an example where they are valid. If X = {x1 , . . . , xk }, then we are in the multinomial case we studied in Section 5.4.1. Example 7.2.1. The multinomial case. We can identify P with (p1 , . . . , pk )T where pj = P {X = xj }, j = 1, . . . , k, and we can write the parameter ν(P ) as h(p1 , . . . , pk ) for some h : Rk → R. In this case we know from Theorem 5.3.2 and Section 5.4.1 that the influence function ψ(x, P ) is obtained from the total differential. We take M = P = {q : 0 < qi < P P 1, ki=1 qi = 1}. Because of the restriction ki=1 qi = 1, the total differential evaluated at x = xj under the conditions of Section 5.4.1 is (see (B.8.14)) 1 ψ(xj , P ) = lim (h((1 − ε)p1 , . . . , (1 − ε)pj−1 , (1 − ε)pj + ε, ε↓0 ε (1 − ε)pj+1 , . . . , (1 − ε)pk ) − h(p1 , . . . , pk )) = k k X X ∂h ∂h ∂h (p) − (p)pt = (p) 1(xt = xj ) − pt (7.2.7) ∂pj ∂p ∂p t j t=1 t=1 40 Tools for Asymptotic Analysis Chapter 7 and if Q ↔ (q1 , . . . , qk ) , 1 ψ0 (Q, P ) = lim (h((1 − ε)p1 + εq1 , . . . , (1 − ε)pk + εqk ) − h(p1 , . . . , pk )) ε↓0 ε k k X X ∂h ψ(xj , P )qj (7.2.8) (p)(qj − pj ) = = ∂pj j=1 j=1 which is just (7.2.5). We see that ψ(X, P ) = k X ∂h (p)(1(X = xj ) − pj ) ∂p j j=1 is the function appearing in (5.4.10) and that (5.4.10) exactly gives (7.2.2) with νbn = ν(Pb ) where Pb is the empirical probability, that is, n 1 Nk 1X N1 ψ(Xi , P ) + oP (n− 2 ) ) = ν(P ) + ν(Pb ) = h( , . . . , n n n i=1 (7.2.9) Equation (7.2.9) just says that (7.2.1) is a valid linear approximation to ν(Pb) in this case. ✷ ∂h (·) in a neighborhood of p is enough to leIn the multinomial case, continuity of ∂p j gitimize the existence of a total differential for h at p, which is what (7.2.7) corresponds R to. Also, all expressions such as Ψ(x, P )dQ(x) naturally exist and are finite. There are different generalizations of the total differential to infinite dimensional spaces. At this point, we simply operate formally, derive the influence function, and show, by example, the validity of (7.2.5), and, in some cases, (7.2.2) for a number of examples. We also relate the influence function to the sensitivity curve of Section 3.5 (see Remark 3.5.1 and Problem 7.2.2). Here is an extension of the multinomial example. Example 7.2.2. Moments and functions of moments. Suppose M is as in (i), (ii), and (iii), Z ν(P ) = h g(x)dP where g(·) = (g1 , . . . , gk )T is a vector of functions from X to R such that Z |g|2 (x)dP (x) < ∞ for all P ∈ M R and h : Rk → R has a total differential at µg (P ) ≡ g(x)dP (x). Then, by definition, for P, Q ∈ M, Z Z ν((1 − ε)P + εQ) = h g(x)dP (x) + ε g(x)d(Q − P )(x) Section 7.2 The Delta Method in Infinite Dimensional Space 41 and, by definition, ∂h (µ (P ) + ε(g(x) − µg (P )))|ε=0 ∂ε g k X ∂h = (µ (P ))(gj (x) − µgj (P )) ∂µgj g j=1 ψ0 (δx , P ) = = ∇T h(µg (P ))(g(x) − µg (P )) . (7.2.10) It can be shown (Problem 7.2.3) that (7.2.5) holds. Moreover, if we set ψ(x, P ) = ψ0 (δx , P ), then by (5.3.23), n ν(Pb ) = ν(P ) + 1 1X ψ(Xi , P ) + oP (n− 2 ) n i=1 (7.2.11) and (7.2.2) holds. We have already seen in Chapters 5 and 6 how to apply (7.2.10) to get expressions for the asymptotic behaviour of moments and cumulants. ✷ b = Recall that a minimum contrast (MC) estimate is defined in Section 5.4.2 as θ R arg min ρ(x, θ)dFb (x) for some contrast function ρ, while a M estimate is defined as R the solution to ψ(x, θ) dFb (x) = 0 for some funtion ψ. An important special case of M estimates is obtained by choosing ψ as the derivative of a contrast function. Example 7.2.3. Quantiles. The median is a MC estimate which can be shown to be consistent and asymptotically normal (Problem 5.4.1), yet does not satisfy the conditions of Theorems 5.4.2. Specifically, suppose X ∈ R has distribution function F with density f and uniquely defined median ν(F ) = F −1 (.5), such that f (ν(F )) > 0. Then, from Problem 5.4.1, if Fb −1 (.5) is a sample median, √ L( n(Fb −1 (.5) − ν(F ))) → N (0, σ 2 (F )) where, σ 2 (F ) = 1/4f 2(ν(F )). The mean and median are informative of the center of a population. Sometimes quantiles are also of interest. For instance in addition to median income we may want to examine the 10th or the 99th percentile of income. We next formally work out the influence function of the α quantile να (F ), 0 < α < 1, defined by F (να (F )) = α . (7.2.12) We assume that equation (7.2.12) has a unique solution for F ↔ P ∈ M. In general, να (F ) corresponds (Problem 7.2.7) to the MC estimate, with ρ(x, θ) = = (1 − α)|x − θ|, x < θ α|x − θ|, x≥θ. If F is continuous R and increasing, να (F ) is the parameter (corresponding to an M estimate) which solves φα (x, θ) dF (x) = 0 with φα (x, θ) = −(1 − α), = α, x<θ x≥θ. (7.2.13) 42 Tools for Asymptotic Analysis Chapter 7 The median indeed corresponds to α = .5. If F = Fb , as we know for the median, να (Fb ) is not uniquely defined in general. For convenience, we use the sample quantile (see (2.1.17)) x bα = 1 b −1 F (α) + FbU−1 (α) = X([nα]+1) , if nα ∈ / {1, . . . , n} 2 1 X([nα]) + X([nα]+1) , otherwise = 2 (7.2.14) where FbU−1 (α) = sup{x : Fb(x) ≤ α}, X(n+1) = X(n) , and X(1) ≤ . . . ≤ X(n) are the ordered X1 , . . . , Xn . We now proceed formally from (7.2.12) assuming να (F ) is uniquely defined on M, and that we are evaluating the influence function at F ↔ P such that f (να (F )) is defined and > 0. Then, we can differentiate (7.2.12) in the form ((1 − ε)F + εG)(να ((1 − ε)F + εG)) = α with respect to ε, evaluated at ε = 0, to get (G − F )(να (F )) + f (να (F )) ∂ να (F + ε(G − F ))|ε=0 = 0 ∂ε which, by (7.2.4), yields, by substituting G(y) = 1(y ≥ x) corresponding to δx , the influence function, − 1(να (F ) ≥ x) − α ψα (x, F ) = . (7.2.15) f (να (F )) Expressions (7.2.15) and (7.2.2) lead to the expectation that we have, for xα = F −1 (α), n x bα = xα + 1 1 X −(1(Xi ≤ F −1 (α)) − α) + oP (n− 2 ) −1 n i=1 f (F (α)) (7.2.16) and hence, by the central limit theorem, √ n(b xα − xα ) =⇒ N (0, σα2 (F )) where σα2 (F ) = α(1 − α)/f 2 (F −1 (α)). That (7.2.16) is, in fact, correct is shown in Problem 7.2.8. ✷ The Fréchet derivative To obtain a general theorem yielding (7.2.2) we need stronger notions of differentiation than that due to Gâteaux. These necessarily rely on some definition of (metric) topology on M, a convex subset of probabilities on X containing P, and all discrete distributions. We can regard the Gâteaux derivative as a partial derivative in the direction (1 − ε)P + εQ. We next turn to a stronger differentiability that corresponds to the existence of a total differential. Definition 7.2.1. d is a suitable metric on M, if it satisfies the usual a)–d) below, and e). Section 7.2 The Delta Method in Infinite Dimensional Space 43 a) d ≥ 0 b) d(P, Q) = d(Q, P ) c) d(P, Q) = 0 iff P = Q d) d(P, Q) ≤ d(P, R) + d(R, Q) all P, R, Q ∈ M e) d((1 − ε)P + εQ, P ) ≤ εd(P, Q) all P, Q ∈ M Examples of suitable metrics are d(P, Q) = sup{| Z f (dP − dQ)| : f ∈ C} R R where C is a Donsker class satisfying: f dP = f dQ for all f ∈ C =⇒ P = Q. For instance, if X = R and C = {1z (·) : z ∈ R} where 1z (t) = 1(t ≤ z), then d is the Kolmogorov (sup) metric. The strongest (often too strong) definition of differentiability of a function ν : M → R is F (for Fréchet) differentiability. In the following we write “Rem” for the remainder of R the approximation ψ(x, P )dQ(x) to ν(Q) − ν(P ). Definition 7.2.2. ν is F differentiable at P iff there exists ψ(·, P ) such that R (a) ψ(x, P )dP (x) = 0. (b) For all ε > 0, there exists δ(ε) > 0 such that if d(P, Q) ≤ δ(ε), then Z Rem(P, Q) ≡ |ν(Q) − ν(P ) − ψ(x, P )dQ(x)| ≤ εd(P, Q). R R Note that ψ(x, P )dQ(x) = ψ(x, P )d(Q − P ). Equivalently, if Qm is any sequence such that d(Qm , P ) → 0, and if (a) holds, then (b) is equivalent to Rem(P, Qm ) = o(d(P, Qm )). Write Pbn for the empirical probability Pb to clearly indicate the dependence on sample size. Theorem 7.2.1. If ν is F differentiable at P with derivative ψ with respect to a suitable metric d and (i) d(Pbn , P ) = OP (n− 2 ) 1 (ii) ψ(·, P ) ∈ L2 (P ) then n ν(Pbn ) = ν(P ) + 1 1X ψ(Xi , P ) + oP (n− 2 ). n i=1 That is, νbn = ν(Pbn ) has influence function ψ. 44 Tools for Asymptotic Analysis Chapter 7 Proof. By hypothesis, for δ(ε) as in Definition 7.2.2, P [d(Pbn , P ) ≤ δ(ε)] → 1 for every ε > 0. Therefore, by F differentiability, Z b P [|ν(Pn ) − ν(P ) − ψ(x, P )d(Pbn − P )(x)| ≤ ε′ d(Pbn , P )] → 1 for every ε′ > 0. By (i), for all ε > 0, there exists M such that 1 Take ε′ = P [d(Pbn , P ) ≤ M n− 2 ] ≥ 1 − ε. ε M, then, for every ε > 0, Z 1 b P [|ν(Pn ) − ν(P ) − ψ(x, P )d(Pbn − P )(x)| ≤ εn− 2 ] → 1. ✷ Example 7.2.4. Cramér-von Mises (C-vM) Statistic. We want to see how the C-vM statistic behaves when H : P = P0 is false and X = R. The C-vM statistic of Example 7.1.6 is ν(Fb), where Z ν(F ) = (F (y) − F0 (y))2 dF0 (y) . By a standard calculation (Problem 7.2.9), the influence function is given by Z ∞ ψ(x, P ) = 2 (1(y ≤ x) − F (y))(F (y) − F0 (y))dF0 (y) . (7.2.17) −∞ It can be shown (Problem 7.2.9) that, if we take d as the Kolmogorov metric d(P, Q) = sup |P (−∞, y] − Q(−∞, y]| y then, Rem(P, Q) = o(d(P, Q)). if P 6= P0 , the Cramér-von Mises statistic is asymptotically normal with mean RThus, R ∞ 2 −1 ∞ 2 (F − F ) (y)dF (y) and variance n ψ (x, P )dF (x) where ψ is given by 0 0 −∞ −∞ (7.2.17). This is quite different from its behaviour under the hypothesis H : F = F0 where its asymptotic behaviour is non-Gaussian as we will see in Example 7.3.1. ✷ There is a weaker but more broadly applicable differentiability notion due to Hadamard. We refer the reader to van der Vaart (1998), Chapter 20, for a clear discussion of this notion and its utility. We shall give some more applications of Theorem 7.2.1 in the problems. Unfortunately, it is hard to apply, even when weakened to the Hadamard form. The problem is Section 7.2 The Delta Method in Infinite Dimensional Space 45 that many parameters are defined implicitly and what R is most serious — the argument P appears as dP , as in the median which minimizes |x − θ|dP (x). Convergence in a suit√ 1 able metric having the n consistency property, d(Pb, P ) = OP (n− 2 ), is not sufficient to have continuity of such ν(P ) in general, much less differentiability. It is often the case that, after formal calculation of the influence function, proving that it provides a suitable approximation is more easily done by ad hoc methods. We illustrate this in the following Theorem 7.2.2. In this result, we also illustrate that the influence function gives us all relevant first order asymptotic information about ν(Pb ) if we have a vector parameter ν(P ) = (ν1 (P ), . . . , νd (P ))T . If the expansion (7.2.2) is valid for νj (P ), j = 1, . . . , d, and ψj (x, P ) is the influence function corresponding to νj (P ), then n √ 1 X ψ(Xi , P ) + oP (1) n(ν(Pb) − ν(P )) = √ n i=1 (7.2.18) where ψ(x, P ) ≡ (ψ1 (x, P ), . . . , ψd (x, P ))T and EP ψ(X, P ) = 0, EP |ψ|2 (X, P ) < ∞. Here ψ is the natural extension of the concept of an influence function from one dimensional parameters to d dimensional parameters. By the multivariate central limit theorem, we conclude that √ n(ν(Pb) − ν(P )) =⇒ Nd (0, EP ψψ T (x)) . (7.2.19) We now present a method of Huber(1967) for establishing influence function approximations for M estimates which covers situations like the median or its analogues in more than 1 dimension. It is an extension of Theorem 6.2.2. Theorem 7.2.2. Suppose φ : X × Rp → Rp satisfies bn is such that A0’: θ Z bn )dPbn (x) = oP (n− 12 ) . φ(x, θ (7.2.20) R A1: θ(P ) uniquely satisfies φ x, θ(P ) dP (x) = 0 for all P ⊂ Q. 2 A2: EP φ X, θ(P ) < ∞ for all P ∈ Q. bn is consistent for θ(P ) for P ∈ Q. A5’: θ A4’: The set Tε = {φ(·, θ) − φ(·, θ(P )) : |θ − θ(P )| ≤ ε} satisfies the conditions of Theorem 7.1.3 with envelope function s(·, ε) such that EP s2 (X, ε) = o(1) as ε → 0, where statements about the vector φ refer to each of its components φj . R A3’: The map θ → φ(x, θ)dP (x) has a total differential H(P ) at θ(P ) which is nonsingular, ∂ H(P ) = EP φi (X, θ(P )) ∂θj 1≤i,j≤p 46 Tools for Asymptotic Analysis Chapter 7 where [·] denotes a matrix. Then, n X 1 b = θ(P ) + 1 θ [−H]−1 (P )φ(Xi , θ(P )) + oP (n− 2 ) . n i=1 (7.2.21) Note (Problem 7.2.11) that formal calculation of the influence function for θ(P ) as a solution of (6.2.2) yields H −1 φ. Proof: By A4’ and Theorem 7.1.3, sup{|En (φ(·, θ) − φ(·, θ(P )))| : |θ − θ(P )| ≤ εn } = oP (1) (7.2.22) if εn → 0. By A5’, it follows that bn ) − φ(·, θ(P ))) = op (1) . En (φ(·, θ (7.2.23) Translating (7.2.23) by using (7.1.19) and using A0’ and A1, we obtain that 1 1 b n ) = n− 2 −n 2 EP φ(X, θ 1 But, by A3’ and A1, since n− 2 n X φ(Xi , θ(P )) + oP (1) . (7.2.24) i=1 Pn i=1 φ(Xi , θ(P )) = OP (1) by the central limit theorem, bn ) = H(P )(θ bn − θ(P )) + oP (|θ bn − θ(P )|) EP φ(X, θ since Ep φ(X, θ(P )) = 0. Combining (7.2.24) and (7.2.25) we obtain the result. (7.2.25) ✷ Here is an example. Example 7.2.5. Multivariate “medians.” Define for P on Rd . Z θ(P ) ≡ arg min |x − θ|r dP (x) where |x| is the Euclidean norm and 1 ≤ r ≤ 2. It is not hard to show (Problem 7.2.12) R that θ(P ) is defined if |x|r dP (x) < ∞ and is unique for r > 1. It is defined when r = 1, which corresponds to the median for d = 1, but then θ(P ) may not be unique. If P has a density p > 0, then θ(P ) is unique even if r = 1 (Problem 7.2.12). Suppose P is such that (i) P has a positive density p. R (ii) |x|2(r−1) p(x)dx < ∞. Then, where θ(Pbn ) = θ(P ) + n 1 1X ψ(Xi , P ) + oP (n− 2 ) n i=1 ψ(x, P ) = M (P )(x − θ(P ))|x − θ(P )|r−2 Section 7.2 The Delta Method in Infinite Dimensional Space 47 and, if X ≡ (X1 , . . . , Xd )T , δij = 1[i = j], (Xi − θi (P ))(Xj − θj (P )) −1 r−2 M (P ) = EP . (7.2.26) (r − 2) + δij |X − θ(P )| |X − θ(P )|4−r The matrix M (P ) is well defined for r > 1 and corresponds to [−H] in (7.2.21). For r = 1, expression (7.2.26) becomes ∞ − ∞, but passage to the limit as r decreases to 1 yields ψ(x, P ) = sgn(x − θ(P ))/2p(θ(P )) as in Problem 5.4.1(e). If r > 1, EP |X|2r−2 < ∞ makes (7.2.26) well defined — see Problem 7.2.13. Checking the conditions of Theorem 7.2.2 otherwise requires some arguments — see Huber (1967) and Problems 7.2.11 and 7.2.12. ✷ 7.2.2 The Quantile Process We now give another approach to justifying the quantile representation (7.2.13) and deducing weak convergence for an infinite dimensional parameter, the quantile process defined by √ (7.2.27) Qn (t) ≡ n(Fb−1 (t) − F −1 (t)), 0 < t < 1 for suitable F . The approach is due to Shorack (1982); see also Shorack and Wellner (1986). Let W 0 (u) denote the Brownian bridge on [0, 1]. Note that −W0 (u) is also a Brownian bridge. We state the result as Theorem 7.2.3. Suppose P on R has distribution F with continuous positive density f = F ′ . Then, F F −1 (t) = t for 0 < t < 1 and (i) For all 0 < ε ≤ 21 , Qn (t) can be approximated by −En F −1 (t) /f F −1 (t) in the sense that sup{|Fb −1 (t)−F −1 (t)+ 1 (Fb(F −1 (t)) − t) | : ε ≤ t ≤ 1−ε} = oP (n− 2 ). (7.2.28) −1 f (F (t)) (ii) If T = [ε, 1 − ε], and Qn (t) defined in (7.2.27) is viewed as a stochastic process on T , then Qn =⇒ W 0 (·)/f (F −1 (·)) (7.2.29) where, by definition, W 0 (·)/f (F −1 (·)) is a Gaussian process with mean 0 on (0, 1), W 0 (s) s(1 − t) W 0 (t) Cov = , , s ≤ t, f (F −1 (s)) f (F −1 (t)) f (F −1 (s))f (F −1 (t)) which has continuous sample functions since W 0 (·) does. (iii) If f (F −1 (t)) ≥ δ, 0 ≤ t ≤ 1, then ε can be taken as 0 in (i) and (ii). 48 Tools for Asymptotic Analysis Chapter 7 (iv) If P = U(0, 1) and En (·) is the classical empirical process, we have kQn + En k∞ = oP (1) (7.2.30) and Qn (·) converges weakly to the Brownian bridge −W 0 (·). Proof. Write (Fb−1 (t) − F −1 (t)) Fb −1 (t) − F −1 (t) = .(F (Fb −1 (t)) − t) F (Fb−1 (t)) − t (Fb −1 (t) − F −1 (t)) b−1 F (F (t)) − Fb (Fb −1 (t)) + ∆n1 (t) = F (Fb−1 (t)) − F (F −1 (t)) ≡ I(t) × II(t) (7.2.31) where ∆n1 (t) = Fb Fb−1 (t) − t satisfies |∆n1 (t)| ≤ 1/n by the definition of Fb −1 (t) since F is continuous. We prove (iv) first. If F (x) = x, 0 ≤ x ≤ 1, kFb −1 − F −1 k∞ = oP (1) (7.2.32) by the Glivenko-Cantelli Theorem (Problem 7.2.16). Now, in the U(0, 1) case, I(t) ≡ 1. On the other hand, 1 1 II(t) = −(En (Fb−1 (t)) + O(n− 2 ))n− 2 = −(En (t) + ∆n2 (t) + O(n − 12 ))n (7.2.33) − 21 where |∆n2 (t)| ≤ k∆n2 k∞ ≤ sup{|En (u) − En (v)| : |u − v| ≤ kFb −1 − F −1 k∞ } . Then k∆n2 k∞ > ε ⊂ sup{|En (u) − En (v)| : |u − v| ≤ δ} > ε + kFb −1 − F −1 k > δ by intersecting the event on the left hand side with kFb −1 − F −1 k∞ ≤ δ and its complement. Thus, lim sup P [k∆n2 k∞ > ε] ≤ lim sup P sup |En (s) − En (t)| : |s − t| ≤ δ > ε n n + lim sup P kFb −1 − F −1 k∞ > δ (7.2.34) n for all δ, ε > 0. But the first term in (7.2.34) is just P sup{kW 0 (s) − W 0 (t)k : |s − t| ≤ δ} > ε Section 7.2 49 The Delta Method in Infinite Dimensional Space and the second is 0 for all δ > 0 by the Glivenko-Cantelli theorem (Problem 7.2.16). Then (7.2.30) for P = U(0, 1) follows from (7.2.33) and (7.2.34). b be their empirical df. Then, if For the (i) case, let U1 , . . . , Un be i.i.d. U(0, 1) and G −1 Xi ≡ F (Ui ), 1 ≤ i ≤ n, X1 , . . . , Xn are i.i.d. with df F . Taking Xi represented in this way as our sample from F , we can write and b−1 (t)) − F −1 (G−1 (t)) Fb−1 (t) − F −1 (t) = F −1 (G (7.2.35) b −t. Fb(F −1 (t)) − t = G(t) (7.2.36) P −1 b −1 (t). Then tn −→ G (t) = t uniformly on [ε, 1 − ε] by (7.2.32). Thus, Set tn = G uniformly on [ε, 1 − ε], F −1 (tn ) − F −1 (t) P ∂ −1 1 . −→ F (t) = −1 tn − t ∂t f F (t) More precisely, by (7.2.30), where b−1 (t)) − F −1 (G−1 (t)) 1 F −1 (G = (1 + ∆n (t)) −1 (t)) −1 −1 b f (F G (t) − G (t) P sup{|∆n (t)| : ε ≤ t ≤ 1 − ε} −→ 0 because f (F −1 (t)) is continuous and bounded away from 0 on [ε, 1 − ε] if ε > 0, by assumption. Thus (i) follows. The case (iii) also follows. Finally, (ii) follows from Donsker’s theorem (Theorem 7.1.4) and Slutsky’s theorem generalized to weak convergence in function spaces (Corollary 7.1.2). ✷ Here is a first application of this result. Example 7.2.6. Linear combinations of order statistics. A class of parameters which includes the trimmed means (3.5.3) are Z 1 ν(P ) = F −1 (t)dΛ(t) (7.2.37) 0 where Λ(t) is the df of a probability distribution on [0, 1]. If Λ(t) = 0, = 1, t<ε t>1−ε or equivalently Λ[ε, 1 − ε] = 1, the parameter ν(P ) is defined for all F . A special case is Λα with density λα (t) = (1 − 2α)−1 , α ≤ t ≤ 1 − α = 0 otherwise (7.2.38) 50 Tools for Asymptotic Analysis Chapter 7 for 0 < α < 12 . Then ν(Pb ) is just the α trimmed mean. From Theorem 7.2.3 (iii), since h→ Z 1−ε h(u)dΛ(u) ε is continuous in k · k∞ we have that Z 1 Z 1 √ n(Fb−1 (t) − F −1 (t))dΛ(t) =⇒ [f (F −1 (t))]−1 W 0 (t)dΛ(t). (7.2.39) 0 0 The limit is N (0, σ 2 (Λ, f )) (Problem 7.2.16) with Z 1Z 1 2 σ (Λ, f ) = [f (F −1 (s))f (F −1 (t))]−1 (s ∧ t − st)dΛ(s)dΛ(t). 0 (7.2.40) 0 We can get a simpler expression for σ 2 (Λ, f ) and more insight by using part (ii). We obtain, if λ(s) = 0, s ≤ ε or s ≥ 1 − ε, ε > 0, Z 1 0 (Fb−1 (s) − F −1 (s))λ(s)ds = − So the influence function of R1 ψ(x, P ) = − = − = − 0 Z 0 Z 0 1 1 (Fb (F −1 (s)) − s) λ(s)ds + oP (n− 2 ) . −1 f (F (s)) Fb−1 (s)λ(s)ds is 1 0 Z Z 1 (1(x ≤ F −1 (s)) − s) λ(s)ds f (F −1 (s)) (1(x ≤ F −1 (s)) − s)λ(s)dF −1 (s) (1(x ≤ y) − F (y))λ(F (y))dy Z ∞ Z ∞ = − λ(F (y))dy − EP ( λ(F (y))dy . x (7.2.41) X Specializing to the trimmed mean, with λ = λα given by (7.2.38), we get, as influence function, F −1 (α) − µα (P ), x ≤ F −1 (α) 1 − 2α x = − µα (P ), F −1 (α) ≤ x ≤ F −1 (1 − α) 1 − 2α F −1 (1 − α) − µα (P ), x ≥ F −1 (1 − α) = 1 − 2α ψα (x, P ) = (7.2.42) where µα (P ) = α(F −1 (α) + F −1 (1 − α)) + (1 − 2α)−1 1 − 2α Z F −1 (1−α) F −1 (α) x dF (x) . Section 7.3 51 Further Expansions From (7.2.41) the asymptotic variance of √ nνα (Pb ) is σ 2 (Λ, f ) = (1 − 2α)−2 {(F −1 (α) − µα (P ))2 + (F −1 (1 − α) − µα (P ))2 Z F −1 (1−α) (x − µα (P ))2 dF (x)} . (7.2.43) + F −1 (α) Note that if f is symmetric, f (a + x) = f (a − x) for some a, then −1 να (P ) = (1 − 2α) Z F −1 (1−α) x dF (x) = a. F −1 (α) Note the agreement between this formula and the sensitivity curve of Figure 3.5.2. ✷ Remark 7.2.1. Finally, note that (7.2.39) makes sense even if λ > 0 on all of (0,1) and is correct for the mean itself, λ ≡ 1. In fact (7.2.39) and (7.2.41) are valid much more generally; see Problems 7.2.21–7.2.24 and results in Bickel (1967), Chernoff, Gastwirth and Johns (1967), and Stigler (1969). ✷ Summary. We introduced a generalization of the stochastic delta method from functions of i.i.d. vectors of dimension d to functions ν(Pb) of the empirical distribution. We did this by introducing Gâteaux and Fréchet derivatives of functions defined on ∞ dimensional linear spaces. These derivatives correspond, respectively, to partial differentiation in a fixed direction and the existence of a total differential. They lead to the important notion of P the influence function which tells us what is the “right” approximation of the n form n−1 i=1 ψ(Xi , P ) to ν(Pb) we should aim for. We derive influence functions for functions of multinomial proportions, functions of sample moments, sample quantiles, the Cramér-von Mises statistic, multivariate medians, and linear combination of order statistics. The Fréchet derivative is used to obtain the validity of influence function approximation for statistics which are not simply functions of vector means or characterizable as M estimates. Empirical process theory is used to justify more general results for M estimates than those of Chapter 5, and in Section 7.2.2 yields weak convergence of the quantile process. 7.3 7.3.1 Further Expansions The von Mises Expansion The influence function gives us the analogue of the first derivative of a parameter. The von Mises expansion is the formal analogue of a Taylor series and tells us, for instance, what we can expect if the “first derivative” vanishes. Write, given ν : M → R, ν(P + ε(Q − P )) = ν(P ) + m X ν (k) (P − Q) k ε + o(εm+1 ) k! k=1 ∂k ν(P + ε(Q − P ))|ε=0 where ν (k) (Q − P ) = ∂εk (7.3.1) 52 Tools for Asymptotic Analysis Chapter 7 the usual Taylor expansion. Suppose, as can easily be shown to hold under mild conditions in case X is finite (Problem 7.3.1), that ∂k ν(P + ε(Q − P ))|ε=0 = ∂εk Z ... Z ψ(x1 , . . . , xk , P ) Πkj=1 dQ(xj ) (7.3.2) where (i) ψ is symmetric in x1 , . . . , xk . R (ii) ψ(x, X2 , . . . , Xk , P ) dP (x) = 0 with P probability 1 or, equivalently, EP (ψ(X1 , . . . , Xk , P ) | X2 , . . . , Xk ) = 0 . (7.3.3) From (ii) it follows that Z ... Z ψ(x1 , . . . , xk , P )Πkj=1 dQ(xj ) = Z ... Z ψ(x1 , . . . , xk , P )Πkj=1 d(Q−P )(xj ) . (7.3.4) Further, ψ may be computed formally as follows. For Q1 , . . . , Qk ∈ M, define k ψ0 (Q1 , . . . , Qk , P ) ≡ where εj ≥ 0, Pk j=1 εj X ∂k εj (Qj − P ))|ε1 =...=εk =0 ν(P + ∂ε1 . . . ∂εk j=1 (7.3.5) < 1. Then, extending (7.2.4), ψ(x1 , . . . , xk , P ) = ψ0 (δx1 , . . . δxk , P ) . (7.3.6) It is easy to see that (7.3.5) and (7.3.2) imply (7.3.6) and (7.3.3) (Problem 7.3.2). Formally, (7.3.2) and (7.3.4) suggest that 1 ν(Pb ) = ν(P )+n− 2 Z ψ(x, P )dEn (x)+ n−1 2 Z ψ(x, y, P )dEn (x)dEn (y)+. . . . (7.3.7) 1 b 2 Still formally, if ψ(·, P ) 6= 0 (7.3.7) R suggests that n (ν(P ) − ν(P )) is of order 1 and its behaviour is the same as that of ψ(x, P ) dEn (x). This is just the linear approximation (7.2.2). Interpretation of von Mises expansion terms. Stochastic integrals As we have already noted, by the multivariate central limit theorem, for T = L2 (P ), R FIDI En −→ WP0R. Now, En (f ) = f (x)dEn (x) by definition. Can we similarly interpret WP0 (f ) as f (x)dWP0 (x) where WP0 (x) is the (transformed) Brownian bridge? This cannot be done in the Riemann-Stieltjes or Lebesgue-Stieltjes way since the sample functions Section 7.3 53 Further Expansions of W 0 are not of bounded variation (Breiman (1968), pp. 261–263). However, if f is piecewise constant, J X aj 1(cj < x ≤ cj+1 ) , f (x) = j=1 R PJ then, if we naturally define f (x)dWP0 (x) ≡ j=1 aj (WP0 (cj+1 )− WP0 (cj )), it is easy to R see that, indeed, f (x)dWP0 (x) has a N (0, VarP f (X)) distribution and can be identified with WP0 (f ). More generally, if f1 , . . . , fk are k such functions the joint distribution of Z fj (x)dWP0 (x), 1 ≤ j ≤ k coincides with that of {WP0 (fj ) : 1 ≤ j ≤ k}. It may be shown, since any R f ∈ L2 (P ) can be approximated in the L2 (P ) sense by such functions, we can define f dWP0 for all f ∈ L2 (PR) with the correct FIDI’s for any finite set of such variables. It turns out that, for k ≥ 2, if b2 (x)dP (x) < ∞, it is, unfortunately, possible to define Z Z . . . b(x1 , . . . , xk ) dWP0 (x1 ) . . . dWP0 (xk ) in two distinct ways as the Ito or Stratonovich stochastic integral. However, if b(x) = 0 whenever x = (x1 , . . . , xk ) has at least two coordinates equal, then the two integrals coincide and our heuristics apply. Our heuristics suggest that if ψ(·, P ) = 0, then the expansion (7.3.7) is of order 2 and Z Z 0 0 b 2n(ν(P ) − ν(P )) =⇒ L ψ(x1 , x2 , P )dWP (x1 )dWP (x2 ) . (7.3.8) If P [X1 = X2 ] = 0, the right hand side What makes (7.3.8) valuable, R R is uniquely defined. 0 0 if it holds, is that the distribution of ψ(x , x , P )dW (x 1 2 1 )dWP (x2 ) is always of the P P∞ 2 form k=1 λk Zk R, where the λk are the eigenvalues of the integral operator which sends h ∈ L2 (P ) into ψ(x1 , x2 , P )h(x1 )dP (x1 ) and the Zj are i.i.d. N (0, 1). Here is an example. Example 7.3.1. (Example 7.2.4. continued). Null distribution of the Cramér-von Mises statistic. Consider ν(F ) given in that example. By (7.2.17), ψ(x, P0 ) = 0 if F0 ↔ P0 . However, we can compute, if F0 is continuous and Gj is the df of the probability Qj , Z ∂ν (F0 +ε1 (G1 −F0 )+ε2 (G2 −F0 ))|ε1 =ε2 =0 = 2 (G1 −F0 )(G2 −F0 )(z) dF0 (z) . ∂ε1 ∂ε2 Thus, ψ(x, y, P0 ) = 2 Z (1(z ≥ x) − F0 (z))(1(z ≥ y) − F0 (z))dF0 (z) = 2F0 (x ∧ y) − (F02 (y) + F02 (x)) + 2 . 3 54 Tools for Asymptotic Analysis Chapter 7 Without loss of generality (Problem 4.1.11), we can take F0 (t) = t, 0 ≤ t ≤ 1, and obtain the required eigenvalues as λj = 1/j 2 n2 . It may be shown (Shorack and WellR1 R1R1 ner (1986), Ch. 5) that, indeed, 0 [W 0 ]2 (u)du = 0 0 (2(x ∧ y) − (x2 + y 2 ) + 2 0 0 ✷ 3 )dW (x)dW (y). Unfortunately, higher order terms in the von Mises expansion do not have closed form limiting distributions. However, the limit theorems we can prove do suggest how to propm erly estimate the limit distribution of n 2 (ν(Pb ) − ν(P )) if ψ(x1 , . . . , xj , P ) = 0 for all j < m via the m out of n bootstrap discussed in Chapter 10. 7.3.2 The Hoeffding and Analysis of Variance Expansions Here is another expansion of a symmetric statistic. Its terms are harder to compute but have important orthogonality properties which ease asymptotic analysis. Let Pb be the empirical df of the sample X1 , . . . , Xn and think of ν(Pb) as a function q(X1 , . . . , Xn ) where q is symmetric, and represent q as Z q(x1 , . . . , xn ) = q(y)dHx1 (y1 ) . . . dHxn (yn ) where Hx (y) = 1(y ≥ x) the distribution function of δx , point mass at x. On the other hand, consider Z EP q(X1 , . . . , Xn ) = q(y) dP (y1 ) . . . dP (yn ) . Note the expansion to n terms for the product Πni=1 yi around y = x Πni=1 yi = Πni=1 xi + n X X {Πxi : i ∈ J¯m }{Π(yi − xi ) : i ∈ Jm } m=1 Jm where Jm ranges over all distinct subsets of m integers out of {1, . . . , n} and J¯m is the complement of Jm . We get, after some algebra, q(X1 , . . . , Xn ) = EP q(X1 , . . . , Xn ) + (7.3.9) Z Z n XX { . . . q(y1 , . . . , yn )}Π{dP (yi ) : i ∈ J¯m }Π{d(HXi − P )(yi ) : i ∈ Jm }. m=1 Jm (0) (n) (1) and let Mq = EP q(X1 , . . . , Xn ). Write X = Mq (Jm ) , Call the sums in the series Mq , . . . , Mq Mq(m) Jm where Mq (Jm ) is the integral in (7.3.9) for a fixed Jm . Note that Mq (Jm ) is a function of {Xi : i ∈ Jm } only. Thus, Mq ({i}) = EP {q(X1 , . . . , Xn )|Xi } − Mq(0) . (7.3.10) Section 7.3 55 Further Expansions So we may write, abusing notation, Mq ({i}) as Mq (Xi ), and similarly, Mq ({i, j}) as Mq (Xi , Xj ) given by, Mq ({i, j}) = EP {q(X1 , . . . , Xn )|Xi , Xj } − Mq ({i}) − Mq ({j}) + Mq(0) . (7.3.11) If S is a subset of {1. . . . , n} with |S| = m + 1 where |S| denotes cardinality, then, more generally, expanding in (7.3.9) we get the recursion Mq (S) = EP {q(X1 , . . . , Xn )|{Xi : i ∈ S}} X − {Mq (T ) : T ⊂ S, |T | = m} X + {Mq (T ) : T ⊂ S, |T | = m − 1} X − {Mq (T ) : T ⊂ S, |T | = m − 2} + · · · (−1)m+1 Mq(0) . (7.3.12) This is the Moebius expansion of Mq (S) — see Stanley (1986), p.116. Suppose we drop the assumption of symmetry on q and simply take Xi independent with Xi ∼ Pi , i = 1, . . . , n. Then, our development remains valid once we replace P by Pi appropriately in (7.3.9) and, in particular, q(X1 , . . . , Xn ) = n X j=0 Mq(j) = n X X j=0 {Mq (S) : |S| = j} (7.3.13) (0) with Mq (S) defined by Mq = E q(X1 , . . . , Xn ), and the recursion (7.3.12) (provided that E|q(X1 , . . . , Xn )| < ∞). The expression simplifies considerably if q is symmetric and X1 , . . . , Xn are i.i.d. In that case, Mq (T ) = uq (Xi : i ∈ T ) with |T | = m, and the integral in (7.3.9) is Z Z m wq (x1 , . . . , xm ) ≡ . . . Eq(y1 , . . . , ym , Xm+1 , . . . , Xn )πi=1 d(Hxi − P )(yi ) = m X r=0 (−1)n−r X {vq (xi1 , . . . , xir ) : {i1 , . . . , ir } ⊂ {1, . . . , m}} (7.3.14) where vq (x1 , . . . , xr ) = Eq(x1 , . . . , xr , Xr+1 , . . . Xn ). The importance of the expansion (7.3.9) comes from the following interpretation (The(j) orem 7.3.1) of the Mq which does not make use of either the identical distribution of the Xi or the symmetry of q. Suppose X1 , . . . , Xn are independent. Let Λ0 be the space of all variables of the form P P constants, let Λ1 be the linear subspace of L2 (P ) of all random u (X ), Λ that of all random variables of the form, i 2 1≤i<j≤n uij (Xi , Xj ), and i=1 i ⊥ so on. Let Vj = Λj ∩ Λj−1 , the linear subspace of all members of Λj orthogonal to all members of Λj−1 , and let π(q|Vj ) denote the L2 (P ) projection of q on Vj in L2 (P ). We write U ⊥ V iff E(U V ) = 0. See Section B.10.3.2. We state our interpretation as 56 Tools for Asymptotic Analysis Chapter 7 Theorem 7.3.1. With the definitions and assumptions of the previous paragraph, if Eq 2 (X1 , . . . , Xn ) < ∞, then Mq(j) = π(q(X1 , . . . , Xn )|Vj ) (7.3.15) and the representation, q(X1 , . . . , Xn ) = n X Mq(j) , (7.3.16) j=0 is a decomposition of q(X1 , . . . , Xn ) into mutually orthogonal functions of X. In fact, Mq (S) ⊥ Mq (T ) (7.3.17) (j) unless S = T and so each of the Mq have an orthogonal representation as X Mq(j) = {Mq (S) : |S| = j} (7.3.18) Proof. The key observation is that, if Eh2 (Xi : i ∈ S) < ∞, and S ∩ T 6= S, then Z E (7.3.19) h(yi : i ∈ S)Π{d(HXi − Pi )(yi ) : i ∈ S}|Xk : k ∈ T = 0. The reason is that the conditional expectation form passes through the integral and product, since the variables are independent, and E[{Π{d(HXi − Pi )(yi ) : i ∈ S}|Xk : k ∈ T } (7.3.20) = Π{E[d(HXi − Pi )(yi )|Xk : k ∈ T ] : i ∈ S} = 0 since one of the terms in the product must vanish given S ∩T 6= S. Statements like (7.3.20) are evidently rigorous if the Pi are discrete, so that d(HXi − Pi )(yi ) = 1(Xi = yi ) − P [Xi = yi ]. The general case can be obtained by induction on the cardinality of S (Problem 7.3.7). Claim (7.3.17) follows since S 6= T implies that either S ∩ T 6= S or S ∩ T 6= T or both and EMq (S)Mq (T ) = EMq (S)E{Mq (T )|Xi : i ∈ S} = EMq (T )E{Mq (S)|Xi : i ∈ T } . Then, (7.3.19) implies that one or the other conditional expectation is 0. Finally, we claim that Mq(j) ∈ Vj , 0≤j ≤m. (7.3.21) Section 7.3 57 Further Expansions It is enough to argue that, if |S| = j, Mq (S) ∈ Λj , and that X Mq (S) ⊥ {hT {Xi : i ∈ T } : |T | = j − 1} . That Mq (S) ∈ Λj follows by definition. On the other hand, EMq (S)hT {Xi : i ∈ T } = E[E{Mq (S)|Xi : i ∈ T }]hT (Xi : i ∈ T ) = 0 since |S| = j, |T | = j − 1 implies that S 6= T . Claim (7.3.21) follows. Write X π(q|Vj ) = Mq(j) + π( Mq(i) |Vj ) i6=j = Mq(j) + X i6=j (i) The second term is 0 since by construction Mq (i) (i) Mq ∈ Λj−1 while, if i > j, Mq theorem follows. π (Mq(i) |Vj ). ⊥ Vj for i < j since Vj ⊂ Λ⊥ j−1 and ⊥ Vj since Vj ⊂ Λj ⊂ Λi−1 . Claim (7.3.15) and the ✷ Theorem 7.3.1 has a number of interesting consequences: Let τ 2 (S) = Var Mq (S) . Then, Var q(X1 , . . . , Xn ) = n X X m=1 {τ 2 (S) : |S| = m} = n X VarMq(m) . (7.3.22) m=1 If we specialize to q symmetric, X1 , . . . , Xn i.i.d., (7.3.22) becomes Var q(X1 , . . . , Xn ) = n X n r=1 r Var vq (X1 , . . . , Xr ) . (7.3.23) A number of important inequalities and other results follow from (7.3.22) and (7.3.23); see Serfling (1980) and van Zwet (1984) as well as the classic paper of Hoeffding (1948). (1) The first term Mq of the Hoeffding expansion of a statistic of the form ν(Pb) is, by (7.3.10), a sum of i.i.d. variables, and is not surprisingly closely related to the first term of the von Mises expansion — the influence function approximation to ν. Unfortunately, E{ν(Pb )|X1 = x} is not, in general, easy to compute and unlike ψ(x, P ) depends on n. However, showing that it is a good approximation can be easier. Here is a simple result. Theorem 7.3.2. Suppose X1 , . . . ,Xn are i.i.d., qn (X1 , . . . , Xn ) is a sequence of symmetric statistics with 0 < Var qn (X) < ∞, and un (x) ≡ E(qn (X)|X1 = x) . 58 Tools for Asymptotic Analysis Chapter 7 Write qn for qn (X). Suppose Var(qn ) = n Var un (X1 ) + o(1) . Then, qn = Eqn + n X un (Xi ) + oP (1). (7.3.24) i=1 n 1hP io n If further L n− 2 → N (0, σ 2 ), then i=1 un (Xi ) − E un (Xi ) 1 L n− 2 qn − E(qn ) → N (0, σ 2 ) . (0) Proof. Define Mq (1) = Eqn (X), and Mq Mq(1) = n X i=1 (7.3.25) via (un (Xi ) − Eun (Xi )) . The result follows because n n X 2 X un (Xi )) un (Xi ) = Var (qn ) − Var ( E qn − E(qn ) − i=1 i=1 since (qn − E(qn )) − n X i=1 un (Xi ) ⊥ n X un (Xi ) i=1 by the definition of un and Theorem 7.3.1. So (7.3.24) follows and then so does (7.3.25) by Slutsky’s theorem. ✷ We illustrate with, Example 7.3.2. U statistics. Our model is X1 , . . . , Xn i.i.d. as X ∼ P ∈ P. Given a function φ : X → R which is symmetric, a U statistic of order m is given by Un ≡ 1 X {φ(Xi1 , . . . Xim ) : 1 ≤ i1 < . . . < im ≤ n} . n (7.3.26) m If P = {P : EP Un < ∞}, then it may be shown that Un is the UMVU estimate of θn (P ) ≡ EP φ(X1 , . . . , Xm ) (Problem 7.3.7). A U statistic of order 1 is just a mean of i.i.d. variables such as the sample mean. An example of a U statistic of order 2, for X = R, is the one-sample Wilcoxon rank statistic for testing symmetry of P , W = 1 X 1(Xi + Xj > 0) . n 2 i<j Section 7.3 59 Further Expansions To apply Theorem 7.3.2, set qn = P i<j 1(xi + xj > 0), and note E 1(X1 + X2 > 0|X1 = x) = P (x + X2 > 0) = 1 − F (−x) . It follows that the projection of qn is qbn ≡ n X un (Xi ) = n X i=1 i=1 1 − F (−Xi ) . Consider the hypothesis H : F is symmetric about zero. Then VarH (qn ) = 4(n − 2)Var 1 − F (−X1 ) and the assumptions of Theorem 7.3.2 are satisfied. Because 1 − Fp(−X1 ) = F (X1 ) ∼ 1 U(0, 1) under H, Var(qn ) = (n − 2)/3, and n− 2 (qn − 21 ) → N (0, 1/3). For X = R2 , X = (Z, V ), Kendall’s τ statistic for testing independence of Z and V , τ= 1 X 1((Zi − Zj )(Vi − Vj ) > 0), n 2 i<j is also a U statistic of order 2. We will study these two statistics further in Problems 7.3.11 and 7.3.16. Next we analyze U statistics generally using the Hoeffding and von Mises approaches. Note first that for a U statistic of order m with Eφ2 (X1 , . . . , Xm ) < ∞, the Hoeffding expansion is truncated, since Un ∈ Λm , Un = m X (j) MTn . j=0 Therefore, from (7.3.22) and (7.3.23), for vq (x1 , . . . , xj ) ≡ E Un (x1 , . . . , xj , Xj+1 , . . . , Xn ) , Var (Un ) = m X n j=1 j Var vq (X1 , . . . , Xj ) . By symmetry, vq (x) = E = h 1 X i {φ(Xi1 , . . . , Xim ) : 1 ≤ i1 ≤ . . . ≤ in X1 = x n m m−1 m−1 φ(x, X2 , . . . , Xm ) n m = m Eφ(x, X2 , . . . , Xm ) . n (7.3.27) 60 Tools for Asymptotic Analysis More generally, for 1 ≤ r ≤ m, vq (x1 , . . . , xr ) = so that n−r m−r n m Var (Un ) = Chapter 7 Eφ(x1 , . . . , xr , Xr+1 , . . . , Xm ) m X r=1 m r where n−r m−r n m ξr , ξr ≡ VarE φ(X1 , . . . , Xm ) X1 , . . . , Xr . (7.3.28) (7.3.29) For the one-sample Wilcoxon statistic we have E φ(X1 , X2 )|X1 = x = 1 − F (−x) and E φ(X1 , X2 )|X1 , X2 ) = 1(X1 + X2 > 0). It follows that ξ1 = Var 1 − F (−X1 ) , ξ2 = Var 1(X1 + X2 > 0) , and Var(W ) = 2 2(n − 2)ξ1 + ξ2 (1 − ξ2 ) /n(n − 1) . ✷ Note that closely related to Un is the von Mises statistic, Z Z Vn ≡ . . . φ(x1 , . . . , xm ) dPb (x1 ) . . . dPb (xm ) = So, Vn = n m nm Un + 1 X {φ(Xi1 , . . . , Xim ) : 1 ≤ i1 , . . . , im ≤ n} . nm Z ... S Z φ(x1 , . . . , xm ) dPb (x1 ) . . . dPb (xm ) (7.3.30) where S = {(x1 , . . . , xm ) : xi = xj for some i 6= j}. The influence function calculation readily yields for Vn , Z Z ψ(x, P ) = m( . . . φ(x, x2 , . . . , xm ) dP (x2 ) . . . dP (xm ) − EP φ(X1 , . . . , Xm )) . However, application of Theorem 7.3.2 for Vn is difficult since F appears as dF . On the other hand, consider Un . Proposition 7.3.1. For Un as (7.3.26), if 0 < Var vq (X1 ) < ∞, then where 1 n 2 Un − EP φ(X1 , . . . , Xm ) =⇒ N 0, EP ψ 2 (X1 , P ) ψ(X, P ) = mEP φ(X1 , . . . , Xm )|X1 . Section 7.3 61 Further Expansions 1 Proof. By (7.3.27) and (7.3.23), if qn = n 2 Un , 1 vq (X1 ) = n− 2 ψ(X1 , P ) and from (7.3.28), 1 Var(n 2 Un ) = mξ1 + OP (n−1 ) , since n −1 r (7.3.31) = O(n−r ). Now apply Theorem 7.3.2 to qn above. Then, n X 1 un (Xi ) = n− 2 i=1 n X ψ(Xi , P ) i=1 so that, since ξ1 = Var ψ(X1 , P ), the conditions of the theorem are satisfied and Proposition 7.3.1 follows. ✷ Remark 7.3.1. If ψ(·, P ) ≡ 0 the limit of 2n[Un − E(Un )] is of the form Z Z ψ(x, y, P )dW 0 F (x) dW 0 F (y) . See (7.3.8). This is also true for Vn . That is, the second term in (7.3.30) is negligible if Eφ2 (X(i1 ) , X(i2 ) , . . . , X(ir ) ) < ∞ for all i1 + . . . + ir = m and X(ij ) = (Xj , Xj , . . . , Xj ) with the dimension of X(ij ) being ij . We leave this to Problem 7.3.9. The von Mises and Hoeffding expansions can be compared quite precisely for U statistics if we assume that φ(x1 , . . . , xm ) = 0 whenever xi = xj for some i 6= j. In that case, −m n Un Vn = n m and further, for ψ(x1 , . . . , xj , P ) and wq as defined in (7.3.14) and (7.3.6), w0 (x1 , . . . , xj ) = ψ(x1 , . . . , xj , P )n−j (7.3.32) for j = 1, . . . , m. We leave this result to Problem 7.3.9. We again refer to Serfling (1980) for an extensive treatment of U statistics. ✷ The analysis of variance expansion Suppose Z1 , . . . , Zk are independent predictor variables, Zj taking on values in Zj ≡ {zj1 , . . . , zjnj } with equal probabilities n−1 j , for 1 ≤ j ≤ k. Then if Y is a “response” variable we are interested in the regression surface, µ(z1 , . . . , zk ) ≡ E(Y |Z1 = z1 , . . . Zk = zk ) . 62 Tools for Asymptotic Analysis Chapter 7 Here µ(Z1 , . . . , Zk ) is a general function of (Z1 , . . . , Zk ) and thus can be expanded in the Hoeffding sense as µ(Z1 , . . . , Zk ) = Eµ(Z1 , . . . , Zk ) + k X j=1 + . . . + Uµ(k) . {E(µ(Z1 , . . . , Zk )|Zj ) − Eµ(Z1 , . . . , Zk )} Here the first term is the mean response which is X 1 Eµ(Z1 , . . . , Zk ) ≡ µ(·, · · ·, ·) = {µ(z1 , . . . , zk ) : zj ∈ Zj , 1 ≤ j ≤ k} . n1 . . . nk The next term is E(µ(Z1 , . . . , Zk )|Z1 = z) − Eµ(Z1 , . . . , Zk ) = [µ(z, ·, · · ·, ·) − µ(·, · · ·, ·)] where we use dots in the usual ANOVA sense for averaging. But in the case where the Z’s are treatments, this is just the effect of treatment 1 at level z. Continuing, U (2) ({1, 2}) (z1 , z2 ) is just the first order interaction between Z1 and Z2 at levels z1 and z2 . So in this context, the Hoeffding expansion is the classical ANOVA expansion mentioned in terms of the corresponding variance decomposition in Example 6.1.3 and discussed extensively in books such as Box and Hunter (1978). Summary. In Section 7.3.1 we consider a generalization of Taylor expansion due to von Mises. The limiting behaviour of statistics whose first order Gâteaux derivative vanishes is related to the Brownian bridge. Finally in Section 7.3.2 we introduce another type of expansion due to Hoeffding and closely related to the analysis of variance. This is an expansion into orthogonal terms and as such enables us to develop powerful bounds used in establishing asymptotic approximations. The theory is applied to U statistics. 7.4 PROBLEMS AND COMPLEMENTS Problems for Section 7.1 1. Establish (7.1.4) using Problem 5.4.5. 2. Establish (7.1.6) using the multivariate Taylor expansion and the multivariate CLT. 3. Prove Corollary 7.1.1 from Theorem 7.1.1. Hint. Let Amj be the event where sup |Zn (s) − Zn (t)| : s, t ∈ Tmj | > ε . Then P (Uj Amj ) ≤ Σj P (Amj ). 4. Suppose T is a metric space and Zn =⇒ Z on l∞ (T ) and P [Z is continuous] = 1. Let P tn −→ t on T , tn possibly random but t a constant. Show that then Zn (tn ) =⇒ Z(t) . Section 7.4 63 Problems and Complements Hint. The map M : l∞ (T ) × T → R given by M (f t) = f (t) is continuous at all (f0 , t0 ) such that f0 is continuous at t0 . Apply Corollary 7.1.2. 5. Assuming the existence of the Wiener process W (·) on [0, 1], show that if U (t) ≡ W (t) − tW (1), then U (·) ∼ W 0 (·) where W 0 (·) is the Brownian bridge. 6. Establish FIDI convergence and tightness in Proposition 7.1.1. Hint. [nt]/n − [ns]/n = (t − s) + O(1/n). 7. Show that, if X = R, T = {All f with |f |∞ ≤ 1} and P is continuous, then 1 sup{|En (f )| : f ∈ T } = n 2 . 8. Show that in Definition 7.1.3, (ii) implies (i). 9. Prove Theorem 7.1.5. 10. Show that the convergence in (7.1.27) is almost sure. Hint. Use the Borel-Cantelli Lemma; see Appendix D.1. 11. Let W 0 (u), 0 ≤ u ≤ 1, be the Brownian bridge. R1 (a) Show that 0 W 0 (u)a(u)du has a N (0, σ 2 (a)) distribution where 2 σ (a) = 2 Z 1 Z 1 0 0 s≤t (b) If A(v) ≡ Rv 0 a(s)ds, R1 0 s(1 − t)a(s)a(t)dsdt . a(s)ds = 0, show that σ 2 (a) = Z 1 A2 (s)ds . 0 0 Hint. From Problem D.2.3, W is continuous. 12. For the Section I.2 example show that the process t → P √ b n(F (t) − Gµb,bσ2 (t)) con- P verges weakly to a Gaussian process Z(· ; µ, σ 2 ) if µ b → µ, σ b2 → σ 2 , whenever the data are generated from the conditional distribution of Zn given Zn > 0 where Zn ∼ N (µn , σn2 ) and µn → µ, σn2 → σ 2 . 13. Suppose U, V, W1 , W2 , . . . are i.i.d. uniform [0, 1] and set Z(t) = U + V t, Zn (t) = Z(t) + n−1 rn X i=1 1[Wi ≤ t], t ∈ [0, 1] where the integer rn satisfies (rn /n) → 0 as n → ∞. Show that F IDI (a) Zn → Z. F IDI Hint. Compute |Zn − Z| and show that Zn − Z → 0. Apply Slutsky’s theorem. 64 Tools for Asymptotic Analysis Chapter 7 (b) Zn ⇒ Z. Hint. Zn (s) − Zn (t) = (s − t)V + n−1 Introduce Tmj rn X i=1 {1[Wi ≤ s] − 1[Wi ≤ t]} . j−1 j = , j = 1, . . . , km , km km −1 and show that |Zn (s) − Zn (t)| is bounded on Tmj by km + rn /n. 14. Let U, V, W1 , W2 , . . . be i.i.d. uniform [0, 1], let Φ be the N (0, 1)df , and set rn X t − Wi t−U Φ , Zn (t) = Z(t) + n−1 . Z(t) = Φ V +1 V +1 i=1 Show that if rn /n → 0 as n → ∞, then F IDI (a) Zn (·) −→ Z(·), (b) Zn (·) ⇒ Z(·). 15. If X1 , . . . , Xn are i.i.d. with continuous df F and empirical df Fb, then, in l∞ (R), √ n Fb (·) − F (·) =⇒ W 0 F (·). Hint. Show that for any df F (·) if U1 , . . . , Un are i.i.d. U(0, 1) then F −1 (U1 ), . . . , F −1 (Un ) are i.i.d. F , where F −1 (t) = inf{x : F (x) ≥ t}. 16. Establish (7.1.15). 17. Establish Theorem 7.1.5 by extending the proof of Theorem 7.1.4. 18. (a) Show there exists a constant α > 0 (α ≈ 6.308), such that for every real random variable X and every h > 0, Z h α −1 P (|X| ≥ h ) ≤ {1 − φX (t)} dt , 2h −h where φX (t) = E{exp(itX)} denotes the characteristic function for X. Remark: In this problem, we do not have the assumption E(|X|) < ∞. Hint. You can use (without proving) the inequality, sin(x) ≤ sin(1), which holds for x any |x| ≥ 1. (b) Let X1 , X2 , . . . be a sequence of real random variables. Assume that φXn converges pointwise to φ, as n → ∞, where φ is continuous (but not necessarily a characteristic function). Show that Xn = OP (1). 19. Show the following (a) For δ > 0, Z ∞ 2 2 δe−δ /2 e−δ /2 −x2 /2 ≤ e dx ≤ . 1 + δ2 δ δ Hint. You may find integration by parts helpful for the first inequality. Section 7.4 65 Problems and Complements (b) For X1 , . . . , Xn independent and identically distributed N (0, 1) random variables, p (i) P max |Xi | > 2 log(n) → 0, as n → ∞, 1≤i≤n p p (ii) log(n)P max |Xi | > 2 log(n) → 1≤i≤n (iii) nc/2−1 √1 , π as n → ∞, p p log(n)P max |Xi | > c log(n) → 1≤i≤n √ √2 , cπ as n → ∞. 20. Let {Zn } be a sequence of independent and identically distributed N (0, 1) random variables. Prove that there is a random variable X such that P (X < ∞) = 1 and p |Zn | ≤ X log(n), for all n ≥ 2 . 21. Let ε1 , . . . , εn be independent and identically distributed random variables with P (ε1 = 1) = P (ε1 = −1) = 1/2. Pn (a) Let a1 , . . . , an be real numbers and let kak2 = i=1 a2i . Prove that for any x > 0, ! n X x2 εi ai > x ≤ 2 exp − P . 2k2ak2 i=1 Hint: Show that for any real λ, E exp(λε1 ) ≤ exp(λ2 /2). (b) Let Z1 , . . . , Zn be independent and identically distributed random variables, independent of ε1 , . . . , εn such that 0 < EZ12 = τ < ∞. Prove that for any x > 0, ! n x2 X √ lim sup P εi Zi > nx given Z1 , . . . , Zn ≤ 2 exp − 2τ n→∞ i=1 with probability one. Problems for Section 7.2 1. Show that (7.2.5) holds if the map, Q → ψ(Q, P ), is continuous for weak convergence, i.e. Qm =⇒ Q implies that ψ(Qm , P ) → ψ(Q, P ) . Hint. Use the result of Section 5.4.1. 2. The sensitivity curve ν(Pb) is defined by SCn (x) = n(ν(Pbn−1,x ) − ν(Pbn−1 )) where Pbn−1 is the empirical distribution of X1 , . . . , Xn−1 , and 1 n−1 b Pn−1 + δx . Pbn−1,x = n n Suppose that the Gâteaux derivative of ν(·) is well defined in every direction and (7.2.2) holds at all P ∈ M. 66 Tools for Asymptotic Analysis Chapter 7 (a) Show that if νb = ν(Pb) has influence function ψ, then Z n1 SCn (x : νb) = n ψ(x, (1 − t)Pbn−1 + tδx ) + ψ Pbn−1 , (1 − t)Pbn−1 + tδx dt . 0 L (b) Deduce that if Pm → P implies that ψ(x, Pm ) → ψ(x, P ), then, as n → ∞, P SCn (x, νb) → ψ(x, P ) . 3. Show that ψ(x, P ) as given by (7.2.10) satisfies (7.2.5). 4. Let (X1 , Y1 ), . . . , (Xn , Yn ) be i.i.d. as (X, Y ) ∼ P , where 0 < E(X 4 ), E(Y 4 ) < ∞. Let ν1 (P ) = Cov(X, Y ) and let ν2 (P ) = ν12 (P )/Var(X)Var(Y ) be the squared correlation coefficient. Compute the influence function of (a)ν1 (P ), (b)ν2 (P ). Hint. Use (7.2.10) and Example 5.3.6. 5. In the Hardy-Weinberg model where three categories have frequencies p1 = θ2 , p2 = 2θ(1 − θ), p3 = (1 − θ)2 , 0 < θ < 1 √ √ (Examples 2.1.4, 2.2.6) we can write the parameter θ as ν1 (P ) = p1 , ν2 (P ) = 1 − p3 , 1 or ν3 (P ) = p1 + 2 p2 . Use Example 7.2.1 to compute the influence functions ψ1 , ψ2 , and ψ3 of ν1 (Pb ), ν2 (Pb ), and ν3 (Pb). Which ψj has the smaller value of the asymptotic variance EP ψj2 (X, P )? 6.(a) Show that if ψα is given by (7.2.13), then the definition of ν(Fb) in Problem 5.4(a) is equivalent to (7.2.14). Hint. Eψα (X, θ) = α[1 − Fb− (θ)] − (1 − α)Fb− (θ). R (b) Show that for ρ and x bα as defined in Example 7.2.3, x bα = arg minθ ρ(x, θ)dFb(x). 7. Show that να (F ) minimizes R ρ(x, θ)dF (x) for ρ and να (F ) as in Example 7.2.3. 8. Establish (7.2.16) under appropriate assumptions. Hint. Use the assumptions and method of Problem 5.4.1 to show that √ ( n(b xα − F −1 (α)), 1 n− 2 n X ψα (Xi , F )) i=1 have an asymptotically joint Gaussian distribution concentrating in the plane on the diagonal {(u, v) ∈ R2 : u = v}, which establishes the claim. 9. (a) Establish (7.2.17). (b) Give the details needed to establish the asymptotic normality of the Cramér-von Mises statistics if the true F 6= F0 using Theorem 7.2.1. 10. Consistency for minimum distance estimates. Let d be a suitable metric and {Pθ : θ ∈ Θ compact ⊂ Rp } be a regular parametric model such that θ is identifiable and b = arg min d(Pb, P ). θ → d(P, Pθ ) is continuous for fixed P . Let θ θ Section 7.4 Problems and Complements 67 b is consistent. (a) Show that θ 1 Hint. If θ 0 is true, d(Pb , Pθ0 ) = OP (n− 2 ). But d(Pθ 0 , P b ) ≤ d(Pθ 0 , Pb ) + d(P b , Pb) ≤ 2d(Pθ 0 , Pb) θ θ and 1-1 continuous maps on compacts have continuous inverses. √ (b) Show that if the conditions of Theorem 7.2.1 are satisfied, then θb is n consistent in √ b the sense that n(θ − θ) = OP (1). 11. In Theorem 7.2.2, show that if θ(Q) is defined as the unique solution of Z φ(x, θ(Q))dQ(x) = 0 for all Q in a convex neighbourhood of P , and A3’ holds, then the influence function of θ(P ) is indeed ψ(x, P ) = [H −1 (P )]φ(x, θ(P )) . R R 12. (a) In Example 7.2.5 show that if |x|r dP (x) < ∞, then |x − θ|r dP (x) takes its infimum over θ for r ≥ 1 and the minimizing θ(P ) is unique if r > 1. R r (b) Show that θ(P ) = arg min (|x − θ| − |x|r )dP (x) which is defined iff R r−1 |x − θ| dP (x) R < ∞. R Hint. If |θ| → ∞, |x − θ|r dP (x) → ∞ and θ → |x − θ|r dP (x) is strictly convex for r > 1. (c) Show that θ(P ) is uniquely defined even if r = 1 if P has a positive density. Hint. (i) Prove the result for d = 1 where θ(P ) is just the median. (ii) If the minimizer is not unique, then by convexity, there exists θ0 , ∆ 6= 0 such that all θ0 + ε∆ are minimizers 0 ≤ ε ≤ 1. Rotate and scale so that ∆ = (1, 0, . . . , 0). 13. (a) Show that Ψ(·, P ) ∈ L2 (P ) and M (P ) given by (7.2.26) is well defined if r > 1 and EP |X|2r−2 < ∞ or r = 1 and P has a positive density. Hint. For r = 1, use spherical polar coordinates. (b) Check the conditions of Theorem 7.2.2 for the multivariate medians. 14. Show that A4’ in Theorem 7.2.2 is implied by: A4”: sup{ |ψ(x, θ) − ψ(x, θ 0 )| : |θ − θ0 | ≤ γ]} ≤ V (x, θ 0 )γ α , α > 0 for all θ0 , γ > 0 where EP V 2 (X, θ) < ∞ for all |θ − θ(P )| ≤ ε. Hint. Consider a box {θ : |θ − θ(P )|∞ ≤ ε} and break it up into a grid of (ε/δ)p boxes of side δ. If θ(1) , . . . , θ(K) , where K = 2p are the vertices of such a box B, then for θ in this box, ψ(x, θ) ≥ f where f is defined as min{ψ1 (x, θ) − ψ(x, θ(P )) : θ in B} and similarly for f . 15. Apply Problem 7.2.14 to Example 7.2.4 to justify A4 for ψ(x, θ) ≡ (x − θ) |x − θ|r−1 for 1 ≤ r ≤ 2, p ≥ 2, by showing that |ψ(x, θ) − ψ(x, θ0 ) | ≤ |θ − θ0 | |x − θ|r−2 . 68 Tools for Asymptotic Analysis Chapter 7 16. (a) Establish (7.2.32). Hint. F −1 F (t) = t, t ∈ R. (b) Show that (7.2.40) follows from (7.2.39). 17. Derive (7.2.43) from (7.2.41). 18. Let Fθ = F0 (· − θ) be a location parameter family. Let θb be the minimum distance estimate Z ∞ θb = arg min (Fb(x) − Fθ (x))2 dF0 (x) . θ −∞ b (a) Compute formally the influence function of θ. R∞ (b) Show that the expansion (7.2.1) is valid if −∞ |f ′ (x)|dx < ∞. Hint. Without loss of generality, take θ0 = 0. Note that θ(F ) solves Z (F0 (x − θ) − F (x))f0 (x − θ)dF0 (x) = 0 . R∞ 2 (c) Set d(F, Fθ ) = −∞ F (x) − F0 (x − θ) dF0 (x). Show that θb is consistent using Problem 7.2.10 and ∂2d (F0 , F0 ) 6= 0 . ∂θ2 ∂d (F0 , F0 ) = 0 , ∂θ (d) Let θb = arg min sup |Fb (x) − Fθ (x)| . θ x √ Show that n(θb − θ) is not asymptotically normal and hence does not have an influence function. √ 1 Hint. Consider θ with |θ| = O(n− 2 ) and Zn (x; θ) = n Fb(x) − Fθ (x) = √ b √ n F (x) − F0 (x) + f0 (x) nθ + Op (1). Apply Donsker’s theorem. P 1 19 (a) Show that in (7.2.1), n−1 ni=1 ψ(Xi , P ) = OP (n− 2 ). (b) Show that ψ is unique if the approximation is valid. That is, show that if Pn 1 ν(Pb) = ν(P ) + n1 i=1 ψ1 (Xi , F ) + op (n− 2 ) Pn 1 = ν(P ) + n1 i=1 ψ2 (Xi , F ) + op (n− 2 ) with EP ψj (X) = 0, EP ψj2 (X) < ∞, j = 1, 2, then P [ψ1 (X) = ψ2 (X)] = 1. 1 Pn Hint. Use the CLT and n− 2 i=1 ∆(Xi ) = oP (1) for ∆ ∈ L2 (P ) iff ∆ = 0. 20. Let X(1) < . . . < X(n) be the order statistics of X1 , . . . , Xn i.i.d. as X ∼ F where F is continuous. Show that if E|X|ε < ∞ for ε > 0, then for all r > 0 there exist Cr > 0 and n(r, k) such that for n ≥ n(r, k) E|X(k) |r ≤ Cr if α ≤ k ≤ 1 − α, n 0<α< 1 . 2 Section 7.4 69 Problems and Complements 21. Show that if f = F ′ > 0 and continuous, then, if n Cov(X(k) , X(l) ) → k n → s, s(1 − t) l n f (F −1 (s))f (F −1 (t)) → t, ; s≤t uniformly for α ≤ nk , nl ≤ 1 − α. Hint. You may use the fact that if Un =⇒ U and E|Un |r+1 ≤ C, all n, then EUnr → EU r uniformly. If Un (·) =⇒ U (·), supt,n E|Un (t)|r+1 ≤ C, then supt |EUnr (t) − EU r (t)| → 0. Use Problem 7.1.1.4 and the quantile process convergence. 22. If (U, V ) are such that E(U 2 + V 2 ) < ∞ and E(V | U = u) is nondecreasing in u, then Cov(U, V ) ≥ 0. Hint. Cov(U, V ) = Cov(U, E(V |U )) = 12 E(U − U ′ )(E(V |U )− E(V ′ |U ′ )) where (U, V ) and (U ′ , V ′ ) are i.i.d.. (See below A.11.14.) 23. Show that under the conditions of Problem 21, we always have Cov(X(k) , X(l) ) ≥ 0 . 24. Suppose that λ = Λ is as in Example 7.2.5, save that we no longer require λ(t) = 0 for t < ε and t > 1 − ε but only that λ(t) ≤ C < ∞ and EX 2 < ∞. Show that the conclusion (7.2.39) still holds. (Bickel (1967)). Hint. Argue that R 1−ε √ n(Fb−1 (t) − F −1 (t))λ(t)dt =⇒ N (0, σε2 (Λ)). (i) ε (ii) R1 1−ε √ b −1 n(F (s) − F −1 (s))λ(s)ds)2 → 0 as ε → 0 if λ ≡ 1 and Problem 7.2.20 ensures that we can pass to the general case. 25. Consider the nonparametric regression model Y = µ(X) + e, where µ(·) is unknown, E(e) = 0, Var(e) = σ 2 , and X and e are independent. The nonparametric correlation is defined as η 2 = CORR2 (µ(X), Y ). Assume η 2 exists. E µ2 (X) −µ2Y (a) Show that η 2 = . Var(Y ) R (b) Let ν(P ) = E[µ2 (X)] = µ2 (x)f (x)dx where X ∼ F , f = F ′ is assumed to exist. Informally, show that the influence function is µ2 (x) + 2µ(x)e by computing the Gâteaux derivative and evaluating it at Q = δ(x,y) . Note: For the model (1 − ε)P + εQ, both f (x) and µ(x) depend on ε. (c) Find the influence function of η 2 . (d) Evaluate your answer to (c) when µ(X) = α + bX. (e) Compare your answer to (d) with the influence function of the Pearson squared correlation coefficient ρ2 when µ(X) = α + βX. 70 Tools for Asymptotic Analysis Chapter 7 Hint. See Problem 9.2.4. Problems for Section 7.3 1. Verify (7.3.2) when X is finite. 2. Show that (7.3.2) and (7.3.5) imply (7.3.3) and (7.3.6) and that ψ is symmetric in x1 , . . . , xn . 3. (a) Verify that if h in Theorem 5.4.1 has continuous derivatives order of k, then θ(P ) ≡ h(p) can be represented as in (7.3.1). (b) Identify the ψ(·, . . . , P ) functions in terms of the derivatives of h and expectations of these. (c) Verify (7.3.2) in this case. 4. A U statistics of order 2 is called degenerate iff U= 1 X u(Xi , Xj ) n 2 i<j with E{u(X1 , X2 )|X1 } = 0 and hence Eu(X1 , X2 ) = 0. Suppose u(x, x) = 0, Eu2 (X1 , X2 ) < ∞. Show that R (a) nU = u(x, y)dEn (x)dEn (y). (b) If L(nU ) → L0 for some probability law L0 , what form would you expect L0 to have? (c) Is the result you conjecture a consequence of weak convergence of En (·)? 5. Establishing weak convergence for degenerate U statistics. Suppose φ(x, y) in (7.3.2) is symmetric. Then there exist real eigenvalues λj , j ≥ 1, and eigenfunctions ψj (x) of the kernel K(x, y) = u(x, y), such that Z ψj (y)K(x, y)dF (y) = λj ψj (x), K(x, y) = ∞ X λj ψj (x)ψj (y) j=1 and Z ψi (x)ψj (x)dF (x) = δij ≡ 1[i = j]. (a) Assuming the preceding, show that Z u(x, y)dEn (x)dEn (y) = = ∞ X j=1 ∞ X λj Z ψj (x)ψj (y)dEn (x)dEn (y) Z λj ( ψj (x)dEn (x))2 . j=1 | {z } Zjn Section 7.4 71 Problems and Complements (b) Show that Cov(Zin Zjn ) = δij and hence, (Z1n , . . . , Zkn ) =⇒ (Z1 , . . . , Zk ), Zj i.i.d. N (0, 1) . P∞ P∞ 2 j=1 λj Zj . j=1 |λj | < ∞, then nU =⇒ R Pk Pk 2 Hint. j=1 λj ψj (x)ψj (y)dEn (x)dEn (y) =⇒ j=1 λj Zj and (c) If E = E Z (U (x, y) − ∞ X j=k+1 k X j=1 2 2 λj ψj (x)ψj (y)dEn (x)dEn (y) 2 Zjn λj = Var ∞ X j=k+1 2 Zjn λj + ∞ X j=k+1 2 λj . 6. Let U ∼ Np (0, 1) and let Mp×p be symmetric. Show that Pp 2 (a) UT M U ∼ j=1 λj Zj where the Zj are i.i.d. N (0, 1) and λ1 , . . . , λk are the eigenvalues of M . That is, M = P ΛP T (7.4.1) where Λ = diag(λ1 , . . . , λp ), and P is orthonormal. (You may assume (7.4.1), a generalization of the principal axis theorem. (B.10.1.1)) (b) Show that (7.4.1) is equivalent to the existence of unique v1 , . . . , vp orthonormal such that M vj = λj vj , vjT M = λj vjT , 1 ≤ j ≤ p. 7. Show that Un defined in (7.3.26) is the UMVU estimate of EP φ(X1 , . . . , Xm ). 8. Establish (7.3.20) rigorously. 9. Establish the conclusion of Remark 7.3.1 for Vn . 10. Establish (7.3.32). 11. statistic W is asymptotically normal with mean R (a) Show that the one-sample Wilcoxon (1 − F (−x))dF (x) and variance σ 2 (F )/n, where Z σ 2 (F ) = (1 − F (−x))2 dF (x) − µ2 (F ) and F (x) = P [X ≤ x]. (b) Show that if F (x) is continuous and F (x) = 1 − F (−x) for all x, then µ(F ) = 1/2 and σ 2 (F ) = 1/12. f ≡ n−2 P 1(Xi + Xj > 0). 12. Suppose W i,j 72 Tools for Asymptotic Analysis Chapter 7 f is a von Mises statistic and that under the assumption of Problem 11 (a) Show that W f (b), W = W + Op (n−1/2 ), where W is the one-sample Wilcoxon statistic. (b) Show that the identity of (a) fails if F is not continuous. 13. Develop a multivariate central limit theorem for U statistics. We state it for U statistics of order 2 but the generalization is obvious. Let ϕ = (ϕ1 , . . . , ϕk ) be a function from R X × X → Rk such that ϕ2j (x, y)dF (x)dF (y) < ∞, for all j, and ϕj are symmetric, 1 ≤ j ≤ k. P (a) Show that U n = n2 i<j ϕ(Xi , Xj ) is asymptotically normal with expectation Eϕ (X1 , X2 ) and variance 2Var(E[ϕ(X1 , X2 )|X1 ]). (b) Extend this result to U statistics of order m1 , m2 , . . . , mk . 14. Show that even if P [Xi = Xj ] > 0 for i 6= j nevertheless the von Mises statistic X ϕ(Xi , Xj ) n−2 i,j is asymptotically normal if Eϕ2 (X1 , X2 ) < ∞ and Eϕ2 (X1 , X1 ) < ∞. Hint. Use Problem 13. f in Problem 12 generally. 15. Apply Problem 14 to derive the asymptotic distribution of W 16. Derive the results parallel to Problems 11, 12, and 15 for Kendall’s τ . 17. Establish (7.3.31). 7.5 Notes Notes for Section 7.1 (1) It may be shown that if T is Euclidean, given the FIDIS of {X(t) : t ∈ T }, it is always possible to choose X(t) separable and X(·, ·) defined by X(u, v) = X(v) − X(u) separable — see Doob (1953), pp. 46–47. Unfortunately the empirical processes in general, can be defined on sets of functions T , which may not be identified with a subset of Euclidean, or even a separable metric space. The only recourse is then to go to so called “outer probabilities” — see van der Vaart and Wellner (1996) and Pollard (1990). However, in all cases of interest that we treat, separability of X(·) and X(·, ·) (and measurability) are more or less evident, so we shall ignore this technicality. Chapter 8 DISTRIBUTION-FREE, UNBIASED, AND EQUIVARIANT PROCEDURES 8.1 Introduction In this chapter we leave asymptotic theory and elaborate on some general inference principles initiated in Chapter 1, Sections 1.2 and 1.3 and Chapter 3, Sections 3.3 and 3.4. The results will guide the construction of procedures with optimal asymptotic properties for models much more general than those of this chapter. Recent work relating to the topics in this chapter deals with optimal inference after data based model selection. See Fithian, Sun and Taylor (2014). One motivating question is, when can we construct a test of composite hypotheses whose power function, probability of Type I error, is constant on the hypothesis and, if so, how can we do it? In examining answers to this question we will establish results important for answering other decision theoretic questions having to do with estimation as well. The first answer, elaborated on in Section 8.2, is based on the concept of a complete sufficient statistic when the null hypothesis is our model. Such a statistic, which, if it exists, must be minimal sufficient, can be conditioned on, and the resulting tests with conditional Type I error probability equal to α have the desired property. An important example discussed is permutation tests. It turns out that the notion of completeness is also what’s needed to construct uniformly minimum variance unbiased (UMVU) estimates in situations where the information bound for unbiased estimates is not achievable. This is a much more powerful method than the one discussed in Section 3.4. The second answer is based on the notion that some models exhibit symmetries which, for loss functions having corresponding symmetries, lead us to consider decision procedures which are also symmetric. That is, there is a group acting on the model which is isomorphic to a group acting on the action space A which in turn is isomorphic to a group acting on decision procedures. For instance in testing H : θ = 0 vs K : θ 6= 0 when we observe X1 , . . . , Xn i.i.d. with density f (x − θ) and f is symmetric about 0 it seems reasonable to restrict consideration to tests δ such that δ(X1 , . . . , Xn ) = δ(−X1 , . . . , −Xn ). The argument, much elaborated in Section 8.3, is that if H holds for X1 , . . . , Xn it does so for −X1 , . . . , −Xn as well and if it doesn’t then −X1 , . . . , −Xn are symmetrically distributed about −θ 6= 0. So, it seems reasonable that X1 , . . . , Xn and −X1 , . . . , −Xn 73 74 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 should lead to the same action. Note, however, that, implicitly, we are postulating a problem of a malevolent nature which might flip the signs on the Xi we see to confuse us. This suggests via Section 3.3 a connection to the minimax principle which we shall explore. How do symmetries relate to our initial question? Symmetry of the procedure leads to critical regions with constant probability of Type I error if θ = 0. If in the problem we just discussed we specify that f (x) = (1/σ)f0 (x/σ), with f0 known and σ unknown, then symmetry arguments lead us to require δ(aX1 , . . . , aXn ) = δ(X1 , . . . , Xn ), a 6= 0. We pursue this argument to show that such tests, which are called invariant with respect to multiplication, have constant probability of Type I error in σ if θ = 0. These notions can be extended to requiring properties for estimators δ of θ, such as δ(X1 + c, . . . , Xn + c) = δ(X1 , . . . , Xn ) + c. (8.1.1) Examples of estimators satisfying (8.1.1) are the mean and the median. See Problem 3.5.6. Such estimates are called equivariant with respect to addition. Such notions turn out to be closely linked to minimaxity so that we can find minimax procedures by restricting to such procedures to begin with. Although these ideas are initially applicable only in restricted classes of models, their applicability where the X i are i.i.d. N (µ, Σ0 ) with Σ0 known can be used to establish asymptotic versions of these results for all smooth, regular i.i.d. parametric models. 8.2 Similarity and Completeness We have seen that in the case of the t test under the N (µ, σ 2 ) model, the t-statistic has the same t-distribution for all values of σ 2 . In this section, we shall study a key property that a number of models have which permits such a reduction. This reduction is based on a set of ideas introduced and developed by Lehmann and Scheffé (1950, 1955) that have another interesting application we shall discuss. Using these notions, it is possible to essentially characterize situations in which we can expect a UMVU estimate to exist and give a constructive method to obtain these procedures. As we shall see, this approach is far more powerful than that of Section 3.4. We develop the concept of completeness for a model we denote by P0 . In the testing context, P0 is the class of probabilities specified by the hypothesis. In the estimation context, it is typically the class P of all probabilities under consideration. 8.2.1 Testing We consider the general testing problem in the Neyman–Pearson framework: we observe X ∈ X , X ∼ P ∈ P, with hypothesis H : P ∈P0 , P0 ⊂ P and alternative K : P ∈ / P0 . Recall that a test φ has level α if EP φ(X) ≤ α for all P ∈ P0 and φ has size α if sup{EP φ(X) : P ∈ P0 } = α. Definition 8.2.1. A (randomized) level α test φ is similar(1) level α if EP φ(X) = α for all P ∈ P0 . Section 8.2 Similarity and Completeness 75 Evidently, a similar test has size α as well as level α. A statistic T (X) is distributionfree if P (T (X) ≤ t) is the same for all P ∈ P0 . For such T(X), tests of the form ϕ(X) = 1(T (X) ∈ A) are similar. Similar tests other than φ(X) ≡ α may not exist, and may exist only for some α (Problem 8.2.19). There is one very important general situation in which they may readily be constructed: when a complete, sufficient statistic exists. Similarity is a property of a test and a model P0 . Completeness is a property of a statistic T (X) and the model obtained when X ∼ P, P ∈ P0 . We consider T (X), and the model Q0 = {P T −1 ; P ∈ P0 }, the set of distributions P T (X) ≤ t of T (X) as P ranges over P0 . Definition 8.2.2. The model Q0 = {P T −1 ; P ∈ P0 } and the statistic T are complete for Q0 and P0 , respectively, if, for any function v(T ), the system of equations, EQ v(T ) = 0, for all Q ∈ Q0 , or, equivalently, implies that (8.2.1) EP v T (X) = 0, for all P ∈ P0 , Q[v(T ) = 0] = P [v T (X) = 0] = 1 for all Q ∈ Q0 , respectively, P ∈ P0 . That is, the system (8.2.1) has only the trivial solution v = 0. The main utility of completeness in testing comes from the following theorem. Theorem 8.2.1. Let T (X) be a complete, sufficient statistic for P0 . Let φ be a randomized test of H. Then, φ is similar level α iff (8.2.2) EP φ(X)|T (X) = α with P probability 1 for all P ∈ P0 . Property (8.2.2) is sometimes called Neyman structure. Proof. Evidently, by the iterated expectation theorem, B.1.20, (8.2.2) implies similarity. On the other hand, suppose that, for P ∈ P0 , EP φ(X) = EP EP φ(X)|T (X) = α (8.2.3) where, by sufficiency, EP φ(X)|T (X) is a function of T (X) not depending on P so that the subscript P on EP φ(X)|T (X) can be dropped. That is, E φ(X)|T (X) is a statistic. Rewrite (8.2.3) as (8.2.4) EP E φ(X)|T (X) − α = 0 for all P ∈ P0 . Completeness leads to (8.2.2). The following simple but useful result follows from (8.2.3). ✷ 76 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 Proposition 8.2.1. If T is sufficient for P0 and if EP φ(X)|T ≤ α for P ∈ P0 , then φ is level α. Completeness and sufficiency are properties of a statistic that push in opposite directions. For instance, the trivial sufficient statistic T (X) = X is not complete unless P0 contains all point masses. On the other hand, if T (X) is a constant, then T (X) is complete but sufficient only if P0 = {P0 }. To see if the concept of completeness is useful, we need nontrivial examples of complete, sufficient statistics. Example 8.2.1. Bernoulli trials. Suppose X1P , . . . , Xn are i.i.d. as X where Pθ [X = 1] = θ = 1 − Pθ [X = 0], 0 < θ < 1. Then, T ≡ ni=1 Xi is sufficient for θ. We claim that T is complete. To see this, write (8.2.1) as n X n k v(k) θ (1 − θ)n−k = 0, 0 < θ < 1 . k k=0 or, equivalently, with ρ = θ/(1 − θ), n X k=0 n k v(k) ρ = 0, k (8.2.5) for all ρ > 0. But (8.2.5) evidently requires v(k) = 0, 0 ≤ k ≤ n, since any polynomial of degree n has at most n roots. ✷ Not surprisingly, Example 8.2.1 is a special case of a general result for exponential families. Theorem 8.2.2. Suppose p(x; θ), θ ∈ Θ ⊂ Rk , is an exponential family of rank k with T sufficient statistic T(X) ≡ T1 (X), . . . , Tk (X) , p(x; θ) = exp k nX j=1 o ηj (θ)Tj (x) − A η(θ) h(x) , and η(Θ) has a nonempty interior. Then, T(X) is complete as well as sufficient. Proof. The system of equations (8.2.1) implies that Z Rk h(t) exp k nX j=1 o ηj tj − A(η) v(t) dt = 0 for all η belonging to an open set, or, equivalently, Z Rk h(t) exp k nX j=1 o ηj tj v(t) dt = 0 (8.2.6) for all η in an open set. Now (8.2.6) implies that h(t)v(t) = 0 by a classical theorem on multivariate Laplace transforms (Widder (1941)). ✷ Section 8.2 77 Similarity and Completeness We gave the proof of this result for the continuous case but it applies equally well in the discrete case or for any exponential family dominated by some measure µ. Here are some immediate consequences of Theorem 8.2.2. (i) If X1 , . . . , Xn are i.i.d. N (µ, σ 2 ) with µ, σ 2 both unknown, then (X̄, σ b2 )T is com2 plete and sufficient. If σ is known, X̄ is complete, sufficient. If µ is known, P (Xi − µ)2 is complete, sufficient. Pn (ii) If X1 , . . . , Xn are i.i.d. Poisson (λ), λ > 0, i=1 Xi is complete, sufficient. Completeness is not limited to exponential families. Example 8.2.2. Completeness in the U(0, θ) family, θ > 0. If X1 , . . . , Xn are i.i.d. U(0, θ), then Mn ≡ maxi {Xi } is sufficient. Note that P (Mn ≤ t) = P (all Xi ≤ t) = (t/θ)n for all 0 ≤ t < θ so that (8.2.1) becomes 1 θ Rθ Z 0 θ v(t)n t n−1 dt = nθ−n θ Z θ v(t)tn−1 dt = 0 0 or, equivalently, 0 v(t)tn−1 dt = 0 for all θ > 0. Differentiating with respect to θ we obtain v(θ)θn−1 = 0 =⇒ v(θ) = 0 and completeness is proved. ✷ Example 8.2.3. Completeness of the order statistics. Here is an example which will prove important for both testing and estimation. Let X1 , . . . , Xn be i.i.d. f ∈ F where F is the set of all continuous case densities. Then, X(1) < . . . < X(n) , the order statistics, are sufficient (Problem 1.5.8). We next argue that they are complete. Now, v(x1 , . . . , xn ) defined for x1 < x2 < · · · < xn can be extended to all of Rn by symmetry v(xi1 , . . . , xin ) ≡ h(x1 , . . . , xn ) for all permutations (i1 , . . . , in ). So (8.2.1) becomes Z Z . . . v(x1 , . . . , xn )f (x1 ) . . . f (xn )dx1 , . . . , dxn = 0 (8.2.7) for all f ∈ F . Let g1 , . . . , gn be arbitrary densities in F so that Pn αj ≥ 0, for all j, j=1 αj = 1. Then (8.2.7) implies G(α1 , . . . , αn , g1 , . . . , gn ) ≡ Z ... Z v(x1 , . . . , xn ) i=1 for all α, (g1 , . . . , gn )T as above. Write αj = wj / Simplifying further, G(w1 , . . . , wn , g1 , . . . , gn ) = Z ... n n X Y Z Pn k=1 j=1 Pn j=1 αj gj ∈ F if αj gj (xi ) dxi = 0 wk where wk ≥ 0, 1 ≤ k ≤ n. v(x1 , . . . , xn ) n n X Y i=1 j=1 wj gj (xi ) dxi = 0 . (8.2.8) 78 Distribution-Free, Unbiased, and Equivariant Procedures The coefficient of Qn j=1 Chapter 8 wj is ∂nG (w1 , . . . , wn , g1 , . . . , gn ) ∂w1 . . . ∂wn w=0 = Z ... Z v(x1 , . . . , xn ) n Y gi (xi ) dxi . i=1 (8.2.9) Put gj (x) = 1(xεAj )/|Aj |, the uniform density on Aj , where |A| is the length of A and A1 , . . . , An are disjoint arbitrary intervals. Then (8.2.9) becomes Z Z v(x1 , . . . , xn ) dx1 , . . . , dxn = 0 (8.2.10) ,..., A1 An for all A1 , . . . , An and (8.2.10) implies that v = 0. ✷ The order statistics are also complete for the case where Xi ∼ F with F continuous. See Bell, Blackwell and Breiman (1960). Let us now apply completeness to obtain distribution-free tests in a number of important examples. Testing hypotheses in k parameter exponential families Suppose X ∼ {Pθ : θ ∈ Θ ⊂ Rk }, a canonical k parameter exponential family, with k ≥ 2 and Θ open, with density function, p(x; θ) = exp k nX j=1 o θj Tj (x) − A(θ) h(x) . We want to test H : θ1 = θ10 where θ10 is a specified value, that is, θ ∈ Θ0 = {θ : θ1 = θ10 , θ ∈ Θ}. Thus, H is Pθ ∈ P0 where P0 is the k − 1 parameter exponential family with densities q(x; θ2 , . . . , θk ) = exp k nX j=2 o θj Tj (x) − B(θ2 , . . . , θk ) h0 (x) where h0 (x) = exp[θ10 T1 (x)]h(x). By Theorem 8.2.2, (T2 (X), . . . , Tk (X))T is comT plete and sufficient for P0 . Thus, if φ(T), where T = T1 (X), . . . , Tk (X) , is any test such that Eθ φ(T) = α, θ ∈ Θ0 , we must have Eθ φ(T) (T2 , . . . , Tk ) = α . (8.2.11) Computing the conditional expectation in (8.2.11) is straightforward since the conditional distribution of T given Tj = tj , j ≥ 2 is determined by that of T1 given Tj = tj , j ≥ 2. By the k dimensional version of Theorem 1.6.1, (T1 , . . . , Tk ) has density of the form p(t1 , . . . , tk ; θ) = exp{θT t − A1 (θ)}h1 (t) . Section 8.2 79 Similarity and Completeness Here (Problem 8.2.20), the conditional distribution of T1 given T[2,k] ≡ (T2 , . . . , Tk )T has density of the form p(t) = exp{θ10 t − B0 (θ10 , t2 , . . . , tk )}h0 (t2 , . . . , tk ) so that (8.2.11) is just Z φ(t, t2 , . . . , tk )p(t)dt = α . (8.2.12) (8.2.13) The usual analogous formulae apply for the discrete case. Conversely, suppose we start with any test statistic S(T) and φ(T) = 1 if S(T) > s = 0 if S(T) < s . In general, we can not choose s so that Eθ φ(T) ≤ α for all θ ∈ Θ0 . However, consider the randomized test, φ∗ (T) = 1, = γ, = 0, S(T) > s(T2 , . . . , Tk ) S(T) = s(T2 , . . . , Tk ) S(T) < s(T2 , . . . , Tk ) , where s(t2 , . . . , tk ) and γ(t2 , . . . , tk ) are determined by Eθ {φ∗ (T)|T2 , . . . , Tk } = α under H. This is a useful test since the conditional distribution of S(T) given T2 = t2 , . . ., Tk = tk does not depend on θ2 , . . . , θk and it has level α by Theorem 8.2.1. By Proposition 8.2.1, if we are satisfied with Eθ φ(T ) ≤ α for all θ ∈ Θ0 , then we can dispense with randomization in φ∗ (T ). In the continuous case φ∗ (T ) is similar without randomization. Note that φ∗ (T) may not be the same as the test φ(T) we started with, but asymptotically they are usually equivalent to first order. Remark 8.2.1 We can start with any statistic S(T) to construct a similar test φ∗ (T ). What makes the test we have just obtained appropriate? The test φ∗ (T ) is “aimed” at one-sided alternatives. We shall see later in this section that the tests we have introduced do have optimality properties. ✷ Example 8.2.4. Another look at testing location and scale for the Gaussian distribution. Suppose X1 , . . . , XnPare i.i.d. N (µ, σ 2 ). This is an exponential family with T1 (X) = Pn n 2 2 2 i=1 Xi , T2 (X) = i=1 Xi , θ1 = µ/σ , θ2 = −1/2σ . We seek a test with guaranteed level α of H : µ = 0 vs K : µ > 0. This is the appropriate test if the Xi are case control differences. If σ 2 = σ02 is known, the UMP test is φ(T1 ) = 1 iff n X i=1 Xi ≥ √ nσ0 Φ−1 (1 − α) . 80 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 If we are wrong about σ0 , this test can have arbitrarily large level. We can apply the principles of this section to conclude that the test with T1 = P φ(T1 , T2 ) = 1 iff T1 ≥ c(T2 , α) Xi and Pθ [T1 ≥ c(t2 , α)|T2 = t2 ] = α (8.2.14) for all t2 has Neyman structure. In our case, the distribution of T1 |T2 = t2 can be determined from P X X ( Xi )2 2 (T1 , T2 ) = Xi , (Xi − X̄) + n P P where, by Theorem B.3.3, Xi , (Xi − X̄)2 are independent N (0, σ 2 /n), σ 2 χ2n−1 , 2 2 respectively. Here V ∼ σ X means that (V /σ 2 ) ∼ X 2 . Consider the t-statistic t= √ √1 T1 nX̄ n = 1 . 1 s { n−1 [T2 − T12 /n]} 2 For fixed T2 = t2 , t is an increasing function of T1 . Thus for each s(t2 ) and T2 = t2 , T1 ≥ s(t2 ) is equivalent to t ≥ d(t2 ) for some d(t2 ), and it is enough to find d(t2 ) such that P T1 ≥ s(t2 )|T2 = t2 = PH t ≥ d(T2 )|T2 = t2 = α . (8.2.15) Here the distribution of t does not depend on σ; thus, if bα is such that PH (t ≥ bα ) = α then by the Neyman structure result (8.2.2), for all σ > 0, PH t ≥ bα |T2 = t2 = α (8.2.16) and we can determine s(t2 ) from (8.2.15) and (8.2.16). Note that (8.2.16) holding for all α ∈ (0, 1) is equivalent to independence of t and T2 under H. This example illustrates the computation of the conditional critical value s(t2 ), and it shows connections to earlier normal model procedures. In practice, we would carry out the test as in Vol. I using the t-distribtuion of t. We can extend the argument leading to (8.2.13) for testing the hypothesis H : θ1 = θ10 to H : θS = θ 0S where θ S ≡ {θi : i ∈ S}, S ⊂ (1, . . . , k). Here is a simple application. Example 8.2.5. Gaussian goodness-of-fit. Suppose we wish to test H : F ∈ {N (µ, σ 2 ) : µ ∈ R, σ 2 > 0} versus a particular alternative with specified density f which is not Section 8.2 81 Similarity and Completeness Gaussian. If we knew µ = µ0 and σ 2 = σ02 , the Neyman-Pearson Lemma implies that the most powerful statistic would be S(X) = n X log f (Xi ) + i=1 1 X 2 µ0 X nµ20 . X − X + i i 2σ02 σ02 2σ02 ∗ Using our φ (T ) construction the similar size α test based on S does not depend on µ0 and σ02 . It is just, if n ≥ 3, φ∗ (X) = 1 n X iff i=1 = 0 where P n X i=1 otherwise n X 2 Xi − X̄ log f (Xi ) ≥ c X̄, i=1 n 2 X Xi − X̄ log f (Xi ) ≥ c X̄, Pn i=1 2 X Xi , X Xi2 = α. is uniformly distributed on the surface Since, given X̄ and i=1 (Xi −X̄) , (X1 , . . . , Xn )TP of the n sphere with center (X̄, . . . , X̄) and radius ni=1 (Xi − X̄)2 , computing involves evaluating the area of the intersection of the surface of the sphere with ) ( n X T log f (xi ) ≥ c (x1 , . . . , xn ) : i=1 which is, unfortunately, not trivial in general. Pn Note that unlike the test “Reject iff i=1 log f (Xi ) is large,” φ∗ (X) has probability of Type I error α whatever be µ and σ 2 . ✷ Here is an important and typical example where we can, in general, not obtain similarity without randomization but can guarantee level α. Example 8.2.6. Independence in contingency tables. Suppose (Xi , Yi ), 1 ≤ i ≤ n are, as in Section 6.4, i.i.d. with Xi taking values x1 , . . . , xr and Yi values y1 , . . . , ys , with arbitrary probabilities, θab = P [X = xa , Y = yb ], 0 < θab < 1 for all a, b. We can immediately reduce by sufficiency to N = {Nab : 1 ≤ a ≤ r, 1 ≤ b ≤ s} with P Nab ≡ 1(Xi = xa , Yi = yb ) which has a multinomial M(n, {θab }) distribution. The hypothesis of independence of X and Y is H : θab = θa+ θ+b , for all a, b where + indicates summation over the index. The Wald test of Section 6.3.2 for this hypothesis is based on the Pearson χ2 statistic, 2 r X s X n Nab − Na+nN+b 2 χ = , Na+ N+b a=1 b=1 see (6.4.9). Note that we can write the density function of N under the hypothesis as nX o p(N : η) = exp Nab (ηa1 + ηb2 ) a,b 82 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 where ηa1 = log θa+ , ηb2 = logP θ+b , an exponential family in canonical Pr form with canonis cal sufficient statistics, Na+ ≡ b=1 Nab , 1 ≤ a ≤ r, and N+b = a=1 Nab , 1 ≤ b ≤ s. Let N r+ = (N1+ , . . . , Nr+ ) and N +s = (N+1 , . . . , N+s ). Consider the test which rejects H iff χ2 ≥ c(N r+ , N +s ), and let c be chosen so that PH χ2 ≥ c(na+ , n+b , N r+ , N +s )|Na+ = na+ , N+b = n+b , N r+ , N +s ≤ α and thus the test has level α under H. Note that (see Problem 8.2.23), given Na+ = na+ , N+b = n+b , for all a, b, 2 χ =n ( X a.b 2 Nab −1 na+ n+b ) . (8.2.17) To implement the test, we need to compute, under H, the conditional distribution of {Nab } given N r+ = nr+ , N +s = n+s , which is multiple hypergeometric. In fact, computing the conditional null distribution of this statistic is computationally difficult if r and s are large. Exact results for low values of r and s are given in the package STATEXACT. More generally, Markov chain Monte Carlo methods such as that of Diaconis and Stumfels (1998) can be used. A particularly interesting example is Fisher’s exact test, where r = s = 2. We note that the conditional χ2 test given N1+ = n1+ , N+1 = n+1 can be based on the distribution of N11 given N1+ , N+1 which, under H, is hypergeometric (n, n1+ , n+1 ). The one-sided version of the test, rejecting H for N11 large, has optimality properties. See Lehmann and Romano (2005). ✷ Example 8.2.7. Two-sample permutation tests. Consider the nonparametric two-sample problem. We observe X1 , . . . , Xm i.i.d. F and Y1 , . . . , Yn i.i.d. G with the X’s independent of the Y ’s, and want to test H : F = G against the alternative that G tends to give higher values than F , which can be formally expressed by K : Ḡ(x) ≥ F̄ (x) for all x, which we refer to as G is stochastically larger than F — see Problems 1.1.4 and 8.3.11(d). If F and G are N (µ, σ 2 ) under H and N (µ, σ 2 ), N (µ+∆, σ 2 ) under K, we have the Gaussian two-sample problem discussed in Section 4.9.5. The test for that parametric model is to reject iff q mn m+n (Ȳ − X̄) ≥ tm+n−2 (1 − α) T ≡ s Pm Pn where s2 = (m + n − 2)−1 b)2 + i=1 (Yi − µ b)2 and µ b = λX̄ + (1 − λ)Ȳ , i=1 (Xi − µ λ ≡ m/(m + n). This test is not similar for general H : F = G. What is the similar version for this nonparametric hypothesis when F and G are continuous? Let Z1 , . . . , ZN be X1 , . . . , Xm , Y1 , . . . Yn , N = m + n, and let Z(1) , . . . , Z(N ) be the ordered values. Since (Z(1) , . . . , Z(N ) ) is a complete sufficient statistic under H by Example 8.2.3, the Section 8.2 83 Similarity and Completeness similar test is 1 if T ≥ c(Z(1) , . . . , Z(N ) ) = γ(Z(1) , . . . , Z(N ) ) if T = c(Z(1) , . . . , Z(N ) ) 0 otherwise where c and γ are chosen so that E ψ(X, Y) | Z(1) , . . . , Z(N ) = α . ψ(X, Y) We simplify by noting that m X i=1 Then T = r Xi + n X Yi = −1 Z(i) , i=1 i=1 n mn N +1 N X Pn m X Xi2 + i=1 P N n X Yi2 = N X 2 . Z(i) i=1 i=1 Pn Yi − i=1 Z(i) − i=1 Yi m P 2 PN 1 2 − ( Z(i) ) Z i=1 (i) m+n−2 N i=1 −1 , Pn PN a monotone function of i=1 Yi given Z(1) , . . . , Z(N ) and hence given i=1 Z(i) and PN 2 i=1 Z(i) . Our test is therefore equivalent to Pn 1 if Pi=1 Yi > c(Z(1) , . . . , Z(N ) ) ψ(X, Y) = γ(Z(1) , . . . , Z(N ) ) if ni=1 Yi = c(Z(1) , . . . , Z(N ) ) 0 otherwise . How do we find c? Note that, given (Z(1) , . . . , Z(N ) ), the values of Y1 , . . . , Yn are known, but not their order. Each ordering is equally likely under H; thus, the conditional distribution of Y1 , . . . , Yn given (Z(1) , . . . , Z(N ) ) is P [Yj = Z(ij ) , 1 ≤ j ≤ n] = −1 N . n The test is thus equivalent to Pn (i) Form s(1) < s(2) < . . . < s(N ) , where the s(k) are the ordered s = i=1 Z(ij ) as n (i1 , . . . , in ) ranges over distinct permutations. Pn (ii) “Reject H if j=1 Yj is among the top N + 1 of the s(k) and reject with n α N Pn probability γ(Z(1) , . . . , Z(N ) ) if i=1 Yi is the n α th sk .” This test is distribution-free in the sense that it does not require the knowledge of the distribution F = G under H. It is the so called size α permutation t-test introduced by R.A. Fisher. In practice, randomization is not used and one settles for a level α test with size ≤ α. The main downside with permutation tests is that the calculation and ordering of the sj become prohibitive for m, n large. However, there is a simple alternative, the Monte Carlo 84 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 ∗ ∗ permutation test: Resample Z1b , . . . , Znb , 1 ≤ b ≤ B, from {Z(1) , . . . , Z(N ) } with equal probability ∗ P [Z11 = z(i1 ) , . . . , ∗ Zn1 = z(in ) |Z(i) −1 N = z(i) , 1 ≤ i ≤ N ] = . n ∗ ∗ That is, draw n times without replacement from {ZP (1) , . . . , Z(N ) } to get (Z11 , . . . , Zn1 ), n ∗ then repeat B times. Order the resulting B sums i=1 Zib , 1 ≤ b ≤ B, and reject if P n ✷ i=1 Yi is among the top [Bα] + 1. This test has level α. Basu’s theorem We noted that in Example 8.2.4, we were able Pnto eliminate the need for computing a conditional distribution since the t-statistic and i=1 Xi2 are independent under H. The existence of such statistics is closely linked to ancillarity. Definition 8.2.3. A statistic S(X) is said to be ancillary for P = {Pθ : θ ∈ Θ} if Lθ S(X) is the same for all θ ∈ Θ. Theorem 8.2.3. (Basu) Suppose T (X) is sufficient and complete for P. If S(X) is ancillary for P, then T (X) and S(X) are statistically independent for all P ∈ P. If the support of P is the same for all P ∈ P, then a converse holds: If T (X) and S(X) are independent for all P , then S(X) is ancillary. Proof. For any set A in the range of S write P [S(X) ∈ A] = Eθ P [S(X) ∈ A|T ] . (8.2.18) Note that we need not put a θ subscript on the left hand side since S is ancillary, nor on the inside term of the right hand side since T is sufficient. Now (8.2.18) becomes Eθ P [S(X) ∈ A|T ] − P [S(X) ∈ A] = 0 for all θ. By completeness P [S(X) ∈ A|T ] = P [S(X) ∈ A] and the independence of S and T follows. We leave the converse for Problem 8.2.21. ✷ P 2 √ Example 8.2.4 (Continued). Here T2 = Xi is complete, sufficient, and t = nX̄/s is ancillary under H : µ = 0. Thus T2 and t are independent under H. Example 8.2.7 (Continued). Rank tests. Let Ri be the rank of Yi among Z(1) , . . . , Z(N ) , that is, Yi = Z(Ri ) . Then R = (R1 , . . . , Rn ) is, under H, necessarily uniform on all combinations {r} of size n from {1, . . . , N }. That is, P (R = r) = 1/ N n . So R is ancillary and tests based on it are distribution-free. An example Pn of such a test is the twosample Wilcoxon test which is based on the statistic W = i=1 Ri , or equivalently, U = Pn Pm 1(X < Y ). Most software provide p-values for the Wilcoxon test when i j j=1 i=1 1 min{m, n} ≤ 10. For min{m, n} ≥ 10, (U − 2 mn)/σW is close to N (0, 1), where 2 σW = mn(m + n + 1)/12. ✷ Section 8.2 8.2.2 Similarity and Completeness 85 Testing Optimality Theory We begin with Definition 8.2.4. A test φ for H : θ ∈ Θ0 vs. K : θ ∈ Θ1 is unbiased level α iff Eθ φ(X) ≤ α, θ ∈ Θ0 , and Eθ φ(X) ≥ α, θ ∈ Θ1 . Thus an unbiased test does at least as well as the trivial test φ ≡ α at all θ. As with unbiasedness of estimates, being biased in testing is not necessarily bad. For instance, in testing H : µ = 0√vs K : µ 6= 0 on the basis of n observations from N (µ, 1), the one-sided √ tests, “Reject iff nX̄ ≥ z(1 − α)” and “Reject iff nX̄ ≤ z(α)” are both biased yet reasonable. Nevertheless, unbiasedness often leads to reasonable procedures. Recall that UMP stands for “uniformly most powerful.” Definition 8.2.5. The test φ∗ is a UMP unbiased level α test if it is unbiased and Eθ φ∗ (X) ≥ Eθ φ(X) for all θ ∈ Θ1 if φ is also unbiased level α. Let ω ≡ ∂Θ0 ∩∂Θ1 denote the intersection of the boundaries of Θ0 and Θ1 . We assume that ω is not empty. Note that if φ is unbiased level α and θ → Eθ φ(x) is continuous, then φ is similar on ω, that is, Eθ φ(x) = α for θ ∈ ω. Lemma 8.2.1. Lehmann. If θ → Eθ φ(X) is continuous for each test φ, and if φ∗ is level α and UMP among the class of test φ satisfying Eθ φ(X) = α for θ ∈ ω, then φ∗ is UMP unbiased α. Proof. The class of tests that satisfies Eθ φ(X) = α for θ ∈ ω contains the class of tests that are unbiased level α. ✷ Remark 8.2.2. If P = {Pθ : θ ∈ Θ}, Θ ⊂ Rd is an exponential family of distributions, then θ → Eθ φ(X) is continuous, for every test function φ. The connection between unbiasedness and completeness comes via Theorem 8.2.4. Suppose θ → Eθ φ(X) is continuous for all φ and T (X) is sufficient and complete for P0 = {Pθ : θ ∈ ω}. Suppose θ1 ∈ Θ1 . Then the UMP unbiased level α test of H : θ ∈ Θ0 vs K : θ = θ1 is > c T (X), θ1 1 P X|T (X) θ φ∗ (X) = γ(T ) if 1 = c T (X), θ1 pw X|T (X) < c T (X), θ 0 1 with c, γ determined by Eθ [φ∗ (X)|T (X)] = α, θ ∈ ω. Proof. Optimality among all unbiased level α tests follows from: (i) Every unbiased level α test is similar on ω = ∂0 Θ0 ∩ ∂Θ1 . (ii) Every test similar on ω has Neyman structure on ω. (iii) The conditional power, Eθ1 φ(X)|T (X) = t , is maximized for each t by φ∗ . Here the conditional probability defining the Neyman-Pearson likelihood ratio is p x; θ1 1 T (x) = t pθ1 X = x, T (X) = t . = pθ1 (x|T (X) = t) = Pθ1 T (X) = t Pθ1 T (X) = t 86 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 By the iterated expectation theorem, (B.1.20), φ∗ is also unconditionally most powerful. ✷ A simple application of this result is Theorem 8.2.5. Suppose X is distributed according to P ∈ P, a canonical exponential family with log p(x, θ) = θ T T(x) − A(θ) + b(x). Write T = (T1 , TT2 )T and θ = (θ1 , θT2 )T . Then, the test = 1 if T1 (x) > c T2 (x) (8.2.19) φ∗ (x) = γ T2 (x) if T1 (x) = c T2 (x) = 0 otherwise is UMP unbiased level α for H : θ1 ≤ θ10 vs K : θ1 > θ10 , where c and γ are determined by Eθ φ∗ (X)|T (X) = α, θ ∈ ω. Similarly, 1 − φ∗ is UMP unbiased for H : θ1 ≥ θ10 vs K : θ1 < θ10 . Proof. The family of conditional distributions of X given T2 (X) = t is a one parameter exponential family of the form q(x; t) = exp{θ1 T1 (x) − A(θ1 , t)}h(x, t)dx . Thus, we can apply Theorem 8.2.4 to conclude that the test (8.2.19) is most powerful among all unbiased level α tests. The test φ∗ is UMP for K : θ1 > θ10 because it does not depend on θ1 . Remark 8.2.3. It is possible to extend this theory to show that procedures such as the two-sided t-test are U M P unbiased. See Lehmann and Romano (2005). Example 8.2.6. (Continued). 2×2 contingency tables. Suppose that in Example 8.2.6, we have r = s = 2 and we are interested in testing H vs K : θab > θa+ θ+b for all a, b. Write N −N N −N N11 N10 p = θ11 θ011+ 11 θ10 θ00+0 10 N11 N10 θ11 θ10 N N = θ011+ θ10+0 θ01 θ00 N1+ θ10 N n−N = γ N11 θ011+ θ10 +1 θ00 n = γ N11 λN1+ η N+1 θ01 θ11 . θ10 P [Y = 1|X = 1]P [Y = 0|X = 0] where γ = = θ01 θ00 P [Y = 1|X = 0]P [Y = 0|X = 1] (8.2.20) and λ and η are appropriately defined from (8.2.20). Note that γ > 0 ⇐⇒ θab > θa+ θ+b for all a, b, and thus γ = 0 corresponds to independence. Therefore, by Theorem 8.2.4, the test which rejects H if N11 > c(N1+ , N+1 ) randomizes on the boundary and accepts otherwise, is UMP unbiased. By Remark 8.2.3, there is also a two-sided UMP unbiased test φ∗ of the form, Reject if N11 > c2 (N1+ , N+1 ) or N11 < c1 (N1+ , N+1 ) with randomization γ1 , γ2 at both c1 and c2 , determined by Section 8.2 87 Similarity and Completeness (i) EH φ∗ (N)|N1+ , N+1 = α and (ii) EH N11 φ∗ (N)|N1+ , N+1 = αEH N11 |N1+ , N+1 = α N1+nN+1 . The condition (ii) comes from the requirement that the conditional power function of the test has derivative 0 at γ = 0 which is necessary for unbiasedness. Unfortunately, this test doesn’t coincide with the conditional χ2 test discussed earlier nor with the naive two-tailed test: Reject iff N11 < c∗1 (N1+ , N+1 ) or N11 > c∗2 (N1+ , N+1 ) with randomization on the boundary, all chosen so that, if N + 1 denotes (N1+ , N+1 ), + + + + ∗ PH [N11 > c∗1 (N + 1 )|N 1 ] + γ1 (N 1 )P [N11 = c1 (N 1 )|N 1 ] = α 2 with a similar identity for c2 , γ2 . Finally, consider ✷ Example 8.2.7. (Continued). Permutation t-test. Consider an alternative with F = N (µ, σ 2 ), and G = N (µ + ∆, σ 2 ), ∆ > 0. Then the test of Theorem 8.2.4 can be written as Reject H if p(X, Y; θ 1 ) 1 T (X, Y) = T > c(Z(1) , . . . , Z(N ) ) p(X, Y; θ 0 ) (8.2.21) where θ 1 = (µ, µ + ∆, σ 2 ) and θ0 = (µ, µ, σ 2 ). This holds, since by sufficiency of T (X, Y) = (Z(1) , . . . , Z(N ) )T on w = Θ0 = {θ0 : µ ∈ R, σ 2 > 0} and the general principles of conditioning, . pθ0 [X, Y|T(X, Y) = t] = p(X, Y; θ0 )1[T(X, Y) = t] Pθ0 T(X, Y) = t for any θ 0 ∈ Θ0 corresponding Pn to H. But by Section 4.9.3 the ratio Pn in (8.2.21) is a monotone increasing function of i=1 Yi given Z(1) , . . . , Z(N ) . Since i=1 Yi and Z(1) , . . . , Z(N ) do not involve any parameters, we conclude that the permutation t test is UMP unbiased against the usual {N (µ, σ 2 ), N (µ + ∆, σ 2 ) : ∆ > 0} alternatives. Remark 8.2.4. Recently the Lehmann-Scheffé theory of this section has been used to derive most powerful unbaised selective tests for inference in exponential family models after data based model selection. See Fithian, Sun and Taylor (2014). 8.2.3 Estimation We will now show how completeness provides the most powerful tool that we have for the construction of uniformly minimum variance unbiased (UMVU) estimates. Recall from Section 3.4.1 that given a model P ≡ {Pθ : θ ∈ Θ}, an estimate δ of a parameter q(θ) ∈ R is unbiased iff Eθ δ(X) = q(θ) for all θ . A UMVU estimate δ ∗ is an unbiased estimate such that, for all θ and all unbiased δ, Varθ δ ∗ (X) ≤ Varθ δ(X) . 88 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 Given any unbiased estimate δ(X) and sufficient statistic T (X) we now show how to construct an unbiased Rao-Blackwell (RB) estimate depending on T only which is at least as good. Theorem 8.2.6. (Rao-Blackwell) Suppose δ is an estimate of q(θ) and Eθ δ 2 (X) < ∞ for all θ. Let T (X) be a sufficient statistic and denote the Rao-Blackwell estimate of q(θ) by (8.2.22) δRB T (X) = E{δ(X)|T (X)} . Then, for all θ ∈ Θ, a) Biasθ (δRB ) = Biasθ (δ). b) MSEθ δRB (T ) ≤ MSEθ δ(X) . (8.2.23) Equality holds in (8.2.23) iff δ(X) is a function of T (X) (with probability 1). Note. As in the case of conditional tests, sufficiency enables us to conclude that δRB does not depend on θ so that it is a bona fide estimate. Proof. Part a) is a consequence of the iterated expectation theorem, Eθ δRB T (X) = Eθ E δ(X) T (X) = Eθ δ(X) . (8.2.24) Further, 2 MSEθ δ(X) = Eθ E δRB (X) − q(θ) T (X) 2 = Eθ E δ(X) T (X) − q(θ) 2 = Eθ E δ(X) − q(θ) T (X) 2 2 ≤ Eθ E δ(X) − q(θ) T (X) = Eθ δ(X) − q(θ) by applying the inequality (EU )2 ≤ EU 2 conditionally on T (X). By A.11.9, equality can hold iff δ(X) is constant given T (X). The theorem follows. ✷ Note. The theorem is a special case of Problem 3.4.2 which shows that δRB improves on δ(X) for convex rather than just quadratic loss functions. The following Lehmann-Scheffé theorem (1950, 1955) completes the story of complete sufficient statistics and UMVU estimates. Corollary 8.2.2. (Lehmann–Scheffé) Suppose T (X) is complete as well as sufficient for P and δ(X) is unbiased for q(θ). Then δRB (T ) is the unique UMVU estimate of q(θ). e unbiased, is better than Proof. By part a) of the theorem, δRB (T ) is unbiased. Suppose δ, 2 2 e δRB (T ), i.e. Eθ δ(X) − q(θ) ≤ Eθ δRB (X) − q(θ) with strict inequality for some θ. Then δeRB (T ), the Rao-Blackwell version of δe has 2 2 Eθ δeRB (T ) − q(θ) ≤ Eθ δRB (T ) − q(θ) . (8.2.25) Section 8.2 Similarity and Completeness 89 On the other hand, Eθ δRB T (X) − Eθ δeRB T (X) = Eθ δRB (T ) − δeRB (T ) = 0 for all θ. The result follows because completeness of T implies that δRB (T ) − δeRB (T ) = 0 . ✷ Constructing UMVU estimates when a complete, sufficient T is available. Method 1. Given q(θ), find δ such that Eθ δ T (X) = q(θ) for all θ. Then δ T (X) is UMVU. Method 2. Given q(θ), find δ(X) such that Eθ δ(X) = q(θ) for all θ. Then δ ∗ ≡ E δ(X) T (X) is UMVU. Example 8.2.8. Exponential families. T (X) is UMVU for Eθ T (X) ≡ Ȧ(θ). Recall two special cases: a) N/n is UMVU for p if N ∼ M(n, p). All aT N are UMVU estimates of aT p. b) X̄ is UMVU for µ if X1 , . . . , Xn are i.i.d. Np (µ, Σ). This example is also covered by Theorem 3.4.3. ✷ More examples will be given in the problems. However, consider the closely related Example 8.2.9. Estimating the variance covariance matrix Σ when X1 , . . . , Xn are i.i.d. Np (µ, Σ) with both µ, Σ unknown Pn for n ≥ 2. By Theorem 8.2.2, a complete sufficient statistic here is T(X) = n−1 i=1 Xi XT where X = (Xij ), 1 ≤ i ≤ n, 1 ≤ j ≤ d. P b ≡ n−1 n (Xi − X̄)(Xi − X̄))T has We try Method 2. The MLE Σ i=1 ( n ) X 1 b = E E(Σ) (Xi − µ)(Xi − µ)T − (X̄ − µ)(X − µ)T n i=1 n−1 1 = = Σ 1− . (8.2.26) n n b is an unbiased estimate of Σ and since it is a function of T only, it Therefore n/(n − 1) Σ b is not a linear function of T so that the theory of Section 3.4 doesn’t is UMVU. Note that Σ apply. ✷ Remark 8.2.5. We could not establish in Section 3.4 that the unbiased estimate s2 of σ 2 in the Gaussian linear model is UMVU because Var(s2 ) does not achieve the information lower bound. Now we have shown that s2 is UMVU. Here is another example. 90 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 Example 8.2.10. Estimation of p for geometric observations. For illustration, we apply both methods in this example. Suppose X1 , . . . , Xn are i.i.d. geometric (p) P [X1 = k] = q k p, 0 < p < 1, k = 0, 1, . . . , . Then, T ≡ Pn i=1 Xi is complete and sufficient and n+k−1 k n P [T = k] = q p , k = 0, 1, . . . , . n−1 We want an unbiased estimate of p. Method 1. We are supposed to find an estimate δ such that ∞ X n+k−1 k n p= δ(k) q p , n−1 k=0 or, equivalently, 1= ∞ X k=0 n + k − 1 k n−1 δ(k) q p . n−1 Then, for n ≥ 2, by the unicity of power series and the fact that the negative binomial distribution with parameters n, k, and p assigns probability 1 to the nonnegative integers, n+k−1 n+k−2 δ(k) = . n−1 n−2 Thus the solution is δ(k) = n+k−2 n−2 n+k−1 n−1 =1− k n−1 = . n+k−1 n+k−1 Then, δ ∗ (T ) = (n − 1)/(n + T − 1), n ≥ 2, is UMVU. Note that the MLE of pb solves the likelihood equation, Epb(X̄) = X̄ , or, equivalently, 1 − 1 = X̄ pb yielding pb = n/(n + T ), which is close to but not the same as the UMVUE. Method 2. We begin with a simple unbiased estimate, δ(X1 ) ≡ 1(X1 = 0). Then, E δ(X1 )|T = k = P [X1 = j|T = k] # " n i hX X Xi = k Xi = k − j /P = P X1 = j, = i=2 n+k−j −2 . n+k−1 . n−2 n−1 Section 8.2 Similarity and Completeness 91 By Corollary 8.2.2, δ ∗ (k) = P [X1 = 0|T = k] is the UMVU estimate which we already derived using Method 1. ✷ Here is a nonparametric example. Example 8.2.11. Estimation of F . Suppose X is univariate, F is the df of X which is assumed completely unknown. We assume P is the family of all distributions with continuous case densities. Here X(1) < . . . < X(n) are complete, sufficient by Example 8.2.3. Since n n 1X 1X Fb (x) ≡ 1(Xi ≤ x) = 1(X(i) ≤ x) n i=1 n i=1 and Fb is unbiased, we conclude that Fb (x) is an UMVU estimate of F (x) for all x. More generally, every function q(Fb) of Fb(·) is a UMVU estimate of its own expectation. ✷ Remark 8.2.6. The basis of our treatment and many more examples may be found in Lehmann and Casella (1998) and the problems. Summary. We introduce the concept of completeness for a class of probability distributions P and show how it can be used in the construction of tests and estimates with desirable properties. In essence, P is complete if it admits a sufficient statistic T such that the only function v(T ) that is an unbiased estimate of zero for all P ∈ P is the function v(T ) = 0. For this case T is also called complete. We call a test φ(x) similar or distribution-free with respect to a hypothesized class of probabilities P0 if EP φ(X) = α for all P ∈ P0 and show that if T (X) is a complete, sufficient statistic for P0 , then a randomized test φ is similar level α iff EP φ(X)|T(X) = α with P probability one for all P ∈ P0 . A test φ satisfying EP φ(X)|T (X) = α, P ∈ P0 , for a complete, sufficient statistic T is said to have Neyman structure. Important examples of complete, sufficient statistics are the natural sufficient statistic in k-dimensional exponential family models and the vector of order statistics or the empirical distribution. It is shown that when a complete, sufficient statistic exists, it can be used to construct similar tests by conditioning on a statistic which is complete and sufficient for the null hypothesis class of probabilities P0 . Important special cases are testing problems in exponential family models, Gaussian models, contingency tables, and the two-sample models where conditioning leads to permutation tests and rank tests. We define S(X) to be an ancillary statistic with respect to P if LP S(X) does not depend on P for P ∈ P, and show Basu’s theorem stating that if S(X) is ancillary for P and T (X) is complete, sufficient for P, then S(X) and T (X) are independent when X ∼ P ∈ P. For testing H : θ ∈ Θ0 vs k : θ ∈ Θ1 , we define a test φ to be unbiased level α if Eθ φ(X) ≤ α for all θ ∈ Θ0 and Eθ φ(X) ≥ α for all θ ∈ Θ1 ; that is, φ is no more likely to accept H when K is true than when H is true. We define φ to be UMP unbiased level α if it is UMP among all unbiased level α tests and show how such tests can be found for some models when a simple statistic which is complete, sufficient with respect to {Pθ : θ ∈ ∂Θ0 ∩ ∂Θ1 } is available. We give the Rao-Blackwell theorem which establishes that if S is any estimate of q(θ), then the conditional expected value E(S|T ) of S given a complete, sufficient statistic T has mean squared error at least as small as that of S. The Lehmann-Scheffé theorem establishes that if S is unbiased, E(S|T ) is UMVU. We give 92 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 two methods for finding UMVU estimates and show, for example, that in the multivariate T P P normal model X̄ = n−1 Xi and (n − 1)−1 Xi − X̄ Xi − X̄ are UMVU for µ and Σ. 8.3 Invariance, Equivariance, and Minimax Procedures The topics we present in this section are a small subset of the extensive discussion given in Lehmann’s classics, Theory of Point Estimation (TPE) and Testing Statistical Hypotheses (TSH), now Lehmann and Casella (1986) and Lehmann and Romano (2005). 8.3.1 Group Models In Section 8.2 we have seen how to reduce composite hypotheses to simple ones by conditioning on a sufficient statistic. Basu’s theorem further enabled us to identify situations where ancillary statistics could be used to turn conditional tests into unconditional ones. In this section we will begin by identifying an important structural property which is typical of such situations and which has further important implications. We recall the definition of a group(1) G of transformations g of a set L. Recall that all g ∈ G map L onto itself in a 1 − −1 fashion. Group multiplication is composition, and G is closed under composition and inversion. That is, i) If g1 , g2 ∈ G so does g1 ◦ g2 given by g1 ◦ g2 (x) ≡ g1 (g2 (x)). ii) If g ∈ G so does g −1 defined by g ◦ g −1 = g −1 ◦ g = j where j(x) = x, all x ∈ L. Here is an important example of such a group. The affine group on Rd Here L = Rd and G can be parametrized by θ ≡ (A, b) where Ad×d ranges over all nonsingular matrices and b over Rd . Then, g(A,b) (x) ≡ Ax + b. It is easy to see that G is a group and that the group G is isomorphic to the group G on A × Rd where A is the set of all nonsingular matrices via the correspondence g(A1 ,b1 ) ◦ g(A2 ,b2 ) = g(A1 A2 ,A1 b2 +b1 ) −1 g(A,b) = g(A−1 ,−A−1 b) . As we know G is not commutative. Most of the groups we consider will be subgroups, some commutative, of the affine group. Groups on the sample space X induce models. The simplest situation is when we start with a single P0 and define P = {P0 g −1 : g ∈ G} (8.3.1) or, equivalently, if X ∼ P0 , g(X) ∼ P0 g −1 ∈ P. Note that G parametrizes P. If the parametrization is identifiable, P0 g0−1 = P0 g1−1 ⇒ g0 = g1 , then G induces a group G Section 8.3 Invariance, Equivariance, and Minimax Procedures 93 on P defined by g(P ) ≡ P g −1 ∈ P. More generally, if θ → Pθ , θ ∈Θ is an identifiable parametrization of P, then G induces a group G on Θ via Pgθ = Pθ g −1 . (8.3.2) It is easy to establish (Problem 8.3.1) that (8.3.2) defines g, a transformation of Θ uniquely, and that G = {g ↔ g ∈ G} is a group isomorphic to G. That is, the map g → g has the properties g1 ◦ g2 = g 1 ◦ g 2 , g −1 = g −1 . We shall call models such as (8.3.1) transformation models induced by P0 , G. Here is a basic example. Example 8.3.1. The multivariate Gaussian model. Take P0 = Nd (0, J), where J is the d × d identity matrix. That is, if ε ≡ (ε1 , . . . , εd ) ∼ P0 , the ε1 , . . . , εd are i.i.d. N (0, 1). The transformation model induced by P0 and the affine group is just P = {Nd (µ, Σ); µ ∈ Rd , Σ nonsingular} (Problem 8.3.2). Note that the parametrization PA,b corresponding to g(A,b) (ε) = Aε + b is not identifiable—see Example 9.1.5. ✷ More generally, we define a group model P by the property that if P ∈ P, X ∼ P , and g ∈ G, then Pg−1 ∈ P. Here is a simple example. Example 8.3.2. The shift model. Fix a density f0 on Rd and let G0 be the group of all translations on Rd , gc (x) = x + c, and P0 the corresponding transformation model. G0 is clearly a subgroup of the affine group and if f0 is Gaussian P0 ⊂ P of Example 8.3.1. Note that the parametrization is identifiable. The model can be written structurally, X=c+ε (8.3.3) where ε ∼ f0 . For Example 8.3.1, X = Aε + b, where A is nonsingular and ε ∼ N (0, J). Now consider the set of all probability distributions corresponding to (8.3.3) when we take f0 to be arbitrary. This remains a group model under the shift group but is not a transformation model since we cannot transform one density shape, say Gaussian, into another, say Cauchy, by simply shifting. Equivalently the parametrization (c, f ) → f (·−c) is not identifiable. ✷ The major group we consider in addition to the affine group, and some of its subgroups, is the group of increasing transformations: Example 8.3.3. The group G of all continuous strictly increasing functions from R onto R. Let F0 be the cdf of P0 where F0 ∈ G. Then the model P generated by P0 and G is the set of all P on R with continuous strictly increasing df. If we consider the subgroup of g which have positive derivatives and P0 has a density which is positive we generate the submodel of all P having positive densities (Problem 8.3.3). ✷ Remarks 8.3.1. (a) Any given point x0 generates an orbit Ox ≡ {y : y = gx for some g ∈ G}. Belonging to the same orbit is an equivalence relation on X (Problem 8.3.4). (b) The groups we have defined so far allow only for samples of size 1. However, if a group G is defined on X and we have a sample (X1 , . . . , Xn ) taking values in X n then we 94 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 naturally define g (n) (x1 , . . . , xn )T = (g(x1 ), . . . , g(xn )) (8.3.4) the tensor group with the natural inherited group operations and properties. (c) For n = 1 all the groups we have considered are transitive, that is, there is only one orbit, X itself. That is, given x0 , x1 ∈ X there exists g ∈ G such that gx0 = x1 . For n > 1, following the definition in (8.3.4) this is no longer the case. For instance, for the shift group with d = 1, orbits are indexed by Rn−1 . Given (x01 , . . . , x0n−1 )T the orbit is the hyperplane, (x01 , . . . , x0n−1 , 0)T + c(1, . . . , 1)T (Problem 8.3.5). We re-encounter such quantities subsequently. 8.3.2 Group Models and Decision Theory The basic connection between group models and inference is made via a group G ∗ on the action space A and the loss function relation, l(P g −1 , g ∗ a) = l(P, a) (8.3.5) for all P ∈ P, g ∈ G, g ∗ ∈ G ∗ , a ∈ A. In other words, the cost of taking action a when you observe X ∼ P is the same as that of taking action g ∗ a when you observe gX ∼ P g −1 . Consistency with (8.3.5) for a decision rule δ is defined by δ(gx) = g ∗ δ(x) (8.3.6) for all x ∈ X . Such rules are called equivariant. In the case of tests they are called invariant for reasons that will become apparent. In other words again, equivariance means that if action a is “right” when you see x then g ∗ a is “right” if you see gx. Requiring decision procedures to have this property has the same nature as requiring unbiasedness. This time rules in the class obey certain symmetries. What G ∗ appear? Here are examples showing what happens in estimation and testing. Example 8.3.4. Testing in the Gaussian two-sample problem. Suppose that as in Section 4.9.3, we observe a sample X1 , . . . , Xn1 of i.i.d. N (µ, σ 2 ) observations and an independent sample Y1 , . . . , Yn2 of N (µ + ∆, σ 2 ) observations. We wish to test H : ∆ = 0, and postulate the usual 0 − 1 loss function. The model P can be viewed as a group model with respect to the shift group applied to (X1 , . . . , Xn1 , . . . , Y1 , . . . , Yn2 ). If Xi → Xi + c, 1 ≤ i ≤ n1 , Yj → Yj + c, 1 ≤ j ≤ n2 under gc , and we parametrize P by θ ≡ (µ, ∆, σ 2 ) in the standard way, then g c θ = (µ + c, ∆ + c, σ 2 ). Note that Θ0 ≡ {(µ, ∆, σ 2 ) : ∆ = 0} is an orbit of G as is Θ1 = {(µ, ∆, σ 2 ) : ∆ 6= 0}. ✷ Following Lehmann and Romano (2005), we shall in general call a testing problem H : θ ∈ Θ0 vs K : θ ∈ Θ1 invariant if P is a group model and Θ0 and Θ1 are disjoint Section 8.3 95 Invariance, Equivariance, and Minimax Procedures unions of G orbits. Consider now the usual 0 − 1 loss. If Θ0 is an orbit, if θ0 ∈Θ0 , l(gθ, 0) = 1 − l(gθ, 1) for all g and similarly for Θ1 . It is then natural to take G ∗ = {Identity}, the trivial group, and see that (8.3.6) defines an invariant test δ(gx) = δ(x) (8.3.7) for all g, x. We take (8.3.7) as our general definition of an invariant test. Example 8.3.4 (continued). In this example invariance means δ(x1 + c, . . . , xn1 + c, y1 + c, . . . , yn2 + c) = δ(x1 , . . . , xn1 , y1 , . . . , yn2 ) (8.3.8) for all x, y, c. For n1 = n2 = 1, this model is invariant under the location-scale group, g(a,b) (x1 , y1 ) = (ax1 + b, ay1 + b) a 6= 0, b ∈ R . Now g (a,b) (µ, ∆, σ 2 ) = (aµ + b, aµ + b + a∆, a2 σ 2 ) and Θ0 and Θ1 are again orbits. This invariance is preserved for general n1 , n2 if g is extended to Rn1 × Rn2 as in (8.3.4). Now test invariance becomes more stringent, δ(ax1 + b, . . . , axn1 + b, ay1 + b, . . . , ayn2 + b) = δ(x, y) for all a 6= 0, b (8.3.9) ✷ Example 8.3.2. The shift model (continued). Suppose we observe X1 , . . . , Xn i.i.d. P from the shift model with fixed f0 . Parametrize the shift model by θ ∈ R, P = {fθ ≡ f0 (· − θ) : θ ∈ R}. Suppose we want to estimate θ with loss l(θ, a) ≡ λ(|θ − a|), λ : R+ → R+ . Then, l(gc θ, a) = λ(|θ + c − a|) = λ(|θ − (a − c)|) . So, for equivariances, we must define gc∗ (a) = a + c and G ∗ is the shift group on R. Thus a translation equivariant estimate δ satisfies δ(x1 + c, . . . , xn + c) = δ(x1 , . . . , xn ) + c, all x1 , . . . , xn , c . (8.3.10) ✷ Example 8.3.5. The Gaussian one-sample problem. Suppose we observe X1 , . . . , Xn i.i.d. N (µ, σ 2 ). As in Example 8.3.4 this model is a group model for G = {gab : xi → axi + b, 1 ≤ i ≤ n, a 6= 0, b ∈ R}. If θ = (µ, σ 2 ), then g (a,b) (µ, σ 2 ) = (aµ + b, a2 b2 ) , a group operating on R×R+ . Suppose we want to estimate µ with the equivariant KullbackLiebler loss (Vol. I, page 169), which in this case is equivalent to scaled quadratic loss, l(θ, d) = (µ − d)2 /σ 2 . Then, as in the previous example, ′ l(g(a,b) θ, d′ ) = (µ − d a−b )2 (aµ + b − d′ )2 = . a2 σ 2 σ2 96 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 ∗ Thus we must define g(a,b) (d) = ad + b and G ∗ is the affine group on R. Correspondingly, an estimate δ of µ is said to be location-scale equivariant iff δ(ax1 + b, . . . , axn + b) = aδ(x1 , . . . , xn ) + b, all a 6= 0, b ∈ R . 8.3.3 (8.3.11) Characterizing Invariant Tests Invariant functions Rather than restrict ourselves to invariant test functions we more generally define an invariant function ψ : X → τ , where τ is arbitrary, as one such that ψ(gx) = ψ(x), all g ∈ G, x ∈ X . (8.3.12) Invariant functions can equivalently be characterized as functions which are constant on orbits of G. This leads naturally to the definition of a maximal invariant (function) M : X → L, where L is general, as a function which is invariant, but further, M (x1 ) 6= M (x2 ) if x1 , x2 are not in the same orbit. Thus M labels orbits uniquely. Clearly every invariant ψ is a function of M ; there exists λ : L → τ such that ψ(x) = λ(M (x)) . (8.3.13) From its definition we see that M is far from unique. In fact, for those familiar with measure theory the appropriate object to consider is the invariant σ field I = {A measurable ⊂ X : gA = A for all g ∈ G}. M simply induces I and (8.3.13) can be interpreted as:“Every invariant function is measurable with respect to I.” Suppose that testing H : θ ∈ Θ0 vs K : θ ∈Θ1 is invariant under G as discussed in the previous section. We define ϕ∗ (x) (possibly randomized) as the (uniformly) most powerful (UMP) invariant level α, test of H vs K1 if ϕ∗ is invariant, has level α and is at least as powerful as any other level α invariant test (defined by ϕ invariant). Our discussion makes it plain that a UMP invariant (UMPI) test is simply the UMP test based on observing M (X) only. Here are three examples of maximal invariants. Example 8.3.6. The shift group. We consider the group applied to [Rd ]n , gc : (x1 , . . . , xn ) → (x1 + c, . . . , xn + c). If, for all c, ψ(x1 + c, . . . , xn + c) = ψ(x1 , . . . , xn ), let c1 = −x1 . Then, ψ(x1 , . . . , xn ) = ψ(0, x2 − x1 , . . . , xn − x1 ) . (8.3.14) But the right hand side of (8.3.14) is clearly invariant with respect to shift. Evidently M (x1 , . . . , xn ) ≡ (x2 −x1 , . . . , xn −x1 ) is a maximal invariant since (x02 −x01 , . . . , x0n − Section 8.3 Invariance, Equivariance, and Minimax Procedures 97 x01 ) = (x12 − x11 , . . . , x1n − x11 ) implies that (x01 , . . . , x0n ) = (x11 + c, . . . , x1n + c) where c = x01 − x11 , which is just the condition required by maximality. Note that we could just as well have taken M (x1 , . . . , xn ) = (x1 − x, . . . , xn − x) where x is the mean of the xi . Note also that (x1 , M (x1 , . . . , xn )) is an alternative representation of (x1 , . . . , xn ). Thus, the value of M (x1 , . . . , xn ) tells us which orbit we’re on and x1 parametrizes points in the orbit, or equivalently G. More generally, if G is parametrized smoothly by θ → gθ where θ ∈ Θ, an l dimensional manifold, and the dimension of X is p, we expect M (x) to range over a p − l dimensional manifold. If l ≥ p, then G is transitive. ✷ Example 8.3.7. The orthogonal group. Let X = Rd and G be the group of orthogonal d×d matrices; the set of all Ad×d such that AAT = AT A = J, the identity. As is well known orthogonal transformations preserve Euclidean distances between points, |Ax − Ay|2 = |A(x − y)|2 = (x − y)T AT A(x − y) = |x − y|2 . This shows that M (x) ≡ |x|2 is invariant. On the other hand |x|2 = |y|2 implies that there exists A orthogonal such that Ax = y (Problem 8.3.6). Thus, |x|2 is a maximal invariant. ✷ Example 8.3.3 (continued). The group of monotone transformations. Suppose g is a continuous strictly monotone increasing function from R onto R and x(1) < . . . < x(n) . Evidently, g(x(1) ) < . . . < Png(x(n) ). Thus if X = {(x1 , . . . , xn ) : x1 6= x2 . . . 6= xn } the ranks Rj (x1 , . . . , xn ) = i=1 1(xi ≤ xj ), 1 ≤ j ≤ n, are invariant. On the other hand, suppose (R1 (x), . . . , Rn (x)) = (R1 (y), . . . , Rn (y)), x, y ∈ X . Then, there exists g ∈ G such that g(x(i) ) = y(i) , 1 ≤ i ≤ n, where x(1) , . . . , x(n) , y(1) , . . . , y(n) are the ordered x’s and y’s. Hence (R1 , . . . , Rn ) is a maximal invariant (Problem 8.3.7). Note in this case a natural representation of (x1 , . . . , xn ) is (R1 (x), . . . , Rn (x), x(1) , . . . , x(n) ). ✷ We now construct UMPI tests. Example 8.3.8. Testing one shift family against another. Let X ∈ Rd be distributed according to Pj,c , j = 0, 1, c ∈ Rd where Pj,c has continuous case density fj (· − c), j = 0, 1, and f0 , f1 are of different types, that is, f1 does not belong to the location family generated by f0 . We want to test H : P ∈ P0,c for some c vs K : P ∈ P1,c for some c. The problem is clearly invariant under the shift group G. By Example 8.3.6, we need only consider tests based on M (X) = (X2 − X1 , . . . , Xd − X1 ). Note that the distribution of M (X) does not depend on c and M (X) has density on Rd−1 given by Z ∞ gj (m) = fj (u, m1 + u, . . . md−1 + u)du . (8.3.15) −∞ Thus a UMPI level α test exists and is of the form Reject H if g1 (M (X)) > cg0 (M (X)), Accept H if g1 (M (X)) < cg0 (M (X)) with c determined by g0 and α. ✷ This example suggests the result: Proposition 8.3.1. The distribution of any maximal invariant M (X) of a group G leaving a model P = {Pθ : θ ∈Θ} invariant depends on θ only through M (θ), i.i.d. as (X, Y ) 98 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 the maximal invariant of the group G induced on Θ by G. Proof: See Problem 8.3.8 or Lehmann and Romano (2005). Example 8.3.9. The Gaussian linear model with known variance. Suppose that as in Section 6.1, Y n×1 = µn×1 + ε where ε ∼ N (0, σ02 J) and µ ∈ V , a linear subspace of Rn of dimension d. We want to test µ = 0 vs µ 6= 0. This testing problem is left invariant by the group G of orthogonal matrices operating on Rn , since Aµ = 0 iff µ = 0 and Aε ∼ N (0, σ02 AJAT ) = N (0, J) if A is orthogonal. By Example 8.3.6 Pd M ≡ |X|2 = j=1 Xi2 is maximal invariant. Under H, M has a σ02 χ2d distribution. More generally, M ∼ σ02 χ2d (|µ|2 σ02 ) where χ2d (θ2 ) is a noncentral χ2 distribution with noncentrality parameter θ2 . It may be shown (Problem 8.3.9) that the χ2d (θ2 ) family is Monotone Likelihood Ratio (MLR) in θ2 and M . Therefore the test, Reject H iff M ≥ χd (1 − α) σ02 where χd (1−α) is the (1−α) quantile of χ2d , is UMP invariant by Theorem 4.3.1. Extension of this result to the case σ 2 unknown and linear hypotheses are given in Lehmann and Romano (2005). ✷ Example 8.3.10. The nonparametric two-sample problem. Let X1 , . . . , Xn1 be i.i.d. F , Xn1 +1 , . . . , Xn be i.i.d G where F, G are continuous but otherwise unknown. All the X’s are independent. The testing problem, H : F = G vs K : F 6= G, is invariant under the transformation group of Example 8.3.3, applied to X = (X1 , . . . , Xn )T . Thus R ≡ (R1 , . . . , Rn )T is a maximal invariant. Under H, we have already seen that P [(R1 , . . . , Rn ) = (i1 , . . . , in )] = 1/n! for all permutations {i1 , . . . , in } of {1, . . . , n}. Under the alternative, however, although there is a formula for the distribution of R, there is no UMP invariant test (Problem 8.3.10). However, as was first noted by Lehmann (1953) there are semiparametric group submodels P ≡ {Pθ,F : G = q(F, θ), θ ∈ R} for which (R1 , . . . , Rn ) has a distribution depending on θ only (Problem 8.3.11). For instance, Savage (1956) considers G = 1 − (1 − F )θ , (8.3.16) θ ≥ 1, a model which includes F (t) = 1 − e−λt , G(t) = 1 − e−λθt , the exponential lifetime two-sample model. The Lehmann-Savage(2) model is a special case of the Cox proportional hazard model, Example 9.1.4, with λ being the hazard rate of F and λθ that of G. See Problems 1.1.12 and 1.1.13. By arguing as in Example 8.3.8, for the model G = 1 − (1 − F )θ , there is a UMP invariant (wrt F ) test for H : θ = 1 vs K : θ = θ1 > 1, which in this case can be computed (Problem 8.3.10(c)). Unfortunately, it depends on θ1 and we need a new concept: A test ψ of H : θ = θ0 vs K : θ > θ0 is locally MP if there exists ε > 0 such that ψ is MP for all θ ∈ (θ0 , θ0 + ε). A formula for computing such tests is given in Problems 8.3.10-8.3.12. For model (8.3.16) there is a locally most powerful (see Problems 8.3.12 and 8.3.13) UMP invariant test called the Savage exponential scores test. Section 8.3 Invariance, Equivariance, and Minimax Procedures 99 Let aE (j) = E(X(j) ) where X(1) < . . . < X(n) are the exponential, E(1), order statistics (see Problem 8.3.13). Then the Savage exponential scores test is given by Reject H iff n X aE (Ri ) < c i=n1 +1 with c determined under F = G by the level α and the uniform distribution on permutations. Hoeffding (1951) considered the optimal rank test for the Gaussian two-sample problem. Note that all rank tests are permutation tests and similar and they have the attractive feature that when H : F = G holds their distributions do not depend on F = G. They are distribution-free. Example 8.3.11. Invariant regression tests. Copula models. As in Section 6.1 consider the framework where the ith measurement (response) Yi among n independent observations has a continuous distribution Fi and we want to investigate whether Fi depends on a vector zi = (zi1 , . . . , zid )T of available constants (predictor values), 1 ≤ i ≤ n, d < n, where zi1 , . . . , zin are not collinear. We consider the testing problem H : Fi = F , 1 ≤ i ≤ n, vs K : Fi 6= Fk , some i, k, which is invariant under the group of continuous increasing transformations of Example 8.3.3 applied to Y = (Y1 , . . . , Yn )T . Thus the rank vector R = (R1 , . . . , Rn )T of Y is maximal invariant. Consider semiparametric group submodels of the form Fi (·) = Ci F (·) , 1 ≤ i ≤ n , where Ci , which is called a copula, is a continuous df on (0, 1) which is known except for a parameter θi that depends on zi ; and F is an unknown continuous baseline df. The term “copula,” introduced by Sklar (1959), is a measure of the dependence between variables X1 , . . . , Xp . It is defined as C(u1 , . . . up ) = P (G1 (X1 ) ≤ u1 , . . . , Gp (Xp ) ≤ up ), p ≥ 2 , where the marginal df’s G1 , . . . , Gp are assumed to be continuous. Here we extend this definition to the p = 1 case: the regression copula for Yi ∼ Fi w.r.t. the hypothesis H : Fi = F , 1 ≤ i ≤ n, for continuous F is Ci (u) = P F (Yi ) ≤ u , = Fi F −1 (u), 1 ≤ i ≤ n . The copula Ci (u) is a measure of the dependence of the distribution Fi on zi . The term “baseline distribution” df refers to the hypothesis distribution F that does not depend on zi . Set Vi = F (Yi ), then Ri = Rank(Vi ) because the ranks are invariant under increasing transformations. The df of Vi is Ci F F −1 (v) = Ci (v), 0 ≤ v ≤ 1 , and the distribution of R only involves Ci (·). Useful choices are the Gaussian copula Ci (v) = Φ Φ−1 (v) − θi and θi = β T zi , in which case Φ−1 (Vi ) = Φ−1 F (Yi ) ∼ 100 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 N (θi , 1) and we may assume Fi = N (θi , 1) when we deal with rank tests. For this copula, when d = 1 and θi = α + βzi , the locally UMP invariant (w.r.t. α, F ) test of H0 : β = 0 vs H1 : β > 0 is based on the normal scores statistic 1 T Φ = n− 2 n X i=1 aΦ (Ri )(zi − z̄i ) , where aΦ (k) = E(z (k) ) with Z (1) < . . . < Z (n) the N (0, 1) order statistics (Problem 8.3.14(d)). The normal scoresaΦ (k), which are also called the Fisher-Yates scores, can be closely approximated by Φ−1 (k − 3/8)/(n + 1/4) . The locally UMP invariant test for the logistic copula Ci (u) = L(L−1 (u) − θi ) with −1 L(t) = 1 + exp(−t) and θi = α + βzi is based on the uniform scores statistic 1 T U ≡ n− 2 n X Ri (zi − z̄) . n+1 i=1 Critical values for the tests based on TΦ and TU can√be obtained from the distribution of the ranks or their asymptotic distributions: TΦ /s and 12TU /s are asymptotically N (0, 1) under H with n X (zi − z̄)2 /(n − 1) s2 = i=1 provided (zi ), 1 ≤ i ≤ n, satisfy the Lindeberg condition max (z − z̄)2 Pn i i →0. 2 i=1 (zi − z̄) See Section 9.5 and Hajek and Sidak (1967), Section V.1.5. ✷ T The parameter β in the model θ = β z in the previous example is only identifiable if i i d ≤ n and (z1j , . . . , znj )T , 1 ≤ j ≤ d are not collinear; see Problem 1.1.9. However, modern technology has made it possible to generate data with d very large, so called “big data.” In particular, d > n. We turn to this case: Example 8.3.11 (continued). Invariant high dimensional data analysis. One approach to Pd the d > n case is to replace β T zi in the preceding copula model with aT zi = j=1 aj zij , where the aj are selected to maximize the spread of {aT zi , : 1 ≤ i ≤ n} subject to aT a = 1. This makes sense if we regard aT zi as predictors because regression analysis is most effective when the predictors are spread out. Thus we choose a to maximize the sample variance of {aT zi : 1 ≤ i ≤ n}. That is, we use (see Section B.10.12 and Problem 8.3.24) the first sample eigenvector b : aT a = 1} b a = arg max{aT Σa b is the sample covariance matrix of (zij )n×d . Let where Σ ti = b azi , 1 ≤ i ≤ n . Section 8.3 Invariance, Equivariance, and Minimax Procedures 101 Here t1 , . . . , tn , which are called the first principal component sample values, can be computed using available principal component analysis (PCA) software, e.g. “eigensoft.” Now we use the semiparametric copula regression submodel Fi (·) = Ci (F (·)) with Ci having parameter ηi = α + βti . This in an example of a sparse model. Several principal components (Problem 8.3.24) can also be used. Typically, ten or fewer are used. The locally UMP invariant PCA tests of H0 : β = 0 vs H1 : β > 0 for the Gaussian and logistic regression copulas are the normal scores and uniform scores tests, respectively, with zi replaced by ti . b and the eigenvector b Remark 8.3.2. Σ a are known to be inconsistent estimates of their population counterparts when d > n, unless the model for the data is restricted to a submodel. One approach is to assume that the predictor is a random vector Z ∈ Rd whose covariance matrix Σ is restricted; see Bickel and Levina (2008 a, b). A similar approach is to restrict the ordered population eigenvalues λ1 ≥ λ2 ≥ . . . (as defined in Section B.10.12 and Problem 8.3.24) of Σ. Such approaches are referred to as sparse PCA, e.g. see Patterson et al. (2006), Paul (2007), El Karoui (2008), Amini and Wainwright (2009), Jung and Marron (2009), Johnstone and Lu (2009), Witten, Tibshirani and Hastie (2009), Shen, Shen and Marron (2011), Birnbaum, Johnstone, Nadler and Paul (2013), and Hastie, Tibshirani, and Wainwright (2015). These references focus on the estimation of the covariance matrix of Z and its eigenvalues and eigenvectors. For papers that focus on the association between individual predictors Zk and a response Y , see Price et al. (2006), Lin and Zheng (2011), and Yang, Doksum and Tsui (2014). ✷ Example 8.3.11 (continued). Invariant tests in transformation models. Consider the transformation regression model with T h(Yi ) = θi + εi , 1 ≤ i ≤ n , where θi = β zi , εi has a known continuous distribution G, and h is an increasing continuous unknown function. In this case h−1 (Yi ) ∼ θi + εi . Thus if G = Φ and we use rank tests, we are in the realm of the Gaussian copula model and can use the normal scores rank tests described earlier in this example. The proportional hazard and proportional odds models are tranformation regression models. See Problem 8.3.14. Example 8.3.12. Invariant tests of independence. Bivariate copulas. Consider the problem of testing the independence of two random variables X and Y with continuous df F (x, y). The testing problem is invariant under increasing continuous transformations h(X), g(Y ), and for a sample (X invari 1 , Y1 ), . . . , (X Pn Pnn , Yn ) i.i.d. as (X, Y ) the maximal ant is (R1 , S1 ), . . . , (Rn , Sn ) where Ri = k=1 1(Xk ≤ Xi ) and Si = k=1 1(Yk ≤ Yi ) are the ranks. Sklar (1959) introduced models of the form Fθ (x, y) = Cθ F1 (x), F2 (y) for some continuous df (copula) Cθ on [0, 1] × [0, 1] which is known except for the parameter θ, where F1 and F2 are the marginal df’s of X and Y . A useful choice is the bivariate Gaussian copula −1 2 Cρ (u, v) = Φρ Φ−1 1 (u), Φ2 (v) , Φ1 = N (µ1 , σ1 ), Φ2 = N (µ2 , σ22 ), Φρ = N (µ1 , µ2 , σ12 , σ22 , ρ) . 102 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 The locally UMP invariantP test of H0 : ρ = 0 vs H1 : ρ > 0 is based on the bivarin ate normal scores statistic i=1 aΦ (Ri )aΦ (Si ) where aΦ (k), 1 ≤ k ≤ n, are the normal scores (see Problem 8.3.16). The null distribuion is obtained by rearranging the pairs (R1 , S1 ), . . . , (Rn , Sn ) so that R1 < R2 < . . . < Rn and noting that the resulting Y -ranks S∗ has P (S∗ = s) = 1/n! for each permutation s of (1, . . . , n). Klaassen and Wellner (1997) show that in the Gaussian copula model the normal scores correlation coefficient Pn a (R )a (S ) PnΦ i2 Φ i ρbΦ = i=1 i=1 aΦ (i) is an asymptotically semiparametrically efficient estimate of ρΦ = Corr Φ−1 (F1 (X)), Φ−1 (F2 (Y1 )) for each ρ ∈ (−1, 1) uniformly in F1 , F2 over the class of all regular estimates. It can be shown that for the bivariate normal copula model, |ρΦ | is the maximum monotone correlation coefficient (Problem 8.3.16) in the sense that |ρΦ | = sup corr a(X), b(Y ) a,b for a and b monotone. Approximate critical value for H : ρ = 0 can be obtained from 1 n− 2 ρbΦ ∼ N (0, 1). For more on rank tests see the problems, Hajek and Sidak (1967), Lehmann (2006), and Lehmann and Romano (2005). 8.3.4 Characterizing Equivariant Estimates Characterizing equivariant estimates is relatively simple in the case of the shift group. Suppose as in Example 8.3.2 we observe X 1 , . . . , X n ∈ Rd i.i.d. f (· − θ) for f and θ unknown. We want to estimate θ and our loss function is of the form λ(|θ − a|), with λ increasing. As we argued for d = 1 an appropriately equivariant estimate obeys δ(x1 + c, . . . , xn + c) = c + δ(x1 , . . . , xn ) . (8.3.17) Examples of equivariant estimates are δ ∗ = X, or (med X1i , . . . , med Xdi )T , or even i i X 1 . Note that if δ, δ ∗ are equivariant then δ − δ ∗ is invariant by (8.3.17). Thus, if M (x) is a maximal invariant under G, any equivariant estimate can be written for some ψ as δ(x1 , . . . , xn ) = δ ∗ (x1 , . . . , xn ) + ψ(M (x1 , . . . , xn )). (8.3.18) Taking δ ∗ (x1 , . . . , xn ) = x1 gives us a particularly simple expression in (8.3.18). From this formula it is easy to deduce that there is a uniformly best equivariant estimate and derive its form, at least for λ(t) = t2 and similar loss functions. Section 8.3 103 Invariance, Equivariance, and Minimax Procedures Theorem 8.3.1. If there exists an equivariant estimate for the shift group with finite risk, then there is a unique (Uniform Minimum Risk Equivariant) UMRE estimate, given by δ ∗ (x1 , . . . , xn ) = x1 − ψ ∗ (M (x)) (8.3.19) where M is a maximal invariant and ψ ∗ (M ) = arg minE0 (λ(X 1 + c)|M ). c Proof. Write, keeping f = f0 fixed, R(θ, δ) = R(0, δ) = E0 λ(X 1 + ψ(M (X 1 , . . . , X n ))) . (8.3.20) The reason for the first identity is that X 1 − θ and ψ(M (X 1 , . . . , X n )) have a distribution not depending on θ. But E0 λ(X 1 + ψ(M )) = E0 {E0 (λ(X 1 + ψ(M ))|M )} . Since ψ is arbitrary the result follows. ✷ 2 If we specialize to λ(t) = t we readily obtain, by Theorem 1.4.1 (Problem 8.3.17), δ ∗ (X 1 , . . . , X n ) = X 1 − E0 (X 1 |M ) . If X 1 has density f0 we can obtain a very suggestive form, R Qn ∗ Rd θQ i=1 f0 (X i − θ)dθ R δ (X 1 , . . . , X n ) = . n i=1 f0 (X i − θ)dθ Rd (8.3.21) (8.3.22) We derive (8.3.22). Let M (X 1 , . . . , X n ) = (X 2 − X 1 , . . . , X n − X 1 ). Then, the conditional density of X 1 |M under θ = 0 is Qn−1 f0 (u) i=1 f0 (mi + u) f (u|M ) = R . Qn−1 i=1 f0 (mi + v)dv Rd f0 (v) So, R Qn−1 uf0 (u) i=1 f0 (mi + u)du E0 (X 1 |M ) = R Qn−1 i=1 f0 (mi + v)dv Rd f0 (v) Rd and X 1 − E0 (X 1 |M ) = R Rd Qn−1 (X 1 − u)f0 (u) i=1 f0 (mi + u)du . R Qn−1 f (v) i=1 f0 (mi + v)dv Rd 0 We obtain (8.3.22) by changing variables, from v to θ = X 1 − v. The UMRE estimate (8.3.22) known as Pitman’s estimate is the (improper) Bayes estimate of θ when θ has prior “density,” π(θ) ≡ constant, the “uniform distribution” on Rd . In particular, Example 8.3.13. Estimating the mean of a multivariate Gaussian distribution. Let X ∼ N (µ, σ02 JP ) where JP is the identity matrix. By sufficiency, because X̄ is also Gaussian, we can consider n = 1 and suppose X = θ + e where e ∼ Np (0, Jp ). Then, we obtain readily that X is the UMRE estimate of θ, just as we have seen it is UMVU.We will consider this and other estimates of µ in Section 8.3.6. ✷ 104 8.3.5 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 Minimaxity for Tests: Application to Group Models Recall Theorems 3.3.2 and 3.3.3 where we saw that minimax procedures δ ∗ were Bayes or approximately Bayes with respect to prior distributions which concentrated on the set of parameters that produce the maxima of the risk of δ ∗ . We have already noted that invariant and equivariant procedures have large sets of constant risk, the orbits of G in Θ. Thus, best equivariant procedures are natural candidates for minimaxity. Testing: We begin with a general study of minimaxity for testing. Here, the asymmetry of the two types of error leads to an asymmetric formulation of minimaxity not covered in Chapter 3. Definition 8.3.1. ϕ∗ is minimax level α for testing H : θ ∈ Θ0 vs K : θ ∈ Θ1 iff ϕ∗ is level α and inf{Eθ ϕ∗ (x) : θ ∈ Θ1 } ≥ inf{Eθ ϕ(x) : θ ∈ Θ1 } for all other level α tests ϕ. Thus the minimax test maximizes the minimum power, where for 0-1 loss, power = 1 risk. Such ψ ∗ is sometimes called a maxmin test. Suppose we are given a prior distribution π for θ on Θ0 ∪ Θ1 ≡ Θ such that i) Π(Θ0 ) = λ = 1 − Π(Θ1 ), 0 < λ < 1 ii) θ given θ ∈ Θj has distribution Πj , j = 0, 1 and a loss function lu given for 0 < u < 1 by lu (θ, 1) = u1(θ ∈ Θ0 ), lu (θ, 0) = 1(θ ∈ Θ1 ) . From a simple extension of Example 3.2.2 (Problem 8.3.18) we see that Bayes tests are of the form, for 0 < λ < 1, R p(x, θ)Π1 (dθ) ϕc (x) = 1 if RΘ1 >c Θ0 p(x, θ)Π0 (dθ) = 0 if the inequality is reversed (8.3.23) where c = λu/(1 − λ). The argument in Example 3.2.2 makes it clear that (8.3.23) doesn’t depend of the finiteness of Θ. We can apply Theorem 3.3.2 to obtain Theorem 8.3.2. Suppose ϕc given by (8.3.23) is size α, and that Π0 {θ : Eθ ϕc (X) = supEθ′ ϕc (X)} = 1 Θ0 Π1 {θ : Eθ (1 − ϕc (X)) = supEθ′ (1 − ϕc (X))} = 1 . Θ1 Then, ϕc is minimax level α. (8.3.24) Section 8.3 Invariance, Equivariance, and Minimax Procedures 105 Proof. Let β = supEθ′ (1 − ϕc (X)) and Θ1 u= α λ c , = . β 1−λ u (8.3.25) Note that ϕc is Bayes for Π, lu and that for R(θ, ϕc ) = Eθ lu θ, ϕc (X) , supR(θ, ϕc ) = supR(θ, ϕc ) = supR(θ, ϕc ) . Θ0 Θ1 (8.3.26) Θ Now, ϕc Bayes, (8.3.26), and Theorem 3.3.2 yield that ϕc is minimax for lu and that the Bayes risk equals the maximum risk. Then, given a competing ϕ of level α, by hypothesis, Z Z λα + (1 − λ)uβ = λ Eθ ϕc (X)Π0 (dθ) + (1 − λ)u Eθ (1 − ϕc (X))Π1 (dθ) Z Z ≤ λ Eθ ϕ(X)Π0 (dθ) + (1 − λ)u Eθ (1 − ϕ(X))Π1 (dθ) ≤ λα + (1 − λ)u supEθ (1 − ϕ(X)) Θ1 and the result β ≤ sup Eθ (1 − ϕ(X)) follows. Θ1 Corollary 8.3.1. If Θ1 = {θ1 }, ϕc is given by (8.3.23) and the first condition of (8.3.24) holds; then ϕc is UMP level α for H : θ ∈ Θ0 vs K : θ = θ1 . It may be shown that the conditions of Theorem 8.3.2 and Corollary 8.3.1 are also necessary for minimaxity and UMP properties of test rules under regularity conditions on the model P and compactness of Θ0 , Θ1 (Blackwell and Girshick (1954, 1979)). Example 8.3.14. Minimaxity in testing in the Gaussian linear model with σ 2 known. H : µ = 0 vs K : |µ| = r0 . Consider the test of Example 8.3.9, which we have shown to be UMP invariant under the orthogonal group. Note that the power function of this test depends on |µ|/σ0 only and hence (8.3.24) holds automatically. We need to exhibit Π0 and Π1 . Without loss of generality take σ0 = 1. Evidently Π0 is point mass at 0. For Π1 it is natural to choose the uniform distribution on the sphere surface {µ : |µ| = r0 }. Then Z 2 2 p(X, µ) 1 1 L(X) ≡ (8.3.27) dΠ1 (µ) = e 2 |X| E(e− 2 |X−r0 U | |X) Θ1 p(X, 0) where X, U are independent, X ∼ N (0, J), and U is uniform on the surface of the unit sphere. Now (8.3.27) becomes L(X) = e− 2 r0 2 E(er0 (X,U ) |X) = e− 2 r0 2 E(er0 (|X|(X/|X|,U ) |X) . Since the distribution of U is invariant under rotations we can replace X/|X| by, say, (1, 0, . . . , 0)T without changing the result. Thus L(X) is a monotone increasing function 106 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 of |X|. We conclude that the UMP invariant test is of the form (8.3.23) and hence we can apply Theorem 8.3.2 to conclude minimaxity. ✷ Example 8.3.15. A UMP test for a composite multivariate Gaussian hypothesis. Let X 1 , . . . , X n be i.i.d. Np ((ϑ, η T )T , Σ0 ). The scalar ϑ and β ≡ (η2 , . . . , ηp )T are unknown; Σ0 is nonsingular but known. We wish to test H : ϑ = ϑ0 vs K : ϑ = ϑ1 > ϑ0 . Note that both hypothesis and alternative are composite and multivariate so that the UMP test theory of Chapter 4 doesn’t apply. First reduce by sufficiency to the case n = 1 by considering X. Consider the case Σ0 = J. Define Π0 , Π1 as corresponding to point masses at (ϑ0 , η 0 )T and (ϑ1 , η 0 )T . Then, if X ≡ (X1 , . . . , Xp )T , R p(X, θ)dΠ1 (θ)dΠ1 (θ) L(X) ≡ Θ1 R Θ0 p(X, θ)dΠ0 (θ) 1 = exp − {(X − θ1 )T (X − θ1 ) − (X − θ0 )T (X − θ0 )} 2 where θ j = (ϑj , η2 , . . . , ηp )T , j = 0, 1. Then, 1 log L(X) = (ϑ1 − ϑ0 )X1 − (|θ 1 |2 − |θ0 |2 ) 2 (8.3.28) and thus the family of Bayes tests ϕc for the loss function lu (θ, ϕ) is equivalent in this case to the family {Reject H iff X1 > d} . But under H, X1 ∼ N (ϑ0 , 1). The conditions of Corollary 8.3.1 are clearly satisfied since the tests have power depending on ϑ only. It follows (Problem 8.3.16(a)) that the size α test for this hypothesis is, in fact, UMP for H : ϑ ≤ ϑ0 vs K : ϑ > ϑ0 . Now consider the case of general Σ0 . Make the well-specified linear transformation, X → U = (X1 , X T[2,p] − Cov(X1 , X[2p] )(VarX1 )−1 X1 )T , (8.3.29) where X [2,p] ≡ (X2 , . . . , Xp )T . Note that ΣR ≡ Cov(X1 , X [2,p] ) = (σ12 , . . . , σ1p ) and VarX1 = σ11 so that the second term has dimension p − 1. Write U ≡ (U1 , . . . , Up )T ≡ AX and σ11 Σ12 Σ0 = . Σ21 Σ22 It follows that U ∼ Np (U , AΣ0 AT ) where U ≡ Aϑ. Since µ1 = ϑ1 we still test H : ϑ = ϑ0 vs K : ϑ = ϑ1 . Most significantly, AΣ0 AT is of the form σ11 0T 0 Σ∗22 where Σ∗22 = Var(U2 , . . . , Up )T . It is easy to see (Problem 8.3.16(b)) that in this case, just as with the special case Σ0 = J = diag(1, . . . , 1), the Bayes test with the same prior as before is of the form {Reject H iff X1 > c}. Therefore this test is UMP for H : ϑ ≤ ϑ0 Section 8.3 Invariance, Equivariance, and Minimax Procedures 107 vs K : ϑ > ϑ0 no matter what Σ0 is. This is the likelihood ratio test and is both UMP invariant and unbiased (Problem 8.3.19(c),(d)). Remark 8.3.3. (a) It turns out, see Lehmann and Romano (2005), that some of the classical likelihood ratio tests for H : σ ≤ σ0 , etc. when we observe X1 , . . . , Xn i.i.d. N (µ, σ 2 ) are UMP and not just UMP unbiased, as we saw in Section 8.2, but most are not. (b) The theory of minimax testing and, as we shall see, estimation in group models is very closely tied to the existence of (invariant) Haar measures on groups—see Nachbin (1965) for a treatment. If G can be smoothly indexed by a Euclidean parameter θ ∈ K, a compact set in Rd , these can be taken as probability distributions, PL , PR on G. They have the property that if A (measurable) ⊂ G then PL (gA) = PL (A) PR (Ag) = PR (A) for all A and g ∈ G. Here gA = {h = gb, b ∈ A} and similarly for Ag. The pair PL , PR are unique, although PL 6= PR is possible if G is not commutative. If G = {gθ : θ ∈ Θ} and G is the group induced on Θ we can produce a prior distribution on any orbit of G by considering gθ0 for θ0 fixed and g ∼ PL or PR . It turns out that PR is the Haar measure important in our context. In our Example 8.3.9, where G is the orthogonal group, indeed PL = PR is the uniform distribution on G and not surprisingly this generates the uniform prior on a sphere (an orbit of G). If G is not compact, improper Haar measures arise. We discuss this further in the next subsection. A very complete treatment of the theory of UMP and minimax tests is given by Lehmann and Romano (2005). 8.3.6 Minimax Estimation, Admissibility, and Steinian Shrinkage We have discussed minimax estimation briefly in Chapter 3. In this section we want to go further and foreshadow some phenomena discussed in Chapters 10 and 11. We begin by recalling Section I.6 in simplified form. Suppose X1 , . . . , Xp are independent Gaussian with common known variance σ02 and EXi ≡ µi unknown and arbitrary. Equivalently, X ∼ Np (µ, σ02 J) where J = diag(1, . . . , 1). We want estimate µ with quadratic loss. That is, if d = (d1 , . . . , dp )T , l(µ, d) = Pto p 2 |µ−d| = i=1 (µi −di )2 . This is a special case of Example 8.3.2 with f0 ↔ Np (0, σ02 J). It is easy to see that Pitman’s estimator here is X itself since M (X) is degenerate (n = 1) and so E0 (X|M (X)) = E0 X = 0. We know that X is UMVU and that, by Problem 3.3.20, X is minimax. If we have a multivariate normal sample, X̄ is Gaussian, so the remarks about X also apply to X̄. Remark 8.3.4. As with testing, minimaxity and equivariance here can be tied to Haar’s measure on the shift group. X1 , . . . , Xp are the formal Bayes estimates with respect to 108 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 R∞ R∞ the “uniform distribution” on Rp . Since −∞ −∞ dt1 . . . dtp = ∞ the uniform is an improper prior. It corresponds to a Haar measure which is not a probability distribution. If R λ(A) ≡ A dt1 . . . dtp then λ(A + c) = λ(c + A) = λ(A). Corresponding Haar measures on scale and other groups generate priors, usually improper, which play an important role in Bayesian inference, being viewed as “uninformative.” Various arguments can be raised for their use including frequentist efficiency. There is now little controversy in the literature that these methods are as legitimate as maximum likelihood. There is more controversy about the nonsubjective interpretation of posterior probabilities calculated under such assumptions. Arguments in favor of the subjective point of view may be found in Bernardo and Smith (1994), Berger(1985), and Gelman, Carlin, Stern and Rubin (1995). ✷ We introduce a decision theoretic notion now only rarely considered which played an important role in the discovery of the practical weaknesses of the minimax principle. A decision rule δ ∗ in a given decision problem is said to be admissible iff there exists no other δ such that R(P, δ) ≤ R(P, δ ∗ ) for all P ∈ P with < for some P . Otherwise δ ∗ is said to be inadmissible. By refining the methods of Section 3.3 it is possible to obtain usable criteria for proving admissibility (Blyth (1951)). This is also possible using the information inequality (Hodges and Lehmann (1951)). This work and relations to complete class theorems such as those mentioned in Section 1.3 are discussed in some detail in Lehmann and Casella (1998)—see also Schervish (1995). We do not proceed further into the depths of this topic but make the trivial point that if a procedure has constant risk and is minimax but inadmissible then any improving procedure is necessarily also minmax. Stein’s remarkable (1956(b)) discovery was that although for the risk function E(|X − µ|2 ), X is minimax for all p, it is admissible only for p = 1, 2. In a sense this can be viewed as an indication that Robbins’ (1964) empirical Bayes view that, for p large, X is improvable in fact already holds for p > 2. The basic tool in this analysis which will prove its value more broadly is an identity due to Stein (1981) giving what has become known as Stein’s unbiased risk estimate. Theorem 8.3.3. Let X ∼ N (µ, J) and δ : Rp → Rp be an estimate of µ. Define ∆(x) = δ(x) − x. (8.3.30) Suppose that, for 1 ≤ j ≤ p, the jth component of ∆ is piecewise continuously differentiable with respect to xj and Pp (i) Eµ |∆(x)2 | = j=1 Eµ ∆2j (X) < ∞ and h i Pp d∆j (ii) j=1 E | dxi (X)| |Xj − µj | < ∞ Then, the quadratic loss risk of δ is given by Eµ |δ(X) − µ|2 = p + Eµ |∆(X)|2 + 2Eµ p X ∂∆j j=1 ∂xj (X) . (8.3.31) Section 8.3 109 Invariance, Equivariance, and Minimax Procedures Proof. The proof relies on a lemma important not only for this result but, more generally, as the foundation of a large body of results in probability referred to as the Stein-Chen method (Stein (1981)). Lemma 8.3.1. Let X ∼ N (µ, σ 2 ). Let g : R → R be a continuous, piecewise differentiable function such that (i) E|g ′ (X)| < ∞, (ii) E|X − µ|g(X) < ∞. Then, Eg ′ (X) = E (X − µ) g(X) . σ2 (8.3.32) Proof. Use integration by parts to get 1 σ A Z 1 1 A x−µ x−u dx = g(x) ϕ − )dx g (x) ϕ g(x)ϕ′ ( σ σ σ σ −A −A −A A Z A (x − µ) x−µ x−µ 1 dx + g(x) ϕ = g(x) ϕ σ σ σ2 σ −A −A Z A ′ x−µ σ and the result follows since g(x)ϕ x−µ σ |A −A → 0 as A → ∞. ✷ The reason for the utility of this elementary lemma is that it holds for arbitrary g. A simple generalization provides a unique characterization of N (µ, r2 ). Condition (ii) may be dispensed with (Stein (1981)). Proof of Theorem 8.3.3. By assumption, Eµ |δ(X) − µ|2 = Eµ |X − µ|2 + 2Eµ ∆T (X)(X − µ) + Eµ |∆(X)|2 p Z ∞ Z ∞ p X Y = p+2 ϕ(xi − µi )dxi + Eµ |∆(X)|2 . ∆j (x)(xj − µj ) j=1 −∞ −∞ i=1 Write Z ∞ −∞ = Z ∞ −∞ Z Z ∞ −∞ ∞ ∆j (x)(xj − µj ) Y −∞ i6=j p Y i=1 ϕ(xi − µi )dxi Z ϕ(xi − µi )dxi ∞ −∞ ∆j (x)(xj − µj )ϕ(xj − µj )dxj . Apply Stein’s identity (8.3.32) to ∆j (x) viewed as a function of xj with {xi : i 6= j} fixed. The theorem follows. ✷ 110 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 This identity does indeed provide an unbiased estimate of the risk (mean squared error) of any estimate δ satisfying the conditions of Theorem 8.3.4. It can be reinterpreted as saying that b ≡ p + |∆(X)|2 + R p X ∂∆j j=1 ∂xj (X) (8.3.33) is an unbiased estimate of R(µ, δ) ≡ Eµ |δ(X) − µ|2 . The application of (8.3.33) to studying inadmissibility of X with respect to a competitor δ(X) is given by Corollary 8.3.2. Suppose δ satisfies the conditions of Theorem 8.3.4, and ∆ defined by (8.3.30) is such that 2 p X δ∆j j=1 δxj (x) + |∆|2 (x) ≤ 0, (8.3.34) with strict inequality on a set of positive probability. Then for all µ, R(µ, δ(X)) < R(µ, X) = p . (8.3.35) Note that still, sup R(µ, δ(X)) = p (8.3.36) µ Stein’s original (1956(b)) proposal of an estimate satisfying (8.3.34) for p ≥ 3 was δ s (X) = p−2 X 1− |X|2 (8.3.37) where (8.3.35) was established. The nature of δ s is interesting. It always lies along the vector X. For large values of |X| it essentially coincides with X but for |X| > p − 2 it is shrunk more and more towards 0 and, for 0 < |X| < p − 2, it points in the direction opposite to that of X. The last property is patently unreasonable and, indeed, Stein (1981) showed that the Stein positive part estimate, δ+ s (X) p−2 = 1− X, |X|2 + (8.3.38) strictly improves δ s (X) and hence X itself. Here x+ ≡ max(x, 0). δ + s (X) has a very natural interpretation. The standard test of µ = 0, Reject iff |X|2 > p − 2 is carried out. If the test accepts, estimate µ by 0, else use [1 − (p − 2)/|X|2 ]X. Section 8.3 111 Invariance, Equivariance, and Minimax Procedures Next we establish Stein’s result for δ s using his identity. Here, ∆s (X) = − (p − 2) X |X|2 ∂∆js = −(p − 2) ∂xj x2j 1 − 2 |X|2 |X|4 ! Thus, if ∆s ≡ (∆1s , . . . , ∆ps )T 2 |∆| + 2 = p X ∂∆js j=1 ∂xj (x) = (p − 2)2 2p(p − 2) 4(p − 2) − + |X|2 |X|2 |X|2 (p − 2)(2 − p) < 0 if p ≥ 3 |X|2 (8.3.39) and Corollary 8.3.2 shows that δs renders X inadmissible. The same argument establishes that δs+ renders X inadmissible. Unfortunately showing that δs+ renders δs inadmissible requires showing that the difference of their risks satisfies p + X ∂(∆ − ∆ ) sj sj 2 Eµ |∆s |2 − |∆+ ≥0. s | +2 ∂x j j=1 (8.3.40) + + Here ∆+ s , ∆sj are defined analogously to ∆s , ∆sj with δs replaced by δs . An enormous literature has grown up from Stein’s results. One of the major conclusions of the theory is that “shrinking” towards linear submodels, not just 0, is an improvement for relatively small values of p. An important method in applied regression analysis, “Ridge regression,” invented independently, successfully utilizes shrinkage towards regression models of dimension k < p if there are p regressors. A very extensive discussion of this literature and many more results may be found in Lehmann and Casella (1998). We shall return to this topic in Chapters 10 and 11. We close this section, following Efron and Morris (1973), by showing how δs may be motivated from a parametric empirical Bayes point of view. Empirical Bayes and the James-Stein estimates δ s Suppose X ∼ Np (µ, J) as we have assumed throughout this subsection. Now, following the empirical Bayes point of views of Section I.6, suppose µ1 , . . . , µp are i.i.d. N (0, σ 2 ) with σ 2 unknown. If σ 2 were known the Bayes estimate of µ is seen from Example 3.2.1 to be δ B (X) = (δB (X1 )), . . . , δB (Xp ))T where δB (x) = σ 2 x/(1 + σ 2 ). Can we estimate σ 2 from our data? Note that, uncondition2 2 ally, the Xi are i.i.d. N (0, 1 + σ 2 ). Thus σ b2 ≡ |X| p − 1 is the UMVU, MLE of σ and we 112 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 are led to bB (X) ≡ δ 1− p |X|2 X. (8.3.41) This is close to but not quite the James-Stein (1961) estimate. It may in fact be shown,see Theorem 5.1, Lehmann and Casella (1998), for instance, that δc (X) ≡ 1 − c(p−2) X |X|2 uniformly improves X provided that p ≥ 3 and 0 < c < 2. Thus, the estimate (8.3.41) uniformly improves X for p > 4. Remark 8.3.5. We have seen that the Stein-type estimators of a vector parameter with p ≥ 3 have a smaller average mean squared error than that of the usual unbiased estimator. However, the mean square error of any one of the components in the Stein vector estimate may be larger than that of the usual unbiased estimate. See Rao and Shinozakt (1978). Summary: This section explores connections between equivariance, invariance, minimax decision rules, and group models. We consider models generated by location groups, location-scale groups, affine groups, and monotone transformation groups. It is found that when we restrict decision rules to those that have the same equivariance properties as the groups generating the models, we can in some cases derive uniformly minimum risk equivariant estimates and equivariant minimax tests. Testing problems that are invariant with respect to the group of increasing transformations lead to rank tests as the invariant tests. We consider rank tests that are locally UMP invariant for semi-parametric model subgroups called copulas. We also discuss weaknesses of the “minimax equivariant” principle by presenting results of Stein that show the inadmissibility of such minimax procedures. 8.4 PROBLEMS AND COMPLEMENTS Problems for Section 8.2 1. Let X ∼ B(n, θ). Show that X(n − X)/n(n − 1) is the UMVU estimate of θ(1 − θ). 2. Variable Selection. Methods for selecting the covariates to be included in a regression model are often based on unbiased estimates of mean squared estimation error (or equivalently (Problem I.7.5), mean squared prediction error). b b b 1 )−R(µ, b 0 ) is an UMVU estimate of R(µ, µ b 1 )− (a) In Problem I.7.3, show that R(µ, µ µ b 0 ) when model (I.8.1) holds. R(µ, µ bp − R bq is an UMVU estimate of Rp − Rq when model (b) In Problem I.7.4, show that R (I.8.2) holds. 3. Let X1 , . . . , Xn be a sample from U(θ1 , θ2 ) where θ1 , θ2 are unknown. (a) Show that T (X) = min(X1 , . . . , Xn ), max(X1 , . . . , Xn ) is sufficient. Section 8.4 113 Problems and Complements (b) Assuming that T (X) is complete, find a UMVU estimate of (θ1 + θ2 )/2. (c) Show that T (X) is complete. 4. Let N = (N1 , . . . , Nk ) have a M(n, θ1 , . . . , θk ) distribution, k ≥ 2, θ1 , . . . , θk unknown. Find a UMVU estimate of θ2 − θ1 . 5. Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) population. (a) Show that if σ 2 is known and µ is not, then X̄ is an UMVU estimate of µ. (b) Show that the UMVU estimate of σ 2 if µ = µ0 is known, and σ 2 is not, is n 1X (X1 − µ0 )2 . n i=1 6. Let X1 , . . . , Xn be as in Problem 4.2.5 and σ = 1. Find the UMVU estimate of Pµ [X1 ≥ 0] = Φ(µ). Hint. (X1 , X̄) has a bivariate normal distribution. Apply Theorem B.4.2. 7. Let X1 , . . . , Xn be a sample from a P(θ) distribution. Find the UMVU estimate of Pθ [X1 = 0] = e−θ . 8. Let X1 , . . . , Xn be a sample from a Γ(p, λ) population where both p and λ are unknown. Find the UMVU estimate of p/λ. 9. Let X = (X1 , . . . , Xn ) be a N (µ, σ 2 ) sample. Show that if n ≥ 2, X though sufficient is not complete. Hint. Consider S(X) = X2 − X1 . More instructive counterexamples may be found in Stigler (1972). See Problem 18. 10. Let X = (X1 , . . . , Xn ) be a U(0, θ] sample, θ > 0. Show that Mn = X(n) is the MLE of θ while T ∗ = [(n + 1)/n]Mn is the UMVU estimate of θ. Show that, if R denotes mean squared error, there exists an estimate T such that R(θ, T ) < min{R(θ, Mn ), R(θ, T ∗ )}. Thus Mn and T ∗ are inadmissible. n+2 1 2 Hint. Consider T = n+1 Mn . Note that R(θ, T ) = (1+n) 2θ . 11. Suppose that T1 and T2 are two UMVU estimates of q(θ) with finite variances. Show that T1 = T2 . 2 . Use the correlation inequality. Hint. If T1 and T2 are unbiased so is T1 +T 2 Problems 12, 13, and 14 give UMVU estimates of genetic trait parameters 12. Consider the genetic example of Problem 2.4.6 where X has the zero truncated binomial distribution B1 (n, θ) with n x n−x x θ (1 − θ) p(x, θ) = , x = 1, . . . , n . 1 − (1 − θ)n 114 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 (a) Show that X is complete and sufficient for θ. (b) Show that the expected number of family members with the trait given that at least one has the disease is Eθ (X) = nθ/[1 − (1 − θ)n ] and thus conclude that (X/n) is the UMVU estimate of q(θ) = θ/[1 − (1 − θ)n ]. 13. For reasons similar to those of Problem 2.4.6 the zero truncated Poisson distribution p(x, θ) = θx e−θ /x! , x = 1, 2, . . . 1 − e−θ is sometimes used as a model. Note that p(x, θ) = P (Y = x|Y ≥ 1) where Y ∼ P(θ). Show that the UMVU estimate of q(θ) = Pθ (Y ≥ 1) = 1 − e−θ is given by T ∗ = 0 if x is odd = 2 if x is even. Hint. Show that if Eθ T (X) = q(θ) then T must equal T ∗ . 14. Consider the model of Problems 2.4.6 and 8.2.12. Suppose that among the offspring of k sisters, exactly one of the children has the disease. Let X denote the number of children with the trait in the family in which one child has the disease; let Y denote the number of children with the trait in the ith of the other families; and let θ denote the true proportion of children with the trait. A reasonable model is one in which X, Y1 , . . . , Yr are independent, X has the Br (n, θ) distribution of Problem 8.2.2, and Yi ∼ B(ni , θ), where r = k − 1 and n, n1 , . . . , nr denotes the number of offspring in the families. Show that the UMVU estimate of θ is n+N −1 −1 − N T −1 T −1 n+N − N T T where T =X+ r X i=1 and N = r X ni . i=1 Pr Hint. Use Theorem 8.2.2 to show that T is complete and sufficient. Write i=1 Yi as PN i=1 Zi where Z1 , . . . , ZN are the indicators of independent B(1, θ) events. Now compute E(Z1 |T ). 15. Estimating the Probability of Early Failure. It is sometimes reasonable to think of the lifetime (time to first repair) of a piece of equipment as a random variable following an exponential distribution with parameter λ which is unknown. Suppose n “identical” pieces of equipment are run and the failure times X1 , . . . , Xn are observed. We want to estimate the probability of early failure, that is, Pλ [X1 ≤ x] = 1 − e−λx for some fixed x. Show that the UMVE is n n−1 X x T ∗ = 1 − 1 − Pn Xi ≥ x if i=1 Xi i=1 = 1 otherwise. Section 8.4 Hint. T = 115 Problems and Complements P Xi is complete, sufficient. Set S(X1 ) = 1[X1 ≤ x], then hX i x 1 E S(X1 )|T = t = P [X1 ≤ x|T = t] = P T =t . ≤ T T By the substitution theorem for conditional expectations, (B.1.16), the right hand side equals P (X1 /T ≤ (x/t)|T = t . Now, X1 /T is independent of T and has a beta, β(1, n − 1), distribution because X1 and (X2 + · · · + Xn ) are independent and the second of these variables has, by Corollary B.2.2, a Γ(n − 1, λ) distribution; next apply Theorem B.2.3. Therefore, if b1,n−1 denotes the β(1, n − 1) density, E S(X1 )|T = t = Z (x/t) b1,n−1 (u) du . 0 Since b1,n−1 (u) = (n − 1)(1 − u)n−2 for 0 < u < 1, the result follows. 16. In Problem 15, consider the UMVU estimate T ∗ and the MLE Te = 1 − exp[−x/X̄]. √ P Show that as n → ∞, n(T ∗ − Te) −→ 0. 17. Let X1 , . . . , Xn be a sample from a N (µ, σ 2 ) population. We want to estimate the proportion q(θ) of the population below a specified limit c, where θ = (µ, σ 2 ). (a) Show that the MLE of q(θ) is Φ (x − X̄)/b σ where X̄ and σ b2 are the sample mean and variance. (b) Show that the UMVU estimate of q(θ) is given by T ∗ (X) = 0 if kV ≤ −1 1 1 = G + kV if − 1 < kV < 1 2 2 = 1 if kV ≥ 1 √ where k = n/(n − 1), V = (c − X̄)/s, and G is the β 12 , 12 (n − 2) d.f. Hint. An unbiased estimate is S(X) = 1[X1 ≤ c]. Now compute b2 . E S(X)|X̄, σ b2 = P X1 ≤ c|X̄, σ 18. (Stigler (1972)). Consider the family P of distributions Pθ , θ = 1, 2, . . . , defined by Pθ [X = i] = 1 if i = 1, . . . , θ, = 0 otherwise. θ (a) Show that P is complete. Hint. Use induction. (b) Show that 2X − 1 is the UMVU estimate of θ. (8.4.1) 116 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 (c) Show that the family P0 = P − {Pk } is not complete when k is a positive integer. Hint. Consider E v(X) where v(i) = = = 0 for i 6= k, k + 1 1 for i = k −1 for i = k + 1 . (d) Show that 2X − 1 is not UMVU if Pθ ∈ P0 . Hint. Consider T (X) = 2X − 1, X 6= k, k + 1 = 2k, X = k, k + 1. Remark. This model is not regular in the sense of Sections 1.1.3 and I.0. 19. Consider model (8.4.1). For what α ∈ (0, 1) can you find a similar test of H : θ = 10 vs K : θ > 10? 20. Establish (8.2.12). 21. Establish the converse of Theorem 8.2.3. 22. Lehmann and Scheffé (1950, 1955) show that if a complete sufficient statistic exists, it must be minimal sufficient. Assume this result. Let X1 , . . . , Xn be a sample from Unif(θ − 0.5, θ + 0.5). (a) Show that (X(1) , X(n) ) is minimal sufficient. Hint. Use the factorization theorem. See Example 1.5.1 (continued). (b) Show that (X(1) , X(n) ) is not complete. Hint. Take v(s, t) = t − s − [(n − 1)/(n + 1)] and use Problem B.2.9. (c) Show that no complete sufficient statistic exists. 23. Establish (8.2.17). 24. Suppose (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. as N (µ1 , µ2 , σ12 , σ22 , ρ). Consider the two statistics: X X X X T T = Xi , Yi , Xi2 , Yi2 P (Yi − Ȳ )(Xi − X̄) ρb = P 1 . P (Xi − X̄)2 (Yi − Ȳ )2 2 If ρ = 0, are T and ρb independent? Why? Why not? 25. Let (Xi , Yi )T , i = 1, 2, . . . , n, be independent vectors with joint density fXi ,Yi (xi , yi |τ, σ) = for xi , yi > 0 and τ, σ > 0. yi2 e−yi /τ −xi yi /στ τ 3σ Section 8.4 117 Problems and Complements (a) Give a complete sufficient statistic (T1 , T2 ) for θ = (τ, σ). (b) Give the moment generating function for (T1 , T2 ) and give the corresponding joint distribution. (c) Derive the maximum likelihood estimators τb, σ b. (d) Calculate E(b τ ), E(b σ ) and construct UMVU estimators of τ and σ. 26. Let Y1 , . . . , Yn be i.i.d. from the uniform distribution U (0, θ) with an unknown θ ∈ (1, ∞). Suppose that we only observe Yi if Yi ≥ 1, Xi = i = 1, . . . , n . 1 if Yi < 1, (a) Derive a UMVUE of θ. (b) Find the MLE of θ and η ≡ P (Y1 > 1). (c) Derive a UMP test of size α for testing H : θ ≤ θ0 versus K : θ > θ0 , where θ0 is known and θ0 > (1 − α)−1/n . Problems for Section 8.3 1. Show that for the affine group on Rd , (8.3.2) defines ḡ uniquely; that Ḡ is a group, and that Ḡ is isomorphic to G. 2. Suppose X ∼ P0 = Nd (0, J), J = diag(1, . . . , 1)d×d . Show that the transformation model induced by P0 and the affine group is {Nd (µ, Σ); µ ∈ Rd , Σ ∈ S}, where S =class of d × d nonsingular matrices. 3. Establish the claim of Example 8.3.2. 4. Establish the claim of Remark 8.3.1(a). 5. Establish the claim of Remark 8.3.1(c). 6. If x, y ∈ Rd , |x|2 = |y|2 there exists A orthogonal such that Ax = y. x Hint. Construct an orthonormal basis where the first member is |x| mapping T (1, 0, . . . , 0) . See Vol. I, page 496. x |x| onto 7. (a) Show that the ranks are indeed maximal invariant in Example 8.3.2. (b) Suppose X = Rn and G is as before. Exhibit a maximal invariant in this case. 8. Establish Proposition 8.3.1. 9. Show that the Xd2 (θ2 ) family is a monotone likelihood ratio family in the parameter θ2 . Hint. See Problem B.3.12. 10. Suppose (X1 , . . . , Xn ) have joint continuous case density f1 and that f0 (x1 , . . . , xn ) = Q n i=1 h(xi ) for some univariate continuous case density h that we can choose. Let X(1) < 118 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 . . . < X(n) be the order statistics of the Xi and (R1 , . . . , Rn ) the ranks. Assume that f0 > 0 whenever f1 > 0. (a) Show that the following Hoeffding’s formula holds. P1 [R1 = r1 , . . . , Rn = rn ] = 1 Ef n! 0 f1 (X(r1 ) , . . . , X(rn ) ) h(X(r1 ) ) . . . h(X(rn ) ) . Hint. Use formula (2.4.8) and note that (X(1) , . . . , X(n) ) and (R1 , . . . , Rn ) are independent under f0 . Apply Hoeffding’s formula to deduce that when f1 (x1 , . . . xn ) = n1 Y h(xi ) i=1 n Y g(xi ) . i=n1 +1 Qn g 1 (b) P1 [R1 = r1 , . . . , Rn = rn ] = n! E0 i=n1 +1 h (X(ri ) ) . (c) Suppose h = H ′ , g = G′ and G = 1 − (1 − H)θ , θ > 0. Show that the UMP (uniformly in F0 ) rank test of H : θ = 1 vs. K : θ = θ1 > 1 where θ1 is known, is given by rejecting for large values of Tn (θ1 ) = E0 n n Y o (1 − U(ri ) )θ1 −1 , i=n1 +1 where U(1) < . . . < Un are the order statistics of a sample from U(0, 1). (d) Set h0 (u) = 1, g1 (u) = 2u, g2 (u) = 2(1 − u), 0 < u < 1, n1 = n2 = 2, and α = 1/6. Show that the MP level α rank test for H : h = g = h0 vs K : h = h0 , g = g1 rejects H iff R3 , R4 ∈ {3, 4} while the MP level α rank test for H vs (h0 , g2 ) rejects H iff R3 , R4 ∈ {1, 2}. Hint. P (R3 , R4 ∈ {3, 4}) = 42 P (X1 < X2 < X3 < X4 ), etc. (e) Describe the MP level α = k 63 rank test for testing H vs (h0 , g1 ) in model (d) when n1 = n2 = 3 and (i) k = 1, (ii) k = 2. (f) Same as (e) except for testing H vs (h0 , g2 ). 11 (a) Show that the distribution of the rank vector (R1 , . . . , Rn ) in the two-sample problem of Example 8.3.10 depends on (F, G) only through GF −1 . Thus if h is a given continuous and strictly increasing function on [0, 1], then {Pθ,h : G = hθ (F ), θ > 0} is a group submodel where the distribution of the ranks depends only on θ and h. Hint. Then rank vector is invariant under the transformation Xi → F (Xi ) ≡ Xi′ . Here Xi′ has the U(0, 1) distribution for i = 1, . . . , n1 , and the distribution GF −1 for i = n1 + 1, . . . , n. (b) In Example 8.3.10, show that GF −1 is maximal invariant. (c) Set F̄ = 1 − F and Ḡ = 1 − G. Show that the Lehmann alternative {Pθ,h : Ḡ = F̄θ } includes the exponential two-sample model. Hint. If U ∼ U(0, 1), then −λ−1 log(1 − U ) ∼ E(λ). (d) Suppose X ∼ F and Y ∼ G. Y and G are said to be stochastically larger than Section 8.4 119 Problems and Complements X and F iff P (Y ≥ t) ≥ P (X ≥ t) for all t ∈ R, that is, iff G(t) ≤ F (t), t ∈ R. In the two sample framework of Example 8.2.7, a test ϕ(x, y) of H : F = G vs K : “G is stochastically larger than F is monotone if yj′ ≥ yj , 1 ≤ j ≤ n, implies ϕ(x, y′ ) ≥ ϕ(x, y). Show that: (i) All monotone tests are rank tests. That is, functions of the ranks Pn R1 , . . . , Rn of Y1 , . . ., Yn . Also show that all tests of the form ϕ(X, Y) = 1 j=1 a(Rj ) ≥ c are monotone provided a(k ′ ) ≥ a(k) for k ′ ≥ k. (ii) All monotone tests have monotone power. That is, EF,G ψ(X, Y) ≥ EF,H ψ(X, Y) if G is stochastically larger than H. Thus monotone tests are unbiased. Remark. Wang (1996) developed a test of equality of distributions against nonparametric stochastically ordered alternatives. 12. Hoeffding (1951) and Hajek and Sidak (1967) show that for rank tests of H0 : θ = θ0 vs K : θ > θ0 , the locally UMP rank test is, under regularity conditions, based on (δ/δθ)P (R = r)|θ=θ0 . (a) Show that for allowable levels α the locally UMP rank test for the model in Problem 8.3.10(c) is the Savage exponential scores test. −λt (b) Show that UMP level α permutation test of H : F = G vs. PnK : F (t) = 1 − e , −λθt G(t) = 1 − e , θ > 1, is given by rejecting for small values of i=n1 +1 xi with critical value cn (x(1) , . . . , x(n) ). (c) Show that the test in (b) is uniformly more powerful in this parametric family of alternatives than the MP rank test given in (a). 13 (a) Show if X(1) < . . . X(n) are the exponential, E(1), order statistics, then Pthat n E(X(j) ) = i=n+1−j i−1 . Hint. See Problem B.2.14. (b) In (a), show that if j = [nu + 1], 0 < u < 1, then E(X(j) ) = − log(1 − u) + o(1). Hint. X(j) = − log(1 − U(j) ), where U(1) < . . . < U (n) are uniform, U(0, 1), order statistics. Taylor expand around U(j) ∼ = u. See Problem B.2.9. 14. Copulas. Consider Example 8.3.11. (a) Let Ci F (·) be the copula model with d = 1, θi = exp{βzi } and the LehmannSavage copula Ci (u) = 1 − [1 − u]θi , 0 < u < 1 . Use Hoeffding’s formula to show that the locally UMP invariant test of H0 : β = 0 vs H1 : β > 0 is based on the exponential scores statistic 1 T E = n− 2 n X i=1 aE (Ri )(zi − z̄) where aE (k) = E(Z (k) ) with Z (1) < . . . < Z (n) exponential order statistics. Hint. Set K(t) = 1 − exp(−t). The df of W = K −1 F (Yi ) is K(w/θi ). (Approximate critical values for TE can be obtained from TE /s ∼ N (0, 1), s2 = Σ(zi − z̄)2 /(n − 1)). 120 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 (b) In the transformation model h(Yi ) = θi + εi , set h(y) = log{− log[1 − F (y)]} for some unknown continuous df F and suppose εi ∼ 1 − exp{− exp(t)}. Let Yi ∼ Fi and assume that Fi and F have continuous case densities fi and f . Show that Fi satisfies the proportional hazard rate model where λ(y|zi ) = θi λ(y) with λ(y|zi ) = fi (y)/[1 − Fi (y)] and λ(y)/[1 − F (y)] being regression and baseline hazard rates. (c) Show that the models of (a) and (b) are the same. (d) Show that for d = 1 and θi = α + βzi in the Gaussian copula model, the normal scores statistic is locally UMP invariant. Use Hoeffding’s formula. (e) Same as (d) except logistic copula and uniform scores statistic. (f) Proportional odds model. (i) Let Ci F (·) be the copula model with d = 1, θi = α + βzi . Bell and Doksum (1966, Table 8.1), considered Ci (u) = u . u + (1 − u) exp(θ) Use Hoeffidng’s formula to show that the uniform scores test is locally UMP invariant for testing H0 : β = 0 vs H1 : β > 0. (ii) The odds ratio is defined as ri = Fi /(1 − Fi ). Let r = F/(1 − F ) where F is a continuous baseline df. Show that for the model in (i), ri (y) = exp(θi )r(y) . This is called the proportional odds model. (iii) In the monotone transformation model h(Yi ) = θi + εi with d = 1 and θ1 = α + βzi , set h(y) = log F (y)/[1 − F (y)] for some unknown continuous df F and assume that εi has a logistic distribution. Show that this model is equivalent to the model in (i) and (ii). (g) Random effect proportional hazard. Bell and Doksum (1966) considered eθi u − 1 , eθi − 1 =u, Ci (u) = θi 6= 0 θi = 0 . Sibuya (1968) and Nabeya and Miura (1972, unpublished) showed that Ci F (·) (called the SINAMI model) is a random effects proportional hazard model λ(y|zi ) = ∆i λ(y) with ∆i having a zero truncated Poisson (θi ) distibution. Show that when d = 1 and θi = α + βzi , the uniform scores test is locally UMP invariant for testing H0 : β = 0 vs H1 : β > 0. (h) Same as (f) except (i) Ci (u) = (1−θi )u+θi u2 . Show that the uniform scores test is locally UMP invariant. (ii) Ci (u) = (1 − θi )u + θi um , m ≥ 2 an integer. Find the locally UMP invariant test when m is known. Section 8.4 Problems and Complements 121 15. Monotone regression tests. In the regression framework where we observe (z1 , Y1 ), . . . , e denote (zn , Yn ) with z1 < . . . < zn nonrandom and Y1 , . . . , Yn independent. Let “<” e . . . <F e n is stochastic inequality. A test ϕ(y) of H : Fi = F , 1 ≤ i ≤ n, vs K : F1 < ′ ′ ′ ′ regression monotone if ϕ(y ) ≤ ϕ(y) for all y and y for which i < j and yi ≤ yj imply yi ≤ yj . Show that (a) All monotone tests are rank tests. (b) All monotone tests have monotone power, that is, inf Fi−1 Gi (y) ≤ Fj−1 Gj (y) for i ≤ j, all y, and if ϕ(·) is monotone, then EF ϕ(Y) ≥ EG ϕ(Y) . e . . . <F e n. Note that it follows that monotone tests are unbiased for H vs K : F1 < −1 ′ ′ ′ ′ Hint. Let Yi ∼ Gi , set Yi = Fi Gi (Yi ) , then i < j and Yi ≤ Yj =⇒ Yi ≤ Yj . (c) Let R1 , . . . , Rn bePthe ranks of Y1 , . . . , Yn . If a(i) is nondecreasing in i = 1, . . . , n, n then the test ϕ(y) = 1 i=j zi a(Ri ) ≥ c is monotone. 16. Consider Example 8.3.12. (a) Show that if (X, Y ) has the continuous case density f (x, y), then (Hoeffding’s formula for bivariate data) n 1 n X f (V (ri ) , W (si ) ) o P (R, S) = (r, s) = E n! f (V (ri ) )f2 (W (si ) ) i=1 1 where f1 and f2 are the marginal densities of f and {V (k) } and {W (k) } are the order statistics in independent samples from f1 and f2 . (b) Show that the bivariate normal scores statistic is locally UMP invariant for testing H0 : ρ = 0 vs H1 : ρ > 0 in the bivariate Gaussian copula. You may differentiate inside the expected value in Hoeffding’s formula. Hint. Because of the invariance of the ranks, assume (X, Y ) ∼ N (0, 0, 1, 1, ρ) ≡ fρ . Use n ∂ X ∂ n log fρ Πni=1 fρ . Πi=1 fρ = ∂ρ ∂ρ i=1 (c) Consider the model with g(X) = V + θZ, h(Y ) = W + θZ, where V and W have N (µ1 , σ12 ) and N (µ2 , σ22 ) densities, the df M of Z is arbitrary, V , W , and Z are independent, and g and h are increasing. Show that the bivariate normal scores statistic is locally UMP invariant for testing H : θ = 0 vs K : θ > 0. Hint. For invariant (rank) statistics, we can assume Z ∞ f (x, y; θ) = ϕ(x − θz)ϕ(y − θz)dM (z) . −∞ (d) Show that for the bivariate Gaussian copula, |ρ| = supa,b corr(a(X), b(Y )) where the sup is over monotone functions a and b. 17. Establish (8.3.21). 122 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 18. Establish (8.3.23). 19. Show in Example 8.3.13 that the size α test that rejects H iff X1 ≥ c is (a) UMP for testing H : θ ≤ θ0 vs K : θ ≥ θ0 when Σ0 is known. (b) The Bayes test for the loss function lc (u, φ) and the priors π0 and π1 corresponding to point masses at (θ0 , ηT0 )T . (c) UMP unbiased when Σ0 is unknown. (d) UMP invariant when Σ0 is unknown. 20. Show that X is minimax for quadratic loss when X ∼ Np (µ, σ02 J) where J = diag(1, . . . 1)p×p . 21. Let N have the binomial distribution B(p, 10); given N = n, let X1 , . . . , Xn be i.i.d. N (µ, 1). The observations consist of (N, X1 , . . . , XN ). (i) If p has a known value p0 , show that there exists neither a UMP test nor a UMP unbiased test of H : µ = 0 against µ > 0. (ii) If p is unknown, show that there does exist a UMP unbiased test of H : µ = 0 against µ > 0. 22. Let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ). State the MP test of H : σ = σ0 against a simple alternative (µ1 , σ1 ) for the two cases (i) σ1 > σ0 ; (ii) σ1 < σ0 , and in each case describe the least favorable distribution for µ over the parameter set (µ, σ0 ) : −∞ < µ < ∞ . 23. Let X1 , X2 be positive random variable with density 1 x1 x2 ,σ > 0 . f , 2 σ σ σ Assuming it exists, let δ(X) be the minimum risk scale equivariant estimate of σ, under the loss function (d − σ)2 /σ 2 . (a) Show that R∞ δ(X) = R0∞ 0 u2 f (uX1 , uX2 ) du . u3 f (uX1 , uX2 ) du (b) Suppose there exists a complete sufficient statistic t(X) for the model. Show that δ(X) is itself a sufficient statistic. 24. Let C be a d×d matrix. Suppose γ1 , . . . , γ2 and w1 , . . . , wd satisfy Cwj = γj wj , 1 ≤ j ≤ d. Then γ1 , . . . , γd are called eigenvalues of C and w1 , . . . , wd are Pcalled eigenvectors. See Section B.10. Consider the inner product defined by (x, y) = xi yi and the norm 1 |x| = (x, x) 2 . Let v1 = arg max{aT Ca : |a| = 1} v2 = arg max{aT Ca : |a| = 1, a ⊥ v1 } i ∴ vd = arg max{aT Ca : |a| = 1, a ⊥ [v1 , . . . , vd−1 ]} Section 8.5 123 Notes and λj = vjT P vj , 1 ≤ j ≤ d. (a) Show that v1 , . . . , vd , λ1 ≥ λ2 ≥ . . . ≥ λd are all well defined and that Cvj = λj vj , 1 ≤ j ≤ d. Thus λ1 , . . . , λd are eigenvalues and v1 , . . . , vd are eigenvectors. This is the Courant-Fischer theorem. P (b) When C is the covariance matrix of a random vector X ∈ Rd , the linear combination Lj = vj X is called the jth principal component. Show that Var(Lj ) = λj . bj which corbj , λ The term “principal components” is also used for the empirical v P c respond to , the empirical variance covariance matrix of a sample X1 , . . . , Xn . P P Sometimes the correlation matrix is used in place of and c. P (c) Establish the (a) part for the inner product (x, y) = di=1 Πi xi yi , where Πi > 0 Pd and i=1 Πi = 1. 25. Let X1 , . . . , Xn be i.i.d. discrete random variables with P (X1 = x) = γ(x)θx /c(θ), x = 0, 1, 2, . . . , P x where γ(x) ≥ 0, θ > 0 is unknown, and c(θ) = ∞ x=0 γ(x)θ . Let Y1 , . . . , Yn be i.i.d. random variables having the beta distribution with density Γ(a + b) a−1 y (1 − y)b−1 , 0 < y < 1 , Γ(a)Γ(b) where a > 0 and b > 0 are unknown. Assume that Xi ’s and Yi ’s are independent. Suppose that we observed Zi = Xi + Yi , i = 1, . . . , n. (a) Show that EX1 = θc′ (θ)/c(θ), where c′ (θ) is the first order derivative of c(θ). (b) Obtain a complete and sufficient statistic for the unknown parameter (θ, a, b). (c) Derive a uniformly minimum variance unbiased estimator of the parameter θ. (d) For a given α ∈ (0, 21 ), derive a uniformly most powerful test of size α for H0 : θ ≥ θ0 vs H1 : θ < θ0 , where θ0 > 0 is a known value. 8.5 Notes Note for Section 8.2. (1) The “similar” terminology arose from the notion that if φ was non randomized and corresponded to critical region C, then C was, under H, similar to the sample space X property P (X ) = 1 in having probability P (C) = α, P ∈ P0 , not depending on P . 124 Distribution-Free, Unbiased, and Equivariant Procedures Chapter 8 Notes for Section 8.3. (1) For general definitions of abstract groups and their properties we refer to, for instance, Birkhoff and MacLane (1998). (2) Lehmann (1953) proposed G = F θ . Savage (1956,1980) noted that if F is replaced by S = 1 − F one obtains (8.3.16). He called this Lehmann alternatives. (3) A formulation of this type is in Wijsman (1990). Chapter 9 INFERENCE IN SEMIPARAMETRIC MODELS As we indicated in Section I.1, our concern in this chapter will be with methods for inference about parameters in non- and semiparametric models whose qualitative large sample behaviour is like that of the corresponding √ procedures for regular parametric models. In the i.i.d. case, estimates converge at rate n to Gaussian distributions, and the behaviour of tests is governed by that of Gaussian processes. Section 9.1 introduces theory for estimates based on maximizing modified and empirical likelihoods in important semiparametric models such as linear models with stochastic covariates, biased sampling models, Cox proportional hazard models with censoring, and independent component analysis models. Section 9.2 and 9.3 deal, respectively, with asymptotic normality and efficiency for estimates. Section 9.4 similarly deals with asymptotic inference for tests in semiparametric models. We need to use the machinery of Chapter 7 to extend the techniques of Chapters 5 and 6. In Section 9.5 we show how the powerful notions of contiguity and local asymptotic normality can clarify power and information bound calculations. 9.1 Estimation in Semiparametric Models In Sections 9.1, 9.2, and 9.3, we want to primarily address the question of how to optimally estimate regular parameters in semiparametric models. We have defined semiparametric models by example and will continue to do so. Loosely, they are models which are naturally parametrized by both Euclidean and infinite dimensional parameters; or where distributions in the model are subject to restrictions. Nonparametric models are loosely described as ones where every probability distribution is at least approximable by distributions in the model. We will consider parametric and nonparametric models to be special cases of semiparametric models. 9.1.1 Selected Examples We have already considered a number of semiparametric models: 125 126 Inference in semiparametric models Chapter 9 Section 3.5: The symmetric location model (3.5.1) Xi = µ + εi , εi ∼ i.i.d. F, µ ∈ R , where the natural parameter is (µ, F ) with F the error distribution function. The gross error model 1 f (x) = F ′ (x) = (1 − λ) ϕ(x/σ) + λh(x) σ of Section 3.5.3 parametrized by (µ, λ, σ, h) where h is a density and µ is unidentifiable without restrictions on h(·) such as h(x) = h(−x). Section I.1.1 and 8.3: The two-sample model of Example 8.3.10 where we observed two independent samples with distributions F and G, is naturally parametrized by (F, G), which is fully nonparametric, and the hypothesis model is {(F, G) : F = G}, which is semiparametric. We pursue an important model which was introduced in Examples 6.2.1 and 6.2.2. Example 9.1.1. Semiparametric linear models with stochastic covariates. Here, we observe (Z1 , Y1 ), . . . , (Zn , Yn ) i.i.d. as (Z, Y ), Z ∈ Rp , Y ∈ R, with Y = α + ZT β + σε . Various semiparametric versions of this model are in use. Here are two: (a) Z and ε are independent, ε ∼ F, Z ∼ H, F ∈ F , H ∈ H, and F , H are general, and Σ ≡ VarH (Z) ≡ EH (Z − EH (Z))(Z − EH (Z))T is nonsingular. (b) The joint distribution Q of (Z, ε) is arbitrary save that Eε2 < ∞, E(ε|Z) = 0, and Var(Z) is nonsingular. Thus we can think of model (a) as parametrized by (α, β, σ, F, H) and model (b) by (α, β, σ, Q). It is easy to see that in neither of these models is σ identifiable and α is not in the first as well (Problem 9.1.1). However, we have already seen in Example 6.2.2 that, if in addition to assumption (a) above, we assume F with density f symmetric about 0, then β and α are both identifiable and we exhibited estimating √ equations for (α, β). In fact, symmetry is only required for identifiability of α, and β is n consistently estimable b = β + OP (n− 12 ) for model (a) by using the estimating equations of in the sense that β Example 6.2.2 with, for instance, f0 the logistic √density (Problem 9.1.1(d)). In model (b), the LSE of Example 6.2.1 is n consistent. To show this, write b =β+Σ b −1 1 β n n X i=1 (Zi − Z̄)(Yi − ZT β) (9.1.1) P b ≡ 1 n (Zi − Z̄)(Zi − Z̄)T . To check √n consistency, note that by consistency where Σ i=1 n of continuous functions of sample moments P b −1 → Σ−1 . Σ Section 9.1 Estimation in Semiparametric Models 127 Further, n 1X (Zi − Z̄)(Yi − ZT β) n i=1 = n n 1X 1X (Zi − E(Z))σεi − (Z̄ − E(Z)) εi n i=1 n i=1 1 = OP (n− 2 ) + OP (n−1 ) √ since E(Z − E(Z))ε = E(Z − E(Z))E(ε|Z) by B.1.20 and E(ε|Z) = 0, so that n consistency of βb holds in model (b). ✷ √ In these two semiparametric linear models, the construction of n consistent estimates of the interesting Euclidean parameter β proved fairly simple. However, it is not clear what alternative and possibly better procedures there might be. That is, we need to develop a concept of efficiency for semiparametric models that can be used to determine what a “best” estimate is. This will be done in the context of this example in Section 9.3. Example 9.1.2. Biased Sampling. Stratification. It is often the case that, while we want to obtain information about a population df F , or at least features of F such as the mean and variance, we are only able to observe a sample from P(F,G) , where G is a finite or infinite dimensional parameter. We begin by considering the essentially nonparametric model of biased sampling from a single population. Here, if the distribution F of Z has density (discrete or continuous case) f , observations are not on Z but on X from w(x)f (x) (9.1.2) W (F ) R where W (F ) = w(x) dF (x), and w(·) is assumed known. An important special case is “length biased” sampling where X ∈ R+ , X ∼ F , and w(x) = x. This arises classically if we are interested in, say, the proportion f (2) of twochild families in a city with a known total number of households N . If we could sample households with at least one child, and Z is the number of children in a sampled household we could estimate f (2) = P [Z = 2] directly. Suppose instead we sample n children at random and consider the proportion pF (2) of children coming from two-child families in this group. If the city is large, f (j) is the proportion of j child families in the city, j = 0, 1, 2, . . ., and X is the number of children in a sampled household, then, pF (x) = pF (2) = P [X = 2] = 2f (2) / ∞ X j f (j) . (9.1.3) j=1 In this case (9.1.2) holds with w(x) = x. Based on observations from pF (·) in (9.1.2), is F identifiable? This is true iff w(x) > 0 whenever f (x) > 0. How do we identify F in this case? We claim that (Problem 9.1.2) pF (x)/w(x) . f (x) = R 1 w(x) dPF (x) (9.1.4) 128 Inference in semiparametric models Chapter 9 Next we consider stratified populations where biased sampling may occur within each strata. A stratified population S is made up of subpopulations S1 , . . . , Sk called strata. We are interested in the density f (z) of a random draw Z from S, but sample by first randomly selecting an index I with corresponding strata SI . This situation arises if S is the set of all patients in a county and S1 , . . . , Sk are the patients in the k hospitals and clinics in the county. Length biased sampling may occur within each strata. In this case we observe X = (I, Y ) where I = 1, . . . , k, P [I = j] = λj , 1 ≤ j ≤ k, are assumed known, and given I = j, Y has density wj (y)f (y) (9.1.5) Wj (F ) R where Wj (F ) = wj (x)dF (x). Again, w1 , . . . , wk are assumed known. In stratified sampling λj is the probability that Y will be from the jth strata Sj and wj (x) = 1(x ∈ Sj ). Pk We claim that F is identifiable iff j=1 λj wj (x)f (x) > 0 (Problem 9.1.2). ✷ pF (y|j) = The next two examples come from the area of survival analysis, the study of lifetimes of subjects in clinical trials, experimental animals, pieces of equipment, etc.. The models in these examples apply more generally to experiments that involve “time to event data.” A description of lifetime distributions more suited to the subject than densities or distribution functions is the hazard rate. Definition 9.1.1. The hazard rate λ(·) of a nonnegative, continuous variable T with density f and distribution fucntion F is defined for t ≥ 0 by λ(t) = lim {P [t ≤ T ≤ t + ∆|T ≥ t]/∆} ∆→0 = lim ∆→0 f (t) d F (t + ∆) − F (t) = = (− log S(t)) ∆(1 − F (t)) 1 − F (t) dt where S(t) = P [T > t] = 1 − F (t) is defined as the survival function. It follows that, f (t) = λ(t) S(t) = λ(t)e− Rt 0 λ(s)ds ≡ λ(t)e−Λ(t) (9.1.6) where Λ(t) is the cumulative hazard function. Note that Λ(t) = − log S(t) . (9.1.7) The discrete version of the hazard rate is defined for a discrete variable T as λ(t) ≡ P [T = t|T ≥ t] = P [T = t] . P [T ≥ t] By definition, λ(t) is 0 unless P [T = t] > 0. The analogue of (9.1.6) becomes P∞ Lemma 9.1.1. If j=1 P [T = tj ] = 1, t1 < t2 < . . ., then k−1 k−1 P [T ≥ tk ] = Πj=1 (1 − λ(tj )) , P (T = tk ) = λ(tk )Πj=1 1 − λ(tj ) . (9.1.8) (9.1.9) Section 9.1 129 Estimation in Semiparametric Models k−1 Proof. P (T ≥ tk ) = Πj=1 n P [T ≥tj+1 ] P [T ≥tj ] o . Next note that P [T ≥ tj+1 ] = P [T ≥ tj ] − P [T = tj ] . The second equation follows from (9.1.8). ✷ We will call (9.1.6) and (9.1.7) the continuous case approximations to (9.1.9). In park−1 ticular, exp{−Λ(t)} approximates Πj=1 (1 − λ(tj )), and vice versa. Our next example introduces two different types of biased sampling. Example 9.1.3. Censoring and truncation. A very common type of data in many contexts is censored or truncated data. In (right) censored sampling we wish to observe “times” T1 , . . . , Tn pertaining to n units sampled from a population. Instead, events occurring at times C1 , . . . , Cn may prevent some of T1 , . . . , Tn from being observed. These censoring times are such that (T1 , C1 ), . . . , (Tn , Cn ) are independent and we observe Xi = (Yi , δi ) where Yi = min(Ti , Ci ), δi = 1(Ti ≤ Ci ) , 1 ≤ i ≤ n . That is, we observe either the time of interest T or the censoring time C and are told which one is observed. A classical situation where this occurs is when T is the time an individual with some disease dies after entering a study of duration L. If T ≤ L, the survival time is observed; if T > L, the individual is said to be lost to follow up. In this case C = L. Subtler is the case where an individual enters the study at a random time T0 assumed independent of the survival time from entry. We then observe survival time T if T ≤ L − T0 and C = L − T0 otherwise, which is assumed independent of T , and indeed 1(T ≤ C) is also observable. An extensive discussion is given for instance in Andersen, Borgan, Gill and Keiding (1988,1993) and Kalbfleisch and Prentice (2002). If we assume that T and C are independent with distributions F, G which are arbitrary and if F, G have densities f, g, then the distribution of (Y, δ) is given by Z y f (s)Ḡ(s)ds (9.1.10) P [Y ≤ y, δ = 1] = −∞ Z y P [Y ≤ y, δ = 0] = g(s)F̄ (s)ds −∞ where F̄ = 1 − F , Ḡ = 1 − G are the appropriate survival functions. Thus, the model is parametrized by (F, G). Can (F, G) be fully identified? The answer is yes, under mild conditions, as we shall see later in the continuation (Example 9.1.9) of this example. Another important type of biased sampling is truncation. Here, the distribution F of T is, as usual, what is wanted, but what we do observe is Y where Y has the conditional distribution of T given T ≥ M where M , assumed independent of T , is an observed threshold. Thus, if Y has density f and df F and M has density g and df G, then the density of X = (M, Y ) is p(m, y) = g(m)f (y)1(y ≥ m) . F̄ (m) (9.1.11) 130 Inference in semiparametric models Chapter 9 An important example is truncation in astronomical data where the luminosities L of stars are available only if their (visual) brightness B is above a minimal detection level since brightness depends both upon the luminosity and the distance to the star. B which is a function of the distance is genuinely random, and, under a hypothesis of homogeneity of star types in any given direction, can be assumed independent of L — see Babu and Feigelson (1996) for instance. The issue of identifiability of F and G arises and again the answer is affirmative under natural conditions — see Example 9.1.9. ✷ Example 9.1.4. The Cox proportional hazard model. Analogues of the Gaussian linear model that were meant to model event times T were initially built around the exponential, E(λ), family of distributions. Specifically, given a vector of covariates Z which might include factors such as age, weight, and treatment indicator, the response lifetime variable T was modelled so that the natural parameter, the conditional hazard rate λ(t|Z = z), was a function of Z and a parameter β. Since λ is positive, a specification such as λ(t|z) = exp{zT β} is natural. If, for instance, we have two groups, z = (z1 , z2 )T , z1 ≡ 1, and z2 = 0 or 1 as the observation comes from group 1 or 2, then this is just a model for two samples from arbitrary exponential distributions. This is a generalized linear model (Section 6.5) with link function g(µ) = − log µ , since µ = 1/λ is the mean of E(λ). But the E(λ) family tends to be a poor model for lifetimes that experience wear, in view of its memoryless property: the conditional distribution of T − t given T ≥ t is the same as the marginal distribution of T . The E(λ) family is characterized by λ(t) ≡ constant. Cox (1972) proposed an important semiparametric generalization of this model. He introduced an unknown continuous, positive baseline hazard rate λ(t) on [0, ∞) and then postulated the conditional hazard rate of the distribution of the nonnegative continuous variable T given Z = z to be of the form λ(t|z) = λ(t) exp{zT β} . (9.1.12) This Cox (β, λ) model is in widespread use throughout biostatistics in part because of the interpretation of the coefficients βj : By writing the model as log λ(t|z) = Σdj=1 βj zj + log λ(t), we see that βj is the change in the hazards on the log scale as zj is perturbed one unit. More generally, we consider λ(t|z) = λ(t)r(z, β) , (9.1.13) where r > 0 is a known function. The model as specified by its conditional hazard rates and the marginal distribution of Z is evidently semiparametric. Although λ(·) is arbitrary, just as with the linear model (a) of Example 9.1.1, it far from covers all λ(t|z) . It, in fact, contains a powerful assumption corresponding to the additive treatment effect assumption of the linear model: the relative effect of Z on the distribution of T is λ(t|z)/λ(t|0) = Section 9.1 131 Estimation in Semiparametric Models r(z, β)/r(0, β), which doesn’t depend on λ or t. That is, (9.1.13) is a proportional hazard rate model. The first question we ask is whether β is identifiable. We will give a full answer to this question later. For the time being, we note identifiability in an important special case. Suppose Z is discrete, Z ∈ {z0 , . . . , zk }, P [Z = zj ] > 0, 0 ≤ j ≤ k, and that (9.1.12) holds. Write rj (β) ≡ r(zj , β) and let S(t) ≡ e−Λ(t) be the survival function corresponding to λ and assume r(z, 0) ≡ 1. Then, P [T > t|Z = z j ] = S r j (β ) (t) , (9.1.14) a model essentially proposed for the two-sample case by Lehmann. See Example 8.3.10. If Pn Z is discrete as above and we observe (Z1 , T1 ), . . . , (Zn , Tn ), let Nj = i=1 1(Zi = zj ) and let n 1 X Sbj (t) ≡ 1(Ti > t, Zi = zj ), j = 0, . . . , k Nj i=1 be the empirical conditional survival functions. By the Glivenko-Cantelli theorem, as n → ∞, P sup |Sbj (t) − S rj (β) (t)| −→ 0 t and β is identifiable √ iff β → (r0 (β), . . . , rk (β)) is 1–1. In fact, r0 (β), . . . , rk (β) and thus β can be estimated n consistently in many ways — for instance, by picking a value t such that 0 < Sb0 (t) < 1 and estimating rj (β) by rbj,t = log Sbj (t) . log Sb0 (t) (9.1.15) Which is the best choice of t? It turns out there is a “best” estimate of β in this model but it is not from (9.1.15). We shall develop it in Section 9.3. ✷ Example 9.1.5. The Independent component analysis (ICA) model. Principal components. As we have seen in Appendix B.6 of Volume I, the multivariate Gaussian N (µ, Σ) model can be written as Xd×1 = AZ + µ where Z is a vector of i.i.d. N (0, 1) variables and A and µ are unknown. We know that this A may not be identifiable since Var (X) = AAT ≡ Σ and E(X) = µ identify the distribution and A may have many more parameters than Σ and µ. We will develop a particular choice of A which is identifiable by using the representation (see Appendix B.10) d X λj vj vjT Σ = OΛOT = j=1 where O = (v1 , . . . , vd )T , which is orthogonal and d × d, Λ is the diagonal matrix of eigenvalues λ = (λ1 , . . . , λd ) of Σ, with λ1 ≥ . . . ≥ λd > 0. See Problem 8.3.24. 132 Inference in semiparametric models Chapter 9 Here λ1 , . . . , λd have corresponding orthonormal eigenvectors, v1 , . . . , vd with |v| = 1, vi ⊥ vj if i 6= j and Σvj = λj vj , 1 ≤ j ≤ d . (9.1.16) If the vj are chosen so that the first nonzero element of vj is positive, they are uniquely (0) defined if λ1 > . . . > λd > 0. If there are ties between some of the λ’s, let λ1 > . . . > (0) (0) λk be the ordered distinct eigenvalues, then the set of {vj : λj = λi }, 1 ≤ i ≤ k, is the 1 identifiable parameter. Viewed in this way, O and Λ are identifiable and so is A = OΛ 2 and we have the ICA model 1 X = OΛ 2 Z + µ . (9.1.17) The linear combinations v1T X, . . . , vdT X of X’s are the principal components of X with vjT X representing the contribution of vj X to the representation X=µ+ d X (vjT X) vj . j=1 P v1 v1 = sup Var(aT X) : |a| = 1 . That is, the vector a ≡ v1 of v1T a P provides the linear combination ai Xi with the largest possible variance subject to |a| = 1. P See Section B.10.1.2, Example 8.3.11 (continued), and Problem 8.3.24. Moreover, v2 has the same property among the set of vectors a orthogonal to v1 , and so on. v2T Here v1T P There are a number of natural semiparametric generalizations of sampling from this model: X1 , . . . , Xn i.i.d. Nd (µ, Σ). The most important appears to be X = AZ where A is an unknown parameter and Z = (Z1 , . . . , Zd )T with the Zj independently distributed and with the distribution Qj of Zj arbitrary. Thus, if Q = (Q1 , . . . , Qd ), P = {P(A,Q) : P(A,Q) is the distribution of X = AZ with Q “arbitrary”} . (9.1.18) It is easy to see that the Qj are identified at most up to location and scale. What is interesting, but certainly not obvious, is that if all the Qj are not Gaussian, then, as we shall see later, A is identified up to a permutation and scale change of its columns. This is equivalent to saying that any parameter q(A) such that q(AD) = q(A), q(Aπ) = q(A) for all diagonal matrices D and permutation matrices Π is identifiable. In the engineering literature algorithms estimating such parameters have proven very effective in a number of important problems — see Hyvarinen, Karhunen and Oja (2001). This methodology is referred to as Independent component analysis (ICA). The simplest problem leading to ICA corresponds to the situation where Z1 , . . . , Zd represent the output of independent sources which are superimposed when received by d different observers, X1 , . . . , Xd . The task is to separately identify the signal coming from the sources. In practice, the Z1 , . . . , Zd are often time series, as are the X1 , . . . , Xd , and the Xj are observations taken at different times tj , j = 1, . . . , n. However, the methods developed in the i.i.d. case work in the time series situation as well. The only adjustment needed is to the estimates of the variance of our estimates of A. ✷ We now turn to methods of estimation in semiparametric models. Section 9.1 9.1.2 Estimation in Semiparametric Models 133 Regularization. Modified Maximum Likelihood There is a fundamental difficulty when trying to apply the maximum likelihood method or the plug-in approach to some non- and semiparametric models. The difficulty occurs for models P that do not contain the empirical probability Pb and where the parameter of interest, say ν(P ), is not defined at Pb . In the case of the MLE, this means that the maximum of the likelihood is not assumed. Here is an example. Example 9.1.6. We illustrate with X1 , . . . , Xn i.i.d. P ∈ P = {All probabilities P with continuous densities on R}. If we parametrize P by p, the density of P , then the likelihood for an observed vector x = (x1 , . . . , xn )T is Lx (p) = Πni=1 p(xi ) . Nothing prevents us from making p(xi ) arbitrarily large for all i = 1, . . . , n. For instance, consider the distributions Πni=1 pk (xi ) in P with, for k = 1, 2, . . ., n 1X 1 x − xi pk (x) = φ n i=1 σk σk where φ is the standard normal density and σk → 0 as k → 0. Thus, sup Lx (p) = ∞ P is not assumed for any p and this approach does not yield an estimate of the infinite dimensional parameter ν(P ) = p, nor simple one dimensional parameters such as ν(P ) = R p(x)dx. ✷ In this example, it is intuitively clear that sequences {pk } which make Lx (p) → ∞ conP verge weakly to Pb = n−1 ni=1 δxi , the empirical probability distribution, which does not have a density. From a measure theoretic point of view, the members of P are dominated by Lebesgue measure, but the “maximum likelihood” Pb is with probability 1 undominated. Kiefer and Wolfowitz (1956) proposed an approach which they called nonparametric maximum likelihood which leads to choosing the empirical distribution as the estimate of P , when we enlarge P to include all discrete distributions. This approach only makes sense in measure theoretic terms and we do not present it, but we shall discuss modified maximum likelihood methods related to empirical likelihood (Owen (1988, 2001)) in Section 9.1.3. Let P be an arbitrary specified semiparametric model and let M0 be the union of the closure of P with the set of all discrete distributions. Let P̄0 be the closure of M0 under (say) weak convergence. Then, given observations x1 , . . . , xn of i.i.d. X1 , . . . , Xn , consider Px = {P ∈ P¯0 : P {x1 , . . . , xn } = 1, P corresponds to P ∈ P} , the set of all members of P̄0 with support {x1 , . . . , xn } that satisfy the model restrictions of P. Now Px can be parametrized by o n p = (p1 , . . . , pn )T ∈ p : pi = P [X = xi ], 1 ≤ i ≤ n, some P ∈ Px , (9.1.19) 134 Inference in semiparametric models Chapter 9 a subset of the simplex in Rn . We identify Px with this set. In most applications, Px is a smooth parametric model indexed by p(·; θ), θ ∈ Rd , d ≤ n, and maximum likelihood over Px is of the usual type. That is, n n X 1 X θb = arg max p(xi ; θ) = 1 . log p(xi ; θ) : n i=1 i=1 Here p(·; θ) is determined by the restrictions that define the distributions in P. Sometimes, we can only require 0 < P {x1 , . . . , xn } < 1 with 1 − P {x1 , . . . , xn } being assigned to ∞. This happens, as we shall see, in censored data problems. We define the empirical or modified likelihood estimate of P for the model P by Pbe = arg max{Πni=1 pi : P ∈ Px } . (9.1.20) We will refer to Pbe as the nonparametric MLE (NPMLE). The NPMLE of a parameter ν(P ) is defined by νb = ν(Pbe ) when νb(Pbe ) exists. Example 9.1.7. We illustrate (9.1.20) with P = {All P with continuous densities on R}. Then P̄0 = M ≡ {All distributions on R}. When there are no ties among {x1 , . . . , xn }, Px is naturally parametrized by Θ ≡ {(p1 , . . . , pn−1 ) : pj ≥ 0, n−1 X j=1 pj ≤ 1, 1 ≤ j ≤ n − 1} . We can write Lx (p) = Πni=1 pi Pn−1 with pn = 1 − j=1 pj . Because this is a multinomial likelihood with one observation in each category, it is uniquely maximized over Θ by p1 = . . . = pn = n−1 . See Example 2.2.8. The NPMLE of P is indeed Pb , the empirical distribution. ✷ Remark 9.1.1. Ties. Categorical data. Ties occur when X is discrete, when there is roundoff, and when data are collected in categories (e.g. age groups or geographic strata). (0) (0) In the case of ties, we let {x1 , . . . , xm } be the distinct xi ’s or indicators of the categories {1, . . . , m} and let (0) Px = {P ∈ P̄0 : P {x1 , . . . , x(0) m } = m X pi = 1} . j=1 This Px is parametrized by (0) p = (p1 , . . . , pm )T ∈ {p : pj = P (X = xj ), 1 ≤ j ≤ m, some P ∈ Px } . The empirical likelihood of X1 , . . . , Xn in the case of ties or categorical data is n j Lx (p) = Πm j=1 pj , Section 9.1 135 Estimation in Semiparametric Models Pn (0) where nj = i=1 1(xi = xj ). In the case of two categories (e.g. treatment and control), m = 2, p2 = 1 − p1 , and Lx (p) = pn1 1 (1 − p1 )n−n1 is the Bernoulli likelihood. For tied or categorical data, the estimate of P for the model P is Pbe = arg max{ m Y j=1 n pj j : P ∈ Px } . When P̄0 is {all distributions on R} the solution (NPMLE) is pbj = nj /n, 1 ≤ j ≤ m. See Problem 2.2.30. ✷ There are many cases where the NPMLE of a parameter θ can be obtained as the solution of a generalized estimating equation of the form Q(θ, Pb ) = 0, where θ solves Q(θ, P ) = 0, and Q(θ, P ) is not necessarily linear in P . That is, Q may not satisfy Z Q(θ, P ) = v(θ, x)dP (x) for any function v. Stratification is an example. Example 9.1.8. Biased sampling. Stratification (Example 9.1.2). Here P is the collection n of all P with densities Π pF (xi ) where pF satisfies (9.1.2), that is, 1 w(xi )fi Px = (p1 , . . . , pn ) : pi = W (F ) Pn where W (F ) = i=1 w(xi )fi and the parameter of interest is θ = (f1 , . . . , fn )T . Here (p1 , . . . , pn ) range freely over the n simplex. As we saw in Example 9.1.7, the maximum is attained for pbi = n−1 , 1 ≤ i ≤ n, and, by inspection, a solution to pbi = w(xi )fi /W (F ) is n X −1 b fi = w (xi )( w−1 (xk ))−1 . k=1 We can write this empirical likelihood estimate fbe (x) of f (x) in terms of the nonparametric empirical probability Pb as Z −1 b b dFe (x) = w (x)dP (x)/ w−1 (y)dPb(y) in agreement with (9.1.4). We next consider the more general stratified model with X = (I, Y ). Here p(I,Y ) (j, y) = λj wj (y)f (y) , Wj (F ) 1 ≤ j ≤ k, y ∈ R. It turns out that Px is unsatisfactory since the observed x1 , . . . , xn force a coupling of the values of I and Y which is incompatible with the model distribution. However, we can define a modified likelihood naturally as follows: 136 Inference in semiparametric models Chapter 9 Suppose wj (y) > 0 for all y. Then the possible support of a member of P which puts positive mass on all the observed (Ii , Yi ), i=1,. . . ,n, is {1, . . . , k} × {y1 , . . . , yn }. The point (a, yb ) in the support of P ∈ P is assigned probability wa (yb ) p(a, yb ) = λa fb ; Wa (F ) Wa (F ) = n X fb wa (yb ). b=1 The condition wj (y) > 0 for all j is sufficient to make F identifiable. The modified likelihood is Y wa (Yb ) N λa Na+ fb +b Wa a,b Pn Pk where Nab = 1(Ib = a), Na+ = = b=1 Nab , N+b P a=1 Nab , Wa ≡ Wa (F ). The modified likelihood equations under the condition b fb = 1 introduced through a Lagrange multiplier γ yield ba = Na+ λ n for a = 1, . . . , k k N+b X Na+ w (Y ) = γ . − ca a b fbb W a=1 Multiplying by fbb and summing we get n=γ+ n k X Na+ X a=1 Hence a solution is γ = 0 and N+b Na+ a=1 W ca wa (Yb ) fbb = Pk ca W b=1 wa (Yb )fba = γ + n. . (9.1.21) Then, for j = 1, . . . , k, summing wj (Yb )fbb we obtain the following estimate of Wj (F ), cj = W n X b=1 k X Na+ wj (Yb )N+b ( wa (Yb ))−1 . c a=1 Wa If we let Pb1 be the empirical df of I and Pb2 that of Y we can rewrite (9.1.22) as Z Z wa (y) b c wj (y)( Wj − dP1 (a))−1 dPb2 (y) = 0. c Wa (9.1.22) This is an equation yielding a generalized estimating equation or generalized M estimate c ≡ (W c1 , . . . , W ck )T . That is W c = W (Pb) where P is the distribution of X, Pb is the W empirical probability, and W (P ) solves Q(W , P ) = 0. Here Qk×1 is the function Z Z −1 wa (y) dP1 (a) dP2 (y) (9.1.23) Q(W , P ) ≡ W − w(y) Wa Section 9.1 137 Estimation in Semiparametric Models where P is the joint probability distribution of (I, Y )T and P1 , P2 are the respective marginals c exists and is of I and Y . It may be shown, if 0 < P [δ = 0] < 1 (Problem 9.1.7), that W c unique with probability tending to 1. Once we have computed W we have our NPMLE Pbe of P bi wi (y)dFbe (y) dPbe (i, y) = λ Z −1 wa (y) b dFbe (y) = dPb2 (y) dP1 (a) ca W and we can, in principle, estimate any parameter ν(P ), P ∈ P, by ν(Pbe ). Example 9.1.9. Censoring and truncation (Example 9.1.3). likelihood theory for censoring. Introduce the hazard rates, λF = f , F̄ λG = ✷ We develop the empirical g Ḡ and note that we can write the joint density of (Y, δ) (for measure theory afficionados, with respect to the appropriate measure!) as p(F,G) (y, δ) = λδF (y)λ1−δ G (y) exp{−(ΛF (y) + ΛG (y))} (9.1.24) To write the likelihood for Px , where Xi = (Yi , δi ), Yi = min(Ti , Ci ), and δi = 1(Ti ≤ Ci ), we take as possible values for Y the ordered Yi denoted by y(1) , . . . , y(n) . The corresponding values of δ are (δ(1) , . . . , δ(n) ), indicating whether Y(i) is censored or not. We will use the discrete form of the hazard rates of T, C at y(i) calling them λF i , λGi . Consider y(i) , δ(i) , i = 1, . . . , n, coming from Xi = xi , 1 ≤ i ≤ n, as fixed values, and λF i , λGi as unknown parameters. Then, for X = (Y, δ) ∼ P ∈ Px , using (9.1.8) and (9.1.9), and assuming P (Ti = Ci ) = 0, 1 ≤ i ≤ n, pX (y(j) , δ(j) ) = P δ [T = y(j) ]P δ [C ≥ y(j) ] · P 1−δ [C = y(j) ]P 1−δ [T ≥ y(j) ] j−1 Y = [λF j k=1 ·[λGj (1 − λF k )]δ [ j−1 Y k=1 (1 − λGk )] j−1 Y (1 − λGk )]δ k=1 1−δ [ j−1 Y k=1 (1 − λF k )]1−δ (9.1.25) where we write δ for δ(j) . We make a further modification of the likelihood by noting that under our assumptions the supports {t1 , . . . , tn } and {c1 , . . . , cn } of the empirical versions of F and G are disjoint. Then, with εij ≡ 1(Yi = y(j) ) the log empirical likelihood for a sample X1 , . . . , Xn is lx (λF , λG ) = n X n X i=1 j=1 + j−1 X k=1 εij {(δ(j) log λF j + (1 − δ(j) ) log λGj ) [log(1 − λF k ) + log(1 − λGk )]} . (9.1.26) 138 Inference in semiparametric models Chapter 9 Let Y(L) be the largest y(i) with δ(i) = 1. For i > L, λF i does not appear in the sum. We have no information about the distribution of times larger than the largest uncensored time other than an estimate of the survival function at T(L) P. nFor i ≤ L, we obtain after some algebra (Problem 9.1.14), if there are no ties, that is, i=1 εij = 1 for all j, that the maximizer of (9.1.26) is bF i = λ δ(i) . n−i+1 (9.1.27) If δ(i) = 0, then Pb [T = y(i) ] = 0. Using (9.1.9), we arrive at the classical Kaplan-Meier estimate of the survival function S(y) = P (T > y) for y ≤ Y(L) , SbKM (y) = Π{(1 − δ(i) ) : y(i) ≤ y} . n−i+1 (9.1.28) Remark 9.1.2. 1) When there are no ties, if Pb corresponds to SbKM , then Pb [T > Y(L) ] = 0 iff L = n. 2) The formula (9.1.25) can be extended to the case where P is discrete to begin with. That is, there are ties but we still suppose the supports of T and C are disjoint or, as Lawless (1982) views it, apparent ties between censoring times and deaths are broken by moving censoring times infinitesimally to the right. 3) If we replace (9.1.25) by its continuous approximation, (9.1.24), formally given by pX (y(j) , δ(j) ) = λδF j λ1−δ Gj exp{− j X (λF k + λGk )} , (9.1.29) k=1 bF j (Problem 9.1.13), but a new expression, we arrive at the same formula for λ SbN A (y(j) ) = exp − j X k=1 bF k λ (9.1.30) as an estimate of S. This is the Nelson-Aalen estimate which we will discuss further later. To study identifiability of λF , λG , F , and G, consider P defined as all P of the form (9.1.24) and note that the unique maximizer over P of, (Problem 9.1.13) Z K(P(F,G) , P ) ≡ log p(F,G) (y, δ)dP (y, δ) for G fixed and 0 < P [δ = 0] < 1 is given by λF (P )(y) ≡ p1 (y) P [Y ≥ y] where p1 (y) = (9.1.31) d P [Y ≤ y, δ = 1] . dy Section 9.1 139 Estimation in Semiparametric Models An analogous formula holds for G. From (9.1.31) and its analogue for G, we get F, G such that P(F,G) = P . Hence (F, G) given by (9.1.31) and its G analogue is the unique maximizer of K and we have identifiability if 0 < P [δ = 0] < 1. Here the condition P [δ = 0] < 1 is necessary for identifiability of F while p1 (y) is not defined if P [δ = 1] = 0. If P [δ = 0] = 0, then G cannot be identified. Note that these identifiability results are density analogues of Problem 1.1.10, identifiability for the discrete case. Truncation is easily handled by the same hazard rate parametrization. f (y) 1(y ≥ m) F̄ (m) = g(m)λF (y) exp{ΛF (m) − ΛF (y)} 1(y ≥ m) . p(M,Y ) (m, y) = g(m) The NPMLE makes and if y(1) < . . . < y(n) , bF (y) = 0, λ (9.1.32) y∈ / {y1 , . . . , yn } bF (y(j) ) = (Nj − j)−1 λ Pn where Nj = k=1 1(Mk ≤ y(j) ), provided Nj > j. That is, we can only estimate b λF (y(j) ) for j ≥ J with J = first k such that y(k) 6= Mlk where ylk ≡ y(j) . Equivalently bF is defined with probability we can only estimate F (y) for y ≥ y(J) . We deduce that λ tending to 1 if F̄ (y) > 0 for all y. Moreover, f can be identified by λF (y) = pY (y)(P [M ≤ y] − P [Y ≤ y])−1 and if G(m) > 0 for all m, g can be similarly identified. We can also obtain b̄ (y) = Π 1 − λ bF (y(j) ) : y(j) ≤ y F b̄ (y ) . b (y )F Pb [{y }] = λ F (j) F (j) (j) We leave details of these results to the reader (Problem 9.1.8). (9.1.33) (9.1.34) ✷ Although empirical likelihood gives simple answers for nonparametric censoring and truncation, survival analysis models with no covariates, it is much less satisfactory for more complicated models such as that of Cox because of the awkward formula (9.1.9). Simpler solutions are obtained by a natural approximation to the empirical likelihood, yet another modified likelihood, suggested by (9.1.6) — but whose real foundations lie in counting process theory — see Aalen (1978), Andersen et al. (1993), and Kalbfleisch and Prentice (2002)(3). Replace, even in the discrete case, if T is a survival time, p(t) by λ(t) exp{−Λ(t)}, where Λ is the cumulative hazard rate. We illustrate the method in Example 9.1.10. The Cox model (Example 9.1.4). We enlarge P of the proportional hazard Cox model to include all distributions of (Z, T ) with T discrete and with hazard rates, given Z = z, λ(t|z) = r(z, β)λ(t) 140 Inference in semiparametric models Chapter 9 where λ(t) and λ(t|z) are defined as in (9.1.8) and (9.1.13). Here Z has density (discrete or continuous) h(·). We apply the empirical likelihood approach to this model. Given x1 = (z1 , t1 ), . . . , xn = (zn , tn ), we can parametrize Px by θ ≡ (h1 , . . . , hn , λ1 , . . . , λn , β) with hi = h(zi ), λ(ti |zi ) ≡ λi r(zi , β), 1 ≤ i ≤ n , where hi ≥ 0, λi ≥ 0 : 1 ≤ i ≤ n, and β ∈ Rd . Using (9.1.9), we obtain the modified likelihood, Lx (θ) = Πni=1 hi λi r(zi , β) Π{(1 − r(zi , β)λj ) : tj < ti } . j (9.1.35) We assume that h = (h1 , . . . , hn )T is not related to (β, λ), in which case we can treat Πni=1 hi as a constant c. See Example 6.2.1. Or we can replace hj with its NPMLE b hj = nj /n from Remark 9.1.2, in which case Πb hj is a constant c. Thus we replace Πni=1 hi with c and θ with (λ, β) in what follows. Our strategy will be to first fix β and find b λ(β) = arg max Lx (β, λ) . λ b Next consider lx (β) = Lx β, λ(β) , which is called the profile empirical likelihood. The b = arg max lx (β), and λ( b β). b Maximizing Lx (θ) with respect profile NPMLEs are now β to λ for β fixed, we get, solving ∂ log Lx (θ)/∂λk = 0, n X 1 r(zi , β) = 1(tk < ti ) . λk 1 − r(zi , β)λk i=1 This equation is awkward. It has no explicit solution for λk , and computer solutions need to be produced for a large grid of β’s. However, if we replace Y 1 − r(zi , β)λj : tj < ti Λ(ti |zi ) = j in (9.1.35) by its continuous case approximation from (9.1.6), that is Λ(ti |zi ) ∼ = r(zi , β) n X j=1 λj 1(tj ≤ ti ) , we get the approximate empirical likelihood, Lx (θ) = cΠni=1 r(zi , β)λi exp{−r(zi , β) n X j=1 λj 1(tj ≤ ti )} . (9.1.36) From (9.1.36) and ∂ log Lx (θ)/∂λk = 0, we find that the maximizers of Lx (θ) are !−1 n X bk (β) = r(zi , β) 1 (ti ≥ tk ) ,1 ≤ k ≤ n . (9.1.37) λ i=1 Section 9.1 Estimation in Semiparametric Models 141 bk (β) for λk in (9.1.36) and noting that By substituting λ n X r(zi , β) i=1 n X j=1 bj (β)1(tj ≥ ti ) = n , λ we find that the approximate profile likelihood is proportional to " # n Y r(zi , β) b Pn L(β) ≡ Lx β, λ(β) = . n−1 j=1 r(zj , β)1(tj ≥ ti ) i=1 (9.1.38) The profile empirical likelihood estimate of β based on the continuous case approximation (9.1.6) is now defined by b = arg max L(β) . β β It coincides with the Cox (1972,1975) partial likelihood estimate. See Example 9.1.11. b , 1 ≤ j ≤ d, and their approximate standard errors are available in The Cox estimates β j most statistical software packages. We next introduce expressions that will be useful for identifiability and asymptotics. Rewriting (9.1.37) with t in place of tk we get b β) = dΛ(t, Z −1 b r(z, β) 1 (s ≥ t) dP (z, s) dPb2 (t) (9.1.39) where Pb2 is the marginal empirical distribution of T and Pb(·, ·) is the empirical of (Z, T ). Given (Z, T ) ∼ P , define P1 and P2 in general as the the marginal distributions of Z and T . Let Z (9.1.40) S0 (t, β, P ) = r(z, β)1(s ≥ t)dP (z, s) = EP r(Z, β)1(T ≥ t) . By taking log in (9.1.38) and using (9.1.39) we see that maximizing L(β) as a function of β is equivalent to maximizing Γ(β, Pb ) where Γ(β, P ) = EP1 log r(Z, β) − EP2 log S0 (T, β, P ) . (9.1.41) Define β(P ) = arg max Γ(β, P ) ; b is then the estimate β (9.1.42) b = β(Pb) . β b as an empirical plug-in estimate based on the generalized estiNote that we can regard β mating equation Γ̇(β, Pb) = 0, where Γ̇ is the gradient with respect to β. This makes it easier to handle identifiability and asymptotics. See Section 9.2. 142 Inference in semiparametric models Chapter 9 Identifiability of β means that, under the Cox model, if β 0 is true, −Γ(β, P ) has a unique minimum at β = β0 , that is, it is a contrast function as defined in Section 2.1.1. This holds iff the map β → LP (r(Z, β)) is 1–1 and P [S0 (T, 0, P ) = c] < 1 for all c. A proof is sketched in Example 9.2.3 when r(z, β) = exp{βT z}. We shall show that, under the identifiability conditions, −Γ(β, P ) is strictly convex in β when r(z, β) = exp{βT z} even when the Cox model does not hold. Thus, for this r, β(P ) defined by (9.1.42) is a well defined parameter for general P . Similarly, for this r, −Γ(β, Pb ) is strictly convex in b is well defined. Note that Γ(β, P ) is not linear β with probability tending to 1 so that β in P so that M estimation theory as given in Section 6.2.1 can not be used directly for asymptotics. We return to this in Section 9.2.2. ✷ We next turn to a widely used modified likelihood. Example 9.1.11. The Cox partial likelihood. Regression analysis for censored and tied data. Consider Example 9.1.9 with censoring, but now assume we have available a covariate vector Z. As before, let λ(t|z) be the conditional hazard rate given Z = z. We change notation and let t1 < . . . < tm be the observed distinct failure times, leaving out censoring times. The Cox partial likelihood Πm i=1 qi is a conditional empirical “likelihood,” where qi is the probability that a failure occurs at time ti computed conditionally for those patients whose failure or censoring times are at least ti , that is, for patients with yj = min{tj , cj } ≥ ti . It follows from (9.1.8) that with Ri = {j : yj ≥ ti }, Πm i=1 qi = λ(ti |zi ) Πm i=1 P j∈Ri λ(ti |zj ) = Πni=1 " λ(yi |zi ) P j:yj ≥yi λ(yi |zj ) #δi (9.1.43) provided that given Zi , 1 ≤ i ≤ n, C1 , . . . , Cn are independent of T1 , . . . , Tn . As pointed out by Cox (1972,1975), Πm i=1 qi is not a “complete” likelihood, so the term “partial” is appropriate. The λ(ti ) term cancels in Πm i=1 qi under the proportional hazard assumption λ(ti |zi ) = λ(ti )r(zi , β), and we have the Cox partial likelihood for proportional hazard model: " #δi r(zi , β) m n Πi=1 qi = Πi=1 P . (9.1.44) j:yj ≥yi r(zj , β) When there is no censoring and there are no ties, this likelihood is equivalent to the modified profile likelihood (9.1.38). ✷ In the next subsection we formalize the approximation behind the modified likelihoods (9.1.36) and (9.1.38). 9.1.3 Other Modified and Approximate Likelihoods There are models where it is convenient to use discrete type modified likelihoods that do P not require pi = 1. See e.g. Murphy (1994), Murphy, Rossini and van der Vaart (1997), van der Vaart (1998), Murphy and van der Vaart (2000), Zeng and Lin (2007), and Kosorok Section 9.1 Estimation in Semiparametric Models 143 (2008). We consider this approach but start with what we call the delta method likelihood, Ld . It leads to approximate likelihoods such as (9.1.36) and the Cox likelihood (9.1.44) in a natural way. This approach is based on a very close approximation to the ordinary likelihood and does not involve discrete type modified likelihoods. Here is the justification of the delta method likelihood: Many semiparametric models can be written as models with parameters β ∈ Rd and η(·); and a likelihood that depends on β, η(·), and η ′ (·), where η ′ (y) ≥ 0. Then the likelihood of i.i.d. (Y1 , Z1 ), . . . , (Yn , Zn ) can be expressed in terms of θ(i) , η(y(i) ), γ(i) ≡ η ′ (y(i) ), and z(i) where the y(i) ’s are the yi ’s ordered, z(i) is the covariate vector corresponding to y(i) , and θ(i) = r(z(i) , β) with r(·, ·) known. To simplify the likelihood, rewrite η(y(i) ) as the telescoping sum η(y(i) ) = n X j=1 η(j) 1(y(j) ≤ y(i) ) with η(j) = η(y(j) ) − η(y(j−1) ), η(y(0) ) = 0 . Now the parameters are β, γ and η where γ = (γ(1) , . . . , γ(n) )T , η = (η(1) , . . . , η(n) )T . To eliminate γ consider the delta-approximation η(y + d) ∼ = η(y) + dη ′ (y), that is, γ(i) = η ′ (y(i) ) ∼ = η(y(i) ) − η(y(i−1) ) η(i) = y(i) − y(i−1) d(i) (9.1.45) where d(i) = y(i) − y(i−1) , y(0) = 0. We substitute γ(i) ∼ = η(i) /d(i) in the likelihood and get an approximate likelihood Ld (β, η) that depends on β and η only so that we have reduced the dimension of the parameter space. In the proportional hazard model, with η(y) = Λ(y), the resulting likelihood is equivalent to the approximate empirical likelihood b of (9.1.38), is equivalent to the Cox (9.1.36) (Problem (9.1.15), and, with η replaced by λ partial likelihood (9.1.44). The authors referred to in the first paragraph define P a similar approximate likelihood n La (β, η) by replacing η(y) with the step function η[y] ≡ i=1 γ(j) 1(y(j) ≤ y) with the γi being the basic parameter. That is, La can be obtained from Ld by setting the spacings d(i) in Ld equal to one. The two approaches based on the δ likelihood and La are equivalent for estimation of β when all terms in involving d(i) , 1 ≤ i ≤ n, do not contain any parameters. See Problem 9.1.16 for a case where they are not equivalent. Ld has the advantage that it is a very close approximation (Problem 9.1.24) to the actual semiparametric likelihood and it does not use discrete models where continuous models are natural. Murphy and van der Vaart (2000) , and Zeng and Lin (2007) among others, have shown under suitable assumptions, that if we use the profile approach where we fix β and set the estimate b (β) = arg max{La (β, η)} , η η b = arg max{La (β, η b (β)} β 144 Inference in semiparametric models Chapter 9 is asymptotically semiparametrically efficient in a sense to be introduced in Section 9.3. We b and ηb ≡ ηb(β) b the maximum approximate profile likelihood estimates (MAPLEs). call β Example 9.1.12. Semiparametric transformation models. Consider the family of composite models where the df of Y given Z = z is a composite of a parametric df G(·; θ) and a nondecreasing function η(·). That is, consider the transformation model θ = r(z, β) ∈ Θ , F (y; θ, η) = G η(y); θ), (9.1.46) where β ∈ Rd , y ∈ A ⊂ R, and η(y) ∈ [η, η] with G(η, θ) = 0, G(η, θ) = 1, all θ ∈ Θ. The proportional hazard model is of this form with G(t; θ) = 1 − exp{−θt} and η(y) = Λ(y) = − log 1 − F (y) for an unknown baseline df F . More generally, F (y) ; and G(t; θ) = G0 (t − θ), interesting cases are G(t; θ) = G1 (θt), η(y) = G−1 1 η(y) = G−1 F (y) ; for specified df’s G1 and G0 (see Problem 9.1.17). 0 If η(·) has a derivative η ′ (y) > 0, and G has density g(v; θ), the exact likelihood is Πni=1 qi with qi = η ′ (y(i) )g X η(j) ; θi j≤i (9.1.47) where η(j) = η(y(j) ) − η(y(j−1) ), η(0) = 0, and θi = r(zi , β). With the approximation η ′ (y(i) ) ∼ = η(i) /d(i) , Ld (β, η) is equivalent to Πni=1 pi with pi = η(i) g X j≤i η(j) ; θi , (9.1.48) P where we do not require ni=1 pi = 1. In this case, Ld (β, η) is constant in d(i) = y(i) − y(i−1) , 1 ≤ i ≤ n, and is equivalent to La (β, η). Moreover, La only depends on the ranks of y1 , . . . , yn , and thus the estimate of β also only depends on the ranks. The MAPLEs are formally where n X Y b = arg max b=η b (β) , ηbi (β)g ηbj (β); r(zi , β) , η β β i=1 j≤i b (β) = arg max η η n Y i=1 pi : η < ηj < η, 1 ≤ j ≤ n . As an example, supposeG(y; θ) = L(y − θ) where L(t) is the logistic distribution [1 + exp(−t)]−1 and η(t) = log F (t)/[1−F (t)] for some baseline F ; then the model (9.1.47) b is the profile NPMLE of Murphy, is the proportional odds model (Problem 9.1.17) and β Rossini and van der Vaart (1998). Existence, consistency, and asymptotic normality of the MAPLEs for this model is given in that paper. Section 9.1 9.1.4 Estimation in Semiparametric Models 145 Sieves and Regularization Empirical likelihood and its relatives are methods which give attractive answers in a number of situations, but their scope is, in fact, limited. To see this, suppose we try to apply it to the symmetric location model of Example 3.5.1. Here P = {P : P has density p(x) = f (x − θ), f is arbitrary symmetric about 0, θ ∈ R} . Unfortunately, Px cannot be defined since, in general, there is no distribution concentrated on x which puts positive mass on all points, which is also symmetric about any point. Sieves We next consider a conceptually very powerful approach, the method of sieves, introduced into statistics by Grenander (1981). In the method of sieves, the approach is to approximate the model P by a sequence of regular parametric models {Pm } called a sieve that in some sense converge to P as m → ∞. The distribution Pbm in the sieve that corresponds to the maximum likelihood estimate for Pm estimates P and provides a plug-in estimate ν(Pbm ) of a parameter ν(P ) in a semiparametric model. It is sometimes convenient to take for Pbm not maximum likelihood but minimum distance or some other estimate well behaved when P ∈ Pm . These methods can be thought of as examples of regularization, an approach first proposed by Tikhonov (1963) in the context of providing stable solutions to “ill posed” differential equations. See Section 11.4.1. The method of sieves is a method which, in principle, can be applied to any statistical problem where we have available an i.i.d. sample with empirical probability Pb . Suppose, given a model P for our data, we can determine P1 ⊂ P2 ⊂ . . . such that (i) P ⊂ ∪∞ j Pj , where Ā denotes closure of A in at least the sense of weak convergence. (ii) Pj is a regular parametric model Pj = {Pθ(j) : θ(j) ∈ Θ(j) ⊂ Rdj } where {dj } is a sequence strictly increasing to ∞ as j → ∞ with Pθ (j) having density function p(·, θ (j) ) and the map θ(j) → p(·, θ (j) ) is smooth. Let ρ(P, Q) be a measure of the discrepancy between two probabilities P and Q. For a given fitting criterion of the form Πj (Q) = arg min ρ(P, Q) : P ∈ Pj , (j) such that if the idea now is to find parameters ηj (P ) ∈ Θ ⊂ ∪∞ j=1 Θ Πj (P ) ≡ Pηj (P ) ∈ Pj then Πj (P ) → Π(P ) = P as j → ∞ 146 Inference in semiparametric models for P ∈ P, and 1 Chapter 9 Πj (Pb ) = Πj (P ) + ∆jn where ∆jn = oP (n− 2 ) for all j. Here Πj (P ) − P can be thought of as contributing to bias and Πj (Pb) − Πj (P ) as contributing to variance. Typically ∆jn cannot be made uniformly small in j for all n. In the method of sieves, we choose a model selection rule which chooses PJn as the “best” model according to some criteria (see Section I.7), and then act as if PJn were true. That is, with m = Jn and Pb the empirical probability, we estimate P by Pbm ≡ Πm (Pb ) . The estimate of a parameter ν(P ) is then νb ≡ ν(Pbm ). The three elements of the method of sieves for estimating a parameter ν(P ), P ∈ P are: a) The choice of {Pj }. b) The choice of ηj (P ). Frequently the first thing tried is the maximum likelihood (Kullback-Liebler) discrepancy of Section 2.2.2, Z ηj (P ) = arg max{ log p(x, θ(j) )dP (x) : θ (j) ∈ Θ(j) } . That is, for P ∈ P, ηj (P ) is the parameter θ in the jth model that makes Pθ closest to P in the Kullback-Leibler sense. c) The choice Jn of j. Unfortunately, the applications of the method of sieves usually lead to technical difficulties which have to be resolved by delicate empirical process theory arguments; see Bickel, Klaassen, Ritov and Wellner (1993,1998), van der Vaart and Wellner (1996), and van der Geer (2000) for theory and practice, examples, and references to the literature. Regularization Sieves are just a type of regularization. Consider the representation of maximum likelihood in terms of Kullbach-Leibler discrepancy given in Section 2.2.2. We start by writing Z 1 log Lx (p) = log p(x)dPb (x) . n We saw in Example 9.1.6 that Z arg max log p(x)dPb(x) : P corresponds to P ∈ P (9.1.49) Section 9.1 Estimation in Semiparametric Models 147 is not always well defined. Yet we know by Shannon’s Lemma (2.2.1) that, quite generally, if P0 ∈ P, Z P0 = arg max log p(x)dP0 (x) : p corresponds to P ∈ P is well defined. This suggests that we consider approximate solutions of the problem (9.1.49) which are “nice” and so will be apt to be closer to the “nice” true P0 . Sieves do this by restricting the model over which optimization is carried out. An alternative is to penalize solutions which are not “nice” by changing the quantity to be For instance, we assume that P is all densities p with a derivative p′ such that Roptimized. ′ 2 [p (x)] dx < ∞. It is natural to maximize Z Z b log p(x)dP (x) − λn [p′ (x)]2 dx . (9.1.50) For λn > 0 theR maximizer of (9.1.50) may be shown to be unique and necessarily have a density with [p′ (x)]2 dx < ∞. On the other hand if λn → 0 it is intuitively clear that the maximizer of the population version of (9.1.50) where Pb is replaced by P should approximate the true p0 . This method is studied further in Chapter 11. It is, in fact, closely linked to the idea of using the posterior mode of a Bayes prior on P. It can also be viewed as a special case of the method of sieves where, the models {Pn } of the sieve however, R are themselves infinite dimensional, Pm = p : [p′ (x)]2 dx ≤ εm . We discuss this further in Chapter 11 but note here the general principles. A review of regularization with discussion of various examples is given by Bickel and Li (2006). Examples We shall give an illustration of how sieves can be used in a, by now, classical example. In this example we are using a straightforward extension to two sequences {j1 } and {j2 }. Example 9.1.13. Partial linear model. Suppose X = (U, Z, Y ) and we postulate Y = βZ + g(U ) + ε , (9.1.51) where (U, Z) have unknown joint distribution H, ε ∼ N (0, σ 2 ) independent of (U, Z) and g is arbitrary except for regularity conditions. We shall assume (U, Z) ∈ R2 , although the model and results are readily extendable to vector U and Z and ε ∼ F arbitrary. This model, introduced by Engle et al. (1986), arises if, for instance, Z takes on only a finite set of values such as the levels of a treatment, but U is a real valued covariate of less intrinsic interest, e.g. age. Thus ν(P ) = β is the parameter of interest. Rewrite the model (with a new g) as Y = β(Z − E(Z|U )) + g(U ) + ε P ={P(β,g,σ2 ) : β ∈ R, σ 2 > 0, g arbitrary} . 148 Inference in semiparametric models Chapter 9 Then, in analogy with ordinary linear regression, because Y − EP (Y |U ) = β(Z − EP (Z|U )) + ε for P ∈ P, we can formally define a parameter β(P ) = EP (Z − EP (Z|U ))(Y − EP (Y |U )) EP (Z − EP (Z|U ))2 (9.1.52) such that β(P(β,g,σ2 ) ) = β . Now (9.1.52) clearly gives a valid definition of β(P ) if EP (Z 2 ) < ∞, EP g 2 (U ) < ∞, and EP (Z − EP (Z|U ))2 > 0. Then P is identifiable provided EP (Y 2 ) < ∞, EP (Z 2 ) < ∞, and Z is not a function of U with probability 1. In fact, we shall argue that only the condition 0 < EP (Z − EP (Z|U ))2 < ∞ is needed (Problem 9.1.12). Note that we cannot define β(Pb) because EPb (Z|U ) and EPb (Y |U ) are not well defined. However, b |U = ·) it is natural to plug-in regularized estimates of these quantities, call them E(Y b and E(Z|U = ·), to obtain a procedure which works. For simplicity, we shall use the histogram density estimate applied to p(y|u), p(z|u). Assume that (U, Y ) has a density on [0, 1] × [0, 1]. Define Ijh = (jh, (j + 1)h] for each axis of the square as in Example I.5 so (1) (2) that Ij1 ,h1 × Ij2 ,h2 form a partition of the square for 0 ≤ ji ≤ 1/hi , i = 1, 2. Let (1) (2) (1) pbn (y|u) = Pb [Y ∈ Ij2 ,h2 (y) | U ∈ Ij1 ,h1 (u)] (2) where Ij1 ,h1 (u) is the partition interval that contains u and Ij2 ,h2 (y) the one that contains y. This is an application of the method of sieves where Pj1 ,j2 corresponds to the model with all densities of (U, Y ) and (Z, Y ) constant on the rectangles described above and {θ(j1 ,j2 ) } are the probabilities of these rectangles. Let h2 → 0. Strictly speaking we no longer have a continuous case conditional density when h2 = 0, but rather the discrete density assigning equal mass to all Yi such that Ui ∈ Ij1 ,h1 . However, the procedure turns out to be satisfactory, yielding, Pn Yi 1(Ui ∈ Ij1 ,h1 (u)) b |U = u) ≡ Pi=1 . (9.1.53) EPbn (Y |U = u) ≡ E(Y n i=1 1(Ui ∈ Ij1 ,h1 (u)) b We next let h1 → 0 and get a similar definition for E(Z|U = u). Now βb ≡ β(Pbn ) is b obtained by plugging these E’s into (9.1.52). These are the plug in estimates corresponding b In this example, to the regularization given. We shall later study a simplified version of β. choosing ηj (P ) amounts to choosing the bandwidths h1 and h2 . See Chapters 11 and 12. See also Engle et al (1986), Chen (1988), and Fan et al (1998). ✷ Example 9.1.14. ICA (Example 9.1.5). The density of X in the independent component analysis (ICA) model (9.1.18) is p(x; A, Q) = |A|−1 Πdj=1 qj (a(j) x) (9.1.54) Section 9.1 149 Estimation in Semiparametric Models where a(j) is the jth row vector of A−1 . We begin with the problem of identifiability, following Kagan, Linnik and Rao (1973); see also Comon (1994). Suppose (A1 , Q1 ), (A2 , Q2 ) lead to the same distribution for X, where all components of Ql , l = 1, 2, have mean 0 and variance 1. Without loss of generality (Problem 9.1.10), we can take A2 = J, the d × d identity. For l = 1, 2, let Zl = (Z1l , . . . , Zdl )T be the vector of independent variables distributed as (Q1l , . . . , Qdl )T . Consider the characteristic function of X = A1 Z, (1) Ψ(t) = E[exp{itT X}] = E[exp{itT A1 Z}] = Πdk=1 ψk (ak t) (9.1.55) (l) where ψk is the characteristic function of Zkl , l = 1, 2, and ak is the kth row vector of AT1 . Similarly, since A2 is the identity, (2) Ψ(t) = Πdk=1 ψk (tk ) . (9.1.56) (j) (j) Taking logs in (9.1.55) and (9.1.56) appropriately, we obtain if log ψk ≡ γk , d X (1) [γk ](ak tT ) = k=1 d X (2) [γk ](tk ) k=1 for all t in a neighborhood of 0. Differentiating twice with respect to tb , tc , we obtain d X (1) [γk ]′′ (ak tT )abk ack = δbc (9.1.57) k=1 (1) where ak = (a1k , . . . , adk ) and δbc is the Kronecker delta. By assumption, [γk ]′′ (0) = 1 for all k. Hence, we see that a1 , . . . , ad are orthonormal. We can deduce from (9.1.57) that (1) either [γk ]′′ (t) is constant, i.e. Zk is Gaussian, or ak is plus or minus one of the coordinate vectors and conclude that, if at most one of the Zk is Gaussian, identifiability of A and γk , 1 ≤ k ≤ d follows (Problem 9.1.10). This argument suggests that an estimate of A can be constructed as follows: Let A∗ be the matrix obtained from A by scaling each column to have norm 1 and drop the requirement that the Zk have variance 1. Let Z |t|2 ∗ 2 b − Πd ψk (a∗ tT , a(k) b Q(A ) = |ψ(t) }dt ∗ )| exp{− k=1 k 2 (k) where ψb is the empirical characteristic function EPb [exp{itT X}] of X and ψk (·, a∗ ) is (k) (k) the empirical characteristic function of a∗ X where a∗(k) and a∗ are the kth row vectors of A∗ and [A∗ ]−1 . Chen (2003) shows that is a b∗ = arg min Q(A b ∗) A √ n consistent estimate of A∗ under mild conditions. 150 Inference in semiparametric models Chapter 9 More significantly, we can also apply the method of sieves to obtain what turns out to be an efficient estimate of A∗ . The idea is to construct suitable estimates of qk′ /qk (·, A∗ ), the score functions of the densities of Z1 , . . . , Zd assuming that A∗ is the truth. A by now standard approach (see Stone, Hansen, Kooperberg and Truong (1997)) is to approximate log qk (·, A∗ ) by a linear combination of basis functions, e.g. splines. That is, we approximate an arbitrary density q by a member of the exponential family, p(x, θ) = exp{ d X θk wk (x) − Ad (θ)} k=1 where the linear span of the wk is dense in the space of all functions for suitable metrics. See Section 11.4.2. Given the qbk (·, A∗ ) we plug back into (9.1.54) and maximize the resulting likelihood in A∗ . This is an application of the method of sieves. Appropriate A∗n can be constructed and despite technical difficulties, this approach works — see Chen (2003). We next consider a model that’s a rival to the Cox model (Reid (1994)). Example 9.1.15. The accelerated failure time (AFT) model. Consider the model where a failure time T is an accelerated version of a baseline failure time T0 in the sense that T = αT0 ; T0 ≥ 0 , where α = exp{−βT Z}, T0 corresponds to β = 0, and T0 is independent of the vector Z of predictors. We can rewrite this model as T eβ T Z = T0 . For the case where Z = Z(t) is a time dependent covariate vector, Cox and Oakes (1984, Section 5.2) and Zeng and Lin (2007) considered the model Z T eβ T Z(t) dt = T0 . (9.1.58) 0 Let Λ(t|z) and Λ(t) denote the cumulative hazards of T and T0 , then Λ(t|z) = Λ Z t eβ T 0 z(s) ds . (9.1.59) As before, let C denote a censoring time set δ = 1(T ≤ C), Y = min(T, C), assume that C is independent of T given Z, and that the distribution of C given Z does not involve β or Λ. Then the log likelihood for data {(Yi , Zi (Yi ), δi ; 1 ≤ i ≤ n} is proportional to (Problem 9.1.20), l(β, λ) = n−1 n X i=1 δi β T Zi (Yi ) + δi log λ(θi ) − Λ(θi ) (9.1.60) Section 9.2 Asymptotics. Consistency and Asymptotic Normality 151 where λ(·) = Λ′ (·) and θi = Z 0 Yi exp βT Zi (s) ds . In this example, the profile approach where we fix β and maximize l(β, λ) with respect to λ(·) does not produce useable estimates (see Problem 9.1.20). If we formally pass to the RY limit as n → ∞ in (9.1.60) we obtain θ(Y ) ≡ θ(Y, β) = 0 exp{βT Z(s)}ds. Note that T0 = θ−1 (T, β). It is natural to consider θ(·, β) as a nuisance parameter and regularize the d P δ = 1, Y ≤ θ−1 (t, β) by a kernel smoothed version profile likelihood by replacing dt which is then estimated empirically. See Chapter 11 for such estimates. The resulting estimates of β are, under regularity conditions, semiparametrically efficient in the sense of Section 9.3 (Zeng and Lin (2007)). ✷ Remark 9.1.3 (a). The proportional hazard and the AFT models coincide for the Weibull model and the periodic hazard rate model. See Problem 9.1.21. (b). The proportional quantile hazard rate and AFT models are equivalent. See Problem 9.1.22. ✷ Summary. We discuss in Section 9.1.1 several semiparametric models for a variety of common experimental frameworks including the symmetric location model, the linear model with stochastic covariates and unknown error distribution and biased sampling models. In the context of survival analysis we introduce the survival function, the hazard function, censoring, truncation, and the Cox proportional hazard model. We also consider the Independent Component Analysis (ICA) model where the basic variables are linear combinations of independent but not identically distributed variables whose distributions are “arbitrary.” We consider in Sections 9.1.2 and 9.1.3 maximum likelihood type methods for semiparametric models that are based on modifications of the maximum likelihood principle. They are usually based on approximating or replacing the arguments of the objective functions associated with the maximum likelihood (Kullback-Leibler divergence) with arguments that are √ “well behaved.” Success is achieved when n consistent (generally efficient) estimates of Euclidean parameters are obtained when the empirical probability Pb is plugged in for the population probability P . The first method considered is modified likelihood which leads to nonparametric MLEs and the second method considered in Section 9.1.4 is the method of sieves. We then illustrate these regularized maximum likelihood and sieve procedures in the semiparametric models introduced in Section 9.1.1. We obtain some classical procedures including the Kaplan-Meier, Nelson-Aalen, and Cox estimates in survival analysis. 9.2 Asymptotics. Consistency and Asymptotic Normality In this section we establish a general consistency criteria and use it to prove consistency and asymptotic normality for some of the estimates we have suggested. 152 Inference in semiparametric models 9.2.1 Chapter 9 A General Consistency Criterion We state a generalization of Theorem 5.2.3, which turns out to be broadly useful for proving consistency. Let kf kK ≡ sup{|f (t)| : t ∈ K}, K ⊂ Rd , and let {Qn (t) : t ∈ Rd } be a sequence of real valued stochastic processes such that A1: Qn (t) is convex as a function of t with probability 1. P A2: kQn − QkK −→ 0 as n → ∞ for all compact K ⊂ Rd , where the limit Q : Rd −→ R is a deterministic function. A3: Q is strictly convex and continuous with lim{Q(t) : |t| −→ ∞} = ∞. It may be shown (Rockafellar (1969), p.74) that A2 follows from P A2’: Qn (·) is separable and Qn (t) −→ Q(t) for all t. Let t0 ≡ arg min Q(t), btn = arg min Qn (t) . By convention, b tn is chosen among the set of minimizers if argmin is not unique and b tn ≡ 0 if the minimum is not assumed. Proposition 9.2.1. If A1, A2, and A3 hold, (i) t0 is uniquely defined. P (ii) b tn −→ t0 . Proof. Condition A3 guarantees (i). For (ii), let ǫ > 0, and note that “|b tn − t0 | > ǫ” implies that the minimizer of Qn (t) is in {t : |t − t0 | ≥ ǫ}. Thus P (|b tn − t0 | > ǫ) ≤ P inf{Qn (t) : |t − t0 | ≥ ǫ} ≤ Qn (t0 ) . Now (ii) follows if we argue that as n → ∞, P inf{Qn (t) : |t − t0 | ≥ ǫ} > Qn (t0 ) → 1 . (9.2.1) Note that by A3, inf{Q(t) : |t − t0 | = ǫ} > Q(t0 ) . Therefore, by A2, as n → ∞, P inf{Qn (t) : |t − t0 | = ǫ} > Q(t0 ) → 1 . Next suppose b tn ∈ {t : |t − t0 | > ǫ}. Then there exists λ ∈ (0, 1) such that t1 ≡ λb tn + (1 − λ)t0 ∈ {t : |t − t0 | = ǫ} . (9.2.2) Section 9.2 153 Asymptotics. Consistency and Asymptotic Normality By convexity of Qn , Qn (t1 ) ≤ λQn (b tn ) + (1 − λ)Qn (t0 ) ≤ λQn (t0 ) + (1 − λ)Qn (t0 ) = Qn (t0 ) . It follows that b tn ∈ {t : |t − t0 | > ǫ} implies that inf{Qn (t) : t : |t − t0 | = ǫ} ≤ Qn (t0 ) . But by (9.2.2), the probability of this event tends to zero. We have established (9.2.1). ✷ Proposition (9.2.1) is what we generally need for consistency arguments. For asymptotic normality particularly in relation to modified likelihood, we shall use this consistency argument in conjunction with the generalized delta method of Section 7.1. 9.2.2 Asymptotics for Selected Models We begin with Example 9.2.1 Biased sampling asymptotics (Examples 9.1.2 and 9.1.8). We consider the case of one population first. Let X = R and recall Fbe (x) = Z Rx x dFbe (y) = R−∞ ∞ −∞ −∞ w−1 (y)dPb (y) , w−1 (y)dPb (y) (9.2.3) where Pb is the NPMLE of P and P is assumed to have density pE (x) = w(x)f (x)/W (F ). If EP (w−1 (X))2 < ∞, then, using Theorem 7.1.5 (Problem 9.2.1), En (x) ≡ where Z(x) = √ n(Fbe (x) − F (x)) =⇒ Z(x) Rx −∞ w−1 (y)dWP0 (y) EP w−1 (X) − Rx −∞ w−1 (y)dP (y) (EP w−1 (X))2 Z ∞ −∞ w−1 (y)dWP0 (y) and WP0 (·) is the Brownian bridge on the support of f with respect to P . More generally, for the stratified framework with several populations, it follows from Theorem 7.2.1 that if EP w−2 (Y, F ) < ∞ where w(y, F ) ≡ k X j=1 λj wj (y) , Wj (F ) then W(Fb ) is well defined with probability tending to 1 and Z 1 b W(F ) = W(F ) + P(x, F )dPb (x) + oP (n− 2 ) (9.2.4) 154 Inference in semiparametric models for Ψ given by (9.1.23). It follows from (9.2.4) that if Fbe (y) = defined in Example 9.1.8 and En (y) = then √ n(Fbe (y) − F (y)) Ry −∞ Chapter 9 dFbe (u) where dFbe is Z · 1 En (y) = n 2 w−1 (u, F ) d(Fbe − F )(u) −∞ Z y n h Z ∞ w (u) a − dF (u) c−2 (u, p) d(Pb1 − P1 )(a) W a −∞ −∞ Z Z io wa (u) ψa (x, P )dP1 (a)d(Pb − P )(x) + oP (1) − 2 Wa (9.2.5) by repeated application of the chain rule where x = (i, v), ψ = (ψ1 , . . . , ψk ) and the second term comes from (9.2.4). These results are sketched in the problems. See also Gill, Vardi and Wellner (1988). ✷ Example 9.2.2. Asymptotics for the Kaplan-Meier and Nelson-Aalen estimates (Examples 9.1.3 and 9.1.9). The Kaplan-Meier estimate of the survival function can be approximated by the simpler Nelson-Aalon estimate which is easier to analyze, and is of the modified empirical likelihood method. For t < Y(L) , using log(1 − a) = −a + 12 a2 + o(a2 ), log SbKM (t) = X log 1 − = − X 1(Y(i) ≤ t) (9.2.6) ( n ) X δ(i) 1(Y(i) ≤ t) . ≤ t) + OP (n − i + 1)2 i=1 δ(i) n−i+1 δ(i) 1(Y(i) n−i+1 Recall from Theorem 7.2.3 that sup{|Y(i) − H −1 ( i i )| : ≤ 1 − ε} = oP (1) n+1 n where H(x) = 1 − F̄ (x)Ḡ(x) is the distribution function of Y . Since t < Y(L) , H̄(t) ≡ 1 − H(t) > 0, and hence n X δ(i) 1(Y(i) ≤ t) 1 = OP ( ) . 2 (n − i + 1) n i=1 (9.2.7) Define n X b 1 (t) = 1 δ(i) 1(Y(i) ≤ t) H n i=1 Section 9.2 155 Asymptotics. Consistency and Asymptotic Normality b the (sub) empirical distribution function of the uncensored observations, let H(t) be the b̄ b Then, empirical distribution function of the Yi , 1 ≤ i ≤ n, and let H = 1 − H. Z t n X b 1 (y) δ(i) dH b N A (t) 1(Y(i) ≤ t) = ≡Λ (9.2.8) n − i + 1 b̄ −∞ H(y−) i=1 b N A is the Nelson-Aalen estimate of the cumulative hazard function. This leads to where Λ the Nelson-Aalen estimate of the survival function of T , b N A (t)} . SbN A (t) = exp{−Λ (9.2.9) From (9.2.6) and (9.2.7), it is clear that if H̄(τ ) > 0, then 1 sup{|SbKM (t) − SbN A (t)| : t ≤ τ } = oP (n− 2 ) . (9.2.10) Using (9.2.10) and empirical process arguments from Chapter 7, we will show Theorem 9.2.1. If H̄(τ ) > 0, and F is continuous, then if a “*” subscript denotes either the KM or NA estimates 1 (i) sup{|Sb∗ (t) − S(t)| : t ≤ τ } = OP (n− 2 ). (ii) As a stochastic process on (−∞, τ ], Z t b 1 − H1 )(y) Sb∗ (t) = S(t) + H̄ −1 (y−)d(H − Z t −∞ −∞ 1 b̄ (H(y−) − H̄(y−))H̄ −2 (y−)dH1 (y) + oP (n− 2 ) . (iii) n (Sb∗ (·) − S(·)) =⇒ Z(·) as a stochastic process on [0, τ ] to a mean 0 Gaussian process with 1 2 Cov(Z(s), Z(t)) = K̄(s)K̄(t)(C(s ∧ t) − C(s)C(t)) where K = S/[1 + C] and Z t Z t 1 1 dΛF (y) = dF (y) C(t) ≡ H̄(y−) H̄(y−) F̄ (y−) 0 0 (9.2.11) Proof. We prove (ii) which will imply (i) and (iii). By (9.2.7), it’s enough to consider SbN A . But b N A}−exp{−ΛF } = −(ΛN A −ΛF )+OP (kΛ b N A −ΛF k2 ) (9.2.12) SbN A −S = exp{−Λ ∞ where kgk∞ ≡ sup{|g(t)| : −∞ < t ≤ τ }. Further, Z t b 1 − H1 ) d(H b N A − ΛF (t) = − Λ (y) H̄(y−) −∞ Z t b̄ + (H(y−) − H̄(y−))H̄ −2 (y−)dH1 (y) + ∆n (t) −∞ 156 Inference in semiparametric models Chapter 9 where ∆n (t) = − + Z Z t t −∞ −∞ b̄ − H̄) (H b 1 − H1 )(y) (y−)d(H b̄ H H̄ b̄ − H̄)2 (H (y)dH1 (y) . b̄ H̄ 2 H Call the terms of ∆n (·), Rem1 and Rem2 , respectively. Now, given Donsker’s theorem for b H(·), we see that b̄ − Hk2∞ H1 (τ ) sup{[H(y) b̄ H̄ 2 (y)]−1 , y ≤ τ }) = OP (n−1 ) Rem2 = OP (kH b̄ H̄ 2 (y) : y ≤ τ } = H(τ b̄ )H̄ 2 (τ ) and H̄(τ ) > 0 by assumption. because inf{H(y) For the other remainder term, write, using the probability integral transforms, ν = b 1 (y) and ν = H1 (y), H Rem1 = Z b 1 (·) H 0 = Z H1 (·) 0 Z H1 (·) b̄ b̄ − H̄)(H b −1 (v)−) (H (H − H̄)(H1−1 (v)−) 1 dv − dv b̄ H̄(H b −1 (v)−) b̄ H̄(H −1 (v)−) 0 (H H 1 1 b̄ − H̄)(H −1 (v)−) − (H b̄ − H̄)(H b −1 (v)−)} {(H 1 1 1 dv + oP (n− 2 ) . −1 H̄ 2 (H1 (v)−) −1 1 P b̄ 1 − H̄ −1 )k∞ −→ 0 (Problem By Donsker’s theorem, Rem1 = oP (n− 2 ) because k(H 1 9.2.4). This completes (ii). The weak convergence to a mean zero Gaussian process in (iii) is again a consequence of Donsker’s theorem after an integration by parts. Finally (9.2.11) may be obtained after a fairly tedious computation — see Problem 9.1.16. In fact, to obtain (9.2.11) simply, we have to approach the problem from the point of view of counting processes. We refer to Aalen (1978), Andersen et al (1993) and Shorack and Wellner (1986, p. 308) for such a development. ✷ Example 9.2.3. Cox estimate asymptotics (Examples 9.1.4 and 9.1.10). We would like to b = arg max Γ(β, Pb ) from (9.1.41) is argue that under suitable conditions, the estimate β uniquely defined, consistent, and asymptotically normal. Although this can be done, it is easier to consider a slight modification: In order to avoid technical difficulties in the right tail of the distribution of T , we limit ourselves to observations (Zi , Ti ) with Ti ≤ τ for some finite fixed constant τ > 0. Here τ can be interpreted as a preassigned time limit in the study. Recall that S0 (t, β, P ) = EP r(Z, β)1(T ≥ t) . and let Γτ (β, P ) = Z 0 τ − log S0 (t, β, P )dP2 (t) + E 1(T ≤ τ ) log r(Z, β) (9.2.13) Section 9.2 157 Asymptotics. Consistency and Asymptotic Normality where P1 and P2 are the marginal probability distributions of Z and T . Note that Γ(β, P ) as defined earlier in (9.1.14) is just Γ∞ (β, P ). Suppose r(z, β) = exp{βT z} and let β(P ) = arg max{Γτ (β, P )}. (9.2.14) We will show that β(P ) is identifiable if we use this r and make the assumptions C : (i) P [T = c(Z)] < 1 for every function c : Rd → R . (ii) EZZT is nonsingular. (iii) T is continuous with a positive density on (0, ∞). We next show that assumption C implies identifiability of β(P ) when r(z, β) = exp[β T z]. b ≡ β(Pb) under these assumptions. We also show consistency and asymptotic normality of β These results hold even when the Cox model for X = (Z, T ) is not satisfied. Here β(P ) can be thought of as the parameter of the distribution in the Cox model “closest” to P. See Section 2.2.2. Let x = (z, t). Theorem 9.2.2. Assume that (Z, T ), T ≥ 0, has a joint probability distribution P such that C is satisfied. Set r(z, β) = exp(β T z). Then β(P ) = arg max Γτ (β, P ) is identifiable. Moreover, let Ṡ0 and Γ̈τ denote gradient and Hessian of S0 and Γτ with respect to β, then P b is uniquely defined with probability tending to 1 and β b −→ (i) β β(P ) as n → ∞. R b = β(P ) + ψ(x, P )dPb (x) + oP (n− 12 ) where (ii) β ψ(x, P ) = −Γ̈−1 τ (β, P )γ(x, P ) with an expression for Γ̈τ (β, P ) given in the proof and, writing β = β(P ), ( Z τ i h Ṡ0 Ṡ0 γ x, P ) = z − E[Z1(T ≤ τ )] − (t, β, P ) − (u, β, P )dP2 (u) S0 0 S0 ) Z t io n h Ṡ T 0 −1 β z (u, β, P ) dP2 (u) 1(t ≤ τ ) . (9.2.15) −e S0 (u, β, P ) z − S0 0 1 b − β) =⇒ N (0, Σ(β, P )) where (iii) n 2 (β Z Σ(β, P ) = ψψ T (x, P )dP (x) . Proof. In view of Proposition 9.2.1, for identifiability and (i), it is enough to check that (a) Γτ (β, Pb) is concave with probability 1. 158 Inference in semiparametric models Chapter 9 (b) Γτ (β, P ) is continuous and strictly concave and Γτ (β, P ) → −∞ as |β| → ∞. P (c) kΓτ (β, Pb ) − Γτ (β, P )k −→ 0. We will establish (a) and (b) by calculating the gradient of Γτ (β, P ), Γ̇τ (β, P ) = − Z τ 0 and the Hessian, Γ̈τ (β, P ) = Z 0 τ Ṡ0 (t, β, P )dP2 (t) + E Z1(T ≤ τ ) , S0 S0−2 Ṡ02 − S0 S̈0 (t, β, P )dP2 (t) . It follows that Γτ (β, P ) is strictly concave if Ṡ02 − S0 S̈0 is negative definite (Section B.9). We show this for the case Z ∈ R. Recall that S0 (t, β, P ) = EP [eβZ 1(T ≥ t)]. Let 1 1 1 U = [eβZ 1(T ≥ t)] 2 , V = [ZeβZ 1(T ≥ t)] 2 , and W = Z[eβZ 1(T ≥ t)] 2 , then By Cauchy-Schwarz, 2 Ṡ02 − S0 S̈0 = E(V 2 ) − (EU 2 )(EW 2 ) . 2 2 (EU 2 )E(W 2 ) ≥ E(U W ) = E(V 2 ) , with equality iff W 2 = a + bU 2 for some a and b 6= 0. Equality leads to the equation (Z 2 − b)eβZ 1(T ≥ t) − a = 0 . Taking expectation with respect to L(T |z) gives P (T ≥ t|z) = ae−βz /(z 2 − b) . This is impossible unless T = c(Z) for some function c, which is ruled out by C(i). Thus Γτ (β, P ) is strictly concave. The condition Γτ (β, P ) → −∞ as |β| → ∞ follows from strict concavity. The preceding Cauchy-Schwarz argument shows that Γτ (β, Pb) is concave in β with probability one. So (a) and (b) hold. Note that these concavity results hold for all α ∈ [0, 1), where α = P (T ≤ τ ). Finally, showing condition (c) is an exercise in empirical process theory. First check that P sup{|S0 (t, β, Pb ) − S0 (t, β, P )| : t ≤ τ + ε} −→ 0 for ε > 0 sufficiently small and then apply the Glivenko-Cantelli Theorem. b of Γτ (β, Pb ) satisfies Γ̇τ (β, b Pb ) = 0. Since To obtain (ii), first note that the maximizer β b b b b β is consistent, we can expand Γ̇τ (β, P ) around β = β, set the expansion equal to zero, and use the mean value theorem to obtain √ b ∗ b √ b n(β − β) = −Γ̈−1 τ (β , P ) nΓ̇(β, P ) Section 9.2 159 Asymptotics. Consistency and Asymptotic Normality P where |β ∗ − β| −→ 0. In addition, we can show Z 1 b Γ̇τ (β, P ) = Γ̇τ (β, P ) + γ x, P ) dPb (x) + oP (n− 2 ) . (9.2.16) We establish (9.2.16) by using the infinite dimensional delta method of Section 7.2. That is, we compute the Gâteaux derivative of ν(P ) ≡ Γ̇τ (β, P ) and show that it equals γ(x, P ). Then we show that −Γ̈−1 τ (β, P )γ(x, P ) satisfies condition (7.2.2) for being an influence b Some of the details are in Appendix D4. function for β. Remark 9.2.1 The formula for γ(x, P ) is valid when τ = ∞ (Tsiatis (1981), Anderson and Gill (1982)). ✷ In the special case of the Cox model we have Proposition 9.2.2. Suppose (Z, T ) has the Cox (β 0 , λ) distribution and that C holds. Then β(P ) = β 0 . Proof. We have shown that Γ̇τ (β, P ) = 0 has a unique solution thus it remains to show that Γ̇τ (β 0 , P ) = 0. We give the proof when d = 1; the general case is only notationally more difficult. Let P0 , P10 , and P20 denote the joint and marginal probability distributions of (Z, T ) under the Cox (β0 , λ) model (9.1.12), respectively. We need to show that β0 satisfies the equation Z τ Ṡ0 (t, β0 , P0 )dP20 (t) = E0 Z1(T ≤ τ ) , 0 S0 where E0 is expectation under P0 . By conditioning on Z and using the iterated expectation theorem (B.1.20), we find dP20 (t) = dE0 P0 (T ≤ t|Z) = dE0 1 − exp{−Λ(t)eβ0 Z } = E0 λ(t)eβ0 Z exp{−Λ(t)eβ0 Z } = S0 (t, β0 , P0 )dΛ(t) . By also conditioning on Z, we find Ṡ0 (t, β0 , P0 ) = E0 Zeβ0 Z 1(T ≥ t) = E0 Zeβ0 Z P (T ≥ t|Z) = E0 Zeβ0 Z exp{−Λ(t)eβ0 Z } . It follows that Z τ Z ∞Z τ Ṡ0 (t, β0 , P0 )dP20 (t) = zeβ0 Z exp{−Λ(t)eβ0 z }dΛ(t)dP01 (z) 0 S0 −∞ 0 ! Z ∞ Z vτ (z) −v = z e dv dP01 (z) , −∞ 0 by making the integral the change of variable v = Λ(t)eβz and setting vτ (z) = Λ(τ )eβ0 z . Because V = Λ(T )eβ0 z given R ∞Z = z has a standard exponential distribution (Problem 9.2.14), the right hand side is −∞ zP (T ≤ τ |z)dP01 (z) = E(Z1(T ≤ τ )). ✷ 160 Inference in semiparametric models Chapter 9 Example 9.2.4. Partial Linear model (Example 9.1.13). Write n X βb − β(P ) = b i |Ui ))[Yi − E(Y b i |Ui )] { (Zi − E(Z i=1 n X − β Since i=1 b i |Ui )) } (Zi − E(Z n X i=1 b 2 (Zi |Ui ))2 (Zi − E −1 . εi = Yi − E(Yi |Ui ) − β(Zi − E(Zi |Ui )) we obtain Pn b i |Ui )]εi [Zi − E(Z βb − β(P ) = Pi=1 n 2 b i=1 (Zi − E(Zi |Ui )) X b i |Ui ))(E(Yi |Ui ) − E(Y b i |Ui )) + { (Zi − E(Z + β X Moreover, n X + 2 n X b i |Ui ))2 )−1 . b i |Ui )){E(Z b i |Ui ) − E(Zi |Ui )}( (Zi − E(Z (Zi − E(Z i=1 b i |Ui ))2 = (Zi − E(Z i=1 n X i=1 (9.2.17) n X i=1 (Zi − E(Zi |Ui ))2 b i |Ui )) + (Zi − E(Zi |Ui ))(E(Zi |Ui ) − E(Z (9.2.18) n X i=1 It follows that b i |Ui ) − E(Zi |Ui ))2 . (E(Z n 1 X [Zi − E(Zi |Ui )]εi 1 βb = β(P ) + + oP (n− 2 ) n i=1 E(Var(Zi |Ui )) (9.2.19) provided that we can show n 1 1X b (E(Zi |Ui ) − E(Zi |Ui ))2 = oP (n− 2 ) n i=1 (9.2.20) n 1X b i |Ui )) = oP (n− 21 ) (Zi − E(Zi |Ui ))(E(Zi |Ui ) − E(Z n i=1 (9.2.21) n 1X b i |Ui ))(E(Yi |Ui ) − E(Y b i |Ui )) = oP (n− 12 ) . (Zi − E(Z n i=1 (9.2.22) It’s relatively easy to show such results under an artificial modification of the definib which is nevertheless useful conceptually. Suppose we divide the sample into tion of E Section 9.3 161 Efficiency in Semiparametric Models {X1 , . . . , Xn−m } and {Xn−m+1 , . . . , Xn } where m ≍ nα with α < 1 to be chosen later. Use {Xn−m+1 , . . . , Xn } only to form the histogram estimates of E(Y |U = u) and b (2) (·|u). Then, define E(Z|U = u). Call these E βb∗ = Pn−m i=1 b (2) (Zi |Ui ))(Yi − E b (2) (Yi |Ui )) (Zi − E . Pn−m b (2) (Zi |Ui ))2 i=1 (Zi − E (9.2.23) b (2) (·|u) and X1 , . . . , Xn−m , the analogues The advantage is that, by the independence of E of (9.2.20)–(9.2.22) become fairly easy. Results, which we shall postpone to Chapter 11, enable us to conclude that then 2 2 b(2) (Y1 |U1 ) − E(Y1 |U1 ))2 = O(m− 3 ) = O(n− 3 α ) E(E (9.2.24) b (2) (Z1 |U1 ) holds under moment assumptions on Z and Y , the and a similar result for E smoothness of the densities of (U, Z) and (U, Y ), and tail assumptions on the density of U . It is then easy to verify (9.2.20)–(9.2.22) and hence that (9.2.19) holds for βb∗ . We leave the details to the problems and the validation of (9.2.19) without sample splitting to Chapter b is well known — see 11. The validity of (9.2.19) under different regularization estimates E Chen (1988) and Fan et al (1998) for instance. ✷ 9.3 Efficiency in Semiparametric Models As we noted in Sections 5.4.3 and 6.2.2, in regular parametric models, there is an asymptotic notion of being the “best” (efficient) estimate of a parameter ν(P ) ∈ R (or Rd ) among broad classes of estimates of ν which are asymptotically Gaussian in some uniform way. For instance, we showed in Theorem 6.2.1, that in a model P = {Pθ : θ ∈ Θ ⊂ Rd } which is “regular,” among M estimates of θ which are “regular,” the “unique” asymptoti1 cally best procedure is one which is equivalent to maximum likelihood to order oP (n− 2 ). “Regular” in these cases corresponds to assumptions A0–A6 of Section 6.2.2 holding for ˙ θ) the candidates’ influence function Ψ and for the optimal influence function ΨOPT = l(·, which corresponds to the MLE. Note that the conditions on ΨOPT correspond to smoothness conditions on the model. We also demonstrated through Example 5.4.2 (Hodges’ example) that some conditions were always necessary. The models considered in this subsection will include the important case in which the parameter ν is defined implicitly. That is, P = {P(ν,η) : ν ∈ Rd , η ∈ H} where η ranges over a function space and ν is defined by ν(P(ν,η) ) = ν. For example, in the symmetric location model (3.5.1), X ∼ P(ν,η) with density p(x, ν, η) = η(x − ν), x, ν ∈ R , where η is an arbitrary R unknown symmetric density. Here ν(P(ν,η) is uniquely defined as, for instance, ν(P ) = x dP (x), if P has a first moment. Similarly, in the Cox proportional hazard model (9.1.12), ν is identified with β and η(·) with λ(·). The partial linear model (9.1.51) has ν = β and η(·) = g(·). 162 Inference in semiparametric models Chapter 9 In this section we shall try to separate out assumptions on the model and on the candidate procedures and, using Hilbert space geometry, show how to extend results from Section 5.4.3 and 6.2.2 to semiparametric models. Regular Models Suppose P is a general model for the distribution P of a random vector X. Consider a parameter ν : P → R defined on P. Let Q be a one dimensional regular parametric submodel of P containing a point P0 in its interior. That is, (i) Q ⊂ P. (ii) P0 ∈ Q. (iii) Q = {Pt : |t| < 1} where Pt have continuous or discrete densities p(·, t). Let l(x, t) = log p(x, t). We define a model Q to be regular if (a) The map t → p(x, t) is continuously differentiable on (−1, 1) for (almost) all x. ˙ t) ≡ (b) l(x, ∂p 1 p(x,t) ∂t (x, t) is defined for (almost) all x, and for all |t| < 1 ˙ 0 < I(t) ≡ EPt [l(X, t)]2 = Z l˙2 (x, t)p(x, t)dµ(x) < ∞ where dµ(x) indicates sum or integral as defined in Section I.1; or more generally, µ is a measure. (c) The map t → I(t) is continuous. Remark 9.3.1. We only consider submodels of the form (iii) that satisfy the conditions (a), (b), and (c). These conditions are stronger than the Hellinger differentiability conditions of Le Cam given in Bickel, Klaassen, Ritov and Wellner (1993,1998) (abbreviated to BKRW), who show in Proposition 1, p. 13, that they imply the Le Cam conditions. They are easily seen to be implied by A0–A6 of Section 5.4.6 applied to ψ(x, t) ≡ l̇(x, t). ✷ Note that l(x, t)and its derivatives depend on Q and its parametrization. That is, if f (x, τ ) = p x, t(τ ) where τ (·) is one to one with inverse t(·), then ∇ log f at τ (t) is related to ∇ log p at t by the chain rule. However, we will take advantage of the equivariance of the Fisher information bound. See Problems 3.4.3 and 9.3.2. Regular Estimates We shall say νb is a regular estimate of ν(P ) on P if (i) νb is consistent on P. (ii) νb is uniformly locally asymptotically linear and Gaussian on all regular parametric submodels containing P0 for all P0 ∈ P. Section 9.3 163 Efficiency in Semiparametric Models What we mean by (ii) is the following: Fix P0 ∈ P. Let Q be any regular one dimensional submodel of P containing P0 and let Z Z L02 (P0 ) ≡ {h : h2 dP0 < ∞ and hdP0 = 0}. Then, there exists ψ(·, P0 ) ∈ L02 (P0 ) such that for X1 , . . . , Xn i.i.d. as X ∼ P0 , Pn 1 (a) νb = ν(P0 ) + n1 i=1 ψ(Xi , P0 ) + oP0 (n− 2 ) 1 (b) For any sequence tn → 0 such that tn = O(n− 2 ), √ ν − ν(Ptn ))) → N (0, σ 2 (ψ, P0 )) LPtn ( n(b R where σ 2 (ψ, P0 ) = ψ 2 (x, P0 )dP0 (x). (9.3.1) We call ψ(·, P0 ) the influence function of νb at P0 in agreement with √ our previous convention ν − ν(Pt )) converges in Section 7.2.1. Note that (b) is a weak uniformity condition: n(b 1 to the same limit in law under Pt as under P0 as long as t → 0 at rate n− 2 . Remark 9.3.2: Although conditions A0–A6 in Section 5.4.2 appear to purely pertain to P0 and not be uniform, they in fact imply that (9.3.1) holds via Le Cam’s theory of contiguity; see Section 9.5. However, the definition of regularity given here is stronger than that of BKRW. ✷ Efficient Estimates The following discussion sketches the derivation of a lower bound on the asymptotic variance of regular estimates given in Lemma 9.3.1 below. Consider any regular estimate νb with corresponding influence function ψ and any regular one dimensional submodel Q containing P0 . Note √ that q(t) = ν(Pt ) is a parameter on Q. Hence, we know that the asymptotic variance of nb ν at P0 is no smaller than the best we could do if we knew that the true P belonged to Q rather than to P. If νb were an M estimate of q(t), Pt ∈ Q, and if I(t) denotes the Fisher information for Pt , then this would mean that σ 2 (ψ, P0 ) ≥ [q ′ (0)]2 I −1 (0) ≡ I −1 (P0 : ν, Q) (9.3.2) by an extension of Theorem 5.4.3 to estimation of q(t) rather than just to estimation of t as in Proposition 3.4.2. Note that the definition of I −1 (P0 : ν, Q) given at the end of (9.3.2) indicates that the bound does not depend on the parametrization of Q (Problem 9.3.2). Suppose Z q(t) = ν(Pt ) = ν(P0 ) + ψ(x, P0 )dPt (x) + o(t) (9.3.3) and ∂ ∂t Z ψ(x, P0 )dPt (x)|t=0 = Z ˙ 0) dP0 (x) . ψ(x, P0 )l(x, (9.3.4) 164 Inference in semiparametric models Chapter 9 Expression (9.3.3) says that the same influence function approximation is valid for the {Pt } as for the empirical probability Pb and (9.3.4) that integration and differentiation can be interchanged. Together, (9.3.3) and (9.3.4) imply that q ′ (0) = ∂ν ˙ (Pt )|t=0 = CovP0 (ψ(X, P0 ), l(X, 0)) . ∂t (9.3.5) It may be shown (see Section 9.5) that by Le Cam’s contiguity theory, (9.3.5) is implied by regularity of Q and νb as we have defined these. It follows that (9.3.2) can be extended to all regular νb, Q. That is, Lemma 9.3.1. For all P0 ∈ P and all ψ corresponding to regular νb on P, σ 2 (ψ, P0 ) ≥ sup{I −1 (P0 : ν, Q) : Q ⊂ P, Q regular 1 dimensional containing P0 } Q ≡ I −1 (P0 : ν, P) . (9.3.6) Proof. By (9.3.5) and (9.3.2), 2 ˙ I −1 (P0 : ν, Q) = CovP0 ψ(X, P0 ) , l(X, 0)/I(0) I(0) , ˙ where I(0) = EP0 [l(X, 0)]2 by (iii)(b). The result follows by Cauchy-Schwarz. Note that I −1 (P0 : ν, P) doesn’t depend on any parametrization of P. ✷ Definition 9.3.1. We call an influence function ψ ∗ (·, P0 ) corresponding to a regular estimate νb∗ of ν(P ), P ∈ P, achieving the bound in (9.3.6), the efficient influence function for ν in P at P0 and we call νb∗ an efficient estimate at P0 . This ψ ∗ is often denoted by e l(·, P0 : ν, P). Adopting the terminology of Section 3.3, νb∗ is called asymptotically minimax over the class of regular one dimensional submodels Q of P. We will find such νb∗ for the class of regular estimates, but it is true generally. See BKRW. Next we sketch the derivation of a method for finding such νb∗ given in Proposition 9.3.1 below. Evidently, for ψ ∗ as in Definition 9.3.1 and all ψ corresponding to regular νb on P, we need to show that σ 2 (ψ ∗ , P0 ) ≤ σ 2 (ψ, P0 ) . On any Q as above, by Theorems 5.3.3 and 5.4.3, any νb∗ regular on Q, achieving I −1 (P0 : ν, Q), is uniquely characterized by its influence function, ψ ∗ (x, P0 ) = q ′ (0) l̇(x, 0) . I(0) (9.3.7) Note that ψ ∗ (·, P0 ) satisfying Definition 9.3.1 cannot depend on the parametrization of Q. We can show (Problem 9.3.2) that if f (x, τ ) = p(x, t(τ )) as above with t(0) = 0 and t′ (τ ) > 0, then l̇f = t′ (0)l˙p . Thus if we let ˙ 0)] = {cl(·, ˙ 0) : c ∈ R} , Q̇(P0 ) ≡ [l(·, (9.3.8) Section 9.3 Efficiency in Semiparametric Models 165 ˙ 0) depend on the parametrization of Q, Q̇(P0 ) does not. For a then, although ψ ∗ and l(·, parameter ν corresponding to a regular estimate νb, we may rephrase (9.3.7) as: “ψ ∗ (·, P0 ) is the unique member of Q̇(P0 ) such that EP0 [ψ ∗ (X, P0 )]2 = I −1 (P0 : ν, Q).” (9.3.9) Note that νb∗ corresponding to ψ ∗ in (9.3.9) may not be regular or even consistent on 2 2 P. Suppose, for instance, that P = {N (µ, (0, 1) and Pσn ) : 2µ ∈ R, σ > 0}. Let P0 = N 2 2 −1 2 Q = {N (0, σ ) : σ > 0}. Then, n X is an efficient estimate of σ on Q, but i=1 i is inconsistent for all P ∈ P with mean 6= 0. However, if we can assume that νb∗ is regular on all of P, we can conclude Proposition 9.3.1. If νb∗ is a regular estimate of ν(P ), P ∈ P, and its influence function ψ ∗ (·, P0 ) is in Q̇(P0 ) for some regular 1 dimensional Q, then σ 2 (ψ ∗ , P0 ) = I −1 (P0 : ν, P) , so that νb∗ is efficient at P0 . Proof. ψ ∗ necessarily satisfies (9.3.9) for Q and is efficient for Q. Thus, σ 2 (ψ ∗ , P0 ) = I −1 (P0 : ν, Q) . By definition I −1 (P0 : ν, P) ≥ I −1 (P0 : ν, Q) = σ 2 (ψ ∗ , P0 ). On the other hand, by Lemma 9.3.1, because νb∗ is regular on P, σ 2 (ψ ∗ , P0 ) ≥ I −1 (P0 : ν, P) , and the proposition follows. ✷ Adopting the terminology of Section 3.3, Q is asymptotically the least favorable regular one dimensional submodel for estimating ν(P ) using regular estimates. We next give a simple example to help interpret the new concepts and to show connections to concepts in Volume I. Example 9.3.1. Suppose X = (Z, Y ) with Z ∈ Rd , Y ∈ R. Let X1 , . . . , Xn be i.i.d. as X ∼ P where Y = ZT β + ε . Here ε ∼ N (0, σ02 ), σ02 is known, Z and ε are independent, and ZZT is nonsingular (cf Examples 6.2.1 and 9.1.1). Let P be the class of distributions of X, β ∈ Rd , and let Qa be the class of distributions Pt,a of X with β t = β 0 + ta, where β 0 ∈ Rd and a ∈ Rd are fixed and t ∈ R is the parameter. If la is the log likelihood for Qa , then, as a function of 166 Inference in semiparametric models Chapter 9 βt , n la (x, t) = − l˙a (x, 0) = 2 1 X yi − zTi βt + constant 2 i=1 n X (zTi a) yi − T 2 i=1 T Ia (0) = nE(a Z a) . zTi βt ≡ n X (zTi a)ei i=1 If we let cj be the d-vector with 1 in the jth entry and 0 elsewhere, then the equations l̇cj (x, 0) = 0, j = 1, . . . , d, are the normal equations as in Example 2.1.1, and yield the MLE of β as in Examples 6.1.2 and 6.2.1. As in Example 6.2.1, consider the parameter ν = ν(P ) = β1 . b = (βb1 , . . . , βbd ) be the MLE and set νb = βb1 . Then νb is regular and by (6.2.22) Let β the influence function is ψ1 (x, β) where ψ1 is the first entry in ψ = (ψ1 , . . . , ψd )T = E(ZZT )−1 ZT e. For this Qa and this ν, q(t) = ν(Pt,a ) = β01 + a1 t; and Pn a1 i=1 (ziT a)ei a21 −1 ∗ , I (P : ν, Q ) = . ψ (x, P0 ) = 0 a nE(aT ZT a)2 nE(aT ZT a)2 Morever, independently of ν, n X (zTi a)ei : c ∈ R . Q̇a (P0 ) = c i=1 Finally, using calculus (Problem 9.3.3), we find −1 sup I −1 (P0 : ν, Qa ) : a ∈ Rd = nE(ZZT ) (1,1) . The right hand side coincides with the information bound in Example 6.2.1 and we can conclude that βb1 is efficient and −1 I −1 (P0 : ν, P) = nE(ZZT ) . (1,1) ✷ Remark 9.3.3. The d dimensional case. These concepts and results extend readily from ν(P ) ∈ R to ν(P ) ∈ Rd by using Section 6.2. In this case, (9.3.6) becomes a matrix inequality where A ≥ B for d × d matrices A and B means that (A − B) is nonnegative definite. See Section B.10.2. Note that if we compute the influence function ψj∗ of each b is the vector of influence functions βbj in Example 9.3.1, then the influence function of β ∗ ∗ T (ψ1 , . . . , ψd ) . See Theorem 6.2.2 for the parametric case. ✷ Section 9.3 Efficiency in Semiparametric Models 167 Remark 9.3.4 It may be shown that, in parametric models P satisfying the conditions of Theorem 5.4.3 or 6.2.2, the MLE is efficient in the sense of Definition 9.3.2. For the 1 dimensional case this is clear, but for Θ ⊂ Rd , d > 1, it has to be shown that being best on all 1 dimensional submodels is equivalent to optimality in the full model (Problem 9.3.4). ✷ Remark 9.3.5. This method of approaching efficiency in the semi- and nonparametric context is due to Stein (1956(a)). It is encompassed in a much more general context by Le Cam (1986) and Le Cam and Yang (1990). For more background, see BKRW (Bickel, Klaassen, Ritov and Wellner (1998)). ✷ In Subsection 9.1.2, we have examined modified likelihood approaches to constructing estimates in semiparametric models. In this subsection, we check if such estimates are efficient, and determine potential efficient influence functions in advance of constructing estimates we hope to be efficient. Definition 9.3.2. The tangent set Ṗ 0 (P0 ) of P at P0 is the union of the set of all Q̇(P0 ) obtained by varying Q over all regular 1 dimensional submodels of P containing P0 . By definition, Ṗ 0 (P0 ) is a subset of the set L02 (P0 ) of mean zero L2 (P0 ) functions. Definition 9.3.3. The tangent space Ṗ(P0 ) of P at P0 is the linear closure of Ṗ 0 (P0 ) 0 in L02 (P0 ), that is, the smallest closed linear subspace of L02 (P P0m) which contains0Ṗ (P0 ). Thus Ṗ(P0 ) is the closure of the set of functions of the form j=1 cj ψj , ψj ∈ Ṗ (P0 ). We use this notation generically, i.e. P can be any model and, in particular, any regular 1 dimensional Q. ˙ 0) over regular 1 Note that Ṗ(P0 ) is not dependent on any ν. It is computed from l(·, dimensional parametric models. Ṗ(P0 ) will be used to find efficient estimates for semiparametric models containing both d dimensional and function parameters. Remark 9.3.6. The parameters set for Pt were defined in (iii) as (−1, 1). By writing t = (t/ǫ)ǫ, we see that we may use (−ǫ, ǫ) as the parameter set for t. See the proof of Proposition 9.3.2 for a case where this is useful. Here is a result that will help interpret Ṗ(P0 ) : Proposition 9.3.2. Suppose P = {Pθ : θ ∈ Θ} is a d dimensional parametric model with l0 (·, θ) = log f (·, θ), which satisfies A0–A6 of Theorem 6.2.2, and suppose Θ ⊂ Rd is open. Then, if P0 ≡ Pθ 0 , θ0 ∈ Θ, ∂l0 ∂l0 (x, θ 0 ), . . . , (x, θ 0 ) , Ṗ(P0 ) = ∂θ1 ∂θd where [S] is the linear span of S, that is, the set of all linear combinations of elements in S. Proof. The assumptions A0 – A6 imply that for all a 6= 0, and ǫ sufficiently small, t → Pθ 0 +ta , |t| < ǫ, is a regular one dimensional parametric submodel Q with l̇(·, 0) = 168 Inference in semiparametric models Chapter 9 i h ∂l0 ∂l0 ∂l0 a (·, θ ) ⊂ Ṗ(P0 ). On the other hand, if (·, θ ). Thus, (·, θ ), . . . , 0 0 0 j=1 j ∂θj ∂θ1 ∂θd p(x, t) = f x, θ(t) is a one dimensional submodel, then Pd d X ∂θj ∂l0 ˙ 0) = ∂l0 (x, θ(t)) |t=0 = (x, θ0 ) (0) l(x, ∂t ∂t ∂θj j=1 ∂l0 ∂l0 belongs to [ ∂θ , . . . , ∂θ ]. 1 d ✷ This proposition makes the rationale for calling Ṗ the tangent space clear. If we think of P as a smooth d dimensional manifold on the set of all probability distributions viewed as a much higher dimensional space, then the tangent space at P0 is, as in the Euclidean case, the linear space spanned by all tangents of smooth curves (1 dimensional regular models) through P0 . Let Π(h|L) denote the projection in L2 (P0 ) of any h in L2 (P0 ) onto the closed linear space L ⊂ L2 (P0 ) with inner product (h1 , h2 ) = CovP0 h1 (X), h2 (X) and norm khk = 1 (h, h) 2 . That is π(h|L) = arg min{EP0 [h(x) − g(x)]2 : g ∈ L} . Such projections are exemplified in Section 1.4 and discussed in terms of Hilbert spaces in Appendix B.10.3. In particular, if L is the space of functions of the form a + bX, Theorem 1.4.3 states that for X, Y with E(X) = E(Y ) = 0, π(Y |L) = Cov(X, Y )/EX 2 X = (X, Y )/|X|2 X . The natural candidate for efficient influence function is given by Proposition 9.3.3. Suppose Ṗ 0 (P0 ) is linear and ψ(·, P0 ) is the influence function of any d dimensional regular estimate. Then, the efficient influence function is given by, ψ ∗ (·, P0 ) ≡ e l(·, P0 : ν, P) = Π(ψ(·, P0 )|Ṗ(P0 )) . (9.3.10) Proof. We claim that the result holds if P = Q, a regular 1 dimensional submodel since by Theorem 1.4.3, (9.3.5), and (9.3.7), Π(ψ(·, P0 )|Q̇(P0 )) = ′ ˙ 0)) (ψ(·, P0 ), l(·, ˙ 0) = q (0) l̇(·, 0) l(·, ˙ 0)k2 I(0) kl(·, which we have shown is an efficient influence function for Q. We now write Q̇, Ṗ 0 for Q̇(P0 ), Ṗ 0 (P0 ). Since e l is itself an influence function, Π(e l|Q̇) = Π(ψ|Q̇) for all Q̇ ∈ Ṗ 0 . Since Ṗ 0 is linear, it follows that Π(e l|Ṗ) = Π(ψ|Ṗ) . Section 9.3 169 Efficiency in Semiparametric Models But, by assumption and what we have demonstrated, ke lk2 = I −1 (P0 : ν, P) = sup{ kΠ(e l|Q̇)k2 : Q ⊂ P, Q regular 1 dimensional containing P0 } . Q Thus, there exist Q̇m , m ≥ 1, such that ke lk2 = lim kΠ(e l|Q̇m )k2 which by Pythagoras’ theorem B.10.3.1 gives ke l − Π(e l|Q̇m )k2 → 0 . Thus, e l ∈ Ṗ(P0 ) and the result follows. ✷ Remark 9.3.7. We assume that there exists a regular estimate with influence function ψ(·, P0 ) for the parameter ν. Such estimates exist under mild conditions. See BKRW. It is not hard to derive Theorem 3.4.3 from Proposition 9.3.3 and the standard formula for projection onto a finite dimensional linear space (B.10.20) (Problem 9.3.5). This proposition and some others we are about to state are used in a semi-rigorous fashion. As we shall see in the applications (called examples) to important models which follow, we calculate the structure of tangent spaces formally, usually ending up with linear spaces which we can only be sure are subspaces of the tangent space. However, if we can show that the influence function of a regular estimate νb∗ belongs to the subspace we have identified or that the projection of an arbitrary influence function on such a subspace is the influence function of a regular estimate νb∗ , we can conclude without further ado that νb∗ is efficient. The following results given in BKRW are to be taken in that formal spirit also. That is, equalities as stated are really true only under regularity conditions which are model dependent. Proposition 9.3.4. Additive property of the efficient score function. Suppose P = {P(g,h) : g ∈ G, h ∈ H}, where G, H may be Euclidean, function spaces, or abstract. Let P0 = P(g0 ,h0 ) and let P1 (P0 ) ≡ {P(g,h0 ) : g ∈ G} , P2 (P0 ) ≡ {P(g0 ,h) : h ∈ H} . Then, Ṗ(P0 ) = Ṗ1 (P0 ) + Ṗ2 (P0 ) . We have encountered parameters ν that are defined implicitly by equations of the form ν(Pθ,h ) = θ. For instance, in linear regression or Cox regression the regression parameter vector β is defined this way. Similarly, so is the location parameter in the location problem. For such parameters there is a geometric way of obtaining efficient influence functions e l via orthogonal projections of the score function for θ on tangent spaces corresponding to nuisance parameters.. 170 Inference in semiparametric models Chapter 9 Proposition 9.3.5. The efficient influence function for implicitly defined parameters. Suppose G ≡ Θ open ⊂ R, and let lθ ≡ l(·, P(θ,h0 ) ), θ ∈ Θ, be the log likelihood for θ. θ Suppose ν(P(θ,h) ) = θ for all h ∈ H. Let P0 ≡ P(θ0 ,h0 ) and l˙ ≡ ∂l ∂θ (·)|θ=θ0 . Then, e l(·, P0 : ν, P) = l̇ − Π(l|˙ Ṗ2 (P0 )) , ˙ kl − Π(l|˙ Ṗ2 (P0 ))k2 I −1 (P0 : ν, P) = kl˙ − Π(l|˙ Ṗ2 (P0 ))k−2 . (9.3.11) (9.3.12) ✷ Note that e l is a normalized version of the projection of l˙ onto the orthocomplement of the nuisance parameter tangent space. The normalization ensures that (e l, e l) = 1, which is a requirement of influence functions; see BKRW (1998) for details. The tangent space of a nonparametric model Proposition 9.3.6. Let M0 be a model that contains the collection of all probabilities on R with continuous bounded densities. Then, Ṁ0 (P0 ) = L02 (P0 ). Before proving this proposition, we note its satisfying implication. Corollary 9.3.1. If ν : M0 → R is a parameter for which a regular estimate νb exists, then νb is efficient. Proof: The result is immediate from Proposition 9.3.4, since every influence function ψ(·, P0 ) ∈ L02 (P0 ) and is its own projection. ✷ For instance, if ν = g(µ), µ = E(X), then νb = g(X̄) is nonparametrically efficient for the model M0 provided g is differentiable at µ. See Theorem 5.3.3. Proof R of Proposition 9.3.4. It is enough to show that Ṁ0 (P0 ) contains all bounded h such that h(x)dP0R(x) = 0. Given such an h, let p(x, t) = exp{th(x) − A(t, h)}p(x, 0) where A(t, h) = log eth(x) dP0 (x). Then, p(·, t) is a canonical exponential family and a regular submodel with ˙ 0) = ∂ log p(x, t)|t=0 = h(x) − ∂A (0, h) . l(x, ∂t ∂t But by Corollary 1.6.1, Z ∂A (0, h) = h(x) dP0 (x) . ∂t R ˙ 0) contains h with h(x) dP0 (x) = 0 and the result follows. Thus l(x, ✷ Examples We finish this section by obtaining candidate efficient influence functions for some of our examples, seeing whether we have obtained estimates efficient at particular P0 or on all Section 9.3 Efficiency in Semiparametric Models 171 of P. As we have noted we usually do not check that we have computed the full tangent space but rather identify a subspace that gives the structure of what elements of the tangent space should look like. Then we check whether an estimate νb obtained by some criteria such as likelihood or modified likelihood has an influence function that belongs to the subspace we have identified. The subspace is computed using the functional delta methods of Section 7.2. Example 9.3.2. The semiparametric linear model (Example 9.1.1). Tangent spaces and influence functions. As we have noted we usually do not check that we have computed the full tangent space but rather identify the structure of what elements of the tangent space should look like. We consider two models that correspond to assumptions and notation given earlier in Example 9.1.1 as (a) and (b) (with σ ≡ 1, variability in σ is absorbed into the parameter f below). Model (a). Here, p(z, y; α, β, h, f ) = h(z)f (y − α − β T z) . Suppose P0 corresponds to the parameter value (α0 , β0 , h0 , f0 ) and consider formally 1 dimensional models, αt = α0 +t∆α , βt = β 0 +t∆β , log ht = log h0 (z)+t∆h (z), log ft = log f (y) + t∆f (y). Formally, if Pt corresponds to the parameter value (αt , βt , ht , ft ) and has density pt (x) with x = (z T , y)T , then ∂ f′ (9.3.13) log pt (x)|t=0 = ∆h (z) + ∆f (ε) + ∆α 0 (ε) , ∂t f0 R R where ∆h (z) h0 (z) dz = ∆f (ε) f0 (ε) dε = 0 and ε ≡ y − α0 − βT0 z. Thus, we expect that f′ f′ Ṗ(P0 ) = {a(Z) + b(ε) + [Z 0 (ε)] + [ 0 (ε)]} , f0 f0 b has influence function where a, b are arbitrary functions in L2 (P0 ). Now the LSE β (9.3.14) ψ(x, P0 ) = [Var(Z)]−1 Z − E(Z) ε which belongs to Ṗ(P0 ) if f0′ /f0 (ε) = cε for some c 6= 0, that is, if f0 corresponds to a mean 0 Gaussian distribution. It follows from our analysis in Example 6.2.2 that, if f0 is true, the efficient influence function corresponds to the MLE obtained by assuming f0 is true and known. Since we don’t know f0 , where does this leave us? It turns out that this is a situation where we can b such adapt in the sense of Example 6.2.1. That is, it is possible to construct estimates β that n X 1 f′ b =β+ 1 β (9.3.15) [Var(Z)]−1 Zi − E(Z) − 0 (εi ) + oP (n− 2 ) n i=1 f0 by estimating f0 properly. Since (9.3.15) is how the best estimate of β behaves if we knew f0 (but not α), this is the best we can hope to do. We refer to Chapters 11 and 12 of this volume as well as BKRW and van der Vaart (1998) for such constructions. 172 Inference in semiparametric models Chapter 9 Model (b). Here, for simplicity, take z ∈ R. Then, p(z, y, α, β, f ) = f (z, y − α − βz) (9.3.16) subject to Z (y − α − βz)f (z, y − α − βz)dy = 0 (9.3.17) for all z but f is otherwise an arbitrary density. Fix α, β at their true values and consider the one dimensional submodel log ft = log f0 + t∆(z, ε) with ε ≡ y − α − βz. Then, and (9.3.17) is, for all z, which is equivalent to ∂ log ft = ∆(z, ε) ∂t Z ε∆(z, ε)f (z, ε)dε = 0 E(ε∆(Z, ε)|Z) = 0 . (9.3.18) We see from this and (9.3.14) that the influence function of the LSE βb does not belong to Ṗ(P0 ) unless E(ε2 |Z) = 0. The efficient estimate which can, under some conditions, be constructed has influence function f′ E f00 (Z, ε)ε|Z (Z − EZ) f0′ (Z, ε) − ε , (9.3.19) I(P0 , β) f0 E(ε2 |Z) where and f′ n E 2 f00 (Z, ε)ε|Z o ′ 2 f0 2 I(P0 , β) = E (Z − EZ) (Z, ε) − f0 E(ε2 |Z) ∂ f0′ (Z, ǫ) = log f0 (Z, v) f0 ∂v v=ε . For these claims see Problem 9.3.8. Example 9.3.3. Tangent spaces for biased sampling (Examples 9.1.2, 9.1.8, and 9.2.1). If k = 1, Ṗ(P0 ) = L2 (P0 ) by (9.1.4). For k > 1, proceed formally as usual. Write, since (w1 , . . . , wk ), λ are known, pt (j, y) = λj wj (y)ft (y) Wj (Ft ) Section 9.3 173 Efficiency in Semiparametric Models and, given h such that R h(y)dF0 (y) = 0, ft (y) = exp{th(y) − A(t, f0 )}f0 (y) . Z Wj (Ft ) = wj (y) exp{th(y) − A(t, f0 )} dF0 (y) . Then, ˙ t (y), 0) = ∂ log pt (j, y)|t=0 = h(y) − A′ (0, f0 ) − l(f ∂t R h(y)wj (y) dF0 (y) . Wj (F0 ) Thus, at least for any bounded h, if X = (I, Y ), ˙ l(X, 0) = h(Y ) − E0 (h(Y )|I) . (9.3.20) It is, in fact, true that the tangent space is Ṗ(P0 ) = {a(I) + b(Y ) ∈ L02 (P0 ) : E0 (b(Y )|I) = −a(I)} . (9.3.21) It is not hard to check that the influence function of the estimate, θ(Fb ) of θ(F ) = F (y), or more generally, Z θ(F ) = v(y) dF (y) is of the form specified in Ṗ(P0 ), and efficiency of θ(Fb ) follows provided regularity can be shown. ✷ Example 9.3.4. Censoring and truncation tangent spaces (Examples 9.1.3, 9.1.9, and 9.2.2). It turns out that, in these cases, Ṗ(P0 ) = L02 (P0 ) so that any regular estimate of a parameter is efficient for that parameter. This follows, since given any distribution P of (Y, δ), there is a unique (F, G) such that P = P(F,G) . This is a consequence of f (y) = λF (y) e−ΛF (y) and λF (y) = h1 (y) H̄(y) where h1 (y) = d H1 (y), dy H1 (y) = P [Y ≤ y, δ = 1], H(y) = P [Y ≤ y] are determined by P and determine P (as functions) — see Problem 9.3.9. ✷ 174 Inference in semiparametric models Chapter 9 Example 9.3.5. The Cox model tangent space (Examples 9.1.4, 9.1.10, and 9.2.3). Here the model is p(z, y; β, λ, h) = h(z)λ(y)r(z, β) exp{−r(z, β)Λ(y)} . Let P0 correspond to the parameter values β0 , λ0 , h0 . We will use the additive property of efficient score functions (Proposition 9.3.4) and compute tangent spaces for each parameter with the other parameters fixed then combine these tangent spaces. First, with λ = λ0 and d h = h0 fixed, β t = β 0 + R yta, a ∈ R , p1t the density p corresponding to the parameters βt , λ0 , h0 , and Λ0 (t) = 0 λ0 (s)ds, we find ∂ Ṗ1 (P0 ) : log p1t ∂t t=0 =a T ṙ (z, β 0 ) − ṙ(z, β 0 )Λ0 (y) r (9.3.22) where ṙ is the gradient ∇r(·, β) with respect to β. Similarly, let β = β0 , h = h0 be fixed, let b(y) be a function with E[b(Y )] = 0, and set λt (y) = λ0 (y) exp{tb(y)}. If p2t is the density for these parameters, we find Z y ∂ Ṗ2 (P0 ) : b(s)dΛ0 (s) . log p2t t=0 = b(y) − r(z, β 0 ) ∂t 0 Next, with β = β0 , λ = λ0 fixed and log ht (z) = log h0 (z) + c(z)t where E[c(Z)] = 0; and with p3t the corresponding density, we find Ṗ3 (P0 ) = ∂ log p3t ∂t t=0 = c(Z) . It follows formally that the tangent set Ṗ(P0 ) = Ṗ1 (P0 ) + Ṗ2 (P0 ) + Ṗ3 (P0 ) is the class of functions of the form Z y ṙ aT (z, β 0 ) − ṙ(z, β 0 )Λ0 (y) + b(y) − r(z, β 0 ) b(s)dΛ0 (s) + c(z) r 0 (9.3.23) for functions b(·) and c(·) with E b(Y ) = E c(Z) = 0. The influence function of the Cox estimate as given by (9.2.15) can be shown to be of this form for the Cox model when r(z, β) = exp{βT z}. In this case ṙ = z exp(β T0 z) and (ṙ/r) = z, and with Y = failure time, T S0 (y, β, P ) = EP eβ Z 1(Y ≥ y) . Let Ṡ0 stand for the gradient with respect to β. We proceed with β ∈ R, then Ṡ0 (y, β, P ) = EP ZeβZ 1(Y ≥ y) . Now, in the case where 1 y ≤ τ = 1, differentiate (9.2.15) with respect to y and show that it may be written in the form (9.3.23). We can show that the derivatives are equal; see Problem 9.3.10. Efficiency follows. ✷ Section 9.4 175 Tests and Empirical Process Theory Example 9.3.6: Partial linear model tangent space (Examples 9.1.13 and 9.2.4). Assuming that ε ∼ N (0, σ 2 ) , we obtain as members of Ṗ(P0 ), ZT ε, and (ε2 /σ02 ) − 1. The nonparametric g, appearing as gt in a 1 dimensional submodel, yields ∂ 1 ε ∂gt (Y − β T0 Z − gt (U ))2 |t=0 = − 2 (U )|t=0 . 2 ∂t 2σ0 σ0 ∂t This gives us Ṗ(P0 ) = ε2 − 1, h(U )ε, w(Z), Zε σ02 b may be for all h, w such that E0 h2 (U ) < ∞, E0 w2 (Z) < ∞. The influence function of β shown to be [E Var0 (Z|U )]−1 (Z − E0 (Z|U ))ε whose components do indeed belong to Ṗ(P0 ). ✷ We leave further consideration of these examples and other models to the problems. 9.4 Tests and Empirical Process Theory It is natural, when entertaining a parametric model, P = {Pθ : θ ∈ Θ}, to consider its consistency with the data. If we have no strong a priori evidence of departure from the model in any particular direction, we naturally begin by considering H : P ∈ P versus K : P 6∈ P. As usual, with a formulation as general as this one, nothing can be done. But begin with a blanket hypothesis such as our usual one, X = (X1 , . . . , Xn ), Xj i.i.d F . Then we can identify P with F = {Fθ : θ ∈ Θ}, Θ ⊂ Rd Euclidean, a regular parametric model and test H : F ∈ F vs K : F 6∈ F , using measures of distance between F and F as we discussed in Section 4.1 and Example I.2. Here failure to reject the hypothesized model only suggests that a model may be an adequate approximation. Recall from Section 1.1.1 that “Models, of course, are never true but fortunately it is only necessary that they be useful,” Box (1979). We pursue model testing in more detail here. Strictly speaking, we are dealing with parametric hypotheses in nonparametric models. Example 9.4.1. Testing goodness-of-fit to the Gaussian distribution (Example 4.1.6 and Section I.2 continued). Consider the test statistic Tn introduced in Example 4.1.6 for testing for some µ, σ vs K : F not Gaussian, H : F (·) = Φ ·−µ σ x − X̄ b b Tn ≡ sup |F (X̄ + σ bx) − Φ(x) | = sup F (x) − Φ . σ b x x We claim that, under H, L(Tn ) → L(sup |Z(x)| ) (9.4.1) x where Z(·) is a Gaussian process with mean 0 and, if x ≤ y, 1 Cov(Z(x), Z(y)) = Φ(x)(1 − Φ(y)) − ϕ(x)ϕ(y) − xy ϕ(x)ϕ(y) 2 (9.4.2) 176 Inference in semiparametric models Chapter 9 As we have noted, to obtain critical values for this statistic and some others we study below, this result is unnecessary since we can always employ the Monte Carlo method described in Examples 4.1.6 and 10.1.1. However, the proof gives us a sense of how the behaviour of this statistic differs from the Kolmogorov statistic of Example 4.1.5 and suggests a general approach for constructing and analyzing goodness-of-fit statistics for composite hypotheses. To establish (9.4.1), assume without loss of generality that the true member of H under which we compute is N (0, 1). Then, √ √ x − X̄ b Zn (x) = n(F (x) − Φ(x)) + n Φ − Φ(x) σ b ≡ En (x) + ∆n (x), where En (·) is the empirical process. Expand ∆n (x) as a function of X̄ and σ b2 around 0 and 1 to get (Problem 9.4.1) √ √ x σ 2 − 1) + Rn (x) ∆n (x) = −ϕ(x) nX̄ − ϕ(x) n(b 2 (9.4.3) where supx |Rn (x)| = OP (n−1 ). Let 1x (y) = 1(y ≤ x) and fx (y) = 1x (y) − ϕ(x)y − xϕ(x) 2 (y − 1) . 2 By definition En fx (y) = Z √ xϕ(x) √ fx (y)dEn (y) = Zn (x) − ϕ(x) nX̄ n(b σ 2 − 1) . 2 en − Zn k∞ = oP (1). It follows that So if we define Zen (x) ≡ En (fx ), (9.4.3) yields kZ en . It is easy to show that {fx (·) : x ∈ R} satZn =⇒ Z0 where Z0 is the limit of Z isfies the conditions of Theorem 7.1.5 and hence that Z0 (x) = Z(x) = WΦ0 (fx (·)) ≡ 0 W Φ(x) where, in the notation of Section I.1, W 0 (·) is the Brownian bridge on [0, 1]. That Cov(Z(x), Z(y)) is as specified in (9.4.2) can be calculated directly (Problem 9.4.1). Formula (9.4.2) shows that the limit distribution of Tn is different from that (see 7.1.24) of the Kolmogorov-Smirnov statistic corresponding to H : F = Φ but this gives little further insight. However, here is another derivation of (9.4.2). We introduce some notation; see Appendix B.10. If H is a Hilbert space and G ⊂ H, let [G] be the smallest closed linear subspace of H containing G. If G = {h1 , . . . , hk }, then [G] = {c1 h1 + . . . + ck hk : c = (c1 , . . . , ck ) ∈ Rk }. Take H = L2 (Φ), L = [X, X 2 − 1] where X ∼ N (0, 1). Then [X] ⊥ [X 2 − 1] and for any g(X) ∈ L2 (Φ), if Π(·|L) denotes projection on the closed linear space L, Π(g|L) = Π(g|[X]) + Π(g|[X 2 − 1]) and hence Π(g|L)(y) = E(g(X)X)y + Eg(X)(X 2 − 1) 2 (y − 1) . 2 Section 9.4 177 Tests and Empirical Process Theory If g(y) = 1(y ≤ x) which we abbreviate as 1x (y), Z x Eg(X)X = yϕ(y)dy = −ϕ(x) −∞ Z x Eg(X)(X 2 − 1) = y 2 ϕ(y)dy − Φ(y) −∞ Z x yϕ(y)dy − Φ(y) = −xϕ(y) . = − −∞ Then fx (·) = 1x (·) − Π(1x |L)(·) and, by the usual properties of projections, Cov(WΦ0 (fx ), WΦ0 (fy )) = CovΦ (fx , fy ) = CovΦ (1x (X), 1y (X)) − CovΦ (Π(1x |L), Π(1y |L)) = CovΦ (Π⊥ (1x (X)), Π⊥ (1y (X)) (9.4.4) where Π⊥ is projection on the orthocomplement of Ṗ(P0 ) in L02 (P0 ) and we write CovΦ (g, h) for Cov(g(X), h(X)) with X ∼ N (0, 1). Thus, as we might expect in the limit, the process Zn (·) is more concentrated than En (·) in the sense that all linear functionals of Zn (·) have smaller variances than the same functional of En (·) in the limit. The implications of this calculation for power are intuitively correct. The tests based on Zn (·) are less powerful against the same alternative than the corresponding tests based on En (·). See Problem 9.4.5. We generalize this argument in the next example. ✷ Other test statistics’ limit distributions can be similarly derived. For instance, the Cramér-von Mises type statistic for this situation is given by (see Problem 4.1.1 and Example 7.1.6) Z ∞ x − X̄ 2 Sn ≡ n (Fb(x) − Φ( ) ) dΦ(x) (9.4.5) σ b −∞ R∞ Clearly, h → −∞ h2 dΦ, is a continuous map from L∞ (R) to R. Hence, Z LΦ (Sn ) −→ L( ∞ −∞ [WΦ0 ]2 (fx (·))dΦ(x) ). This limit law has an explicit form — see Shorack and Wellner (1986), for instance. Example 9.4.2. Goodness of fit to a general smooth parametric model. We assume that P ≡ {Pθ : θ ∈ Θ ⊂ Rd } is a smooth parametric model satisfying the assumptions of Theorem 6.2.2. Let T be a set of functions on X satisfying the assumptions of Theorem 7.1.5. Let, for f ∈ T , n 1 X (f (Xi ) − E b (f (X)) Zn (f ) = √ θ n i=1 (9.4.6) b is the MLE of θ. Here we compute Eθ f (X) then substitute θ = θb to get where θ Eθb f (X) . It is reasonable that suitable goodness-of-fit statistics for H : P ∈ P should 178 Inference in semiparametric models Chapter 9 be based on the magnitude of |Zn (·)|. For instance, if X = R, f (x) = 1x , the natural generalization of the Kolmogorov statistic is 1 Tn = sup |Fb(x) − F b (x)| = √ sup |Zn (1x )|. θ n x x We claim that, under H with the data X generated by P0 ≡ Pθ0 , Zn (·) =⇒ Zθ0 (·) . (9.4.7) 0 where Zθ0 (f ) = Wθ0 (f − Πθ0 (f )). Here Wθ0 = WP00 , Πθ0 (f ) ≡ Π(f | [∇l(X, θ0 )] ) 0 0 0 denotes projection in L2 (P0 ), and ∇l(X, θ0 ) = ∂l ∂l ∂θ1 (X, θ 0 ), . . . , ∂θd (X, θ 0 ) T . Then, Cov(Zθ0 (f ), Zθ0 (g) ) = Covθ0 (f (X), g(X) ) − Covθ 0 (Πθ 0 (f )(X), Πθ 0 (g)(X) ) 0 0 ⊥ = Covθ0 (Π⊥ θ (f ), Πθ (g)) . (9.4.8) 0 0 We have discussed the consequences of this formula in terms of power above. The proof of (9.4.7) is a straightforward extension of (9.4.4). Write √ Zn (f ) = En (f ) − n(E b f (x) − Eθ 0 f (x)). θ Write the last term as √ n Z b − p(x, θ0 ))dµ(x). f (x)(p(x, θ) We can justify Taylor expansion to get √ n(E b f (X) − Eθ 0 f (X)) θ Z T √ b − θ0 ) + op (1). = f (x)∇l(X, θ 0 )p(x, θ 0 )dµ(x) . n(θ (9.4.9) Now by (6.2.10), (9.4.9) is equal to n 1 X −1 [Eθ 0 f (x)∇l(X, θ 0 )]T √ I (θ 0 )∇l(Xi , θ0 ) + op (1) n = En ([Eθ 0 f (X)∇l(X, θ0 )] i=1 −1 −1 I (9.4.10) (θ0 )∇l(·, θ 0 ) ) . But the argument of En is just Π(f | [∇l(X, θ0 )]) by (B.10.20), and (9.2.7) follows. The validity of (9.4.8) comes from the general property (B.10.14) applied to H = L2 (P0 ) and L = [∇l(X, θ0 )], CovP (f − Π(f | L)(X), g(X)) = 0 for all g ∈ L. ✷ Section 9.4 Tests and Empirical Process Theory 179 The result of this example is implicit in the work of Darling (1955). See also Durbin (1973). An excellent and very complete treatment of goodness-of-fit tests such as these is Chapter 5 of Shorack and Wellner (1986). The parametric bootstrap How do we use the result (9.4.7)? Suppose that all the constants implicit in Theorem 7.1.5 and the conditions of Theorem 6.2.2 do not depend on Pθ for θ in small open sets. This implies that all stochastic convergence and approximations that we claim are uniform in θ on these open sets. We want to set an approximate critical value for a test statistic, Tn = q(Zn (·)) where q is continuous on l∞ (T ). It is natural to use the parametric bootstrap test which we define as follows. b Tn . (i) Compute θ, ∗ ∗ (ii) Generate B samples of size n, (Xb1 , . . . , Xbn ), b = 1, . . . , B from P b . θ ∗ ∗ ∗ (iii) Compute Tbn , b = 1, . . . , B, the test statistic Tn (Xb1 , . . . , Xbn ) for each of the B b samples. For instance, if Tn = supx |F (x) − F b (x)|, then θ ∗ ∗ ∗ Tbn = supx |Fbb∗ (x) − F b (x)| where Fbb∗ is the empirical df of (Xb1 , . . . , Xbn ) and θb ∗ ∗ b is the MLE computed on the basis of X , j = 1, . . . , n. θ b bj ∗ ∗ ∗ (iv) Reject H at level α iff Tn > T([b(1−α)]+1) where T(1) ≤ . . . ≤ T(B) are the ordered ∗ Tnb , 1 ≤ b ≤ B. The result (9.4.7) suggests that this procedure provides an approximate size α test if the true θ 0 belongs to the interior of Θ. Here is a sketch proof under the assumption that the limit law Lθ0 (q(Zθ 0 )) of Tn = q(Zn ) is continuous. Note first that B 1 X P ∗ 1(Tnb > Tn ) − P b [Tn∗ > Tn ] −→ 0 θ B (9.4.11) b=1 as B → ∞ by Chebyshev’s inequality. Both expressions in (9.4.11) are to be interpreted ∗ ∗ ∗ as conditional on X1 , . . . , Xn , that is, Tn∗ = Tn (X11 , . . . , X1n ) where the X1i are i.i.d Pb , given X1 , . . . , Xn . But under our assumptions, θ(X1 ,...,Xn ) sup{|Pθ [Tn > x] − Pθ [q(Zθ ) > x]| : x ∈ R, |θ − θ0 | ≤ ε} −→ 0 . (9.4.12) b → θ0 in P probability, (9.4.11) and (9.4.12) imply that if Tn and T ′ have Since θ n θ0 the same distribution but are independent, P b [Tn∗ > Tn ] − Pθ0 [Tn′ > Tn ] → 0 in Pθ 0 θ probability. This is what we needed to argue. This technique is the first example of bootstrap techniques, Monte Carlo in the service of inference, that we will investigate further in Chapter 10. To make matters concrete, we specialize to 180 Inference in semiparametric models Chapter 9 Example 9.4.3. Goodness of fit for the gamma family. Suppose P is the Γ(p, λ) family, a possible model for lifetime distributions (see Lawless (1982)). We specialize to T = {1(0, u) : u ∈ R} or equivalently to statistics based on √ Zn (u) = n(Fb (u) − Gpb,λb (u)) b is the MLE of (p, λ) and Gp,λ (u) is the c.d.f. of Γ(p, λ). In particular, consider where (b p, λ) Tn = supu |Zn (u)|, a Kolmogorov-Smirnov type statistic. Since Gp,λ (u) = Gp,1 (λu) √ u n sup |Fb ( ) − Gpb,1 (u)| b u λ so that we can simulate under λ = 1. However, no such simplification is possible for the parameter p. Thus, to get critical values for Tn we need to apply the parametric bootstrap and simulate from Gpb,1 (·) as we discussed. ✷ Tn = Remark 9.4.1. (a). Goodness of fit tests are often used as diagnostics. Thus, for instance, in the Gaussian goodness-of-fit problem we might want to assess where the deviations from Gaussianity are significant by taking into account that the standard deviation of the empirical process at a point depends on that point. Under normality, Fb (µ + σu) has variance Φ(1 − Φ)(u)/n. Thus it is natural to consider 1 1 Sn = n 2 sup{|Fb (X̄ + σ bu) − Φ(u)|[Φ(1 − Φ)]− 2 (u) : |u| ≤ M } . (9.4.13) Then, not only is a large value of Sn indicative of departure from Gaussianity, but where that departure occurs is weighted appropriately as in Example 4.4.3. We cannot take M = ∞ here (Problem 9.4.2). However, it is possible to consider statistics such as 2 Z 1 Fb (x) − Φ x−X̄ σ b n dΦ(x) Φ(x)(1 − Φ(x)) 0 which are asymptotically equivalent to the well known Shapiro-Wilk statistics — see Mecklin and Mundfrom (2004) for a review. It is possible, as in Section 6.3, to make power calculations for such tests. Consider 1 H : F = F0 . For a sequence of alternatives {Fn } with Fn (x) = F0 (x) + n− 2 (∆(x) + o(1) ) uniformly in x, it is possible to obtain a limit distribution for Zn (·) in terms of a Gaussian process. Unfortunately, the family of ∆ we need to consider is too large and the resulting expressions generally analytically not tractable. For some calculation of this type see Hajek and Sidak (1967). These results may also be put in the general framework of contiguity, which we discuss further in Section 9.5. It is possible to extend these goodnessof-fit tests we have described to hypotheses such as that of independence of U and V where, in general, X = (U, V ) has an arbitrary distribution. Such a hypothesis is semiparametric. This extension is discussed in the problems. A general approach to constructing tests of goodness-of-fit to semiparametric hypotheses is given in Bickel, Ritov and Stoker (2003). Section 9.5 9.5 Asymptotic Properties of Likelihoods. Contiguity 181 Asymptotic Properties of Likelihoods. Contiguity As we have seen in Chapters 5 and 6 the asymptotic behavior of maximum likelihood estimates and tests and related confidence regions corresponds to the exact behavior of the same procedures specialized to the Gaussian linear model with known variance. In this section we will show how these results follow heuristically from properties of the likelihood which are by no means limited to i.i.d. observations. The theory originated in early work of Wald (1943) but was brought to full generality and understanding largely by Le Cam in a series of papers starting in 1956 and culminating in his (1986) treatise. Hàjek (1972) also made important contributions. We shall derive results only under the strong conditions of Chapters 5 and 6 rather than under the elegant necessary and sufficient conditions of Le Cam. But we will state the general results in the appropriate Hilbert space context with references to their appearance and proofs in Le Cam and Yang (1990), Le Cam (1986), and Hàjek and Sidàk (1967). In this way we shall connect this theory with the semiparametric estimation theory of Sections 9.1–9.3. We begin by motivating the abstract definitions which follow by a closer examination of the i.i.d. case. We need to take the point of view of Section 9.3 in which we introduced local parametric models. That is, we fix Pθ 0 and then consider regular parametric submodels of the form Pε ≡ {Pθ : |θ − θ0 | ≤ ε}, where ε is arbitrarily small and θ belongs to Rd . We formalize this by reparameterizing the model using the sample size dependent scale of order n−1/2 in the i.i.d. case. That is, we consider PM,n ≡ {Pθ0 + √t : |t| ≤ M } (9.5.1) n for M < ∞ arbitrary. Note that Pθ 0 could be any P residing in a semiparametric model. What matters is that we construct a sequence of shrinking regular parametric models centered at Pθ 0 . This enables us, following Le Cam, to think about local approximations to models and all related decision theoretic problems rather than just to distributions of estimates or tests. In particular we can formulate all the optimality properties of Chapters 5 (Sections 5.4.3 and 5.4.4), 6 (6.2.2), and Section 9.3 in terms of the local t. If θ∗n is a regular √ parameter ∗ ∗ estimate of θ, then, by the plug-in principle, tn ≡ n(θn − θ0 ) is the corresponding estimate of t and we can state properties in terms of t∗n . For instance, optimality of the bn as defined in (9.3.1) and (9.3.2) now reads: Under regularity conditions, if t∗ is MLE θ n asymptotically Gaussian with mean 0 uniformly in t on all PM,n , that is, Lθ 0 + √t (t∗n ) → Nd (t, Σ∗ ) (9.5.2) n uniformly for |t| ≤ M , all M < ∞, then Σ∗ ≥ I −1 (θ 0 ) and, if btn is the MLE of t, Lθ0 + √t (btn ) → Nd (t, I −1 (θ 0 )) n 182 Inference in semiparametric models Chapter 9 uniformly for |t| ≤ M , all M < ∞. We also note that for the model Nd ≡ {Nd (t, I −1 (θ0 )) : t ∈ Rd } the various asymptotic optimality statements become exact for fixed n. We can read this as saying that Nd is an approximation in a strong sense to the set of local model {PM,n : M < ∞} as n → ∞. We begin by introducing the fundamental notion of Local asymptotic normality (LAN) (n) of general likelihoods. For Θn ⊂ Rd , let En ≡ {Pθ : θεΘn }, ; n ≥ 1, be a sequence of experiments (models). By this notation we mean that, for the nth experiment, we ob(n) serve X (n) ∈ X (n) , X (n) ∼ Pθ , θ ∈ Θn . We can think of X (n) as being a vector (X1 , . . . , Xn )T belonging to Rn . In the i.i.d. case, where the Xi are i.i.d. with common (n) density f (x, θ), Pθ has density pn (x, θ) = n Y f (xi , θ). i=1 The parametrization of Θn we consider usually depends on n. It contains open balls around a fixed point θ 0 so that we can talk about θn = (θ 0 + tγn ) ∈ Θn , where γn ↓ 0, and |t| ≤ M for some M > 0 and all t if n is large enough. Having γn ↓ 0 is what makes our 1 calculations local. In all “regular” i.i.d. cases, γn = n− 2 . Note that we have already made local calculations like this in Section 5.4.4, in connection with testing. The local likelihood ratio is defined as Ln (t) = pn (X (n) , θn ) pn (X (n) , θ0 ) θn ≡ θ0 + tγn . , (9.5.3) We suppress θ0 in the notation and we use the conventions that if p(n) (·, θ 0 ) = 0 and p(n) (·, θn ) > 0, then Ln (t) = ∞, and that 0/0 = 0. Finally let Λn (t) ≡ logLn (t). Definition 9.5.1. En is uniformly locally asymptotically normal (ULAN) at θ0 iff there exist d × 1 statistics {T n (X(n) , θ0 )}, n ≥ 1, such that we can write 1 Λn (t) = tT T n (X (n) , θ0 ) − tT Σ0 t + Rn (t) 2 where Pθ 0 (i) sup{|Rn (t)| : |t| ≤ M } −→ 0 for all M < ∞. (ii) Lθ 0 (T n (X (n) , θ0 )) −→ Nd (0, Σ0 ). Remark 9.5.1. If X 1 , . . . , X n are i.i.d. Nd (θ, Σ−1 0 ) and θ = 0, then 1 Λn (t) = tT Σ0 (n− 2 n X 1 X i ) − tT Σ0 t 2 i=1 Section 9.5 Asymptotic Properties of Likelihoods. Contiguity 183 1 Pn and n− 2 i=1 Σ0 X i ∼ Nd (0, Σ0 ). We have already encountered the ULAN property in Examples 7.1.1 and 7.1.2. We show there that if {Pθ : θ ∈ Θ} is a model satisfying the regularity condition of these (n) (n) examples, then En ≡ {Pθ : θ ∈ Θ}, where Pθ is the (product) joint distribution of X (n) ≡ (X1 , . . . , Xn ) i.i.d. Pθ , is ULAN at all θ0 with T n (X (n) , θ0 ) = n−1/2 n X l̇(Xi , θ0 ) , i=1 and Σ0 = I(θ 0 ), the information matrix. We shall also have use for the following general property. Let X (n) be a sequence of probability spaces. Typically X(n) ∈ X (n) with X (n) as a subset of Rn . Definition 9.5.2. Given two sequences of probabilities {P (n) }, {Q(n) }, n ≥ 1, on X (n) we say that {Q(n) } is contiguous to {P (n) } iff for any sequence of events {An }, n ≥ 1, such that P (n) (An ) → 0, Q(n) (An ) → 0 also.(2) Contiguity has a very simple statistical interpretation. Any test of H : P = P (n) vs K : Q = Q(n) which asymptotically has probability of type I error 0 must asymptotically have probability of type II error 1. Contiguity can be used to turn asymptotic results for a relatively simple P (n) into asymptotic results for a more complicated Q(n) because if an approximation Sn to a statistic Tn satisfies Tn − Sn = oP (1) under P (n) , then Tn − Sn = oP (1) under Q(n) also. For instance: Example 9.5.1. Consider the uniform score statistic 1 T U = n− 2 X Ri (zi − z̄) n+1 of Example 8.3.11. Under the hypothesis H, an approximation to TU is X 1 Ui (zi − z̄), Ui = F (Yi ) . SU = n − 2 Here Ui ∼ U(0, 1) under H. By ordering the Ui we can write 1 SU − T U = n − 2 X U(i) − i (z(i) − z̄) n+1 where z(i) is the predictor value of the sample point with response Y(i) . Because E(U(i) ) = i/(n + 1) we have, under H, Set s2 = n−1 P EH (TU − SU )2 = Var(TU − SU ) . (zi − z̄)2 ; then we can use Problem B.2.9 to show that 2 E (TU − SU )/s → 0 184 Inference in semiparametric models Chapter 9 provided zi , 1 ≤ i ≤ n, satisfy Lindeberg’s condition. See Problem 9.5.3. For contiguous alternatives Qn , (TU − SU )/s = oP (1) also. By Slutsky’s theorem we√can find the asymptotic power of TU for contiguous alternatives by computing limn→∞ P ( 12SU /s ≥ z1−α ). See Problem 9.5.4. ✷ Define L(n) (X (n) ) ≡ q (n) (X (n) )/p(n) (X (n) ) where p(n) and q (n) are densities of P (n) and Q(n) and 0/0 = 0, a/0 = ∞, for a > 0. A simple and powerful sufficient (and necessary) condition is embodied in Proposition 9.5.1. (Le Cam’s First Lemma). Suppose (i) LP (n) L(n) (X (n) ) ⇒ L(L) with L finite valued, and (ii) E(L) = 1. Then, Q(n) is contiguous to P (n) . We refer to Hàjek and Sidàk (1967), BKRW, and van der Vaart (2000) for a general proof of this lemma. We sketch a proof in Problem 9.5.1. We next consider L(n) (X (n) ) = Ln (t) as defined in (9.5.3) and, as a corollary, we obtain (n) Proposition 9.5.2. Suppose {Pθ (n) (n) : θ ∈ Θn }, n ≥ 1, is ULAN at θ0 . Let P (n) ≡ Pθ 0 1 and Q(n) ≡ Pθ n , where θn ≡ θ0 + tγn , γn = n− 2 , and |t| ≤ M < ∞ all n. Then Q(n) is contiguous to P (n) . Proof. By Examples 7.1.1 and 7.1.2, Ln ⇒ eZ where Z ∼ N (−σ 2 /2, σ 2 ) and σ 2 ≡ tT Σ0 t. But, by (A.13.20), EesZ = exp{−(σ 2 s)/2+(σ 2 s2 )/2} which equals one if s = 1, and the result follows from Proposition 9.5.1 with L = eZ . ✷ The importance of this proposition is that the likelihood ratio is asymptotically determined by T (n) whatever θ is in the local model which suggests that T (n) is, in a suitable sense, asymptotically sufficient, so that any decision procedure has risk equivalent in the local asymptotic sense to the experiment in which one observes only T (n) (X (n) , θ0 ). To see that this last experiment is very simple we state and prove the fundamental Proposition 9.5.3. (Le Cam’s Third Lemma). Suppose En is a sequence of ULAN experiments at θ 0 , T n ≡ T n (X(n) , θ0 ) is as in Definition 9.5.1, and S n ≡ {S n (X (n) )} is a sequence of p dimensional statistics such that !! Σd×d Cd×p 0d×1 T T T . , Lθ 0 (Tn , Sn ) → Nd+p T Kp×p Cd×p µp×1 1 Then, if θ n = θ 0 + tγn , with γn = n− 2 , Lθ n (Sn (X (n) )) → Np (µ + C T t, K). (9.5.4) Section 9.5 Asymptotic Properties of Likelihoods. Contiguity 185 An immediate consequence is Proposition 9.5.4. Suppose En is ULAN at θ 0 with T n (X (n) , θ0 ) and Σ0 as in Definition 1 9.5.1. Then, if θn = θ0 + tn− 2 (n) Lθ n (Σ−1 , θ0 )) → Nd (t, Σ−1 0 T n (X 0 ). Remark 9.5.2. We discuss the implications of Proposition 9.5.4 before giving the proof of both Propositions 9.5.3 and 9.5.4. Since observing Σ−1 0 T n and T n are equivalent this (n) proposition says that the experiment, observing Σ−1 T for the model Pn = {Pθ0 +tγn : n 0 t ∈ Rd }, is equivalent, in some approximate sense, to observing W = t + Z, t ∈ Rd 1 d 2 where Z ∼ Nk (0, Σ−1 0 ); or equivalently, Σ0 W = µ + Z, µ ∈ R , for Z ∼ Nd (0, J), J = d × d identity. Coming back to our i.i.d. example, observing (X1 , . . . , Xn ) for i.i.d. Pθ 0 +tn−1/2 is asymptotically equivalent, in the rough sense we have discussed, to observP ing I −1 (θ 0 )n−1/2 ni=1 l̇(Xi , θ0 ) which, under regularity conditions, is asymptotically bn − θ 0 ). Unfortunately, having op (1) approximations equivalent to observing tbn ≡ n1/2 (θ for the likelihood ratio statistic in the local model is far from enough to justify the asymptotic sufficiency properties they suggest. We refer to Le Cam (1956, 1972, 1986) for the difficult arguments involved. Proof of Proposition 9.5.3. The characteristic function of S n is Eθn exp{iwT S n } = Eθ 0 exp{iwT S n + Λn (t)} + O(Pθ 0 [p(n) (X (n) , θ 0 ) = 0]) . (9.5.5) The second term in (9.5.5) tends to 0 by contiguity. By Definition 9.5.1, 1 iwT S n + Λn (t) = iwT S n − tT Tn − tT Σ0 t + oPθ0 (1) . 2 (9.5.6) Let (U , V ) denote the limit in law of (Tn , Sn ). The right hand side of (9.5.6) converges in law to iwT V + tT U − (1/2)tT Σ0 t. It can be shown (Problem 9.5.2) that 1 E exp{iwT V + tT U } = exp{iwT µ − (w T KV + tT Σ0 t + itT Cw)}. (9.5.7) 2 By Hammersley’s theorem (B.7.3) we can find (Λ∗n (t), [S ∗n ]T ) with the same distribution as (Λn (t), S Tn ) which converge in probability to (tT U ∗ − 1/2 tT Σ0 t, V ∗ ) with (U ∗ , V ∗ ) having the same Gaussian distribution as (U , V ). Then, 1 P exp{iwT S ∗n + Λ∗n (t)} → exp{iwT V ∗ + tT U ∗ − tT Σ0 t} . 2 The limit variable has, by (9.5.6), expectation exp{iwT (µ + C T t) − 12 wT Kw} which is the characteristic function of the postulated Gaussian limit distribution. Finally, we can conclude that the limit of expectations is the expectation of the limit in (9.5.5) by Remark B.7.1. ✷ 186 Inference in semiparametric models Chapter 9 Proof of Proposition 9.5.4. Let S n = Σ−1 0 T n . Then the conditions of Proposition 9.5.3 −1 −1 are satisfied with µ = 0, K = Σ−1 0 Σ0 Σ0 = Σ0 , C = Jk×k where J is the identity. The result follows. ✷ We give an application of Le Cam’s Third Lemma which also serves to illustrate the asymptotic optimality properties of procedures that we associate with LAN. Example 9.5.2. Asymptotic efficiency of the Rao score test. Recall the Rao test in a 1 dimensional regular model given by (5.4.55): 1 “Reject H : θ = θ0 iff Sn ≡ [nI(θ0 )]− 2 n X ∂l (Xi , θ0 ) ≥ z1−α .” ∂θ i=1 (9.5.8) This test was shown to be asymptotically optimal by a direct calculation in Problem 5.4.8. We reprove this result under weaker assumptions. By Example 7.1.1., for a regular one diP ∂l (Xi , θ0 ). By the central limit theorem, mensional model, Tn (X (n) , θ0 ) = n−1/2 ni=1 ∂θ under H, (Tn , Sn ) ⇒ N (0, 0, I(θ0 ), 1, 1). √ By Le Cam’s Third Lemma, if θn = θ0 + t/ n, then p Lθn (Tn ) → N (t I(θ0 ), 1). p Thus the asymptotic power function of the Rao test is β(t) = 1 − Φ(z1−α − t I(θ0 )), which by Theorem 5.4.5 is optimal. Note that the result is valid under the conditions of Example 7.1.1 which are weaker than those of Theorem 5.4.5. In particular nothing is assumed about the MLE. To see this we need only note that, by Example 7.1.1, the most powerful test of H : θ = θ0 vs. K : θ = θn , “reject H iff Λn (θ0 , θn ) ≥ cn (α, θ0 ),” is asymptotically equivalent to the Rao test both under H and under K by ULAN and contiguity (Problem 9.5.9). We next turn to an extension of the Rao score test to models with nuisance parameters. Example 9.5.3. Neyman’s Cα test. Consider H : ν = ν0 vs. K : ν > ν0 for p(·, θ), θ ∈ Rd a regular model with θ0 = (ν0 , η T )T satisfying the conditions of the multivariate part of Example 7.1.1. Let n X 1 l∗ (Xi , ν0 , η) Sn (ν0 , η) = n− 2 i=1 where l is the efficient score function defined by l∗ = e l/ke lk2 with e l = l˙ − Π(l|˙ Ṗ2 (P0 )), ˙ 2 (P0 ) is Ṗ2 (P0 ) ≡ {P(ν0 ,η) : η ∈ Rd−1 }, and e l, Π defined as in (9.3.10). Here Π l|P the orthogonal projection of the ν score function l˙ on the tangent space generated by the nuisance parameter η. As shown in Section 1.4, such projections can often be computed as conditional expectations. See Remark 1.4.6. ∗ Section 9.5 Asymptotic Properties of Likelihoods. Contiguity 187 Let us assume that we can use the data to find out that we are in a neighborhood of the b, such that for any θ 0 , hypothesis, if it holds. That is, there exist η 1 n− 2 n X 1 l∗ (Xi , ν0 , η 0 ) = n− 2 i=1 n−1 n X n X i=1 [l∗ ]2 (Xi , ν0 , η 0 ) = n−1 n X i=1 i=1 b + oPθ 0 (1) , l∗ (Xi , ν0 , η) (9.5.9) b + oPθ0 (1) . [l∗ ]2 (Xi , ν0 , η) Maximum likelihood estimates of η under H satisfy (9.5.9) under A0-A6 of Theorem 6.2.2. See Problem 9.5.10 for a suitable semiparametric example where (9.5.9) holds. If (9.5.9) holds, we can construct the following Neyman Cα test of H: where b ≥ z1−α σ(η) b Reject H iff Sn (ν0 , η) σ 2 (η) ≡ n−1 n n 2 X 1X ∗ ∗ ∗ l∗ (Xi , ν0 , η) − l (ν0 , η) and l (ν, η) = l (Xi , ν, η). n i=1 i=1 Using (9.5.9) and Slutsky’s theorem we can show (Problem 9.5.12) that for all θ 0 satisfying H, b ⇒ N (0, [I 11 (θ0 )]−1 ) Sn (ν0 , η) (9.5.10) where I 11 (θ0 ) is the upper left entry of I −1 (θ0 ). Thus the test based on Sn is asymptotically level α for each θ0 . By contiguity it is asymptotically of level α uniformly for θ that 1 satisfy H and are in radius n− 2 balls of θ0 . What about the power function of the Cα test? To apply the third lemma, set We know that 1 1 Tn = n− 2 Σℓ̇(xi , θ0 ) ∈ Rd , Sn = I ′′ (θ0 ) 2 S(ν0 , η) ∈ R . L (TTn , STn )T → N I(θ 0 ) Cd×1 0d×1 , , T 1 0 Cd×1 1 where Cd×1 = ([I 11 (θ0 )]− 2 , 0T )T is obtained from the identities – see Problem 9.5.13. ∂l (X 1 , θ0 ) = E[l∗ ]2 (X 1 , θ0 ) = [I 11 (θ 0 )]−1 ∂ν ∂l (X 1 , θ0 ) = 0, 1 ≤ j ≤ d − 1 . El∗ (X 1 , θ 0 ) ∂ηj El∗ (X 1 , θ 0 ) (9.5.11) We conclude from Le Cam’s Third Lemma that if θn = θ0 + tn−1/2 , θ0 ∈ H, t = (t1 , . . . , td )T , 1 1 Lθ n ([I 11 (θ0 )] 2 Sn (θ 0 ) → N (t1 [I 11 (θ0 )]− 2 , 1) . 188 Inference in semiparametric models Chapter 9 Thus the asymptotic power function of the Cα test is 1 b ) ≥ z1−α σ(b β ∗ (θ) = lim Pν n [Sn (ν 0 , η η )] = 1 − Φ(z1−α − t1 [I 11 (θ0 )]− 2 ) n for θ n as given before. The convergence is uniform on n−1/2 neighborhoods of points θ0 in H. The details of this argument are given in the hint to Problem 9.5.14. The basic point is that, again, the generalization (9.5.10) of the Rao test (the Neyman Cα test) allowing for nuisance parameters, has, in the presence of ULAN, by Example 7.1.1 the same asymptotic optimality property as the standard test in the multivariate shift Gaussian model with known variance covariance matrix — see Example 8.3.10. See Problem 9.5.15 for an example. ✷ We conclude with two final applications of Le Cam’s Third Lemma pointing to its generality. Example 9.5.4. Testing for constant hazard rate. Recall that the exponential distribution is characterized by constant hazard rate and plays the role of default distribution in reliability theory and survival analysis. We will, following Bickel and Doksum (1969), analyze the power of a class of tests which are expected to have power against the plausible alternative of increasing failure rate r(x) = f (x)/[1 − f (x)]. Proschan and Pyke (1966) proposed the use of a statistic which is a U statistic in the normalized spacings of the data, defined as follows. Let X(1) < . . . < X(n) be the order statistics of a sample X1 , . . . , Xn i.i.d. P concentrating on R+ with P continuous. Define the normalized spacings as Di = (n − i + 1)(X(i) − X(i−1) ) with X(0) ≡ 0. It may be shown that if P = E(λ) then D1 , . . . , Dn are themselves i.i.d. E(λ). (See Problem B.2.14.) On the other hand, intuitively if P has an increasing failure rate we expect the Di to decrease in i stochastically — see Problems 1.1.14, 3.5.13, and P 9.5.5. Proschan and Pyke proposed T = i<j 1(Dni ≥ Dnj ) as a test statistic. Barlow and Proschan (1996), Bickel and Doksum (1969), and Bickel (1970) investigated statistics of the form U= n X i=1 ci ( Di − 1) S (9.5.12) Pn Pn where S = i=1 Di = i=1 Xi , and the ci are decreasing in i. The motivation is that these statistics are scale-free measures of the covariance between the normalized spacings Di /S and the ci . We expect the covariance to be positive if the failure rate is increasing. Moreover, as we shall see below, Bickel (1970) shows that for suitable ci the tests based on (9.5.12) are efficient against specific alternative families. Because U is scale invariant its distribution under H doesn’t depend on λ and we can set suitable critical values for U by Monte Carlo simulation under E(1). See Example 4.1.6. As we shall see, a simple Gaussian approximation is also valid. Suppose that we have a regular parametric model P ≡ {P(λ,η) : λ ∈ R+ , η ∈ R} with P(λ,0) = E(λ) for all λ — see Problem 9.5.6. We want to compute the power of the size α Section 9.5 Asymptotic Properties of Likelihoods. Contiguity 189 test for H : η = 0 vs K : η > 0 based on rejecting for large values of U and compare it to the asymptotically MP test. Let T ∂l ∂l l̇(·) = (·, λ0 , 0) , (·, λ0 , 0) . ∂λ ∂η To applyP Le Cam’s Third Lemma we need the asymptotic joint distribution of U and Tn ≡ n−1/2 ni=1 l̇(Xi , θ0 ) for θ0 = (λ0 , 0)T . We start by approximating Tn with a linear function of the Di . Without loss of generality set λ0 = 1, then T ≡ = n X l̇ (Xi , θ0 ) = i=1 n X i=1 G−1 n X l̇ X(i) , θ0 i=1 n X a i i X(i) − G−1 n+1 n+1 i=1 n 2 i i 1X + bn X(i) − G−1 ) (9.5.13) 2 i=1 n+1 n+1 i n+1 + where G is the df of Xi under Pθ0 , so that G−1 (t) = − log(1 − t), and ∂ l̇ i i = G−1 , θ0 a n+1 ∂x n+1 2 i i ∂ l̇ bn = , θ0 µn n+1 ∂x2 n+1 where µn lies between X(i) and G−1 (i/n + 1). Write i i X X i (D − 1) i j X(i) − G−1 = + (n − j + 1)−1 + log(1 − ) n+1 n − j + 1 n+1 j=1 j=1 i X 1 (Dj − 1) . + OP = (n − j + 1) n j=1 (9.5.14) The last statement follows from the well-known approximation, n X 1 1 , = log n + γ0 + O j n j=1 where γ0 is Euler’s constant. Substituting (9.5.14) into (9.5.13) we get n n n X X X 1 i 1 1 1 d O l̇ (Xi , 1, 0) = n− 2 (Di − 1) + n− 2 n− 2 n+1 n i=1 i=1 i=1 n 2 X i i 1 X(i) − G−1 (9.5.15) + n− 2 bn n + 1 n + 1 i=1 190 Inference in semiparametric models where d i n+1 = n X 1 1 k a n+1 n+1 1− k n+1 k=i+1 Chapter 9 . 2 i = OP (n−1 ) and thus, if bn is We can show (Problem 9.5.7) that X(i) − G−1 n+1 bounded, then 1 n− 2 n X 1 l̇(Xi , θ0 ) = n− 2 i=1 n X d i=1 i n+1 1 (Di − 1) + OP (n− 2 ). Suppose ci ≡ c(i/(n + 1)) with the function c(·) continuous, Z 0 1 |d(t)|2 dt + Z 0 1 R1 0 (9.5.16) c(u)du = 0, and c2 (t)dt < ∞ . Then, since the Di are i.i.d. E(1) under H we can apply the Lindeberg-Feller theorem (Appendix D.1) and the delta method to U . We obtain after some calculation (Problem 9.5.8) that Z (T , U ) ⇒ N3 0, 0, I(θ0 ), 0 1 2 c (t)dt, Z 1 T d (t)c(t)dt 0 . (9.5.17) From this we may use Lemma 9.5.3 to deduce the power function of U and relate it to that of the asymptotically most powerful test of H : P = E(λ) vs. K : P = P(λ,η) , η ≥ 0 (Problem R 9.5.8). Note that approximate critical values can be obtained because U =⇒ N (0, c2 (u)du). For more discussion and the relations of tests of type U to the ProschanPyke and other rank tests see Bickel and Doksum (1969) and Bickel (1970). ✷ Our final example needs central limit theorems for dependent random variables which we do not pursue in this book. We include it to illustrate the generality of LAN. Example 9.5.5. Autocorrelation. The most common model for temporal dependence of a sequence of observations is the autoregressive one — see Example 1.1.5, described by Xi = βXi−1 + εi i = 1, . . . , n where the εi are i.i.d. with common density f . This only specifies the conditional distribution of X2 , X3 , . . . given X1 . If f is N (0, σ 2 ) and |β| < 1 we can, by specifying X1 ∼ N (0, σ 2 (1 − β 2 )−1 ) , make the sequence X1 , . . . , Xn a stationary, homogeneous, Gaussian Markov Chain. That is, (X1 , . . . , Xk ) and (X1+m , . . . , Xk+m ) are identically distributed and the conditional distribution of Xi given X1 , . . . , Xi−1 depends only on Xi−1 and is the same for all i. In fact, if we add to the model specification on X1 a non-zero mean as in Example 1.1.5 we Section 9.5 Asymptotic Properties of Likelihoods. Contiguity 191 obtain the only examples of stationary, homogeneous Gaussian Markov Chains. We restrict ourselves to this simple case for most of this example. Suppose we are interested in testing H : β = 0 vs. K : β 6= 0; that is, independence vs. autoregression. It is easy to see, if p(x1 , . . . , xn , β) is as given in Example 1.1.5, for µ = 0 and σ 2 known, that n n X ∂ 1 X εi−1 εi . Xi−1 Xi = log p(X1 , . . . , Xn , β)|β=0 = 2 ∂β σ i=2 i=1 (9.5.18) Thus, (9.5.18) is the Rao score statistic. By letting X1 ∼ N (0, σ 2 /(1 − β 2 )) we make the process stationary. We examine the asymptotic distribution of 1 1 Λ(tn− 2 ) ≡ log p(X1 , . . . , Xn , tn− 2 ) . p(X1 , . . . , Xn , 0) (9.5.19) The usual Taylor expansion gives 1 1 Λ(tn− 2 ) = tn− 2 n X n εi−1 εi + i=2 t2 X 2 ε + oP (1). 2n i=2 i−1 (9.5.20) We show the model is ULAN. To establish LAN we need to show that 1 n− 2 n X i=2 L εi−1 εi → N (0, 1) (9.5.21) since we know, by the WLLN, that It is clear that n1/2 Var(n−1/2 n X i=2 Pn i=2 n−1 1X 2P ε → Eε21 = 1. n i=2 i Eεi−1 εi = 0 and εi−1 εi ) = Var(ε1 ε2 ) + 2 n X Cov(εi−1 εi , εj−1 εj ) = Var(ε1 ε2 ) = 1, i6=j since Cov(εi−1 εi , εj−1 εj ) = Eεi−1 εi εj−1 εj = 0 by the independence of the εi . That (9.5.20) holds under the hypothesis H of independence is a consequence of any one of the three following approaches: (i) Martingale central limit theorems (Hall and Heyde (1980)) (ii) Central limit theorems for functions of Markov chains (Prakasa Rao (1983)) (iii) Strongly mixing random variables (Ibragimov and Linnik (1971)) 192 Inference in semiparametric models Chapter 9 All of these essentially imply asymptotic normality of Λ(tn−1/2 ) with natural parameters (mean and variance). Unfortunately, none of the tools we have developed in the independence case can be applied here directly. To perform tests, we can use statistics which are more robust, for instance, 1 U ∗ ≡ n− 2 n X ψ(Xi−1 )ψ(Xi ) (9.5.22) i=2 with a function ψ which downweights large Xi . Thus, under the assumption that ε has a double exponential distribution we arrive at (9.5.22) for the Rao statistic with ψ(t) ≡ sgn(t) as the Rao score statistic. Le Cam’s Third Lemma and a multivariate central limit theorem for sums of dependent random variables yield the asymptotic distribution of U ∗ as N (0, π 2 /4). Using (i), (ii), or (iii) again yields L L(T, U ∗ ) → N 0, 0, 1, 1, E(sgn(εi εi−1 )εi εi−1 )/Eε21 = N (0, 0, 1, 1, (E|ε1 |)2 /Eε21 ) . Thus, as in the case of the ordinary one sample problem, if ε ∼ N (0, 1), the asymptotic distribution of U ∗ is N (0, π 2 /4). The comparison in power of T and this U ∗ leads to the same conclusions as the asymptotic comparison of the sign test to the t test or of the median to the mean in the one sample problem. For more on inference under dependence we refer to Prakasa Rao (1983) and Ibragimov and Hasminskii (1981). ✷ Summary. This section treats local approximations to models that can be used to find approximations to the properties of statistical decision procedures. For the regular model √ {Pθ : θ ∈ Θ}, the local model is {Pθn : θn = θ0 + t/ n ; |t| ≤ M } for some constant M > 0. More generally we consider sequences of models, called experiments, of the (n) : θ ∈ Θn } and define En to be uniformly locally asymptotically normal form En = {P θ (ULAN) at θ0 if the log likelihood ratio for testing θ0 vs θ0 + tγn , where |γn | ↓ and |t| ≤ M , can be closely approximated by an asymptotically normal sequence of statistics. We relate the ULAN concept to the concept of contiguity: The sequence of probabilities {Q(n) } is contiguous to the sequence of probabilities {P (n) } iff for any sequence of events An such that P (n) (An ) → 0, Q(n) (An ) → 0 also. This concept is useful because if we can P show that an approximation S n to a vector T n satisfies T n − S n −→ 0 under a hypothesis P (n) H, then T n − S n −→ 0 also for contiguous alternatives. In the case where P (n) = P θ0 (n) with θn = θ 0 + tγn , |t| ≤ M , Q(n) can be shown to be is ULAN and Q(n) = P θn contiguous to P (n) . We establish Le Cam’s Third Lemma which states that if T n denotes the ULAN approximation to the local log likelihood ratio and if the joint distribution of T n and some statistic S n tends to the multivariate normal distribution for P (n) , then the asymptotic distribution of S n is asymptotically multivariate normal for Q(n) . We illustrate the usefulness of these ideas by establishing the asymptotic efficiency of the Rao score and Neyman Cα tests and by developing asymptotic theory for a class of unbiased tests of exponentiality vs increasing failure rate based on normalized spacings. Section 9.6 9.6 193 Problems and Complements PROBLEMS AND COMPLEMENTS Problems for Section 9.1 1. In Example 9.1.1, (a) Show that β is identifiable in model (a). (b) Show that α is unidentifiable in model (a) and σ is unidentifiable in both (a) and (b). (c) Show that both α and β are identifiable in model (b). b as defined in Example 6.2.2 with f0 the logistic density is (d) Show that β for β in model (a). √ n consistent b and α b in model (b). (e) Give the influence functions of the estimates β b = Ȳ − Z̄β 2. In Example 9.1.2, (a) Establish (9.1.4). (b) Prove that F is identifiable in both (9.1.2) and (9.1.5). What conditions are needed? 3. Parametric biased sampling. Simpler models result in more efficient inference when the sample size is small; see Section I.7 (although if seriously wrong the biases they bring can be overwhelming). Thus comparisons of procedures derived from parametric and semiparametric models are of interest. In the length bias part of Example 9.1.2 let V be the number of strata with v elements and suppose that V has the Poisson (µ) density h(v) = hµ (x) ≡ e−µ µv , v = 0, 1, . . . . v! An element is drawn at random from the union of all strata S0 , S1 , where Sj has j elements and the size X of the strata is observed, where now X ≥ 1. Let Z be a variable with L(Z) = L(V |V ≥ 1). (a) Show that Z has density f (z) = fµ (z) ≡ e−µ µz , z = 1, 2, . . . . (1 − e−µ )z! (b) Show that W (F ) = E(Z) = µ/[1 − e−µ ] and thus pF (x) = That is L(X) = L(V + 1). e−µ µx−1 , x = 1, 2, . . . . (x − 1)! 194 Inference in semiparametric models Chapter 9 (c) Recall that V and Z are not observed; instead XPis. Show that the MLE of µ based n on a sample X1 , . . . , Xn from pF (x) is µ b = i=1 (Xi − 1)/n and thus, by the equivariance property of MLEs, the MLEs of h(v) and f (g) are b h(v) = hµb (v) and b f (g) = fµb (z). Remark 1. The Poisson model includes households with no children and µ = E(V ) is the expected number of children over all households. 4. Parametric stratified sampling. In the stratified sampling part of Example 9.1.2, suppose (I1 , Y1 ), . . . , (In , Yn ) is a sample from pF (x)(j, y). Based on this sample, find the MLE of µ when Z is Poisson (µ), k = 2, X1 = {0, 1, 2, . . .}, X2 = {1, 2, 3, . . .}, and (a) λ1 = λ2 = 21 , (b) λ1 and λ2 general but known; where λj = P (I = j), j = 1, 2. 5. Size biased class size. Suppose we are interested in the mean size of undergraduate classes at a certain university. We sample n students and ask which of the following categories their current classes fall in: [1, 20), [20, 40), [40.60), . . . , [800, ∞). These categories are converted to the class sizes k = {10, P 30, 50, . . . , 900}, and we define the mean from the perspective of the students as µS = k=K kpk , where pk is the proportion of students in classes of size k. Suppose we are interested in the mean class size µP from the perspective of the professors at the university. Explain why µP is smaller than µS and describe how to estimate µP . 6. (a) In Example 9.1.10, show that β → LP (r(Z, β)) is 1–1 and P [S0 (T, 0, P ) = c] < 1 for all c are necessary for identifiability of β in the Cox model. (b) Show that under these conditions, if β(P ) is defined by (9.1.42), β(Pβ ) = β0 . 0 c exists and is unique with probability tending to one 7. In Example 9.1.8, show that W provided 0 < P (δ = 0) < 1. Hint. Use exponential family theory. 8. In Example 9.1.9, prove (9.1.32) to (9.1.34) following the approach for censoring. 9. For a semiparametric example where maximum likelihood fails, consider the model where Y ∈ {0, 1}, Z ∈ B ⊂ Rd , B bounded, U ∈ [0, 1], and E(Y |Z, U ) = P (Y = 1|Z, U ) = L β T Z + η(u) , β ∈ Rd , with L(t) = 1/[1 + e−t ]. This is a partially linear logistic regression model. Assume that R (k) 2 η(·) is in the class of functions Fk with J (k) (η) < ∞, where J (k) (η) = η (t) dt, k ≥ 1. Let L(β, η) denote the likelihood based on (Yi , Zi , Ui ) i.i.d. as (Y, Z, U ). Write p(y, z, u) = p(y|z, u)p(u, z) and assume that p(u, z) does not involve β. Show that sup{L(β, η) : η ∈ Fk } = 1 . Thus any b ∈ B is a MLE of β. In this model efficient estimates can be obtained by max 2 2 imizing the penalized likelihood L(β, η) − λ2 J (k) (η) where J (k) (η) is a roughness penalty and λ2 is a tuning parameter. See Mammen and van der Geer (1997). 10. In Example 9.1.14, Section 9.6 Problems and Complements 195 (a) Show that without loss of generality we can take J to be the d × d identity matrix. (b) Fill in the details in the proof that if Var Zk = 1, EZk = 0, 1 ≤ k ≤ d and at most one of the Zk is Gaussian, then A is identifiable. (Assume that if EZk2 < ∞, γk (t) can be defined and is twice differentiable in a neighbourhood of 0.) Hint. If γk′′ ≡ c, then it takes on an infinite number of distinct values. 11. In Example 9.1.4, the estimate rbj,t depends on selecting t ≡ b t so that 0 < Sb0 (b t) < 1. P Thus t = b t is random. Find conditions under which rbj,bt −→ rj (β). 12. Show that in Example 9.1.13, if EP |Y | < ∞, 0 < EP (Z − E(Z|U ))2 < ∞, then (9.1.52) is valid and β is identifiable for the given parametrization. Hint. Y − EP (Y |U ) = β(Z − E(Z|U )) + ε. 13. Show in Example 9.1.9 that (a) If (9.1.25) is replaced by its continuous approximation (9.1.29), we arrive at the same estimating equation for λFj but the expression (9.1.30) for S. (b) The unique maximizer of K(P(F,G) , P ) is given by (9.1.31). Thus the parametrization sending (F, G) into P according to (9.1.24) is identifiable if 0 < P [δ = 0] < 1 by using (9.1.31). Hint for (b). Use Shannon’s inequality. 14. In the censored data framework of Example 9.1.9, show that the empirical likelihood bF i = estimate of the hazard rate λF i at y(i) based on maximizing on (9.1.26) or (9.1.29) is λ δ(i) /(n + 1 − i) when there are no ties. 15. Show that in the proportional hazard model, Ld of Example 9.1.12 is proportional to the approximate empirical likelihood (9.1.36). 16. In Example 9.1.12, show that the likelihoods Ld and La are different for the model with Λ(y|z) = aθΛ(y) + (1 − a)Λ(θy) , where θ = exp{β T z}, a ∈ (0, 1) is fixed, and Λ(y) = − log 1 − F (y) for some baseline df F on [0, ∞). 17. The odds ratio is defined as Γ(t|z) = F (t|z)/[1 − F (t|z)] and the proportional odds model is defined by Γ(t|z) = θΓ(t), θ > 0, t > 0 , where θ = r(z, β) and Γ(t) = F (t)/[1 − F (t)] for some baseline distribution F on [0, ∞). (a) Show that if G(u; θ) = u/[u + θ(1 − u)], 0 ≤ u ≤ 1, then G F (y); θ follow a proportional odds model. 196 Inference in semiparametric models Chapter 9 −t −1 (b) Show that if G(y; θ) = L(y − θ) where L(t) = [1 + e ] is the logistic df and if −1 F (y|z) = G η(y); θ , η(y) = L F (y) , then F (y|z) follow a proportional odds model. (c) (i) Write an expression for the likelihood Ld of Example 9.1.12 for this model. (ii) When θ = exp{βz}, z ∈ R, show that this likelihood is concave in η1 , . . . , ηn , where we write ηi for η(i) . Remark 2. Bell and Doksum (1966, Table 8.1, last row) considered optimal tests of H : θ = 1 in the proportional odds model (a). This paper gives several more examples of models of the form G η(y); θ . See Problems 8.3.14 and 9.1.25. 18. We noted that Ld and La give the same estimates of θ and β in model (9.1.46). However, they give different estimates of η ′ (t). For instance, in the Cox proportional hazard model, the La estimate of λ = Λ′ is given in (9.1.37). It is increasing in y ∈ {y(1) , . . . , y(n) } and converges to zero and thus is not consistent. On the other hand, the Breslov estimate n X b B (y) = b (k) ) 1 (y(k) ≤ y) , Λ λ(y k=1 b which is the counting measure integral of λ(·), is known to be consistent. (a) Show that the Ld estimate is bd (y(i) ) = λ (y(i) − (y(i−1) ) 1 P T b j≥i exp{β z(j) } , y(0) = 0, 1 ≤ i ≤ n . bd (y) to be λ bd (y(i) ) on [y(i) , y(i−1) ), 1 ≤ i ≤ n, and zero elsewhere, If we define λ then Z y bd (s)ds = Λ b B (y) λ 0 bd (·) is the ordinary right derivative of the is again the Breslov estimate. Thus λ Breslov estimate. bd (·) is not a consistent estimate of λ(·). For consistent (b) Show, using an example, that λ estimates of λ(·), see Chapter 11. Hint. If Y(i) is an exponential, E(λ), order statistic, then (n + 1 − i)(Y(i) − Y(i−1) ) has an E(λ) distribution (Problem B.2.14). 19. In the accelerated failure time model, show that the log likelihood is proportional to (9.1.60). 20. For the AFT model (9.1.58) the approximate log likelihood la is obtained by replacing Λ( ) by a step function with the jump size λi at each v = θi , 1 ≤ i ≤ n. Show that P P (a) la (β, λ) = n−1 ni=1 δi βT Zi (Yi ) + δi log λi − nj=1 λj 1[θj ≤ θi ] . Section 9.6 197 Problems and Complements (b) For fixed β, the maximizer is δk , 1(θ j ≥ θk ) j=1 bk (β) = Pn λ 1≤k≤n. (c) The approximate profile log likelihood is proportional to la (β) = n−1 n n n io hX X 1(θj ≥ θi ) . δi β T Zi (Yi ) − δi log i=1 j=1 Note that this objective function has an infinite maximum and does not P achieve its maximum for finite β. Thus the approximate likelihood La without the pi = 1 condition does not work for this model. 21. Models where the proportion hazard (PH) and accelerated failure time (AFT) models are equivalent. Let Ti ≥ 0 be a failure time whose continuous df Fi depends on a vector zi of fixed covariates. Let λi (t) = fi (t)/[1 − fi (t)] where fi (t) is the continuous case density of Ti . Consider the following models for independent failure times T1 , . . . , Tn : (i) PH: λi (t) = ∆i λ(t), some unknown baseline λ(t); ∆i = r(zi , β) > 0, some known r(·, ·), unknown β. (ii) AFT: Fi (t) = F (τi t), some unknown baseline F ; τi = s(zi , α) > 0, some known s(·, ·), unknown α. (iii) Periodic hazard rate (PHR): λi (t) = hi (log t)tγi −1 , some known non-negative period function hi (·) with period ci > 0; γi = q(zi , θi ) > −1, some known q(·, ·), unknown θ. (iv) Weibull: λi (t) = γi tγi −1 , γi = q(β i , θ) > −1, some known q(·, ·), unknown θ. In this case, Fi is said to be the Weibull distribution. (v) Trigonometric PHR: λi (t) = exp{sin 2π log t/ log τi }tγi −1 , τi = s(zi , α) > 0 and γi = q(zi , θi ) > −1, where s(·, ·) and q(·, ·) are known, α and θ are unknown. (a) Show that (i) and (ii) both hold iff (vi) λ(τi t) = ∆i λ(t)/τi , all t > 0, some τi 6= 1, τi > 0, some ∆i > 0. (b) Suppose limt↓0 [λ(t)/ta ] exists and is positive for some a > −1. Show that (i) and (ii) hold iff Fi is a Weibull distribution. (c) Show that if (iii) holds with ci = | log τi | and γi = log ∆i / log τi , then (vi) holds so that the PHR model is both PH and AFT. (d) Show that (iv) and (v) are examples of (iii). 198 Inference in semiparametric models Chapter 9 Remark. Doksum and Nabeya (1984) show that a model is both PH and AFT iff it is PHR. 22. The quantile hazard rate (QHR) of a random variable T ≥ 0 with hazard rate λ(·) is defined by q(α) = λ(tα ), 0 < α < 1, where tα = F −1 (α) is the αth quantile of the distribution F of T . Let q(α|z) denote the quantile hazard rate for (T |z). The proportional quantile hazard (PQH) model is defined by q(α|z) = θq(α) , where q(·) is a baseline QHR and θ = r(z, β) for some known function r(·, ·). Show that L (T |z) follow the PQH model iff (T |z) follow the AFT model where (T |z) = θ−1 T0 for some baseline variable T0 . 23. The Hoeffding Rank Likelihood. (a) Consider Example 9.2.12 where Fi (y) = G(η(y); θi ) for some parametric df G(v; θ) and increasing differentiable function η(·), where θi = r(zi , β) with zi a nonrandom covariate vector. Show that the Hoeffding rank likelihood of Problem 8.3.10(a) reduces to ! n . Y n! , g(V (ri ) ; θi ) g(V (ri ) ) LR (β) = EG i=1 where g(v; θ) is the continuous case density of G(v; θ) and g(v) = g(v; θ0 ) with θ0 corresponding to a null hypothesis (typically β = 0). Assume that g(v) > 0 Qn whenever i=1 g(v; θi ) > 0. (b) Show that if g(y; θ) = 1 − exp{−θy} and η(y) = − log[1 − F0 (y)], y > 0, for some continuous df F0 , then LR (β) is equivalent to the Cox likelihood (9.1.44). See Kalbfleish and Prentice (1973, 2002) who called LR (β) a marginal likelihood. Hint for (b): Note that if h(·) is decreasing, Rank h(Yi ) = n+1−Ri. For the proportional hazard model, transform Yi by Ui = 1 − F0 (Yi ), then fUi (u) = θi uθi −1 , 0 < u < 1. Hoeffding’s formula with f (u) = 1(0 < u < 1) shows LR (β) ∝ n Y i=1 θi Z ········· Z n Y 0<u1 <u2 <···<un <1 i=1 uiδi −1 du1 · · · dun , where δi = θbi and bi = index on the Y with rank n + 1 − i. Use this to show that LR (β) ∝ n Y θi . θ Σ k:Y k ≥Y(i) k i=1 Remark 3. Asymptotic properties of maximum rank likelihood estimates were obtained by Bickel and Ritov (1997) who show that in a general class of transformation models, estimates based on LR are asymptotically efficient in the sense of Section 9.3. Section 9.6 199 Problems and Complements 24. (a) Show that n−1 times the log likelihood for the Cox model in the uncensored case is l(ξ) = n−1 n X i=1 yi log pi , ξ = β, λ(·) where pi = θi λ(ti ) exp{−θi Λ(ti )}, θi = exp(β T zi ) . P (b) Set λi = Λ(ti ) − Λ(ti−1 ), Λ(t0 ) = 0, then Λ(ti ) = j≤i λj . Let ld be l with λ(ti ) replaced by λi /di , where di = ti − ti−1 . Let F and f = F ′ be the df and density of T . Assume that log λ(·) is uniformly continuous and that λ(·) is nondecreasing. Show that l(ξ) − ld (ξ) = OP n−1 log λ(Tn ) where Tn is the largest F order statistic, T ∼ F . Hint. By the mean value theorem, λi = di λ(Ti∗ ) where Ti ≤ Ti∗ ≤ Ti−1 . Here Ti−1 and Ti are order statistics. Next show that X log λ(Ti ) − log λ(Ti−1 ) = n−1 log λ(Tn ) . l(ξ) − ld (ξ) ≤ n−1 Remark. It is known that λ(Tn ) ∼ = Tn f (Tn ) in the sense that the ratio tends to one with probability one. See de Haan and Ferreira (2006), Theorem 5.4.1. Thus if tf (t) is bounded, then l(ξ) − ld (ξ) = O(n−1 ) with probability one. (c) Let la be l with λ(ti ) replaced by λi , that is, let Λ(t) be a step function with jumps of size λi . Show that in the exponential case where λ(t) = 1/µ is constant, E[l(ξ) − la (ξ)] does not tend to zero as n → ∞. P Hint. See Problem B.2.14 and recall that if T ∼ E×p(µ) then E(T(i) ) = nj=n+1−i (1/j). Remark. In this model ld and la are equivalent functions of ξ. However, ld is more convenient for showing closeness to the semiparametric likelihood l. 25. Two sample plug-in semiparametrics. Suppose X1 , . . . , Xn are i.i.d. as X ∼ F , Y1 , . . . , Yn are i.i.d. as Y ∼ H; the X’s and Y ’s are independent. Consider the model where H(y) = G F (y); θ with F unknown for some parametric model G(u; θ), 0 ≤ u ≤ 1. Assume temporarily that F is known; then if we set U = F (Y ), then V has distribution b ) of θ based on U1 , . . . , Un , Ui = F (Yi ). If F G(u; θ) and we can find the MLE θ(F b b b is unknown, set θ = θ(F ), where F is the empirical of the X’s. Because supx |Fb (x) − 1 b ) is a consistent estimate of θ(F ) and θ(F ) F (x)| = OP (n− 2 ), θb is consistent when θ(F is a continuous function of F in the sup norm. For the following models, find b ) assuming F is known. (a) The influence function of θ(F b Fb). (b) The influence function of θb = θ( Models: (i) Proportional hazards. G(u; θ) = 1 − [1 − u]θ , θ > 0. 200 Inference in semiparametric models Chapter 9 (ii) Lehmann. G(u; θ) = uθ , θ > 0. (iii) Proportional odds. G(u; θ) = u/[u + θ(1 − u)], θ > 0. (iv) Normal transformation model. G(u; θ) = Φ Φ−1 (u) − θ , θ ∈ R. (v) SINAMI. G(u; θ) = (eθu − 1)(eθ − 1)−1 , θ 6= 0, = u when θ = 0. (vi) Self-contamination. G(u; θ) = (1 − θ)u + θuc , 0 ≤ θ ≤ 1, c > 1 is a constant. Remark 4. These models were considered in Problem 8.3.14. Problems for Section 9.2 1. In Example 9.2.1, show that En (·) =⇒ Z(·). Hint. Apply the delta method and show that the remainder is bounded uniformly. 2. Derive (9.2.5) using Theorem 6.2.1 and Theorem 7.1.4. 3. Derive (9.2.7). 1 4. Establish that Rem 1 = oP (n− 2 ) in the proof of Theorem 9.2.1, using the equicontinuity of the empirical process and the consistency of the empirical quantile function. 5. Establish (9.2.11) by computing the covariance of the limiting process in part (ii) of Theorem 9.2.1. Hint. Use integration by parts. 6. Verify in detail that the conditions of Theorem 9.1.3 imply statements (a) and (b) of the proof of Theorem 9.2.2 showing how Theorem 1.6.3 applies. 7. Verify the details of the proof that c) of Proposition 9.1.1 applies in the proof of (iii) of Theorem 9.2.2. 1 Hint. Use the fact that n 2 (S0 (·, β, Pb ) − S0 (·, β, P )) can be thought of as an empirical process and then apply the delta method to Λ̇α . 8. Derive a result similar to Theorem 9.2.1 for the truncation estimates (9.1.34). 9. Give the details of the derivation of (9.2.17). 10. Show that (9.2.17) and (9.2.20)–(9.2.22) imply (9.2.19). 11. Verify (9.2.20)–(9.2.22) assuming the validity of (9.2.24). 12. Consider the Cox model with z fixed and T having hazard rate λ(t) exp{βT z}. Let A = {t : log S0 (t, β, P ) is strictly convex in β} . Using the proof of Theorem 1.6.3, show that log S0 (t, β, P ) is always strictly convex and analytic and P [T ∈ A] > 0 unless P [T = c(z)] = 1 , which is ruled out by our assumptions. 13. Suppose Z ∼ Uniform(0, 1) and that (T |z) has the exponential model constant failure rate λ exp{βz}, λ > 0, β ∈ R. Section 9.6 201 Problems and Complements (a) Find the asymptotic distribution of the MLE βbMLE of β in this model. (b) Find the asymptotic distribution of the Cox estimate βbCOX using Theorem 9.2.2. (c) Give the asymptotic relative efficiency (ARE) of βbCOX with respect to βbMLE in the exponential model (see Problem 5.4.1(f) for the definition of ARE). Give the value of the ARE when τ = ∞. (You may use MATLAB or any other software in this problem.) 14. Suppose that (Z, T ) ∼ Cox(β 0 , λ) and that condition C holds. Show that, given Z = z, V = Λ(T ) exp(β T0 z) has a standard exponential distribution. Hint. Set θ = exp(βT0 z). Note that P θΛ(T ) ≤ s|z = P T ≤ Λ−1 (s/θ)|z . 15. Suppose (Z1 , T1 ), . . . , (Zn , Tn ) are i.i.d. as (Z, T ), where T > 0 is a survival time and Z ∈ R, with 0 < E[Z 2 ] < ∞ and P (Z = 0) = 0, is a predictor random variable. Let H(z, t; β), β ∈ R, denote the distribution of (Z, T ) and assume that the marginal distribution of Z does not involve β. Consider the estimating equation n 1X ψ(zi , ti ; β) = 0 , n i=1 where ψ(z, t; β) = ṙ(z;β) r(z;β) − ṙ(z; β)t, and r(· ; β) > 0 is a 1-1 known function of β with derivative ṙ(z; β) = ∂r(z; β)/∂β. (a) Suppose (Z, T ) has distribution H1 where H1 is such that T given Z has the Weibull distribution with df 1 − exp{−r(z, β0 )tα }, α > 0 . For what values of α does EH1 [ψ(Z, T ; b)] = 0 have the solution b = β0 ? (b) Let r(z; β) = exp(−βz) and let α be the answer(s) to part (a). Is the solution b = β0 in part (a) unique? Hint. You may use the fact that if W has the Weibull distribution with df 1 − 1 exp{−r(wα / θ), then E[W ] = θ α Γ(α−1 + 1). Pn (c) Let D(b) = n1 i=1 ψ(Zi , Ti ; b). Suppose r(z, β) = exp(−βz). Show that D(b) = 0 has a unique solution. (d) Let β0 be the solution to EH [ψ(Z, T ; b)] = 0 when r(z; β) = exp(−βz) and let βb be the solution to D(b) = 0 as detailed in (c). Derive the asymptotic distribution of √ b n(β − β0 ). √ √ Hint. Note that P ( n(βb − β0 ) ≤ s) = P D(bn ) ≤ 0 , where bn = β0 + s/ n. 202 Inference in semiparametric models Chapter 9 Problems for Section 9.3 1. Show that (9.3.5) is implied by (9.3.3) and (9.3.4). 2. (a) Show that I −1 (P0 : ν, Q) given in (9.3.2) is valid and independent of parametrization. Hint. Let t → τ (t) be a reparametrization of Q. Let f ∗ (·, τ ) = p(·, t(τ )) where t(τ ) has inverse τ (t), t(0) = 0, and t′ (τ ) > 0. Let Pf,τ denote the probability corresponding to f (·, τ ) and compute ∂ log f ∗ (·, τ ) ∂τ τ =0 , ∂ ν(Pf,τ ) ∂τ τ =0 . (b) Show that ψ ∗ and Q̇ as defined in Definition 9.3.1 and (9.3.8) do not depend on the parametrization of Q. Hint. Note that l˙f = t(a)l˙p . 3. In Example 9.3.1, show that sup{I −1 (P0 : ν, Qa ) : a ∈ Rd } = [nE(ZZT )]−1 (1,1) . 4. Show that in parametric models P satisfying the conditions of Theorem 6.2.2, the MLE is efficient in the sense of Proposition 9.3.1. Hint. Take any a ∈ Rd , θ0 and consider the one dimensional model Q ≡ {Pθ 0 +λa : |λ| ≤ ε}. 5. Derive Theorem 3.4.3 from Proposition 9.3.2. 6. (a) Give the tangent spaces for the {G(p, λ) : p > 0, λ > 0}, {β(r, s) : r > 0, s > 0} models at a generic point of the parameter space. Pd (b) Suppose p(x, η) = exp{ j=1 ηj Tj (x)−A(η)}h(x), η ∈ E. Compute the tangent space of this model at η0 . 7. Suppose X ∼ P(θ,η) , θ ∈ R, η ∈ Rd , a regular parametric model satisfying A0 – A6 of Theorem 6.2.2 for l(X, θ, η). Let θb be the MLE of θ. Show that θb has influence function at (θ0 , η 0 ) given by where e l(X, P0 : θ, P) = l˙1 − Π(l˙1 |[l˙21 , . . . , l˙2d ]) kl˙1 − Π(l˙1 |[l˙21 , . . . , l̇2d ])k2 ∂l ∂l l˙1 (X) = (X, θ0 , η 0 ), (X, θ0 , η 0 ), l2j (X) = ∂θ ∂ηj 1≤j ≤d. Hint. e l is the first coordinate of I −1 (θ0 , η 0 )(l˙1 , l̇21 . . . . , l˙2d )T . Use B.10.20. 8. Suppose X = (Z, Y ), Y = βZ + ε, E(ε|Z) = 0 with the joint distribution of (Z, ε) otherwise arbitrary and Eε2 < ∞, 0 < EZ 2 < ∞. (a) Show formally that the tangent space of this model at (α0 , β0 , L0 ) where L0 is the Section 9.6 203 Problems and Complements joint distribution of (Z, ε) is given by n h(Z, ε) + c f0′ (Z, ε) : f0 the joint density of (Z, ε), Eh(Z, ε) = 0, f0 o Eh2 (Z, ε) < ∞, E(εh(Z, ε)|Z) = 0 and c ∈ R , and that the influence function of the LSE does not belong to the tangent space in general. (b) Suppose that Z is two valued, P [Z = 0] > 0, P [Z = 1] > 0. Show that the estimate which is the MLE for β in the model Yi = α + βZi + σ(Zi )εi where the εi are i.i.d. N (0, 1) and independent of the Zi , the Zi are treated as fixed, and σ(·) as known is efficient for β at P0 with the εi as above. (c) Suppose σ 2 (Z) is unknown but the εi are still Gaussian N (0, 1) independent of Zi . How would you construct an efficient estimate? R Hint. (a) If pt (z, ε) = exp{th(z, ε) − ∆(p0 , h)}p0 (z, ε) and εpt (z, ε)dε = 0 for all t, z, then E(εh(Z, ε)|Z) = √ 0. (c) The LSE is a n consistent estimate of β. 9. In the censoring example (Example 9.3.4), let ν(P ) = S(t0 ) for some fixed t0 . (a) Show (formally) that the tangent space at (F, G) in this example is {h(Y, δ) ≡ δ∆1 (Y ) − Z Y −∞ ∆1 (s)dΛF (s) + 1 − δ∆2 (Y ) − Eh(Y, δ) = 0, Eh2 (Y, δ) < ∞} . Z Y ∆2 (s)dΛG (s), −∞ (b) Show that the influence function of the empirical MLE of ν, exp{− of this form. RY 0 b 1 (s) dH } b¯ H(s) is 10. Show that in Example 9.3.5 the influence function of the Cox estimate (9.2.15) is of the form (9.3.23). Hint. Set up identity and differentiate to solve. 11. (a) Show that the influence function of Fe (x) for Fe as given in Example 9.1.2 belongs to Ṗ2 (P0 ) as given in Example 9.3.2. (b) Show that θ(Fb) ≡ Fb (x) is regular. You may assume wj ≥ ε > 0 for 1 ≤ j ≤ k if necessary. 12. In Example 9.2.4, show that βb∗ given by (9.2.23) is efficient. 13. Consider the nonparametric regression model Y = µ(X) + e, where µ(·) is unknown, E(e) = 0, Var(e) = σ 2 , X and e are independent with cin with continuous case densities 2 f and g. The nonparametric correlation is η 2 ≡ CORR µ(x), Y ). Assume g ′ and η 2 exist. Recall (Problem 7.2.25) that η 2 = E µ2 (x) − µ2Y /Var(Y ). Formally find the 204 Inference in semiparametric models Chapter 9 tangent space Ṗ(P0 ) and show that it is of the form “(function of x) + (function of x times a function of e).” Remark. By Problem 7.2.25 it follows that the function µ2 (x) + 2µ(x)e of Pninfluence 2 −1 2 ν(P ) = E µ (X) is in Ṗ(P0 ). Thus νb ≡ n b (Xi ) will be an efficient estimate i=1 µ of ν(P ) provided µ b(·) is an “accurate” estimate of µ(·). See Doksum andSamarov (1995) and Aı̈t-Sahalia, Bickel and Stoker (2001) for such estimates of E µ2 (X) for multivariate X. 14. Suppose (X, Y ) ∼ P ∈ P where P is the bivariate normal copula model with Φ−1 F (X) , Φ−1 G(Y ) ∼ N (0, 0, 1, 1, ρ). (a) Show that if F and G are known, then the tangent space at (0, 0, 1, 1, ρ0) is Ṗ1 (P0 ) = ρ0 + (1 + ρ20 )zw − ρ0 (z 2 + w2 ) /(1 − ρ0 )2 where z = Φ−1 F (x) and w = Φ−1 G(y) . Hint. P (X ≤ x, Y ≤ y) = H(z, w) where H = N (0, 0, 1, 1, ρ). (b) Suppose F and G are unknown, and that f = F ′ and g = G′ exist. Show that the tangent space Ṗ(P0 ) at (P0 , F0 , G0 ) is Ṗ1 (P0 ) + Ṗ2 (P0 ) + Ṗ3 (P0 ) where b(x) b(x)f0 (x) ρ0 w − z + z + f0 (x) ϕ(z) 1 − ρ20 c(y) c(y)g0 (y) ρ0 z − w Ṗ3 (P0 ) = + w + g0 (y) ϕ(w) 1 − ρ20 Ṗ2 (P0 ) = for some b(x) and c(y) in L2 (P0 ). Here ϕ, f0 , and g0 are the continuous case densities of Φ, F0 , and G0 . Remark. Klaassen and Wellner (1997) show that the normal scores correlation coefficient of Example 8.3.12 is efficient for the bivariate normal copula model. That is,they show that the influence function of ν(Pb ) is in Ṗ(P0 ), where ν(P ) = Corr Φ−1 F (X) , Φ−1 G(Y ) . Problems for Section 9.4 1. Validate (9.4.3). 2. (a) Show that Sn of (9.4.13) is well defined, i.e. finite even if M = ∞. (b) Show that weak convergence theory cannot yield anything here since 1 sup |Z(x)| [Φ(1 − Φ)]− 2 (x) = ∞ x where Z is given in Example 9.2.1. Hint. Use the Law of the Iterated Logarithm, 1 1 limt→0 |W (t)|t− 2 | log2 t|− 2 > 0 where log2 ≡ log log. 1 3. Suppose that Fn (x) = F0 (x) + ∆n (x)/n 2 where |∆n − ∆|∞ → 0 and let Pn be the distribution under which X1 , . . . , Xn are i.i.d. Fn . Section 9.6 205 Problems and Complements (a) Show that, under Pn , En0 (·) ≡ √ n(Fb (·) − F0 (·)) =⇒ W 0 (F0 (·)) + ∆(·). (b) Let F0 be the U(0, 1) df and let φ1 , φ2 , . . . be the eigenfunctions of the kernel where [Kh](s) = R1 0 h → Kh, h ∈ L2 (0, 1) h(u)(s ∧ u − su)du, i.e. Z 1 φa (u)φb (u)du = δab , Kφa = λa φa . 0 The φ and λ are well known to be √ φa (u) = 2 sin(aπu), 0 ≤ u ≤ 1, λa = (aπ)−2 . R1 P∞ Show that if ∆ ∈ L2 (0, 1), ∆ = k=1 (∆, φk )φk where (∆, φk ) ≡ 0 ∆(u)φk (u)du R ban ≡ φa (u)En0 (u). Then, for all k and Z bkn ) FIDI (Zb1n , . . . , Z =⇒ (Z1 , . . . , Zk ) where Zj are independent N (µj , λj ), 1 ≤ j ≤ k, and µj ≡ (∆, φj ). R1 2 P P∞ − 12 2 ′ 2 ′ (c) Show that 0 En0 (u)du =⇒ ∞ k=1 Zk = j=1 λj (Zj + ∆j λj ) where Zj , j ≥ 1, are i.i.d. N (0, 1). R1 P P∞ 2 2 Hint. (c) E 0 (En (u) − m k=1 Zkn φk (u)) du = k=m+1 (λk + (∆, φk ) ). 4. Extend the result of Problem 3 to the situation of Example 9.4.2. Suppose T = 1 {1(−∞, s] : s ∈ R}, let Pn correspond to Fθ 0 (x) + ∆n (x)/n 2 and |∆n − ∆|∞ → 0. Show that√if bn (·) = n(Fb − F b )(·), then, under Pn , Z θ0 bn (·) =⇒ Z 0 (·) + (∆ − Π (∆))(·). (a) Z θ0 θ 0 ∂l (X, θ0 ) : 1 ≤ j ≤ d], then any test based on (b) In particular, show that if ∆ ∈ [ ∂θ 0 bn (·) has no power for the alternative Pn . Why is this qualitatively reasonable? Z √ 5. Show, extending the result of Problem 9.2.4, that for any fixed S, Tn1 ≡ n(Fb −F b )(s) θ has no greater, and usually less, power as a test statistic for H : F = Fθ 0 vs K : Fn as in √ Problem 9.2.4 than does Tn2 ≡ n(Fb − Fθ0 )(s). Hint. Compute critical values under H using Problem 9.2.4 and then show as in Chapter 5 206 Inference in semiparametric models Chapter 9 that the power for Tnj is governed by a ratio of signal to noise, µj /σj . Show that µ22 /σ22 can be written as (a, b)2 /kbk2 for a suitable a, b while 2 µ21 (a, b − Π0 b)2 (a, Π⊥ 0 (b)) = = 2 σ12 kb − Π0 bk2 kΠ⊥ 0 (b)k where Π0 , Π⊥ 0 are projection operators. Use the identity Π(a|[b]) = Π(a|[Π⊥ 0 b]) + Π(a|[Π0 b]) 2 and hence kΠ(a|[b])k2 ≥ kΠ(a|[Π⊥ 0 b])k . 6. Suppose νb is an efficient estimate of ν(P ) for P ∈ P. Let e l(X, P0 : ν, P) be its influence function. Suppose that, for all P0 ∈ P, e l(·, Pb : ν, P) is well defined and E0 !2 n 1Xe (l(Xi , Pb : ν, P) − e l(Xi , P0 : ν, P) →0. n i=1 (a) Show how to construct asymptotic 1 − α confidence bounds for ν(P ), P ∈ P based on νb. (b) Suppose that P0 n Var∗ νb∗ −→ I −1 (P0 : ν, P) where the ∗ as usual indicate computation under the bootstrap distribution given Pb as defined in Section 9.4. Show how to construct the asymptotic 1 − α confidence bounds in this case. Problems for Section 9.5 1. Establish Proposition 9.5.1. Hint. Given An , by the Neyman–Pearson lemma, there exists cn < ∞ such that P (n) [Ln > cn ] ≤ P (n) (An ) ≤ P (n) [Ln ≥ cn ] and Q(n) [Ln ≥ cn ] ≥ Q(n) (An ). If P (n) (An ) → 0, suppose cn ↑ c ≤ ∞. Since P (n) [Ln > cn ] → 0, P [Λ ≥ c] = 0. For suitable d < c, 1 − Q(n) [Ln ≥ d] = Q(n) [Ln < d] = EP (n) (Λn 1(Ln < d)) → EΛ1(Λ < d). Since EΛ = 1, Q(n) [Ln ≥ d] → EΛ1(Λ ≥ d). But E(Λ1(Λ ≥ d)) → 0 as d ↑ c by the dominated convergence theorem. 2. Establish (9.5.7). 2 3. In Example 9.5.1, show that E (Tu − SU )/s → 0 under Lindeberg’s condition. Hint. Put bounds on Var(U(i) ) and Cov U(i) , U(j) using Problem B.2.9. Note that the covariance can be obtained from Var(Y − X) = VarX + VarY − 2Cov(X, Y ) . 4. Let X1 , . . . , Xn be i.i.d. with continuous df F and density f = F ′ ; and let Y1 , . . . , Yn be i.i.d. with df G(y) = F (y − θ). This is Example 9.5.1 with zi = 0, 1 ≤ i ≤ m, zi = 1, Section 9.6 207 Problems and Complements m + 1 ≤ i ≤ N , where N = m + n. In√this case TU of Example 9.5.1 is the two-sample Wilcoxon statistic. Show that for θ = t/ n, t > 0 and λ = lim (n/N ), if λ ∈ (0, 1), the N →∞ asymptotic power of TU is Z √ p lim P 12TU ≥ z1−α = 1 − Φ z1−α − λ(1 − λ) t/12σ f 2 (x)dx . N →∞ Hint. By contiguity it is enough to compute lim P N →∞ defined in Example 9.5.1. √ 12SU ≥ z1−α where SU is 5. Let G(t) = 1 − exp{−t}, t ≥ 0, and r(t) = f (t)/[1 − F (t)] where {t : f (t) > 0} = [0, ∞). Show that (a) If r(t) is increasing on [0, ∞) and i < j, then PF (Di ≤ Dj ) ≥ PG (Di ≤ Dj ) = 12 . (b) Define F to have more increasing failure rate (IFR) than F1 , written F > F1 , if F1−1 F c is convex (cf. van Zwet (1964)). Show that if F1 and F are increasing on [0, ∞), 0 elsewhere, F > F1 and i < j, then c PF (Di ≤ Dj ) ≥ PF1 (Di ≤ Dj ) . (c) A test φ(·) is said to have a monotone power for IFR testing if F > F1 =⇒ EF φ(X) ≥ EF1 φ(X) . c In Example 9.5.4, let D = (D1 , . . . , Dn ) and D′ = (D1′ , . . . , Dn′ ) be two vectors of normalized spacings and let δ(·) be a test function based on normalized spacings. Then δ(·) is said to be monotone wrt IFR testing if δ(D′ ) ≤ δ(D) for all D and D′ with Di′ /Di nondecreasing in i. Show that the tests δU,v (·) that reject exponentiality in favor of K :“F is IFR” when U ≥ v, v > 0, have monotone power and are unbiased. (U is given by (9.5.12).) Hint (a). Use (b) with F1 = G for the first inequality. ′ Hint (b). Since F1−1 F is increasing, X(i) = F1−1 F (X(i) ) is the ith order statistic in a ′ random sample from a population with distribution F1 . Let Di′ = (n − i + 1) X(i) − −1 ′ , i = 1, . . . , n. Since F1 F is convex, i < j and Di′ ≥ Dj′ implies Di ≥ Dj . X(i−1) Hint (c). Let Z be a discrete random variable takingPon the values 1, . . . , n and let PP 0 and P1 be probabilities such that P0 (Z = k) = Dk / Di and P1 (Z = k) = Dk′ / Di′ . Let D and D′ be such that Dk′ /Dk is nondecreasing in k, then P0 and P1 have monotone likelihood ratio in k. Set cn (i) = ci . It follows 4.3.1P(first established P from Theorem P P ′ by ′ Lehmann (1955)) that E − c (Z) = − c (i)D / D ≤ − c (i)D / Di = P0 n n i i n i EP1 − cn (Z) ; thus the tests δU,v (·) are monotone. Next use the hint to (b). 208 Inference in semiparametric models Chapter 9 j 6. Let G(t) = 1 − e−t ; t ≥ 0. Consider the models Pη,λ with densities λfj (λx; η), j = 1, 2, 3, x > 0, λ > 0, η ≥ 0, where, for t ≥ 0. f1 (t; η) = (1 + η)tη exp{−t(1+η) } (Weibull) 1 ( Linear FR) f2 (t; η) = (1 + ηt) exp{−(t + ηt2 )} 2 f3 (t; η) = 1 + ηG(t) exp − t + η t − G(t) (Exponential FR) (a) Show that the FRs for fj (t; η), j = 1, 2, 3, are (1 + η)tη , (1 + ηt), and 1 + ηG(t). Pn j i (b) Show that for the models Pη,λ , j = 1, 2, 3, the statistic i=1 d1 n+1 Di − 1 as defined by (9.5.15) and (9.5.16) has d1 (u) given by − log − log(1 − u) , log(1 − u), and −u, respectively (recall that λ = 1 in (9.5.16)). 7. Show that if X(i) is an E(1) order statistic, and if G(t) = 1 − exp(−t), t ≥ 0, then X(i) − G−1 i 2 = OP (n−1 ) . n+1 Hint. Write X(i) = G−1 (U(i) ) where U(i) is a uniform (0, 1) order statistic. Expand G−1 (U(i) ) around E(U(i) ) = i/(n + 1) and use Problem B.2.9. R R1 i 8. In Example 9.5.4, let ci = c n+1 with c(u) du = 0 and 0 c2 (u) du < ∞. (a) Establish (9.5.17). Pn 1 P (b) Set U ∗ = i=1 ci (n−1 λDi − 1). Show that n− 2 (U − U ∗ ) −→ 0 as n → ∞ under the exponential hypothesis. It follows that U and U ∗ have the same asymptotic distributions for contiguous alternatives. (c) Use LeCam’s Third Lemma and (9.5.17) to give the asymptotic power function for contiguous alternatives of the test based on U . (d) Use (c) above to find c(·) that maximizes the asymptotic power function of U . 9. In Example 9.5.2 assume the model √ satisfies the ULAN condition. Show that Λn = Sn + oP (1) under K : θn = θ0 + t/ n. 10. Consider the partial linear model Y = βZ + g(0) + ε, ε ∼ N (0, σ 2 ), of Examples 9.1.13, 9.2.4, and 9.3.5. (a) Establish formally analogues of (9.5.9). Identify ν0 with β and η with (µ1 , µ2 )T where µ1 (u) = E(Z|U = u) and µ2 (u) = E(Y |U = u) = βη1 (u) + g(u). Recall that we observe X1 , . . . , Xn i.i.d. as X = (Z, U, Y ). Assume n even, set k = n/2, and use X1 , . . . , Xk to estimate µ1 (·), µ2 (·) and Xk+1 , . . . , Xn to estimate the influence function. Use (11.6.1) to estimate µ1 (·) and µ2 (·). (b) Determine sufficient conditions for your estimates of (a) to satisfy (9.5.9). (c) Find the asymptotic power function of the Neyman cα test (9.5.10) for H : β = 0 vs K : β > 0. (d) Suppose (U, Z) has a N (0, 0, 1, 1, ρ) distribution. Discuss the behaviour of the power function in (c) as a function of ρ. Section 9.7 Notes 209 11. Consider the logistic copula regression model of Example 8.3.11 with θi = β0 + β1 zi . Assume that the zi , 1 ≤ i ≤ n, satisfy Lindeberg’s condition and that the baseline 1 P distribution F is known. We test H : β = 0 vs β > 0 using SU = n− 2 ni=1 √F (Xi )(zi − z̄). Find the asymptotic power of this test for contiguous alternatives β = t/ n. Remark. By Problem 9.5.3, the uniform scores rank statistic TU has the same asymptotic power as SU . 12. Establish (9.5.10). 13. Establish (9.5.11). ∂l ∂l − π0 ∂ν |Ṗ1 (P0 ) is orthogonal to Ṗ1 (P0 ). Hint. l∗ = ∂ν 0 0 1 14. Convergence of the power function βn (θ) of the Neyman Cα test is uniform on n− 2 neighbourhoods of θ 0 . b ) for sequences ν0 + √sn , η 0 + √tn . Hint. Apply Le Cam’s third lemma to Sn (ν0 , η 15. The Neyman-Scott model. Suppose Xij = Yij + σZij , i = 1, . . . , n, j = 1, . . . , K, K ≥ 2 where Yij ≡ Yi , independent of Zij and each other and distributed according to F unknown, and Zij are i.i.d. N (0, 1). Only Xij are observed. (a) Construct a test for H : σ 2 = 1 vs σ 2 > 1, F unknown, which is size α and has asymptotic power > α for σ 2 > 1 independent of F . Pk 1 2 2 2 Hint. Consider Ui ≡ K−1 j=1 (Xij − Xi· ) , i.i.d. σ χK−1 . (b) Show that this test is asymptotically most powerful among all asymptotically level α tests of F = N (µ, τ a ) for any µ, τ 2 . Hint. Compute l∗ . (c) 2. Neyman Scott Z. Give a heuristic argument for or against the test being asymptotically most powerful at other F . (d) Suppose you treat the Yij as unknown constants µi . Show that you arrive to the same test. (e) However, show that under the assumption of (b) the MLE of σ 2 is not consistent. 16. Suppose that Pn (x, θ), θ ⊂ R, satisfy the ULAN condition with Tn (X) is asymptoti2 1 (Sn ) + N (µ, σ ) cally N (0, I) under Pn (x, θ), Sn is asymptotically ancillary if L θ+tn− 2 independent of t. Show that Tn and Sn are then asymptotically independent. 9.7 Notes Notes for Section 9.1. (1) Owen (2001) makes his extension of empirical likelihood to (multiple) biased sampling Pk by considering k independent samples of size nj , with nj > 0, 1 ≤ j ≤ k, j=1 nj = n, and permitting an empirical likelihood model for each. (2) The Cox model was introduced by Cox (1972,1975) in connection with yet another modified likelihood, partial likelihood. See Example 9.1.11. Other modifications, conditional and marginal likelihoods, are discussed in Kalbfleisch and Sprott (1970) and Kalbfleich and Prentice (1973). See Problem 9.1.23. 210 Inference in semiparametric models Chapter 9 Notes for Section 9.2. (1) For an alternate proof of the concavity of Γτ (β, P ), see Problem 9.2.12. Note for Section 9.3. (1) It may turn out that even though we have a candidate efficient influence function, no regular estimate exists — see BKRW. Notes for Section 9.5. (1) Le Cam’s notion of LAN is uniform in a much weaker sense. He requires only Rn (tn ) P θ −→0 0 if tn −→ t. The notion we use is due to Hàjek (1970). It simplifies proofs and applies to all common situations of interest in which Le Cam’s weaker property applies. (2) The notion of contiguity is an asymptotic analogue of the familiar measure theoretic, finite n property that P (n) dominate Q(n) . Chapter 10 MONTE CARLO METHODS 10.1 The Nature of Monte Carlo Methods The importance of Monte Carlo methods as we have discussed before is that they enable us to construct estimates of features of probability distributions which are impossible or very difficult to compute analytically. Here “estimate” is not used in the sense we have before of a quantity depending on data (which is random through the data being used) to estimate a feature of an unknown probability distribution. Rather it is we who are generating random quantities which can be used to estimate features of a probability distribution specified in a completely known way. We can couple the plug-in principle (Section 2.1.2) with Monte Carlo methods to obtain statistical estimates as well. That is, if we have data from an unknown probability distribution belonging to a model we can estimate that distribution in some way and then, if we have a feature (parameter) of the true distribution we want to estimate, use Monte Carlo to estimate the feature by random quantities generated using the estimated distribution. We shall discuss both Monte Carlo and statistical applications of Monte Carlo in this chapter. Here are some typical examples of statistical situations where Monte Carlo is natural. Example 10.1.1. Goodness-of-fit tests. Consider testing H : F = F0 using a statistic Tn (X1 , . . . , Xn ) where X1 , . . . , Xn are i.i.d. with df F . We need an α critical value cn such that PF0 [Tn ≥ cn ] = α. Monte Carlo Solution: Generate, on the computer(1), B i.i.d. sets of n i.i.d. observations (X11 , . . . , X1n ), . . . , (XB1 , . . . , XBn ), from F0 and form Tnb ≡ Tn (X1b , . . . , Xnb ), 1 ≤ b ≤ B. Then, for large B, the empirical distribution of the Tnb , 1 ≤ b ≤ B, is arbitrarily close to the distribution of Tn (X1 , . . . , Xn ) under F0 by the law of large numbers. Thus if T(1) ≤ · · · ≤ T(B) are the ordered Tnb and [ ] is the greatest integer function, the natural estimate T([B(1−α)]+1) of cn tends to cn as B → ∞. (We assume cn uniquely defined here.) See also Example 4.1.6. Recall again that B can be made as large as we please. Here are two applications: (a) A simple application is to permutation tests. Consider the two sample problem where X1 , . . . , Xn are independent with X1 , . . . , Xm i.i.d. G1 , Xm+1 , . . . , Xn i.i.d. G2 and H : G1 = G2 . This hypothesis is not simple. However, as we have noted in Chapter 8, under H, if X(1) ≤ . . . X(n) are the ordered X1 , . . . , Xn , the conditional distribution of 211 212 Monte Carlo Methods Chapter 10 (X1 , . . . , Xn ) given X(1) = x(1) , . . . , X(n) = x(n) , x(1) ≤ · · · ≤ x(n) , the permutation distribution is simply the uniform distribution on {(x(i1 ) , . . . , x(in ) ) : (i1 , . . . , in ), a permutation of {1, . . . , n}}. This is our F0 . Sampling from this distribution is easy. One simply selects i1 uniformly from {1, . . . , n}, i2 uniformly from {1, . . . , n} − {i1 }, and so on. A routine for doing this is implemented in R for instance as an option of the “sample” function. The two-sample t-statistic r m(n − m) Y − X T = n s of Example 8.2.7 is an example of a statistic Tn . The permutation distribution of T is appropriate when G1 and G2 are not Gaussian and min(m, n − m) is not large. (b) A second application is to testing for normality. As in Example 4.1.6 and Section I.2, consider testing the goodness-of-fit hypothesis H : F = Φ ·−µ for some µ, σ, where σ Φ is the N (0, 1) distribution and say the statistic is an unorthodox one such as that of Shapiro–Wilk type, the residual sum of squares of a normal quantile plot for the data, 2 n X X(i) − X̄ i −1 , Z(i) = . Z(i) − Φ Tn (X1 , . . . , Xn ) = n + 1 σ b i=1 For this Tn , we can take F0 = N (0, 1) because Tn is invariant under Xi → (Xi − µ)/σ. Approximations to the distribution of Tn under F0 are known (Venter and de Wet (1972)) but the exact distribution is only sparsely tabulated. As in Example 4.1.6, to do Monte Carlo we need a set of B simple random samples of size n from N (0, 1). Most statistical packages including R can generate these. But for some F0 , it is not clear a priori how this is done, if we view as most primitive that one can generate U(0, 1) observations easily.(1) We shall discuss some methods in Section 10.2. Example 10.1.2. Estimating risks functions. Given a parametric model for X ∈ Rn , {Pθ : θ ∈ Θ}, a decision theoretic formulation with loss function l(θ, δ), and a decision rule δ(X), we want to compute the risk R(θ, δ) = Eθ l(θ, δ(X)) for selected θ. As discussed in Section 5.1, this can typically not be done analytically since if X, say, has density p(x, θ), Z R(θ, δ) ≡ Eθ l(θ, δ(X) = l(θ, δ(x))p(x, θ)dx is an n-dimensional integral. The Monte Carlo approach is just to generate X1 , . . . , XB i.i.d. Pθ and approximate R(θ, δ) by B 1 X l(θ, δ(Xb )) B b=1 and again appeal to the law of large numbers. For instance take X = (X1 , . . . , Xn ) i.i.d. with df F0 (x − θ) where F0 is symmetric with center of symmetry 0. Let, for 0 ≤ α < 12 , n−[nα] X̄α = (n − 2[nα])−1 X j=[nα]+1 X(j) Section 10.1 The Nature of Monte Carlo Methods 213 be the alpha trimmed mean. Then, if θb = X̄α and l(θ, d) = (θ − d)2 , Eθ (θb − θ)2 = E0 (θb2 ), and we can approximate the constant risk function arbitrarily closely by taking B samples of size n, from F0 , and computing the average of l(θ, δ(X1 )), . . . , l(θ, δ(XB )) over these samples. We evaluate the performance of X̄α and other estimates by repeating this for several F0 of various shapes such as the Gaussian, Laplace, and Cauchy df’s. See Problem 3.5.9 and Andrews et al (1972). In the case α = 12 , when X̄α is interpretable as the median, it is possible to use numerical integration in one dimension to evaluate its MSE by formula (5.1.2). For X̄0 ≡ X̄ one can compute the characteristic function numerically and then use Fourier inversion. Even for X̄α , by conditioning on X̄([nα]+1) and X̄(n−[nα]) it is possible to use one- and twodimensional numerical integrations (Problem 10.1.2). However, these methods become more and more elaborate and fail entirely, for instance, for general linear combinations of order statistics. Thus, Monte Carlo is appropriate. ✷ Both of these examples should make it clear that, so far, the use of Monte Carlo is an arbitrarily sharp approximation to a high-dimensional integral or sum. If the dimension (sample size) were 1 or 2 we could use more accurate numerical integration procedures. But even for modest sample sizes numerical integration doesn’t work. The great virtue of (simple random sampling) Monte Carlo methods is that their MSE E !2 B Var l(θ, δ(X)) 1 X l(θ, δ(Xb )) − Eθ l(θ, δ(X)) = B B b=1 is not dependent on the dimension n and the numerator is bounded uniformly if l is. We shall see that simple random sampling is not always applicable, but important variants can be used. Example 10.1.3. Joint posterior distributions. Consider a Bayesian framework where θ has prior density π on Rk and X given θ = θ has density p(x|θ). Here both π and p(x|θ) are given analytically. By Bayes’ rule, the posterior density is Z π(θ | X = x) = p(x|θ)π(θ) p(x|t)π(t)dt . (10.1.1) Rk The denominator of (10.1.1) presents a difficult high-dimensional integration problem if k is at all large (unless π is an analytically explicit conjugate prior as in Section 1.6.5). The problem becomes even more acute if we need, say, the marginal posterior density of θ1 which involves a k − 1-dimensional integration for each θ1 . It is often very useful to represent the posterior distribution of θ by the empirical distribution of i.i.d. observations θ1 , . . . , θB from the posterior distribution. In this case, obtaining an approximation to the (marginal) posterior distribution of, say, q(θ) is easy. We just use its representation by the empirical distribution of q(θ1 ), . . . , q(θ B ). Unfortunately it turns out that, if k is at all large, in general, it is very difficult to obtain a genuine simple 214 Monte Carlo Methods Chapter 10 random sample, θ1 , . . . , θB , as above. For instance, consider a generalization of Example 3.2.1. Let θ = (µ, σ −2 ) and let X1 , . . . , Xn be i.i.d. N (µ, σ 2 ) given θ. Put on θ the prior distribution making µ and σ independent with µ ∼ N (ν0 , τ02 ) and σ −2 ∼ Γ(p0 , λ0 ). It is, a priori, quite unclear how to generate independent observations from the resulting posterior distribution. In such situations, methods of a type we shall discuss in Section 10.4 known as Markov chain Monte Carlo (MCMC) are used to generate θ∗1 , . . . , θ∗B which are approximately independent and individually have approximately the correct posterior distribution. ✷ Summary. We illustrated the usefulness of Monte Carlo methods in three situations. The first is where for some completely specified distribution F0 , the distribution LF (Tn ) of a test statistic equals LF0 (Tn ) for all distributions F in a null-hypothesis class F0 . Suppose LF0 (Tn ) cannot be computed analytically. We can then generate B independent samples from F0 , compute the values Tn1 , . . . , TnB for these B Monte Carlo samples, and then approximate the null-hypothesis distribution of Tn by the empirical distribution of Tn1 , . . . , TnB . This case illustrates the need for algorithms for generating samples from a given F0 . The second situation is where we need to compute the risk Eθ,τ l(θ, δ(X)) of a decision procedure δ for a model with principle parameter θ and nuisance parameter τ . For instance, if we use a location equivariant estimate (Problem 3.5.6) to estimate the center of symmetry θ of a distribution Fθ (x) = F0 (x − θ), with squared error loss, the risk is not a function of θ, but it is a function of τ = F0 that can typically not be computed analytically. The risk can be estimated by generating B independent samples from F0 , and computing the average of l(θ, δ(X1 )), . . . , l(θ, δ(XB )) over these samples. Thus, to evaluate the performance of decision procedures δ, we need to generate Monte Carlo samples from various interesting distributional shapes F0 . The final example is where in a Bayesian framework, we want the posterior distribution of a random parameter θ given the data, but this distribution cannot be computed analytically. One fruitful approach is to generate i.i.d. observations θ1 , . . . , θB from the posterior distribution and to use the empirical distribution of q(θ1 ), . . . , q(θB ) to approximate the distribution of a quantity q(θ) of interest. In Section 10.4 we will introduce Markov chain Monte Carlo (MCMC) methods which are the most commonly used procedures for approximately generating such Bayesian Monte Carlo samples. 10.2 Three Basic Monte Carlo Methods Having seen some examples of the potential applicability of Monte Carlo methods we now turn to the basic problem of generating simple random samples from a distribution on Rk which is specified either by its density p0 or some other feature. We also consider the problem of using samples from p0 to estimate integrals involving p 6= p0 , or to obtain samples from p 6= p0 . The fundamental methods discussed in this section and a number of other topics are treated much more extensively in the classical book of Hammersley and Handscomb (1965) and the more recent books by Ripley (1987) and Liu (2001). Section 10.2 10.2.1 Three Basic Monte Carlo Metheds 215 Simple Monte Carlo We begin by considering the problem of generating X ∼ F , for X ∈ X . If X = R then the first basic approach is to use the probability integral transform. If F is continuous, F −1 (u) = inf{x : F (x) ≥ u}, and U ∼ U(0, 1), then F −1 (U ) ∼ F because P F −1 (U ) ≤ x = P U ≤ F (x) = F (x) . If F is specified analytically, this approach is natural since F −1 can always be computed, say, by bisection. Of course, it’s best if F −1 is itself analytically described. Example 10.2.1. Exponential and Cauchy variables. E(λ) variables can be generated as −λ−1 log U since F (x) = 1 − exp(−λx) and 1 − U ∼ U(0, 1). Cauchy variables (see Problem B.2.1) can be generated as tan π(U − 12 ). ✷ This method becomes more powerful by using known distribution results. Example 10.2.2. Box–Muller–Knuth method. We want to generate independent pairs of N (0, 1) variables (X1 , X2 ). This may be done as follows. Let X1 = R cos A, X2 = 2 R sin A, where (R, A) are the polar coordinates of (X1 , X2 ). Then by Theorem B.3.1, R 1 has an E 2 distribution and is independent of A which is U(0, 2π). Hence p p −2 log U1 cos(2πU2 ), −2 log U1 sin(2πU2 ) have the desired joint distribution if (U1 , U2 ) are independent U(0, 1). ✷ The Box–Muller–Knuth method is an example of a general principle. When a distribution F corresponds to a function of independent variables which can be generated already, e.g., U(0, 1) or E(λ), then observations from F can be generated. Example 10.2.3. Gamma and beta. By Theorem B.3.1, the gamma, λ), distribution, PΓ(p, m for p = m/2 where m is an integer, can be represented as (2λ)−1 j=1 Zi2 where the Zi are i.i.d. N (0, 1). Thus by generating a block of m i.i.d. N (0, 1) variables we can generate a Γ(p, λ) variable. If m is even, m = 2p, we can generate the variable more simply from a block of p i.i.d. E(λ) variables using the explicit form of F −1 for exponential variables. For positive integers m and n, set r = m/2 and s = n/2. Then we can go further since V = X1 /(X1 + X2 ) will have a beta, β(r, s), distribution if X1 , X2 are independent Γ(r, 1), Γ(s, 1). Thus using independent gamma variables we can generate β(r, s) variables (Theorem B.2.3). To obtain general gamma and beta variables, see Example 10.2.6. ✷ Example 10.2.4 Sampling from a distribution with finite support. Suppose X is finite {x1 , . . . , xk } and we want to simulate observations of an X valued random variable X Pk such that P [X = xj ] ≡ pj , j=1 pj = 1. Even though the xj need not be real, we can do this as follows. Form the partition of [0, 1] given by I1 ≡ [0, p1 ), I2 ≡ [p1 , p1 + p2 ), . . ., Ik ≡ [p1 + . . . + pk−1 , 1]. Suppose that U has a U(0, 1) distribution. Then, it is clear that if we define g(u) = xj iff u ∈ Ij , j = 1, . . . , k 216 Monte Carlo Methods Chapter 10 then g(U ) ∼ X. Since any distribution may be approximated by finite discrete distributions, this seems easy but, in fact, for selected x1 , . . . , xk , the P [X = xj ] are usually not given analytically. This method is basic in the more difficult problem of generating a sample S ≡ {xi1 , . . ., xin } of size n from a finite population x1 , . . . , xN with probabilities of inclusion π1 , . . ., πN , that is, as in Example 3.4.1, P [xi ∈ S] = πi , i = 1, . . . , N. 10.2.2 ✷ Importance Sampling R Suppose we wish to approximate I(h) = h(x)p(x)dx for a specific h, when we are given, for X ∈ Rk , an analytically presented density p(x). Thus if h(x) = 1(x1 ≤ t1 , . . . , xk ≤ tk ), I(h) would be the df F (t1 , . . . , tk ), and if h(x) = xr11 . . . xrkk , I(h) would give R us moments. There is a way of generating an unbiased Monte Carlo approximation to h(x)p(x) dx without in fact generating a sample from p by using the method of importance sampling. Theorem 10.2.1. Let p0 be any density from which X ∈ Rk can be easily generated. Suppose that P is absolutely continuous with respect to P0 , that is, p0 (x) = 0 =⇒ p(x) = 0. Then, if X1 , . . . , XB are i.i.d. according to p0 , B 1 X p(Xb ) Ib ≡ h(Xb ) B p0 (Xb ) (10.2.1) b=1 R is an unbiased estimate of I ≡ p(x)h(x)dx. Moreover, Ib has variance Z 2 1 p (x) 2 2 V ≡ h (x)dx − I . B p0 (x) (10.2.2) Proof. Unbiasedness follows from Z p(x) p(Xb ) h(Xb ) = h(x)p0 (x)dx. E p0 (Xb ) p [p0 >0] 0 (x) The variance argument is similar. ✷ The idea here is that one finds a density p0 which qualitatively behaves like p but from which simulation is easy. The extent to which the estimate is good is precisely gauged by the size of p/p0 , since this ratio, a natural measure of qualitative similarity, governs the variance. The following example is given to illustrate simply the qualitative features of importance sampling. It is evidently not a situation where importance sampling would be used. Section 10.2 Three Basic Monte Carlo Metheds 217 Example 10.2.5 Gaussian distributions. Suppose p0 is N (0, 1) and p is N (0, τ 2 ). Qualitatively both are symmetric densities but as τ 2 moves from 1, p becomes more and more peaked or diffuse. Take h(x) = x for simplicity. Then we get, for B = 1, 2 Z ∞ 1 x 2 2 2 V = √ x λ exp − (2λ − 1) dx 2 2π −∞ for λ = τ −1 . Thus V = ∞ if λ2 ≤ 12 , and by recognizing V as a multiple of the variance of a Gaussian distribution, V = λ2 (2λ2 −1)−3/2 if λ > 21 . So if λ = 1, V = 1 as expected, while if λ → ∞, V ∼ 2−3/2 λ−1 → 0. This last result is, in retrospect, not surprising, since Varλ (X) → 0 as λ → ∞. Note that it is not true that the best choice of p0 for all h is p0 = p but rather p0 proportional to |h|p (Problem 10.2.2). This incidentally leads to the important observation that importance sampling is directed at integrating a specific function with respect to p rather than the general purpose of “representing” P by the empirical distribution of B observations identically and independently distributed (approximately) as P . Some methods related to importance sampling such as antithetic variables will be considered in the problems. ✷ 10.2.3 Rejective Sampling The rejective sampling method due to von Neumann (1951) enables us, knowing the density p up to a constant, say p(x) = aq(x), to generate X with density p given that one can generate observations from a density p0 with the property that p sup (x) = c < ∞. x p0 Necessarily c > 1 unless p ≡ p0 . Let π(x) ≡ c−1 p q(x)/p0 (x) (x) = , 0 ≤ π(x) ≤ 1. p0 sup[q(x)/p0 (x)] (10.2.3) The method is to generate i.i.d. X1 , X2 , . . . consecutively from p0 . Then consecutively generate independent Bernoulli variables I1 , I2 , . . . such that P rob[Ij = 1 | Xj ] = π(Xj ) . Let τ be the first j such that Ij = 1. “Reject” all Xj , j < τ ; “accept” Xτ as an observation from p. Moreover, let P r denote the probability corresponding to this sampling scheme. Then Theorem 10.2.2. Under the given conditions, (a) τ has a geometric marginal distribution with parameter 1/c, that is j−1 1 1 P r[τ = j] = 1− , j ≥1, c c and hence E(τ ) = c. (10.2.4) 218 Monte Carlo Methods Chapter 10 (b) P r[τ < ∞] = 1. (c) Xτ has density p. Proof. (a) and (b) hold because the Ij are independent and Z 1 p 1 P r[Ij = 1] = (x) p0 (x)dx = . c p0 c To establish (c), note that P r(Xτ ∈ A) = = = ∞ X j=1 ∞ X P r(τ = j)P r(Xj ∈ A| τ = j) P r(τ = j)P (Xj ∈ A) j=1 ∞ X 1− j=1 = Z 1 c j−1 1 c Z A p(x) p0 (x)dx p0 (x) p(x)dx . A ✷ Remark 10.2.1. For rejective sampling, the loss one incurs by using p0 rather than p is naturally measured by how long it takes to generate Xτ , for instance, by the expected time to acceptance. Using (10.2.4) we can write E(τ ) = c(p, p0 ) ≡ sup x p(x) , p0 (x) (10.2.5) We illustrate the properties of this method: Example 10.2.6. Gaussian, gamma, and beta distributions by rejective sampling. By arguing as in Example 10.2.1, if U ∼ U(0, 1) then V ≡ sgn(U )(− log |U |) has a Laplace (double exponential) density, p0 (x) = 21 e−|x| . Suppose we want to generate X ∼ N (0, 1) by rejective sampling from the Laplace distribution. Then, r 2 r x 2 2e p exp − + |x| = = 1.32. c(p, p0 ) = sup (x) = sup π 2 π x x p0 Thus one obtains a Gaussian variable typically after 2 or 3 steps. This is clearly inefficient compared to the Box–Muller–Knuth method. As a second example suppose we wish to generate beta(r, s) variates. Suppose r, s > 1. If we take p0 ≡ 1, on (0, 1) we see that c(p, p0 ) = Γ(r + s) Γ(r)Γ(s) r r+s r−1 s r+s s−1 . Section 10.3 The Bootstrap It may be shown that (Problem 10.2.3), as r, s → ∞, r r r+s → t, 0 < t < 1, c(p, p0 ) ∼ . r+s 2πt(1 − t) 219 (10.2.6) Thus using the uniform distribution as p0 is a poor method for r and s large and doesn’t work for r, s < 1. This is not surprising since p(0) = p(1) = 0. On the other hand, using rejective sampling and p0 given by beta([2r]/2, [2s]/2) is reasonable (Problem 10.2.4). An alternative is to first generate independent Γ(r, 1) and Γ(s, 1) variables and use the method of Example 10.2.3. For the gamma distribution Γ(r, 1), 0 < r < 1, we can take p0 (x) = 1 (rxr−1 1(0 < x < 1) + e−x ), x > 0 2 (10.2.7) from which it is easy to simulate (Problem 10.2.5) and then note that by Theorem B.2.3 sup x 2 p (x) ≤ ≤ 2. p0 rΓ(r) ✷ This method is dependent on being able to find a p0 from which it is easy to generate and for which c(p, p0 ) is close to one. This issue becomes acute when we need to generate a large number of i.i.d. variables having the desired density p, as we do, for instance, in Bayesian inference. The most general way of attacking this problem is through the Markov Chain Monte Carlo (MCMC) methods discussed in Section 10.4. In the next section we study a statistically very important example of the use of Monte Carlo methods where generation of the variables is simple. Summary. We considered first the simple Monte Carlo method where desired samples from a given density p0 can be obtained by using transformations of basic variables. Thus normal variables can be obtained as transformations of uniform variables, and gamma, beta, t, F , and multivariate normal variables can be obtained from independent normal variables using Sections B.3, B.4, and B.6. Next we considered importance sampling where a sample from P0 can be used to estimate integrals of the form EP h(X) when P is absolutely continuous with respect to P0 . Finally, we considered rejective sampling where variables with density p0 can be used to generate a variable with density p 6= p0 provided c = supx [p(x)/p0 (x)] < ∞. Rejective sampling is useful when p is of the form p(x) = aq(x) with q known and a unknown; and when c is close to one. 10.3 The Bootstrap The “bootstrap,” formally introduced by Efron (1979) (see also Efron and Tibshirani (1993) and Shao and Tu (1995)), is the first broad application of Monte Carlo methods to inference. The three classical problems to which he applied the method remain excellent illustrations of the idea. The context throughout this section is simple random sampling, X1 , . . . , Xn i.i.d. as X ∼ P . To avoid confusion, in this section, we shall write Pbn for the empirical probability Pb to indicate its dependence on n. 220 10.3.1 Monte Carlo Methods Chapter 10 Bootstrap Samples and Bias Corrections We are given a parameter µ(P ) defined for all P ∈ P and its plug-in estimate µ bn ≡ µ(Pbn ), where we assume that P is large enough so that all P with finite support such as Pbn are included. We wish to estimate the bias of µ bn , BIASn (P ) ≡ EP µ bn − µ(P ) . If we assume for simplicity that µ(P ) is bounded, BIASn (P ) is itself a parameter defined for all P ∈ P. However, it has three special features. In general, a) It depends on n. b) It is computable from knowledge of P only through a multiple integral. If we write µ bn as µn (X1 , . . . , Xn ) for some known function µn , then Z Z EP (b µn ) = · · · µn (x1 , . . . , xn )dP (x1 ) . . . dP (xn ) . c) We expect BIASn (P ) to tend to 0 as n → ∞, but we expect nBIASn (P ) to tend to some limit B(P ) 6= 0. We have encountered this type of estimation problem for the special case Z Z µ(P ) = h g1 (x)dP (x), . . . , gd (x)dP (x) (10.3.1) in Theorem 5.3.2 and, in the general case, in Chapter 7. We applied the delta method for Euclidean and infinite dimensional parameters, respectively, to obtain approximations to BIASn (P ) of the form B(P )/n. We can then reasonably estimate BIASn (P ) by B(Pbn )/n. The difficulty with this program as we have discussed earlier is that even for d = 1 and certainly for d > 1, the approximation B(P )/n is both tedious to compute analytically and may give little insight. The alternative proposed by Efron is to estimate the parameter BIASn (P ) by the empirical plug-in method to obtain BIASn (Pbn ). Save for the dependence of the parameter on n, this is the plug-in idea discussed in Chapter 2. Unfortunately, computing the first term in Z BIASn (Pbn ) = µn (x1 , . . . , xn )dPbn (x1 ) . . . dPbn (xn ) − µ(Pbn ) = 1 nn X 1≤i1 ,...,in ≤n µn (Xi1 , . . . , Xin ) − µ(Pbn ) is, in general, unfeasible practically. However, we can apply Monte Carlo “resampling” as follows. Section 10.3 221 The Bootstrap Bootstrap Estimation of Bias a) Monte Carlo Step. Generate B i.i.d. samples of size n from Pbn ; call them ∗ ∗ X∗b = (Xb1 , . . . , Xbn ), b = 1, . . . , B . ∗ ∗ That is, given X1 , . . . , Xn , Xb1 , . . . , Xbn are i.i.d. as X ∗ with distribution P ∗ = Pbn , P ∗ [X ∗ = Xj ] = 1 , n j = 1, . . . , n. Here, the superscript * is used to denote conditioning on the data X = (X1 , . . . , Xn ). ∗ More informally, for fixed b, each Xbj is obtained by sampling with replacement from the ∗ b population {X1 , . . . , Xn }. Let Pnb denote the empirical distribution of the bth bootstrap ∗ ∗ sample X∗b = (Xb1 , . . . , Xbn ). b) Estimation Step. Estimate BIASn (Pbn ) by b BIAS(B) n (Pn ) ≡ = B 1 X b∗ µ(Pnb ) − µ(Pbn ) B b=1 B 1 X ∗ ∗ µn (Xb1 , . . . , Xbn ) − µ(Pbn ) . B b=1 Remark 10.3.1 a) Note that B is at our disposal and may be chosen as large as we wish. In particular, formally, B = ∞ yields BIASn (Pbn ) by the law of large numbers applied to the i.i.d. ∗ variables µ(Pbnb ), b = 1, . . . , B. b) This procedure is nothing else than Monte Carlo approximation of the integral of µn (x1 , . . . , xn ) with respect to the product measure Pbn × . . . × Pbn . c) Even if B = ∞, we have no guarantee that the plug-in estimate BIASn (Pbn ) is good. ✷ b b Given a biased estimate θn = θ(Pn ) of a parameter θ(P ), we may wish to try to eliminate or reduce its bias (see Section 3.4.1). A natural approach is to use the empirically bias corrected estimate θen∗ ≡ θbn − BIASn (Pbn ) or the bootstrap bias corrected estimate b θbnB = θbn − BIAS(B) n (Pn ) . We illustrate these issues with an example, in which exact calculation is clearly superior in one special case but, in another special case of much more common type, the bootstrap is much simpler to apply. 222 Monte Carlo Methods Chapter 10 Example 10.3.1. Estimating the bias of sample moments. To avoid technicalities, Rsuppose P ∈ M = {All probabilities on [0, 1] with 4 finite moments}. Suppose µ(P ) = x dP , the population mean. Then, µ bn = µ(Pbn ) = X̄. Now EP µ bn = EP X̄ = EP X = µ(P ). −1 b Therefore, BIASn (P ) = 0 = BIASn (Pbn ), but BIAS(B) n (Pn ) = B ∗ Var ∗ b BIAS(B) n (Pn ) b E ∗ BIAS(B) n (Pn ) = 0 . −1 −1 = (nB) n n X i=1 ∗ PB b=1 X̄b∗ − X̄. Then, (Xi − X̄)2 = OP ((nB)−1 ) where E and Var denote expected value and variance for the conditional probability P ∗ = L(X ∗ |X) defined in the Monte Carlo step. Note that P ∗ , E ∗ , and Var∗ integrate out the Monte Carlo randomness that has been introduced. Turn now to estimating the bias of σ 2 (Pbn ), where σ 2 (P ) ≡ VarP (X). Here, n 1X σ (Pbn ) = (Xi − X̄)2 . n i=1 2 From (3.4.2), n−1 2 σ (P ) EP σ 2 (Pbn ) = n 2 σ (P ) . BIASn (P ) = − n Now, n 1 X (Xi − X̄)2 . BIASn (Pbn ) = − 2 n i=1 Thus, the empirically bias corrected estimate is n n+1X σ (Pbn ) − BIASn (Pbn ) = (Xi − X̄)2 , n2 i=1 2 (10.3.2) whose bias is −σ 2 (P )/n2 . The bootstrap bias corrected estimate n n−1 X 2 b (Xi − X̄)2 σ bnB (Pbn ) = σ 2 (Pbn ) − BIAS(B) n (Pn ) = n2 i=1 agrees to order n−1 with the usual unbiased estimate (n − 1)−1 10.3.3). Going further, let K3 (P ) = EP (X − µ(P ))3 Pn i=1 (Xi − X̄)2 (Problem Section 10.3 223 The Bootstrap Pn be estimated by K3 (Pbn ) = n−1 i=1 (Xi − X̄)3 . Here 1 EP K3 (Pbn ) = K3 (P )(1 − 2 ) . n Hence, BIASn (Pbn ) = −n−2 K3 (P ) and again K3 (Pbn ) − BIASn (Pbn ) is unbiased. However, consider the numerator of the kurtosis (see A.11.10), K4 (P ) = EP X 4 − 3(EP X 2 )2 (10.3.3) Now (see Problem 10.3.4) EP K4 (Pbn ) = K4 (P )+ a 1 n + b1 a2 a3 b2 b3 + + + K (P )+ σ 4 (P ) (10.3.4) 4 n2 n3 n n2 n3 for appropriate constants aj , bj ; j = 1, 2, 3. Here, K4 (Pbn ) − BIASn (Pbn ) is biased. However, while 1 EK4 (Pbn ) = K4 (P ) + O( ) , n 1 (10.3.5) EK4 (Pbn ) − BIASn (Pbn ) = K4 (P ) + OP ( 2 ) . n Note that the example illustrates that using BIAS(B) n is easier than going through the exact calculation of BIASn (Pbn ). Moreover, computing B(P )/n by the delta method for this last example is almost as tedious as computing EP K4 (Pbn ), and using the approximation b BIAS(B) ✷ n (Pn ) is easier. A natural question to ask is how large does B have to be so that (10.3.5) holds when b BIASn (Pbn ) is replaced by BIAS(B) n (Pn ) or, more generally, when is 1 b Eθ(Pbn ) − BIAS(B) n (Pn ) = θ(P ) + OP ( 2 )? n Calculations that answer these questions are in Hall (1986). When θ(P ) is a smooth function of means as in Theorem 5.3.2, b b E ∗ BIAS(B) n (Pn ) = BIASn (Pn ) −1 −1 b Var∗ BIAS(B) B ). n (Pn ) = OP (n (10.3.6) (10.3.7) Here we again use Efron’s convenient * notation for the conditional distribution L(X ∗ |X) of observations resampled from the data X = (X1 , . . . , Xn ). Recall that P ∗ , E ∗ , etc. integrate out the Monte Carlo randomness we have introduced. Since, under the conditions of Corollary 5.3.1, 3 BIASn (Pbn ) = BIASn (P ) + OP (n− 2 ) 224 Monte Carlo Methods Chapter 10 b Hall’s and our conclusion is that for BIAS(B) n (Pn ) to be asymptotically equivalent to BIASn (P ) − 32 2 to order OP (n ), we need to take B ≍ n , where an ≍ bn as usual denotes an = O(bn ) and bn = O(an ). The proofs of these results are sketched in Problem 10.3.1. In bias correction the bootstrap makes an asymptotically appropriate higher order correction. Of greater importance is the use of the bootstrap in estimation of variances of complex estimates and in setting confidence bands and regions, which we turn to in the next section. 10.3.2 Bootstrap Variance and Confidence Bounds The standard formula for an upper asymptotic level 1 − α upper confidence bound (UCB) for a parameter θ on the basis of the MLE θbn in a regular one dimensional parametric model with Fisher information I(θ) is given in Section 5.4.5 and is of the form z1−α . θbn + q nI(θbn ) More generally, if we have a regular, as defined in Section 9.3, asymptotically normal estimate µ bn of a parameter µ(P ) such that √ µn − µ(P ))) → N (0, σ 2 (P )) Lp ( n(b for all P ∈ P√and σ bn is a (locally uniformly) consistent estimate of σ(P ) in P, then µ bn + z1−α σ bn / n is a natural asymptotic level (1 − α) UCB. Thus, for instance, if P = {Pθ : θ ∈ Rd } bn = (θbn1 , . . . , θbnd )T behaves regularly, then, letting kI ij k denote the inand the MLE θ verse of the information matrix I of Section 6.2.2, s bn ) I 11 (θ θbn1 + z1−α (10.3.8) n is an asymptotic level 1 − α UCB in P. As we have noted in the past, construction of σ bn 11 b and computation of I (θn ) is usually far from trivial. The bootstrap gives what appears to be an “automatic” solution. and Suppose that the standard deviation of µ bn , SDn (P ), exists and normalizes µ bn properly, µ bn − µ(P ) Lp → N (0, 1) (10.3.9) SDn (P ) √ nSDn (P ) → σ(P ) . (10.3.10) Section 10.3 225 The Bootstrap Then it is natural to try SDn (Pbn ) as a consistent estimate of SDn (P ), in the expectation that SDn (Pbn ) − SDn (P ) = OP (n−1 ) . (10.3.11) The resulting proposed confidence bound would, by the usual inversion argument, be µ bn + z1−α SDn (Pbn ) . Calculation of SDn is again a problem because Z Z SDn2 (P ) = ... µ b2n (x1 , . . . , xn ) dP (x1 ) . . . , dP (xn ) − Z ... Z 2 µ bn (x1 , . . . , xn ) dP (x1 ) . . . , dP (xn ) . However, getting a bootstrap approximation is simple. Bootstrap Estimation of Variance. ∗ (i) Let {Xbj : 1 ≤ b ≤ B, 1 ≤ j ≤ n} be bootstrap samples generated as in the Monte ∗ ∗ Carlo step of the estimation of BIASn (P ) and compute µ bn (Xb1 , . . . , Xbn ) ≡ µ b∗nb P B and µ b∗n· = B −1 b=1 µ b∗nb . (ii) Estimate SDn (P ) by (B) d SD n (B) dn Again, if B = ∞, SD ≡ B 1 X ∗ (b µnb − µ b∗n· )2 B b=1 ! 12 . = SDn (Pbn ). Example 10.3.2. Estimation of the variance of the median and trimmed means. Let Mn ≡ Median(X1 , . . . , Xn ). We have seen that (Problem 5.4.1), if X has density f = F ′ and if θ(F ) ≡ F −1 ( 12 ), f (θ) > 0, then √ 1 −1 1 LF n(Mn − F ( )) → N 0, 2 . 2 4f (θ) Consistent estimation of f , and thus of σ(P ) = 1/2f (θ), is possible in this case, see Chapter 11. So also is estimation of σ(P ) by plugging Pbn into the formulae (5.1.2) and (5.1.3). Moreover, application of the bootstrap is easy. Similarly for the trimmed means X̄α given by (3.5.3), we have in Example 7.2.6 the result √ n(X̄α − µα (P )) → N (0, σα2 (P )) 226 Monte Carlo Methods where X̄α = µα (Pbn ), µα (P ) = 1 1 − 2α Z x1−α xα Chapter 10 x dP (x), σα2 (P ) = VarP ψα (X) with ψα given by (7.2.42). Again σα2 (Pbn ) is easily computed. However, while there is no particularly attractive formula to plug into for SDn2 (Pbn ), the bootstrap is easy to apply in the estimation of variance step above. ✷ There are a number of questions we need to ask: (i) For estimation of variances per se, when does plugging in Pbn yield a consistent estimate? (ii) When is SDn (P ) an appropriate normalization to yield asymptotic normality for (b µn − µ(P ))? d (B) to approximate SDn (P ) to the appropriate (iii) How big does B need to be for SD n 1 order oP (n− 2 )? (iv) What is gained by using the bootstrap in cases where an alternative approximation exists? It turns out that for the median and trimmed means, the answers to (i) and (ii) are affirmative under the natural conditions considered in Example 7.2.6. We summarize the affirmative answers to the first three questions for the simple case where µ bn is a smooth function of a vector mean. Its proof is left to the Problem 10.3.1. Theorem 10.3.1. Under the conditions of Corollary 5.3.2 (a) and (b), d n = SDn (Pbn ) + OP (n− 12 B − 12 ) . SD Since, under the same conditions, SDn (Pbn ) = SDn (P ) + OP (n−1 ) . B ≍ n will give total errors of the same order as the error in SDn (Pbn ). The answer to question (iv), bias and variance estimation is unclear in terms of statistical performance. However, the computational simplification can be striking. The Jackknife The jackknife due to Quenouille (1949) and Tukey (1958) preceded the bootstrap as a “sampling” approach to estimating biases and variances of complex statistics. It is closely related to the sensitivity curve discussed in Section 3.5.3 (Volume I). Given an i.i.d. sample X(n) ≡ {X1 , . . . , Xn } with X ∼ F , a parameter θ(F ) and a sequence of estimates θ(m) (X1 , . . . , Xm ), 1 ≤ m ≤ n, of θ(F ), we want to estimate the bias Bn (F ) ≡ EF θb(n) − θ(F ) Section 10.3 227 The Bootstrap and the variance σn2 (F ) ≡ VarF θb(n) . The idea is to use the estimates θb(n−1) (X−i ) ≡ θb(i) obtained by evaluating θb(n−1) at the subsamples X−i ≡ {X1 , . . . , Xi−1 , Xi+1 , . . . , Xn } of size n − 1 obtained by deleting one Xi at the time, 1 ≤ i ≤ n. The estimates of bias and variance are b bn = (n − 1)(θb(·) − θ), B σ bn2 = n X i=1 (θb(i) − θb(·) )2 . θb(·) = n−1 n X i=1 θb(i) , The relation of the jackknife idea to the sensitivity curve defined in Section 3.5 and Problem 7.2.2 can be seen from SC(X(n) , θb(n) ) = n(θb(n) − θb(n−1) ) . Since if θ(F ) is smooth in F , the sensitivity curve is a crude approximation to n−1 times the influence function (Problem 7.2.2). Thus we can write σ bn2 ≈n −2 n X i=1 ψ(Xi , F ) − ψ̄ 2 ≈n −1 Z ψ 2 (x, F ) dF (x) (10.3.12) where ψ is the influence function of θb(n) as defined in Section 7.2 and ψ̄ = n−1 n X i=1 ψ(Xi , F ) ≈ 0 . This approximation to σ bn2 is exactly corect if θ(F ) is linear in F so that θb(n) is an average bn can also be jusand, as we have seen in Problem 7.2.2, approximately correct generally. B tified using higher order asymptotics. For more on the jackknife, see Efron and Tibshirani (1993), Chapter 11. 10.3.3 The General i.i.d. Nonparametric Bootstrap Efron (1979)(1) proposed a general bootstrap principle along the following lines: Let P denote a nonparametric class of probabilities on a set X . We assume that P contains all discrete distributions on X . Let X = (X1 , . . . , Xn ) denote i.i.d. observations from P ∈ P. We are interested in a functional Tn (X, P ) which is invariant under permutations π(X) of the elements of X, that is Tn (π(X), P ) = Tn (X, P ) for each π(X) = (Xi1 , . . . , Xin ) with ij 6= ik for j 6= k. By sufficiency of the empirical distribution, this restriction has 228 Monte Carlo Methods Chapter 10 no force as long as we are only considering i.i.d. observations. In this case we also write Tn (Pbn , P ) for Tn (X, P ), where we have identified Pbn with X. The parameter λn (P ) we are interested in is calculable from the distribution of Tn (Pbn , P ). That is, for some map θ from {LP (Tn (Pbn , P )) : P ∈ P} to some set Θ, we can write λn (P ) = θ(LP (Tn (Pbn , P ))). (10.3.13) Then, the plug-in principle leads to the estimate of λn (P ), where λn (Pbn ) = θ(L∗ (Tn (Pbn∗ , Pbn ))) Tn (Pbn∗ , Pbn ) = Tn (X1∗ , . . . , Xn∗ , Pbn ) and * as usual denotes that we are dealing with an i.i.d. sample X1∗ , . . . , Xn∗ from Pbn . We will use bootstrap samples as follows to approximate L∗ and λn (P ): ∗ ∗ a) Approximating L∗ (Tn (Pbn∗ , Pbn )). Generate bootstrap samples X∗b = (Xb1 , . . . , Xbn ), b = 1, . . . , B, as in the Monte Carlo Step of the bootstrap bias and variance approximation. ∗ ∗ ∗ ∗ b Then compute {Tnb , 1 ≤ b ≤ B} with Tnb = Tn (Xb1 , . . . , Xbn , Pn ). Let δx be pointmass PB −1 ∗ ∗ b ∗ b ∗ , the empirical distribution δ at x. Approximate L (Tn (Pn , Pn )) by LB ≡ B T b=1 nb ∗ of {Tnb , 1 ≤ b ≤ B}. (B) b) Bootstrap Estimation of λn (P ). The estimate λn (Pbn ) of λn (P ) is given by θ(L∗B ). It is easy to see that estimation of the bias BIASn (P ) corresponds to taking Tn (Pbn , P ) = µ bn − µ(P ) , Z θ(FT ) = z dFT (z) , where FT is the df of Tn (Pbn , P ). This gives θ(LP (Tn (Pbn , P ))) = EP (b µn ) − µ(P ) . With the same choice of Tn (Pbn , P ), estimation of SDn (P ) corresponds to θ(FT ) = Z 2 z dFT (z) − Z 2 ! 12 z dFT (z) . This formulation suggests an entirely different approach to setting confidence bounds and confidence regions. Consider again setting a confidence bound on µ(P ) using µ bn . In Examples 4.4.1 and 4.4.2, to obtain confidence bounds for the mean and variance of a normal distribution, we used the existence of pivots, functions T (b µn , µ(P )) whose distribution did not depend on P , with appropriate monotonicity properties. In fact, our general √construction of asymptotic µn − µ(P ))/b σn which UCBs for µ(P ) is based on an asymptotic pivot of the form n(b tends to N (0, 1) for all P ∈ P. Section 10.3 229 The Bootstrap Consider µ bn − µ(P ) as a potential pivot. Essentially only if µ(P ) is a translation parameter of a translation parameter family {F (x − µ) : µ ∈ R} is µ bn − µ(P ) in fact a pivot. However, note that if cnα (P ) is the unknown α quantile of the distribution of µ bn − µ(P ), then P [b µn − µ(P ) ≥ cnα (P )] = P [µ(P ) ≤ µ bn − cnα (P )] = 1 − α Thus estimating the parameter cnα (P ) to appropriate accuracy by b cn should yield µ bn − b cn (B) as an asymptotic (1 − α) UCB. The bootstrap yields an estimate b cn as follows: Let Tn (Pbn , P ) = µ bn − µ(P ), and let λn (P ) be the α quantile of L(b µn − µ(P )). The concrete ∗ application of the bootstrap is, having generated Tnb =µ bn (X∗b ) − µ(Pbn ), 1 ≤ b ≤ B, then the bootstrap approximation to cnα (P ) is the αth quantile of the empirical distribution of ∗ {Tnb ; 1 ≤ b ≤ B}, that is ∗ b c(B) = Tn([Bα]) n ∗ ∗ ∗ where Tn(1) ≤ . . . ≤ Tn(B) are the ordered {Tnb } and [ ] denotes the greatest integer function. If we specialize to µ bn = µn (Pbn ), then cb(B) =µ b∗n([Bα]) − µ bn n where µ b∗n(1) ≤ . . . ≤ µ b∗n(B) are the ordered {b µ∗nb }. Thus, µ b∗n([Bα]) is the Monte Carlo estimate of the α quantile of the bootstrap distribution of µ bn . We thus obtain as a potential asymptotic (1 − α) UCB, bn ) = 2b µn − µ b∗n([Bα]) . µ bn − (b µ∗n([Bα]) − µ (10.3.14) This is not Efron’s so called percentile method. Efron proposed the UCB µ b∗n([B(1−α)+1]) . √ It may be shown (Problem 10.3.6) that if n(b µn − µ(P )) tends to a Gaussian distribu1 tion and bootstrap quantiles converge to population quantiles at the n− 2 rate, then Efron’s method with B = ∞ yields an asymptotic (1−α) UCB. Efron’s motivation for his proposal here and for subsequent refinements is that µ b∗n([B(1−α)+1]) is equivariant under monotone transformations.That is, if we use it as an asymptotic level (1 − α) UCB for µ(P ) then g(b µ∗n([B(1−α)]+1) ) is the corresponding asymptotic level (1 − α) UCB for g(µ(P )) if g is increasing. For more on these topics, we refer to Efron and Tibshirani (1993) and Hall (1997). There is nothing special about the proposed pivot µ bn − µ(P ). If we have an estimate of the asymptotic variance of µ bn , call it σ bn2 , and √ (b µn − µ(P )) → N (0, 1) , L n σ bn then it is natural to apply the above process to Tn (Pbn , P ) = √ (b µn − µ(P )) n σ bn 230 Monte Carlo Methods Chapter 10 (B) ∗ ∗ ∗ = (b µ∗nb − µ bn )/b σnb , where σ bnb is to obtain a bootstrap estimate, say dbnα , using Tnb (B) σ bn (X∗b ), and invert to get µ bn − σ bn dbnα as an asymptotic UCB. We illustrate this process concretely in a classical example. Example 10.3.3. Bootstrap t. Suppose, P = {P : 0 < EP X 2 < ∞} and we want to put confidence bounds on µ(P ) = EP X. If we restrict P to Gaussian distributions, we are naturally (see Example 4.4.1 and Section 4.9.2) led to the t statistic defined, for s given in (10.3.2), by √ (X̄ − µ) Tn = n s √ and the level (1 − α) UCB of Example 4.4.1, X̄ + tn−1 (1 − α)s/ n, where tn−1 (1 − α) is the 1 − α quantile of the Tn−1 distribution. If P is not Gaussian but is unknown, the distribution of Tn is also unknown and it is natural to use the bootstrap. Form the bootstrap √ ∗ ∗ ∗ samples (Xb1 , . . . , Xbn ), the corresponding X̄b∗ and s∗b , Tnb = n(X̄b∗ − X̄)/s∗b , b = (B) ∗ 1, . . . , B, and the lower α quantile of the bootstrap distribution, Tn([Bα]) ≡ dbnα . The level (1 − α) bootstrap UCB is s X̄ − db(B) . nα √ n √ Applying the same principle to the pivot n(X̄ − µ), we arrive at the UCB, ∗ , 2X̄ − X̄([Bα]) ∗ ∗ where X̄(1) ≤ . . . ≤ X̄(B) are the ordered bootstrap sample means. Does either of these work in the sense of giving asymptotically correct coverage probability? Is one preferable to the other on theoretical grounds? We address questions such as these now. ✷ 10.3.4 Asymptotic Theory for the Bootstrap Since asymptotic theory for statistics eventually rests on the law of large numbers and Rthe 2central limit theorem, ∗we start ∗there as well. Suppose X1 , . . . , Xn are i.i.d. P with x dP (x) < ∞. Let (X1 , . . . , Xn ) be a nonparametric bootstrap sample. By B.1.3 the joint distribution of (X1 , . . . , Xn , X1∗ , . . . , Xn∗ ) is completely determined by the marginal distribution of (X1 . . . , Xn ) and the conditional distribution of X1∗ , . . . , Xn∗ given Xi = xi , P 1 ≤ i ≤ n, specified as Xi∗ i.i.d. with common distribution Pbn ≡ n−1 ni=1 δxi . As we have noted, for any statistic Tn (X1 , . . . , Xn ), L∗ (Tn (X1∗ , . . . , Xn∗ )), the conditional distribution of Tn (X1∗ , . . . , Xn∗ ) given X1 , . . . , Xn , is a random quantity. Suppose LP (Tn (X1 , . . . , Xn )) =⇒ L0 . It makes sense to ask if P [L∗ (Tn (X1∗ , . . . , Xn∗ )) =⇒ L0 ] = 1 which we define as almost sure (a.s.) convergence in law of the bootstrap distribution of Tn , or the weaker statement which is the one needed for statistics: For all ε > 0, as n → ∞ P [ρ(L∗ (Tn (X1∗ , . . . , Xn∗ )), L0 ) ≥ ε] → 0 (10.3.15) Section 10.3 231 The Bootstrap where ρ is a metric for weak convergence, i.e. Ln =⇒ L0 iff ρ(Ln , L0 ) → 0. See Problem 10.3.8 for Mallows’ metric which is an example of such a ρ. We call (10.3.15) convergence in law in probability of the bootstrap distribution of Tn . An elegant proof of (10.3.15) for certain Tn using Mallows’ metric can be found in Bickel and Freedman (1981). See Problem 10.3.8. The most basic result is Theorem 10.3.2. Suppose X1 , . . . , Xn are i.i.d. as X ∈ Rd . Let P denote the joint distribution of (X1 , . . . , Xn , X∗1 , . . . , X∗n ). (a) If E|X| < ∞ then, as n → ∞, P n h1 X n i=1 i X∗i −→ EX = 1 . (10.3.16) (b) If E|X|2 < ∞ and X has positive definite covariance matrix Σ then 1 P [L∗ (n− 2 n X i=1 (X∗i − X̄)) =⇒ N (0, Σ)] = 1 . (10.3.17) Proof. (a) Note that X∗1 , . . . , X∗n is a double array because the empirical probability Pbn depends on n. Thus we need to use the uniform strong law of large numbers (SLLN) given in Appendix D.1. That is, we need to show that n X |Xi |1(|Xi | ≥ M ) EP |X∗ |1(|X∗ | ≥ M ) = n−1 i=1 converges a.s. to zero as n → ∞. The right hand side converges a.s. to E[X|1(|X| ≥ M )] by the SLLN, and this expression tends to zero as M → ∞ by the dominated convergence theorem (Theorem B.7.5). Thus P [X̄∗ → X̄] = 1 by D.6 and D.7. Because P [X̄ → E(X)] = 1 by the SLLN, we can for each ε > 0 select N such that, a.s., for n ≥ N , |X̄∗ − X̄| ≤ ε/2 and |X̄ − E(X)| ≤ ε/2; then, a.s., |X̄∗ − E(X)| ≤ ε for n ≥ N and we have shown (a). To establish (b), we refer to the Lindeberg-Feller theorem for double arrays of independent variables as stated in Appendix D.1. According to this result (Theorem D.3), to establish (10.3.17) for d = 1, we need only check that E ∗ (Xi∗ ) = X̄ and that for all ε > 0, as n → ∞ n 1X ∗ ∗ 2 1 a.s. E (Xi ) 1(|Xi∗ | ≥ εn 2 ) −→ 0 n i=1 a.s. E ∗ (Xi∗ − X̄)2 −→ Var(X) . (10.3.18) (10.3.19) The left hand side of (10.3.18) is n ∆n (ε) ≡ 1X 1 (Xi )2 1(|Xi | ≥ εn 2 ) . n i=1 (10.3.20) 232 Monte Carlo Methods Chapter 10 1 Let M > 0 be arbitrary and select n so that εn 2 > M . Then ∆n (ε) ≤ Un , where n Un ≡ 1X (Xi )2 1(|Xi | > M ) . n i=1 (10.3.21) a.s. By the SLLN, Un −→ EP (Un ) = EP (X)2 1(|X| > M . By the dominated convergence theorem, we can make EP (Un ) arbitrarily small by selecting M sufficiently large, thus (10.3.18) follows. Since n 1X 2 1X (Xi − X̄)2 = Xi − (X̄)2 , E ∗ (Xi∗ − X̄)2 = n i=1 n (10.3.19) follows from the SLLN and the result follows for d = 1. The general result is argued in the same way (Problem 10.3.9). ✷ With this theorem, we can easily establish the conclusion we hoped for in Example 10.3.3 and the preceding discussion. Example 10.3.3. Bootstrap t (Continued). By Theorem 10.3.2 (b), if σ 2 (P ) = VarP (X), √ P [ n(X̄ ∗ − X̄) =⇒ N (0, σ 2 (P ))] = 1 . √ Thus the bootstrap approximation to the (1 − α) quantile of the distribution of n(X̄ − µ) converges to z1−α with probability one, which, by the argument leading to (10.3.14), ∗ establishes 2X̄ − X̄[∞α] as an asymptotic (1 − α) UCB. P ∗ Next consider (X̄ − X̄)/s∗ . By Theorem 10.3.2 (a), if [s∗ ]2 ≡ n−1 ni=1 (Xi∗ − X̄ ∗ )2 , then P s∗ −→ σ(P ) . (10.3.22) P P Here (10.3.22) follows from n−1 (Xi∗ − X̄ ∗ )2 = n−1 (Xi∗ − X̄)2 + (X̄ ∗ − X̄)2 , n 1X ∗ P (X − X̄)2 −→ σ 2 (P ) , n i=1 i and P (X̄ ∗ − X̄)2 −→ 0 . It follows from (10.3.22) and Slutsky’s theorem that √ (X̄ ∗ − X̄) n =⇒ N (0, 1) s∗ (∞) P in probability which yields that dbnα −→ zα , and hence establishes the asymptotic validity of √ (10.3.23) X̄ − db(∞) nα s/ n Section 10.3 233 The Bootstrap as a 1 − α UCB. Next let n → ∞ and let B temporarily be fixed. For each δ > 0 lim P ∗ [sup | n→∞ t B 1 X B 1 2 b=1 ∗ 1(Tnb ≤ t)− P ∗ [Tn∗ ≤ t] | ≥ δ] ≤ P [kEB k∞ ≥ δ] (10.3.24) where Tn∗ = Tn (X1∗ , . . . , Xn∗ ) and EB is the empirical process for U1 , . . . , UB i.i.d. U(0, 1). To establish (10.3.24) let Gn (t) denote the distribution of Tn∗ , and let U, U1 , . . . , UB be i.i.d. U nif (0, 1). Then the expression inside the sup in (10.3.24) has the same distribution as 1 B− 2 B X b=1 1 Ub ≤ Gn (t) − P U ≤ Gn (t) . The sup over t of this expression is bounded above by kξB k∞ . To show equality, let Hn denote the distribution of Tn . Then |Gn (t) − Φ(t)| ≤ |Gn (t) − Hn (t)| + |Hn (t) − Φ(t)| and the central limit theorem together with Polya’s theorem (Theorem B.7.7) implies that Gn (t) converges uniformly to Φ(t). The continuity of Φ shows that 1 sup B − 2 t B X b=1 ∗ 1(Tnb ≤ t) − P ∗ (Tn∗ ≤ t) converges weakly to kξB k∞ . By Donsker’s theorem, P n) db(B nα −→ zα where B = Bn → ∞ as n → ∞. We conclude that for such B the bootstrap t UCB s X̄ − db(B) nα √ n √ is asymptotically valid in the sense that P µ(P ) ≤ X̄ − dbB nα s/ n → (1 − α) (Problem 10.3.11). Which of these three is better and is either to be preferred to using s s X̄ − z(α) √ = X̄ + z(1 − α) √ n n as an approximate (1 − α) UCB? It turns out (see Hall (1997)), that 1 s P [X̄ + z(1 − α) √ ≥ µ(P )] = (1 − α) + Ω(n− 2 ) n where cn = Ω(an ) means cn = O(an ) and an = O(cn ). Moreover 1 ∗ ] = (1 − α) + Ω(n− 2 ) , P [µ(P ) ≤ X̄[∞(1−α)+1] 234 Monte Carlo Methods Chapter 10 if EP |X1 |3 < ∞, while (∞) s b P µ(P ) ≤ X̄ − dnα √ = (1 − α) + Ω(n−1 ) n if EP X14 < ∞. This establishes that this bootstrap t UCB method (10.3.23) is better than the bootstrap percentile method if only probability of coverage is considered. These results are based on Edgeworth expansions and are beyond the scope of this book. We refer to Hall (1986) and (1997), where the effect of choice of B is also discussed. ✷ The empirical process theory of Section 7.1.1 may also be generalized. The most general result, due to Giné and Zinn (1990), is the following: Theorem 10.3.3. Let (X1 , . . . , Xn , X1∗ , . . . , Xn∗ ) be defined as before, Z 1 f d(Pbn − P ) , En (f ) = n 2 and 1 En∗ (f ) = n 2 = n Z − 12 f d(Pbn∗ − Pbn ) n n X X f (Xj ) . f (Xi∗ ) − i=1 Then, if En =⇒ WP0 , j=1 P [En∗ =⇒ WP0 ] = 1 , and conversely. Note that since finite dimensional convergence has been established by Theorem 10.3.2, it is only tightness that is at issue. Some maximal inequalities also generalize to bootstrap samples. We refer to van der Vaart and Wellner (1996) for further discussion. Here is an example of an application of the Giné-Zinn result which also follows from an earlier result of Bickel and Freedman (1981). Example 10.3.4. Confidence bands for a distribution function. Suppose that X ∈ R with distribution function F which is completely unknown. We want to set a simultaneous confidence band for F . By the equivalence of tests and confidence regions, we consider the Kolmogorov statistic (see Example 4.4.6) Tn (F0 ) ≡ sup |Fb (x) − F0 (x)| . x If cn (F0 ) is the 1 − α quantile of the distribution of Tn (F0 ) under F0 , we can have as a 1 − α confidence region, √ Cα ≡ {F : sup n |Fb (x) − F (x)| ≤ cn (F )} . (10.3.25) x Section 10.3 235 The Bootstrap If we assume F0 is continuous, cn (F ) ≡ cn doesn’t depend on F and thus √ Cα0 = Fb (x) ± cn / n is a simultaneous 1 − α confidence band for F . It turns out, Problem 10.3.12, that Cα0 is, in fact, level 1 − α for all F . However, if F is discrete, Cα0 is too big, while Cα is difficult to compute and may not be a band. To remedy this, we can apply the bootstrap principle and estimate cn (F ) by cn (Fb ) or rather the usual approximation, the [B(1 − α)] + 1th order statistic cnB (Fb ) of ∗ Tnb ≡ √ n sup |Fb∗ (x) − Fb (x)|, 1 ≤ b ≤ B . x It follows from our discussion and the Giné-Zinn theorem that, indeed b bα ≡ Fb ± cnB√n (F ) C n is an asymptotic size 1 − α confidence band for F arbitrary. For further discussion and some Monte Carlo simulations, see Bickel and Krieger (1989). ✷ 10.3.5 Examples Where Efron’s Bootstrap Fails. The m out of n Bootstraps Our results so far have not established whether, for instance, Efron’s bootstrap which is drawn n times from {X1 , . . . , Xn } will consistently estimate features of more irregular Tn (Pbn , P ), such as nVarP {median(X1, . . . , Xn )}. In fact, the bootstrap is consistent in approximating the distribution of √ 1 n(med(X1 , . . . , Xn ) − F −1 ( )) 2 and its variance. However, it is fairly easy to generate examples of inconsistency of the bootstrap, as we next illustrate. Example 10.3.5. The bootstrap distribution of the maximum. Suppose X1 , . . . , Xn are i.i.d. real valued with df F such that F (x) = 1, x ≥ c. Assume that X1 has a density f with f (c) > 0. Then, (Problem 10.3.13), n(c − max(X1 , . . . , Xn )) =⇒ E(f (c)) , (10.3.26) in probability, where E(λ) denotes the exponential df with mean λ−1 . But, n(max(X1 , . . . , Xn ) − max(X1∗ , . . . , Xn∗ )) does not converge in law in probability — see Problem 10.3.14. For this and many other examples of bootstrap failures, see Bickel, Götze and van Zwet (1997). ✷ 236 Monte Carlo Methods Chapter 10 There is a fix for the bootstrap failures found independently by Götze (1993) and Politis ∗∗ and Romano (1994). For m ≤ n, let X∗∗ ≡ (X1∗∗ , . . . , Xm ) be a sample from X ≡ {X1 , . . . , Xn } taken without replacement. Suppose LP (Tn (X1 , . . . , Xn )) =⇒ L0 . ∗∗ Then, if L∗∗ indicates the conditional distribution L(X∗∗ |X) of (X1∗∗ , . . . , Xm ) given X1 , . . . , Xn , ∗∗ L∗∗ (Tm (X1∗∗ , . . . , Xm )) =⇒ L0 in probability, provided only that m → ∞, m/n → 0. This m out of n without replacement bootstrap, also referred to as subsampling, is studied in Politis, Romano and Wolf (1999). The analogous with replacement m out of n bootstrap which includes that of Efron is studied in Bickel, Götze and van Zwet (1997). The m out of n bootstraps can be seen as methods of regularization. Difficulties with the Efron bootstrap occur when a feature λn (Q) of LQ (Tn (Pbn , Q)), which includes Q = Pbn as a possible argument, is such that λn (P ) → λ(P ) , where λ(P ) is an irregular parameter. Then, while the “bias” λn (P ) − λ(P ) → 0, the “variance” λn (Pbn ) − λn (P ) can blow up. Taking m(n)/n → 0 and m(n) → ∞ still makes the “bias” λm(n) (P ) − λ(P ) → 0 but reduces the “variance” arbitrarily depending on how slowly m(n) → ∞. We illustrate this phenomena in Problem 10.13.15. Summary. We introduced the bootstrap, which is a method for approximating distributions, estimates, critical values, and confidence procedures by drawing i.i.d. Monte Carlo samples from the empirical probability distribution Pbn . Let P be a set of probabilities on a set X , and suppose P contains all discrete distributions. Let X = (X1 , . . . , Xn ) be i.i.d. observations from P ∈ P and let Tn (X, P ) be a functional which is invariant under permutations of X. We also write Tn ≡ Tn (Pbn , P ), where we have identified Pbn with X. We considered parameters that are features of L(Tn ) such as the distribution FTn (t) = P (Tn ≤ t), quantiles FT−1 (α), θ < α < 1, the expected value EP Tn , and the n variance VarP Tn . More generally, we considered λn (P ) = θ(LP (Tn (Pbn , P ))) where θ is a map from {LP (Tn (Pbn , P )) : P ∈ P} to a set Θ. The bootstrap estimate of λn (P ) is the empirical plug-in estimate λn (Pbn ). We can write λn (Pbn ) as θ(L∗ (Tn (X∗ , Pb)) where X∗ = (X1∗ , . . . , Xn∗ ) is an i.i.d. sample from Pb . This Monte Carlo sample, which we can think of as a sample drawn with replacement from {X1 , . . . , Xn }, is called a bootstrap sample. When L∗ and λn (Pn ) are difficult to compute, they can be approximated by the following procedure called the (Efron) bootstrap: Generate B independent boot∗ strap samples X∗1 , . . . , X∗B and compute Tnj = Tn (X∗j , Pbn ), j = 1, . . . , B. Now use ∗ ∗ the empirical distribution L∗B of Tn1 , . . . , TnB to approximate L∗ (Tn (X ∗ , Pbn )) and use Section 10.4 237 Markov Chain Monte Carlo θ(L∗B ) to approximate λn (Pbn ). We showed how the bootstrap can be used to estimate the bias and variance of estimates, how it can be used to find approximate critical values for tests, as well as approximate confidence bounds, intervals, and regions. We develop and discuss some asymptotic results that imply that in “regular” situations, bootstrap approximations are asymptotically valid. We also give an example of an “irregular” case where the bootstrap fails, and we introduce alternative bootstraps, the m out of n with and without replacement bootstraps. The latter can be used in all of the situations where the Efron bootstrap does not work. 10.4 10.4.1 Markov Chain Monte Carlo (MCMC) The Basic MCMC Framework We want to generate random variables X1 , X2 , . . . on a sample space X distributed according to p, but direct generation is not feasible. Rejective sampling procedures generate from a sequence of i.i.d. variables, Y1 , Y2 , Y3 , . . . distributed according to p0 , a variable X exactly distributed according to p. Evidently we obtain B variables, X1 , . . . , XB i.i.d. according to p by accepting X1 = Yτ1 and then continuing to sample Y ’s, till we get X2 = Yτ2 and so on. Rejective sampling is not always practicable because there may not be a clear choice of p0 such that supx p/p0 (x) is not too large. It turns out to be useful to simultaneously enlarge our field of action and our use of the term “sampling” in two directions: (i) We permit (Y1 , Y2 , . . .) to be generated according to a Markov sequence, a generalization of independent sampling. The required Markov chain theory is reviewed in Appendix D.5. (ii) We do not insist that the X1 , . . . , XB we obtain be either exactly marginally distributed according to p or be exactly independent. This generalization of rejective sampling introduced by Metropolis et al (1953) and developed by Hastings (1970) has become very important. The basic idea of MCMC is to construct a homogeneous positive recurrent Markov chain with transition kernel K(x1 , x2 ), the conditional density p(x2 |x1 ) of X2 evaluated at x2 , given X1 = x1 , such that p is the unique stationary density of K. Here the term “conditional density” refers to both the discrete and continuous case. Thus in the discrete case, K(x1 , x2 ) = P (X2 = x2 |X1 = x1 ), and in the continuous case moving from x1 to x2 means drawing x2 according to the distribution whose density is p(x2 |x1 ) = K(x1 , x2 ). In what follows we will use the continuous case notation and write integrals. In the discrete case, the integrals should be read as sums. For functions g on the sample space X define the total variation norm by kgk1 ≡ Z |g(x)|dx 238 Monte Carlo Methods for continuous models and kgk1 = n X i=1 Chapter 10 |g(xi )| for discrete models. Specifically, we seek K(x, y) from which, given Y1 = x, we can generate consecutive Y2 , Y3 , . . ., simply such that the following holds. R (a) For all y, p(x)K(x, y)dx = p(y), that is, stationarity of p with respect to K. (b) Let Y1 be distributed according to a selected p0 . Let Y2 , . . ., Ym , . . . be the Markov sequence obtained by using Y1 as an initial value and K as the transition kernel. Then, if pm (y) is the density of Ym , we require that there exist c > 0 and 0 < ρ < 1, such that ||pm − p||1 ≤ cρm . (10.4.1) The idea is to begin with a relatively simple Y1 , and for M large, run the chain to YM as in (b) above (burn in). Continue the chain and define Xj = YM+j , j = 1, . . . , n, . . .. Then X1 , . . . , Xn , . . . will be distributed approximately according to p by (a) and (b) (Problem 10.4.1). There are many variants, for instance, repeating the (b) step B times using independent Y1 ’s and using the B resulting approximately i.i.d. as p “X1 ’s”. Why should we use this extension of rejective sampling, rather than just stick to the former? The main reason is that, as we stated and we shall see in our examples, it is often the case that a p0 with a reasonably small c(p, p0 ) is not readily available. The “uniform” p0 almost invariably is very poor. MCMC in some sense allows us to adjust p0 as we move so as to get closer to p on the first iterate we choose to retain. 10.4.2 Metropolis Sampling Algorithms A general class of kernels producing a desired p was proposed by Metropolis et al (1953) and then generalized by Hastings (1970) and others. All have the following form: (1) Choose a proposal kernel K0 (x1 , x2 ) from which it is relatively easy to generate x2 for a given x1 . Given x1 , select a candidate new value x2 according to K0 (x1 , x2 ). (2) For a given function r : R2 → [0, 1], such that r(x, x) ≡ 1, move to x2 with probability r(x1 , x2 ), otherwise stay at x1 . Call the resulting value y1 . (3) Repeat with y1 in place of x1 to obtain y2 , etc. This leads to the new kernel K(x1 , x2 ) = r(x1 , x2 )K0 (x1 , x2 ), if x1 6= x2 X = K0 (x1 , x1 ) + (1 − r(x1 , y))K0 (x1 , y) = 1− X y6=x1 y6=x1 r(x1 , y)K0 (x1 , y), if x1 = x2 . (10.4.2) Section 10.4 239 Markov Chain Monte Carlo The original Metropolis algorithm has K0 (·, ·) symmetric and p(x2 ) . r(x1 , x2 ) ≡ min 1, p(x1 ) Intuitively, a symmetric K0 has the uniform distribution as stationary distribution (Problem 10.4.2). Then this r(x1 , x2 ) nudges the generated x’s towards values more probable under p, so that, in the end, the relative frequency with which x is visited over a long time is approximately p(x). Also note that, as for rejective sampling to compute r, we need only to know p(·) up to a constant. It is evident that if x1 6= x2 , for the Metropolis original proposed K, p(x1 )K(x1 , x2 ) = min(p(x1 ), p(x2 ))K0 (x1 , x2 ) = min(p(x1 ), p(x2 ))K0 (x2 , x1 ) (10.4.3) = p(x2 )K(x2 , x1 ). The condition p(x1 )K(x1 , x2 ) = p(x2 )K(x2 , x1 ), for all x1 , x2 (10.4.4) is called detailed balance and implies that p is the stationary distribution of K(·, ·); see Appendix D.5. The following result gives a general prescription for constructing K. Proposition 10.4.1 (Stein). Suppose K0 is a Markov kernel and K(x1 , x2 ) = r(x1 , x2 )K0 (x1 , x2 ), x1 6= x2 . (10.4.5) Then K is itself a Markov kernel satisfying detailed balance if and only if r(x1 , x2 ) can be written as δ(x1 , x2 ) p(x1 )K0 (x1 , x2 ) (10.4.6) where δ(x1 , x2 ) is symmetric and, for all x1 , X δ(x1 , x2 ) ≤ p(x1 ) . (10.4.7) x2 6=x1 Proof. Detailed balance holds by definition iff p(x1 )K(x1 , x2 ) is symmetric. To check symmetry note that, by (10.4.5) and (10.4.6), δ(x1 , x2 ) = p(x1 )r(x1 , x2 )K0 (x1 , x2 ) = p(x1 )K(x1 , x2 ) . Thus detailed balance holds for K if and only if δ is symmetric. To establish (10.4.7), note that because we need X 0 ≤ K(x1 , x1 ) = 1 − r(x1 , x2 )K0 (x1 , x2 ), x2 6=x1 240 Monte Carlo Methods Chapter 10 we must have X r(x1 , x2 )K0 (x1 , x2 ) = X δ(x1 , x2 ) ≤1 p(x1 ) x2 6=x1 x2 6=x1 and conversely. The proposition follows. ✷ Remark 10.4.1. (a) The inequality (10.4.7) is implied by r(x1 , x2 ) ≤ 1 for all x1 , x2 . This is necessary for our interpretation of r and of the implementation of step (2) of the algorithm but not for (10.4.4) and (10.4.5). (b) The general Metropolis proposal is to use p(x2 )K0 (x2 , x1 ) (10.4.8) r(x1 , x2 ) ≡ min 1, p(x1 )K0 (x1 , x2 ) which is evidently symmetric. (c) If K0 satisfies detailed balance with respect to p, then we may choose K = K0 . The Metropolis condition and a fortiori (10.4.7) give little guidance as to how K0 is to be chosen and, indeed, this is usually dictated by the structure of the particular situation we want to deal with and considerations which are both numerical and statistical. To illustrate the Metropolis algorithms, we consider an interesting application. We shall pursue this somewhat complicated example further in Section 10.5. Example 10.4.1. A prediction problem. Let, Yi = Zi + εi , i = 1, . . . , n where we assume that (Z1 , . . . , Zn ) are observed and, for initial simplicity, that ε1 , . . . , εn are i.i.d. with known N (0, 1) distribution. We do not, however, observe Y1 , . . . , Yn but rather the order statistics (Y(1) , . . . , Y(n) ) where Y(1) < . . . < Y(n) . Our goal is to predict Yi , i = 1, . . . , n on the basis of this data. An example (Petty et al (2002)) where this question arises (with the distribution of the εi unknown) is observation of a set of n cars on a highway passing the first of two checkpoints at times Z1 < . . . < Zn which label the cars, and again at a second checkpoint at times Y(1) < . . . < Y(n) . Observation is carried out by a device (loop detector) which can record the time of passing of a vehicle but not its identity. The model specifies that ε1 , . . . , εn , the travel times of the vehicles, are i.i.d. given (Z1 , . . . , Zn ) and we wish to predict εi = Yi − Zi given this data. But, since Zi are known, this is equivalent to predicting Yi . Let (R1 , . . . , Rn ) be the (unobserved) ranks of Y1 , . . . , Yn defined by Yi = Y(Ri ) . In what follows, all quantities depend on (Z1 , . . . , Zn ), but we suppress this dependence. The best squared error predictor of Yi is by Theorem 1.4.1, E(Yi |Y(1) , . . . , Y(n) ) = n X j=1 Y(j) P [Ri = j|Y(1) , . . . , Y(n) ] (10.4.9) Section 10.4 241 Markov Chain Monte Carlo and the P [Ri = j|Y(1) , . . . , Y(n) ] are known in principle. Unfortunately, the natural formula is computationally hopeless since it appears to involve n! multiplications of n terms each and n! additions, P [Ri = j|Y(1) , . . . , Y(n) ] = (10.4.10) P {A(Y(k1 ) , . . . , Y(kn ) , Z) : All permutations (k1 , . . . , kn ) of {1, . . . , n} with ki = j} P {A(Y(k1 ) , . . . , Y(kn ) , Z) : All permutations (k1 , . . . , kn )} where A(Y(k1 ) , . . . , Y(kn ) , Z) = n Y j=1 ϕ(Y(kj ) − Zj ). However, we claim that the expression (10.4.10) can be approximated using MCMC. The idea is to generate B (approximately) independent “observations” (R1b , . . . , Rnb ), 1 ≤ b ≤ B, from the conditional distribution of (R1 , . . . , Rn ) given Y(1) , . . ., Y(n) , using the Metropolis algorithm. We can then compute the natural estimate of the marginal distribution of Ri , i = 1, . . . , n, by #{Rib = j} Pb [Ri = j|Y(1) , . . . , Y(n) ] = B and the natural predictor of Yi , b i |Y(1) , . . . , Y(n) ) = Ybi ≡ E(Y n X j=1 Y(j) Pb [Ri = j|Y(1) , . . . , Y(n) ]. To see that we can apply the Metropolis sampler, note that P [R1 = k1 , . . . , Rn = kn |Y(1) , . . . , Y(n) ] ∝ n Y j=1 ϕ(Y(kj ) − Zj ) where the “unknown” proportionality constant is the marginal density of (Y(1) , . . ., Y(n) ), P {A(Y(k1 ) , . . . , Y(kn ) ), Z) : All permutations(k1 , . . . , kn )}. The state space for our Markov chain is thus {All permutations(k1 , . . . , kn )of{1, . . . , n}}. 1 What proposal distribution shall we use? One possibility is to take K(x1 , x2 ) = n! for all x1 , x2 , that is, pick the proposal permutation uniformly at random. This seem unreasonable in the car example, since, given that cars have approximately the same speed and Z1 < . . . < Zn , the identity permutation (1, . . . , n) should be favored. A possible type of proposal suggested by Ritov in Pasula et al (1999) and studied extensively by Ostland (1999) is to make only interchanges of a pair of cars possible on each step, that is, K0 {(k1 , . . . , kn ), (k1′ , . . . , kn′ )} > 0 ′ iff, for one and only one j, kj′ = kj+1 , kj+1 = kj , and ki′ = ki for i 6= j, j + 1. Call such moves interchanges. Further, it is assumed that all interchanges have equal probability (n − 1)−1 . Since K0 is symmetric, the Metropolis algorithm can be described by 242 Monte Carlo Methods Chapter 10 (1) Given k = (k1 , . . . , kn ), propose the interchange ′ (kj , kj+1 ) → (kj+1 , kj ) = (kj′ , kj+1 ). (2) Accept the interchange with probability r(k, k′ ) = min{exp{(Zj+1 − Zj )(Y(kj ) − Y(kj+1 ) )}, 1}. Note that since, in the car example, Zj+1 − Zj > 0, this procedure reasonably always accepts interchanges such that cars which follow each other at the beginning but do not follow each other at the end in the initial permutation do so after interchange. But, with probability depending (inversely) on the discrepancy between the arrival times at the beginning and at the end, we can also invert the natural order as given by the initial permutation. It may be shown that K is geometrically ergodic, that is, satisfies (10.4.1). Hints to proofs of the assertions of this example are given in the problems. A discussion and simulations of this and related models may be found in Ostland (1999). ✷ 10.4.3 The Gibbs Samplers An important and well-analyzed class of kernels K dictated by problem structure is that of the Gibbs samplers. The simplest case of these can be characterized as special cases of Metropolis samplers with K = K0 since this simple Gibbs sampler satisfies detailed balance. Suppose that X = (U, V ) has distribution with density p(·, ·). U and V can be vectors, U ∈ Rd1 , V ∈ Rd2 , or abstract. We assume that we can easily generate observations from the conditional distributions, L(V |U = u) for all u, and L(U |V = v) for all v, and want to generate observations (approximately) from the marginal distribution of U , respectively V , or (see Remark 10.4.2) generate observations from the joint distribution of (U, V ). The Gibbs sampler, introduced by Geman and Geman (1984), prescribes a Markov chain on Rd1 whose stationary distribution is the marginal distribution of U as follows: (1) Draw U1 = u1 from some distribution on Rd1 . (2) Generate V1 = v1 according to the conditional distribution of V given U = u1 . (3) Draw U2 = u2 from the conditional distribution of U given V = v1 and repeat (2) and (3) indefinitely using the output u generated in (3) as the input u generated in (2). We claim that Proposition 10.4.2. Let U1 , U2 , . . . be generated as in (2) and (3) above. Then (a) The desired marginal density of U , call it p(·), is known up to a constant. (b) The chain has p as stationary density. (c) The transition kernel defined by (2) and (3) satisfies detailed balance. Section 10.4 243 Markov Chain Monte Carlo Proof. To establish (a), note that if p(v|u) and q(u|v) denote the conditional densities of V given U and U given V and q is the marginal density of V , then p(v|u) = q(u|v)q(v) p(u) (10.4.11) by Bayes’ theorem. Hence p(u) = q(u|v0 ) q(v0 ) p(v0 |u) for any fixed v0 with p(v0 |u) > 0 and (a) holds with unknown constant q(v0 ). To establish (b), note that the Markov kernel K(u1 , u2 ) we have described is Z K(u1 , u2 ) = p(v|u1 )q(u2 |v)dv (10.4.12) where we assume that all variables considered have continuous type densities, but this is immaterial. Essentially, the procedure is valid whenever the formalism makes sense. The general condition for stationarity is easily checked: Z Z p(u1 ) p(v|u1 )q(u2 |v)dvdu1 (10.4.13) Z Z = q(u2 |v) p(u1 )p(v|u1 )du1 dv Z = q(u2 |v)q(v)dv = p(u2 ) . The proof of detailed balance is left to the problems (Problem 10.4.3). ✷ Remark 10.4.2 Note that similarly, we can generate V2 , V3 , . . . satisfying Proposition 10.4.2. For the Gibbs sampler, U ∼ p(·) and V ∼ q(·) are, by (2) and (3), equivalent to (U, V ) ∼ p(·, ·) because p(u, v) = p(u)q(v|u) = q(v)p(v|u). Thus (UM , VM ), (UM+1 , VM+1 ), . . . are approximately from L(U, V ). Here is an important example of the use of the Gibbs sampler. Example 10.4.2. Posterior distributions in the Gaussian model. We consider an extension of Example 1.6.12 where we observe X1 , . . . , Xn i.i.d. N (µ, σ 2 ) and then prescribe a conjugate prior distribution for θ = (µ, σ 2 ) by specifying that µ and σ 2 are independent with µ having a N (µ0 , τ02 ) distribution and σ −2 having a Γ(p0 , λ0 ) distribution. This model falls under the framework of Example 10.1.3. As we have seen in Example 1.6.12, the posterior distribution of µ given X1 , . . . , Xn and σ 2 is µ0 nX̄ + 2 2 1 σ τ0 . N n 1 , n 1 + 2 σ σ2 + τ 2 τ2 0 0 This is L(U |V = v) with U = µ and V = σ −2 in Gibbs sampler notation. The posterior distribution of σ −2 given µ and the data can be computed (Problem 10.4.5) and is the 244 Monte Carlo Methods Chapter 10 gamma distribution Γ(p0 + 12 n, λ(X, µ)), where λ(X, µ) = λ0 + 1X (Xi − µ)2 . 2 (10.4.14) Thus L(V |U = u) is also available. So we can implement the Gibbs sampler once we know how to draw from the normal and gamma distributions as we have learned in Section 10.2. In this way we get, for instance, (approximate) observations from the posterior distribution of σ 2 given X. As we noted we can use the Gibbs sampler to generate a bivariate Markov chain (Um , Vm ), with Um corresponding to µ and Vm to σ −2 yielding approximate observations from the posterior of (µ, σ −2 ) given X. We shall see how to use these methods in inference in Section 10.5. ✷ Example 10.4.3. A bivariate normal example. Let L(Y |X = x) = N ρx, (1 − ρ2 ) = L(X|Y = x) for |ρ| < 1. By Theorem B.4.2 in Volume I, these are the conditional distributions of the bivariate normal distribution (X, Y ) ∼ N2 (0, 0, 1, 1, ρ) and hence X ∼ Y ∼ N (0, 1). We will observe the workings of the Gibbs sampler. Suppose we initiate X (0) ∼ N (0, σ 2 ) . Then we can write 1 Y (0) = ρX (0) + (1 − ρ2 ) 2 Z , where Z ∼ N (0, 1) independent of X. Therefore, At the next round, Y (0) ∼ N 0, ρ2 σ 2 + (1 − ρ2 ) . 1 X (1) = ρY (0) − (1 − ρ2 ) 2 Z X (1) ∼ N 0, ρ4 σ 2 + ρ2 (1 − ρ2 ) + (1 − ρ2 ) = N 0, ρ4 σ 2 + (1 − ρ4 ) . Continuing, if at the kth stage X (k) ∼ N (0, σk2 ) then (Problem 10.4.8) and thus X (k+1) ∼ N 0, ρ4 σk2 + (1 − ρ4 ) , 2 σk+1 − 1 = ρ4 (σk2 − 1) . Then, σk2 − 1 is monotone decreasing or increasing according as σk2 ≷ 1 and converges to the limit 2 2 σ∞ − 1 = ρ4 (σ∞ − 1) . 2 So σ∞ = 1. Note that convergence is exponential. ✷ Section 10.4 245 Markov Chain Monte Carlo Example 10.4.4. The random effects model. We follow Example 3.2.4 and have Xij = µ + ∆i + εij , 1 ≤ i ≤ I, 1 ≤ j ≤ J , 2 with ∆i and εij i.i.d. and independent of each other, N (0, σ∆ ) and N (0, σe2 ), respectively. 2 2 T T Then, by (3.2.17), if τ ≡ (µ, σe , σ∆ ) , ∆ ≡ (∆1 , . . . , ∆l ) , respectively, p(x|∆) = Πi,j ϕσe (xij − µ − ∆i ) which depends on τ . Here the ∆i are latent (unobserved) variables with joint density Y 2 ϕσ∆ (∆i ) . p(∆|µ, σe2 , σ∆ )= i In the Bayesian framework we put a prior π(·) on τ and include ∆ in our list of parameters. The full posterior distribution is then π(τ, ∆|x) ∝ p(x|∆)p(∆|τ )π(τ ) , where the proportionality constant c(x) is such that Z π(τ, ∆|x) dτ d∆ = 1 . The value of the constant is irrelevant if we want the posterior mode but matters for credi2 ble Bayesian regions of τ . It is absolutely necessary if, say, we want the density of σ∆ |x. We can, in principle, obtain this constant by the Metropolis-Hastings method finding an e 1 , τ2 ) which has the posterior distribution of τ |x as its homogeneous Markov kernel K(τ stationary distribution in ways we have discussed. Note that since we are generating an empirical probability distribution, say that of τ b0 , τ b0 +t , . . . , τ b0 +nt , observations from the Markov chain after “burn in” spaced t apart (where t is “large”), this empirical distribution yields estimates of every plausible function of π(τ |x). 2 This approach also works for a non-Bayesian formulation of the model. Suppose µ, σ∆ , 2 2 2 σe are simply unknown parameters. Then we are interested in good estimates µ b, σ be , σ b∆ and confidence regions for ∆ just based on the conditional distribution of ∆ given X with 2 parameter values µ b, σ be2 , σ b∆ . We still need to calculate the conditional distributions. These are necessarily Gaussian in this case since (∆, X) have a joint Gaussian distribution which is easy to compute. This is not the case, however, if, for instance, we consider a robust alternative to model (3.2.16) and assume ∆i have, say, logistic distributions. Neglecting 2 what is now the difficulty of estimating σ∆ , σe2 , and µ, we are still left with a formidable integration problem finding the conditional density of ∆i given X. MCMC can come to the rescue as before. ✷ The natural generalization of the simple Gibbs samplers appropriate to Bayesian problems of this structure is to the case where we want to simulate from p, the density of U = (U1 , . . . , Uk ), where the Uj may themselves be vectors. We are given that the conditional distribution of Ui given {Uj : j 6= i}, call it L(Ui |U−i ), is easy to sample from. 246 Monte Carlo Methods Chapter 10 Then, the idea is to cycle, initializing by picking U = u0 arbitrarily, then U1 = u11 from L(U1 |U−i = u0−i ), then U2 from L(U2 |U−2 = (u11 , u03 , . . . , u0k )), and continuing till the kth coordinate when we have U1 = (u11 , . . . , u1k ) complete, and then repeat. Although this chain does not obey detailed balance, we can show that its stationary distribution is p by arguing as we did for the case k = 2. An example is the random effects model (Example 10.4.4) where Ui = ∆1 . 10.4.4 Speed of Convergence and Efficiency of MCMC As we have noted, a reasonable measure of the desirability of a Metropolis or Gibbs sampler {X1 , . . . , Xn } once simplicity has been taken into account is governed by the speed of convergence of the distribution of Xn to the desired stationary distribution with discrete or continuous case function p. But how is convergence to be measured in general? If the state space X = {x1 , . . . , xN } is finite and the transition matrix is K = kK( · , · )kN ×N then, as we note in Appendix D5, if a homogeneous Markov chain is irreducible, aperiodic, and satisfies detailed balance, then, as required in (10.4.1), for some ρ ∈ (0, 1) ∆n (K) ≡ N X k=1 |P [Xn = xk ] − p(xk )| ≍ ρn where “≍ ρn ” means that, as n → ∞, n−1 log ∆n (K) → log ρ. We next relate convergence to eigenanalysis. Consider an aperiodic irreducible reversible Markov Chain on the finite state space {1, 2, . . . , N }. Under these conditions it can be shown (Grimmett and Stirzaker (2001), Sections 6.6 and 6.14) that the transition matrix K has real eigenvalues λ1 , . . . , λN such that λ1 = 1 and |λj | < 1 for j = 2, . . . , N . Moreover the entries vrj in the corresponding eigenvectors vr are real. Define the inner product N X xj yj pj (x, y) = j=1 where pj is the probability of observing j under the stationary distribution. We can take the eigenvectors v1 , . . . , vN to be an orthornormal basis with respect to (·, ·). It follows that v1 = (1, . . . , 1)T ≡ 1. We can now give an exact expression for the n-step transition probability. Proposition 10.4.3. Set pij (n) = P (Xn+1 = j|X1 = i). Under the conditions in the previous paragraph, N X pij (n) − pj = pj λnr vrj vri . r=2 Proof. We can write the unit vector ej = (0, . . . , 1, . . . , 0)T as ej = N X r=1 (ej , vr )vr = N X r=1 vrj pj vr . (10.4.15) Section 10.4 247 Markov Chain Monte Carlo T Next note that Kn ej = p1j (n), . . . , pN j (n) and Kn vr = λnr vr . Multiply (10.4.15) on the left by Kn and find that pij (n) = pj N X λnr vrj vri . r=1 The result follows because λ1 = 1 and v1 = 1; so the first term in the sum is pj . ✷ Now we can establish convergence rates: Corrollary 10.4.1. Under the conditions of Theorem 10.4.3, |pij (n) − pj | ≤ pj (|λ|2 )n N X vrj vri r=2 where |λ|2 = max |λj | is less than one. Moreover, the total variation norm satisfies 2≤j≤N N X j=1 |pij (n) − pj | ≤ |vi |2 (N − 1)(|λ|2 )n where |vi |2 = max |vri |. 2≤r≤N Proof. The first part follows from the definition of P |λ|2 . The second part follows because P 2 p |v | ≤ 1 by Cauchy-Schwarz and because j pj vrj = 1 by the definition of the j j rj inner product for the eigenvectors. Remark. For more on the convergence of MCMC, see Diaconis and Strook (1991), Grimmett and Stirzaker (2001), and Diaconis (2009). Efficiency of MCMC We will measure the effectiveness of MCMC by comparing it to the approximations we would have if we could obtain an i.i.d. sample from p. First we need the following result which shows that the second largest eigenvalue λ2 can be interpreted as a maximal correlation. Theorem 10.4.1. Suppose the state space of a homogeneous chain is finite, and that K( · , · ) satisfies detailed balance. Then, if the eigenvalues of the transition matrix K are ordered so that 1 = λ1 ≥ λ2 ≥ . . . ≥ λN , λ2 = ρ+ ≡ max{Corr(g(X1 ), g(X2 )) : g arbitrary}. Proof. The pair (Xi , Xj ) has discrete case density K(xi , xj )pi , where pi = P (X1 = xi ). The optimization problem whose solution we claim to be λ2 is to maximize over vectors g= (g1 , . . . , gn )T , X gi gj K(xi , xj )pi (10.4.16) i,j 248 Monte Carlo Methods subject to X g i pi = 0 , (i) (ii) X Chapter 10 gi2 pi = 1 . i i P Regard (10.4.16) as a sum over j. Fix j and maximize each term gj i gi K(xi , xj )pi . The solution g0 , which necessarily exists, satisfies for all j, some γ1 , γ2 , X gi K(xi , xj )pi = γ2 pj gj + γ1 pj , (10.4.17) i since the method of Lagrange multipliers applies, (Apostol (1957), TheoremP 7.10, p.153; see Problem 10.4.9). The sum over j of the left side of (10.4.17) reduces to i gi pi = 0. Thus the sum over j of the right side is also 0. This implies that γ1 = 0 by (i). By detailed balance, pi K(xi , xj ) = K(xj , xi ) , pj which implies that γ2 is a right eigenvalue of K = K(xi , xj ) N ×N and g0 is a right eigenvector. Since all eigenvectors are vectors of real numbers and g ≡ 1 is not a candidate (by (i)) we conclude by the Courant-Fischer theorem (Problem 8.3.24), that is, γ2 is the second largest eigenvalues. Moreover multiplying (10.4.17) by gj , summing, and using (ii), we see that γ2 = max{Corr(g(X1 ), g(X2 ))} . ✷ Proposition 10.4.4. Under the assumptions of Theorem 10.4.1, max{Corr(g(X1 ), g(Xk+1 )) : g arbitrary} = (ρ+ )k . Proof. See Problem 10.4.10. Remark. Corr(X1 , Xk+1 ) is called the autocorrelation of the chain, and max{Corr(g(X1 ), g(Xk+1 )} , g the maximal autocorrelation. Next we show that ρ+ can be used to show the effectiveness of MCMC. PN Example 10.4.5. Consider the problem of estimating Ep g(X1 ) = j=1 g(xj )pj by PB −1 ḡ ≡ B b=1 g(Xb ), where X1 is initialized at some distribution and X2 , . . . , XB are obtained via K( · , · ). Then 2 E(ḡ − Ep g(X1 )) = 2 B 1 X (Eg(Xb ) − Ep g(X1 )) B b=1 Section 10.4 1 + B 249 Markov Chain Monte Carlo B B−1 B 1 X 2 X X Var g(Xb ) + Cov(g(Xa ), g(Xb )) . B B b=1 (10.4.18) b=1 a=b+1 ′ On the other hand, suppose that X1′ , . . . , XB are an i.i.d. sample from p and define ḡ ′ ≡ PB −1 ′ B b=1 g(Xb ). It is reasonable to define the effectiveness or “efficiency” of the Markov chain sampling scheme by E(ḡ ′ − Ep g(X1 ))2 e(K) ≡ lim inf : all g not constant . B→∞ E(ḡ − Ep g(X1 ))2 We claim that Theorem 10.4.2. Under the assumptions of Theorem 10.4.1 e(K) = 1 − ρ+ . 1 + ρ+ The proof is sketched in Problems 10.4.10 and 10.4.11. It is quite possible that ρ+ < 0 so that the Markov chain is more efficient than independent sampling — see Problem 10.4.12. Rosenthal (2003) shows, in the context of general chains, how to construct Metropolis samplers with good speed of convergence from samplers with ρ+ close to −1. Speed of convergence results are also available for countable or, more importantly, continuous state spaces. However, more conditions are required and computation is even more challenging. An introductory treatment is in Tierney’s article in Gilks et al (1995) and a thorough treatment is given in Meyn and Tweedie (1993). See also Rosenthal (2003). Unfortunately, computing or even bounding λ2 in cases of interest is difficult. A key question that remains is how to determine the length of the “burn in” period from initial simulations. We refer to Liu (2001), Diaconis (2009), and Brooks, Gelman, Jones, and Meng (2011), for instance, for more literature on MCMC. Summary. We introduced Markov Chain Monte Carlo methods which generate X1 , X2 , . . . that are approximately i.i.d. from p, where p is a density known up to a constant, and from which generation of random variables is difficult. The method entails generating observations from a homogeneous Markov chain with p as stationary distribution. In particular, we presented the Metropolis–Hastings method for constructing such Markov chains. We introduced the Gibbs sampler which is an algorithm where knowledge of L(V |U = u) and L(U |V = v) is used to generate variables U and V from the marginal distributions L(U ) and L(V ) as well as from the joint distribution L(U, V ). We showed, using an example, how in a Bayesian framework, the Gibbs sampler can be used to generate data from L(θ1 |X), L(θ2 |X), and L(θ1 , θ2 |X) when L(θ1 |X, θ2 ) and L(θ2 |X, θ1 ) are easy to generate from. Since variables which approximately have the desired p are only produced after the chain is run for some time — the so called “burn in” period, it is important to know how good the approximation is. We gave conditions under which the discrete case density pk of Xk is close to p. In particular, when X has a finite state space X , then under conditions on the Markov kernel K, the distance between p and pk is of order ρk , where ρ ∈ (0, R 1). We finally discussed how MCMC can be used to approximate integrals of the form g(x)p(x)dx and measured the effectiveness of such approximations by comparing them to Monte Carlo approximation based on (unavailable) i.i.d. samples from p. 250 10.5 Monte Carlo Methods Chapter 10 Applications of MCMC to Bayesian and Frequentist Inference Geyer makes the argument in his paper in Gilks et al (1996) that MCMC is as applicable in frequentist inference as it is in Bayesian inference, given that one is basically concerned with computing high dimensional integrals. We agree with this point of view but go even further. We argued in Section 5.5 that, at least from an asymptotic point of view, if the model we consider is regular parametric, the parameter θ belongs to Rd and the prior distribution put on θ is smooth, then optimal Bayes procedures coincide asymptotically with efficient frequentist procedures (Maximum Likelihood), from a frequentist point of view. This point of view runs into trouble when d is of the order of n or larger — see, for instance, Diaconis and Freedman (1986). These questions, as well as more basic issues of consistency of Bayes procedures continue to be vigorously investigated — see Ghosh and Ramamoorthi (2003). We begin with a continuation of Example 10.4.2. Example 10.5.1. The Gaussian model. We saw in Example 10.4.2 how we could use the Gibbs sampler to approximately generate observations θ∗1 , . . . , θ∗B from the posterior distribution of θ = (µ, σ −2 ) when the prior distribution has µ and σ −2 independent Gaussian and Gamma, respectively. Assume for simplicity that the observations were obtained by running a sampler from independent starting points so that the Xi∗ generated by the sampler are independent and that, although this is true only approximately, their distribution is correct. We have seen earlier that the Bayes estimates of µ for quadratic loss or other symmetric loss functions if σ is assumed known is just the posterior mean −1 n µo 1 nX̄ . + + σ2 τo2 σ2 τo2 However, if σ 2 is unknown and has a prior distribution as in Example 10.4.2, to get an estimate of µ, we have to compute the marginal posterior mean of µ, ( ) −1 nX̄ n µo 1 E | X , + 2 + 2 σ2 τo σ2 τo for which we need Z 0 ∞ nX̄w + µ0 τ02 −1 1 q(w|X)dw nw + 2 τo where q(w|X) is the posterior density of σ −2 , a quantity only determinable through numerical integration. On the other hand, we have, using θ ∗1 , . . . , θ ∗B , the simple approximation P ∗ ∗ ∗ ∗ T B −1 PB b=1 θ1b of the marginal posterior mean of µ, where θ b = (θ1b , θ2b ) . Similarly −1 ∗ 2 B θ2b provides an estimate of the marginal posterior mean of σ . Next, suppose we want the shortest possible Bayes level 1 − α credible interval for µ. We need to find [µ, µ̄] such that Section 10.5 (i) PB Applications of MCMC to Bayesian and Frequentist Inference b=1 251 1(θ∗1b ∈ [µ, µ̄]) = [B(1 − α)], (ii) µ̄ − µ minimizes µ2 − µ1 among all pairs (µ1 , µ2 ), µ1 < µ2 , satisfying (i). Finding a minimum volume Bayes credible region for θ, a generalization of the problem stated above, can be done as follows. Since the posterior density of θ is proportional to the joint density of (X, θ), which is completely known, the regions which need to be considered are just all Sc = {θ : p(X|θ) q(θ) ≥ c} where q is the prior density and p(X|θ) is the conditional density of X given θ = θ. Evidently we now choose c so that B 1 X ∼ 1(θ∗b ∈ Sc ) = 1 − α . B b=1 ✷ We pursue these ideas more generally. We observe X ∼ p(·|θ), θ ∈ Θ, where p(x|θ) is a density function, θ ∈ Θ, θ has a prior density π(θ), (θ, X) has joint density π(θ)p(x|θ), and we want to generate observations from the posterior density of θ given X. We took advantage of special features of Example 10.4.2 and we used the Gibbs sampler. How do we generate the θ∗b in general? The all purpose answer is a Metropolis algorithm since specification of p(x|θ) and q(θ) is necessary to define the model. That is, we obtain the θ∗b by the usual algorithm: Choose a positive recurrent aperiodic Markov kernel K(·, ·) on Θ × Θ with stationary density π(θ|x) and generate observations from it after a suitable burn-in from a number of starting points. As usual, one picks K0 (·, ·) positive recurrent, aperiodic Markov on Θ × Θ and then lets K(θ1 , θ2 ) = r(θ1 , θ2 )K0 (θ1 , θ2 ) p(x|θ2 )π(θ2 )K0 (θ2 , θ1 ) . if θ1 6= θ2 with r(θ1 , θ2 ) = min 1, p(x|θ1 )π(θ1 )K0 (θ1 , θ2 ) (10.5.1) All we are using here is that the posterior density of θ, π(θ|X), is proportional (up to a data dependent constant) to π(θ)p(x|θ). Let A be an action space and l : Θ × A → R+ be a loss function. The Bayes procedure for any such problem is, according to Section 3.2, given by δ ∗ (x) = arg min E(l(θ, a)|x) a where E(l(θ, a)|x) = Z l(θ, a) dΠ(θ|x) and Π(θ|x) is the posterior probability distribution with density π(θ|x). Suppose we are able to generate θ∗1 , . . . , θ∗B independent, each having (approximately) the posterior distribution of θ given X = x. Then it is natural to use the approximation δ ∗(B) B 1 X ∗ l(θb , a) . (x) = arg min a B b=1 (10.5.2) 252 Monte Carlo Methods Chapter 10 Thus, for instance, in Example 10.5.1, in general, without our particular choice of π(µ, σ −2 ), the conditional distributions of µ and σ −2 , respectively, given each other and X, would not have a closed form. But the Metropolis sampler is still implementable, although, of course, the usual questions of speed of convergence need to be asked. The Gibbs and Metropolis sampler can be applied in frequentist inference in a number of ways. One we have already seen in the prediction context in Example 10.4.1. We pursue that in Example 10.5.2. Maximum likelihood and travel time prediction. EM-MCMC. We assumed, in the prediction problem, that the travel time density was known to be N (0, 1). Suppose, more realistically, that the density is in fact unknown. We can approximate the density by a parametric model f (·, θ), such as the gamma family or more generally finite mixtures of gammas. How do we carry out maximum likelihood estimation of θ? One possible approach is via the EM algorithm and MCMC. Represent Y1 , . . . , Yn as (Y(1) , . . . , Y(n) , R1 , . . . , Rn ) with density f (Y, θ|Z) = Πni=1 f (Y(Ri ) − Zi , θ) . Viewing R1 , . . . , Rn as missing, we see that we can alternate bOLD ) ≡ E step : Compute Λ(θ, θ n X i=1 Eb {log f (Y(Ri ) − Zi , θ)|Y(1) , . . . , Y(n) } θOLD bOLD ) to get θ bNEW by solving M step : Maximize Λ(θ, θ bOLD ) = ∇Λ(θ, θ n X i=1 Eb {∇ log f (Y(Ri ) − Zi , θ)|Y(1) , . . . , Y(n) } θOLD = 0 (10.5.3) The first computation can be done, in principle, by MCMC since bOLD ) = Λ(θ, θ n n X X i=1 j=1 log f (Y(Ri ) − Zi , θ)P b [R = j|Y(1) , . . . , Y(n) ] (10.5.4) θ OLD i which is a problem of the same type we considered in Example 10.4.2. (Here, everything is conditional on Z.) Thus, using the same notation as before on the M step, we maximize bOLD ) ≡ Λ(θ, θ n X n X i=1 j=1 log f (Y(j) − Zi , θ)Pbθ OLD [Ri = j|Y(1) , . . . , Y(n) ] . bOLD , running MCMC given θ bOLD , then maximize Λ(θ, b That is, we alternate, for fixed θ bOLD ) to get θ bNEW . Implementation of this kind of algorithm is discussed in Ostland θ (1999). ✷ Section 10.5 Applications of MCMC to Bayesian and Frequentist Inference 253 We can think of this implementation of EM and a more flexible approach in a general context, following Geyer — see Chapter 14 in Gilks et al (1995). We are given f (x, v, θ), the joint density of (X, V, θ) in an analytically tractable form. We observe X so that the likelihood is, Z g(X, θ) = f (X, v, θ) dv and our goal is to maximize g(X, θ) with respect to θ. According to EM, we need, for the bOLD = θ0 E step, given θ Λ(θ, θ 0 ) ≡ Eθ0 (log f (X, V, θ) | X) For θ0 fixed, we construct a Metropolis or other sampler for the conditional distribution of V given X under θ0 and get V1 , . . . , VB . We estimate, b Λ(θ, θ0 ) = B −1 B X log f (X, Vb , θ0 ) b=1 b in the M step. and maximize Λ A more basic alternative is to use formula (2.4.8) which gives g(X, θ) f (X, V, θ) L(θ, θ0 ) ≡ = Eθ0 |X . g(X, θ 0 ) f (X, V, θ0 ) Again, we generate V1 , . . . , VB (approximately) i.i.d. from the conditional distribution of V given X under θ0 and estimate L(·, θ, θ 0 ) by b θ 0 ) = B −1 L(θ, B X f (X, Vb , θ) . f (X, Vb , θ 0 ) (10.5.5) b=1 b as the object to be maximized and thus we apparently do not need Naively, we can view L to run a new sampler for each θ. This is, however, illusory since the approximation of L by b is poor if θ is not close to θ 0 , as we illustrate in the next example. L But this approach does make it clear that we can vary strategies, for instance, by replacbOLD ) such as the Newton–Raphson b θ ing EM steps by other methods of optimizing L(θ, method. An important feature of these approaches is that since the sampler, and, hence the objective function, change at each iteration, we no longer have the guaranteed increase in the objective function that the classical EM and other good algorithms give us. Here is an application which illustrates the risks of naive application of this method and indicates the correct way of proceeding. Example 10.5.3. Transformation model. Many models including the Cox proportional hazard model, so called frailty models, and a semiparametric extension of the Box-Cox (1964) transformation model can be put in the following framework: We observe (Zi , Yi ), 1 ≤ i ≤ n, i.i.d. as (Z, Y ), Z ∈ Rd , Y ∈ R . 254 Monte Carlo Methods Chapter 10 We postulate that there is an unknown monotone strictly increasing transformation a : R → R such that for θ ∈ Rd , a(Y ) = θT Z + ε, 1≤i≤n (10.5.6) where ε is independent of Z and has known distribution F0 with density f0 . We can interpret this by saying that, on an unknown scale a, the responses Y follow an ordinary linear model. Thus, we have a semiparametric model {P(θ,a) : θ ∈ Rd , a increasing. It may be shown (Problem 10.5.3) that if −t f0 (t) = e−t e−e , t≥0, a Gumbel density, then the model is a reparametrization of the Cox proportional hazard model. Other distributions correspond to frailty models (Vaupel, Manton and Stallard (1979)) where a latent variable ξ is postulated for each individual and a conditional hazard rate for Y given by λ(y, θ|z, ξ) = ξr(θ, z)λ(y) . (10.5.7) If we assume ξ comes from a fixed known distribution, then this is again a transformation model. See Problem 10.5.6. From the structure of these models we see that, to test θ = θ0 , we can invoke invariance of the model under the group of monotone transformations as in Chapter 8, to conclude that it is reasonable to base tests on the likelihood pθ (r, z) of (Zi , Ri ), 1 ≤ i ≤ n, where R1 , . . . , Rn are the ranks of Y1 , . . . , Yn . Because Rank(a(Yi )) = Rank(Yi ), we can without loss of generality set a(y) = y in this likelihood. Moreover, because pθ (r, z) = pθ (r|z)h(z) and h(z) does not depend on θ, it is enough to consider the conditional density function of R given Zi = zi , 1 ≤ i ≤ n, which, for general fθ (y|z), is L(θ) ≡ Pθ [R = r|z1 , . . . , zn ] = Eθ ′ f (Yi |zi ) Πni=1 θ |R = r fθ′ (Yi |zi ) (10.5.8) where θ′ is a reference value of θ which is selected in what follows, and fθ (y|z) is the conditional density of Y given Z = z. In the linear model (10.5.6) a natural choice for θ′ is 0, in which case, 1 n fθ (Y(ri ) |zi ) L(θ) = Pθ [R = r|z1 , . . . , zn ] = E0 Πi=1 , (10.5.9) n! f0 (Y(ri ) ) which is Hoeffding’s formula. See Problem 9.1.23. Note that under θ = 0, the Yi are i.i.d. and (R1 , . . . , Rn ), (Z1 , . . . , Zn ) are independent with P0 [R = r] ≡ 1/n! . Example 10.5.2 also has this structure, save that (10.5.6) is replaced by a parametric model and we concentrated on (Y(1) , . . . , Y(n) ) rather than (R1 , . . . , Rn ). Now it seems reasonable to estimate θ by maximizing a Monte Carlo approximation to L(θ). From (10.5.9), this appears possible even without MCMC, and was proposed by Doksum (1987) for θ satisfying 1 |θ|/σ0 = O(n− 2 ) where σ02 is the variance of ǫ. In this approach: Section 10.5 Applications of MCMC to Bayesian and Frequentist Inference 255 ∗ ∗ Generate sets of n independent observations, Y1b , . . . , Ynb , 1 ≤ b ≤ B, with Yjb∗ having density f0 (·), 1 ≤ j ≤ n, and, then, given R = r, Z = z, as data, estimate L(θ) by ) ( B X fθ (Y(r∗ i )b |zi ) 1 n b L(θ) ≡ (10.5.10) Πi=1 Bn! f0 (Y(r∗ i )b ) b=1 ∗ ∗ where Y(1)b ≤ . . . Y(n)b are the ordered Yib∗ . This can be viewed as importance sampling or as an implementation of (10.5.5). Here 1 n fθ (Y(ri ) |zi ) b Var0 Πi=1 Var L(θ) = B(n!)2 f0 (Y(ri ) ) which for fixed n tends to zero as B → ∞. However, we face the usual importance sampling issue. For the linear model (10.5.6), ! fθ2 (Yi |zi ) n n fθ (Yi |zi ) −1 (10.5.11) = Πi=1 E0 Var0 Πi=1 f0 (Yi ) f02 (Yi ) ) ! Z ( 2 f0 (y − θT zi ) n = Πi=1 dy − 1 . f0 (y) Here, as n → ∞ for any fixed θ 6= 0, the variance goes to ∞ exponentially (Problem 1 10.5.1), although it is O(1) for the |θ|/σ0 = O(n− 2 ) case considered by Doksum (1987) (Problem 10.5.2). There are various alternative ways to proceed taking advantage of special features of particular models — see Murphy (1995) for gamma frailty for instance. However, for linear regression transformation models, a general way is to use (10.5.8) and MCMC ideas as follows: b0 denote the MLE for Y1 , . . . , Yn distributed as Πn f0 (yi − θT zi ). 0. Let θ i=1 1. At the mth step, m ≥ 0, generate B i.i.d. vectors (Y1b , . . . , Ynb ), 1 ≤ b ≤ B, bT zi ). Let Y ∗ , . . . , Y ∗ denote where Y1b , . . . , Ynb are distributed as Πni=1 f0 (yi −θ m (1)b (n)b Y1b , . . . , Ynb ordered. 2. Estimate L(θ) by L∗m (θ) = B f 1 X n Πi=1 θ (Y(r∗ i )b |zi ) , B fb b=1 θOLD (10.5.12) bOLD = θ bm . Maximize L∗ (θ) to obtain θ bm+1 and proceed to step m + 1. where θ We call this algorithm the likelihood sampler. It produces a semi-parametric likelihood estimate of θ for the transformation model where (a(Y )|z) ∼ fθ (y|z) for unknown increasing a(·). When f0 is N (0, 1), this is an extension of the Box-Cox model that has a(t) = (tλ − 1)/λ, λ ∈ R. ✷ 256 Monte Carlo Methods Chapter 10 Summary. We demonstrated how MCMC and Gibbs sampling algorithms can be used in Bayesian analysis to compute approximations to estimates and credible intervals by generating data approximately distributed as L(θ1 , θ2 |X) when L(θ1 |X, θ2 ) and L(θ2 |X, θ1 ) are known. We next showed, in examples, how to use MCMC to approximate maximum likelihood estimates in difficult situations. In particular we showed how a combination of the EM algorithm and MCMC can be used to compute maximum likelihood estimates in the travel time model; and we showed how MCMC can be used to compute maximum likelihood estimates in a transformation model with unknown monotone transformation. 10.6 Problems and Complements Problems for Section 10.1 1. In Example 10.1.1 (a), suppose both G1 and G2 are χ22 under H. Let m = n − m = 15. Use R or other software to generate the Monte Carlo and permutation distributions of the two-sample t-statistic. Use M = 1000 Monte Carlo trials and 1000 random permutations. Summarize the results in histograms and compare the results to the student t28 density of Section B.3.1. 2. In Example 10.1.2, suppose that F0 is known. (a) Show that by conditioning on X̄([nα]+1) and X̄(n−[nα]) it is possible to express E0 (X̄α2 ) in terms of one and two dimensional integrals. (b) Use (a) and MATLAB or some other software to evaluate E0 (X̄α2 ) when n = 10, α = 0.25, 0.50, 0.75 and (i) F0 is N (0, 1), (ii) F0 is Laplace, f0 (x) = 12 exp{−|x| }. Problems for Section 10.2 1. Let r = m/2 and s = n/2 for positive integers m and n. Describe how Theorem B.2.3 of Vol. I can be used to generate a random variable with a β(r, s) distribution. 2. Show that in Theorem 10.2.1, V is minimized by choosing p0 proportional to |h|p. 3. (a) Establish (10.2.6) by showing that for this example, as s, r → ∞, c(p, p0 ) 1 √ . −→ p r+s 2πt(1 − t) (b) Show that rejective sampling is not useful if p0 is uniform and r or s < 1. 4. Let p denote the β(r, s) density with r, s > 1. Set m = [2r] and n = [2s] where [·] is the greatest integer function. Let r0 = m/2 and s0 = n/2 and let p0 be the β(r0 , s0 ) density. Discuss the behavior of c(p, p0 ) as s and r varies, including the case where s, r → ∞. 5. Show how one can simulate a random variable with the density (10.2.7). Hint. Let Z be the Bernoulli ( 12 ). Generate a value of Z. If Z = 0, let X be generated by Section 10.6 257 Problems and Complements the density rxr−1 , 0 < x < 1 (show how). If Z = 1, let X be generated by the exponential, E(1) density. −1 6. The logistic distribution has df FL (x) = [1 + exp{−x}] fL (x) = and density e−x , −∞ < x < ∞ . (1 + e−x )2 (a) Describe how the simple Monte Carlo technique can be used to generate X1 , . . . , Xn with density fL . Hint. Find FL−1 (·). (b) Describe how rejective sampling and (a) preceding can be used to generate X1 , . . . , Xn with density afL (x) , −∞ < x < ∞, a > 0 . f (x) = 1 + e−|x| (c) Use MATLAB or other software to find a in (b) preceding. Then find E(τ ) where τ is as in Theorem 10.2.2. (d) Carry out rejective sampling as in (b) preceding to obtain an observed sample x1 , . . . , x100 from f . You may use existing software. How long did the procedure take? Draw a histogram of your results. 7. Suppose X has density f (x) = ae−|x| , a > 0, x ∈ R . 1 + e−|x| (a) Describe how rejective sampling can be used to generate X. (b) Use MATLAB or other software to find the constant a. 8. Let f and g be continuous case density functions. Assume there exists a finite M such that f (y) ≤ M g(y) for all y. To obtain a random sample of size n from the density function f (x) using the Rejection Method Algorithm, simulate Y1 , . . . , YN from the density g(x). Then record the accepted sample {X1 , . . . , Xn } and the rejected sample {Z1 , . . . , ZN −n }. Let Ef denote expectation with respect to f . PN Pn (a) Are δ1 = n−1 i=1 h(Xi ) and δ2 = N −1 i=1 h(Yi )f (Yi )/g(Yi ) unbiased estimators of Ef h(X)? (b) Find the marginal density of Zi . (c) Suppose N > n. Is the estimator δ3 = N −n X (M − 1)f (Zj )h(Zj ) 1 N − n j=1 M g(Zj ) − f (Zj ) and, unbiased estimator of Ef h(X)? Explain. 258 Monte Carlo Methods Chapter 10 9. Suppose X1 , . . . , Xn are i.i.d. P where P ∈ P = {Pθ : θ ∈ R}. (a) Assume P is regular in the sense that ψ(x, θ) ≡ ∂/∂θ √ log p(x, θ) satisfies conditions A1–A4, A6 in Section 5.4.2 of Volume I. Let θ = t/ n + θ0 and L(X1 , . . . , Xn ) ≡ n Pn (log p(X , θ ) − log p(X , θ )). Show that for X , . . . , Xn i.i.d. i n i 0 1 i=1 2 t 2 Lθ0 (L(X1 , . . . , Xn )) =⇒ N − I(θ0 ), t I(θ0 ) , 2 where I(·) denotes Fisher information. Hint. See Examples 7.1.1 and 7.1.2. (b) Suppose that |∂l/∂θj |ψ(X, θ)| are uniformly bounded for θ in a compact set and 1 ≤ j ≤ 3. Given a bounded function gn : X n → R, |gn (X1 , . . . , Xn )| ≤ M < ∞, we wish to estimate En ≡ Eθn gn (X1 , . . . , Xn ). Suppose it is easy to generate i.i.d. observations from Pθ0 and we use the importance sampling estimate B X ∗ ∗ ∗ ∗ bn ≡ 1 E gn (X1b , . . . , Xnb ) exp{L(X1b , . . . , Xnb )} , B b=1 ∗ where {Xib : 1 ≤ i ≤ n} are i.i.d. from Pθ0 , 1 ≤ b ≤ B. P bn = En −→ 0 as n and B tend to infinity. Show that E bn ) ≤ C/B for some constant C independent of n. (In (b), if necessary, you Hint. Var(E may assume that the given boundedness condition implies the conditions for part (a).) Problems for Section 10.3 R 1. Establish (10.3.6), (10.3.7), and Theorem 10.3.1 for the parameter θ(P ) = g( xdP (x)), where we assume g ′′′ exists and |g ′′′ | ≤ M , E|X1 |4 ≤ M for some constant M > 0. Formally show that if B ≍ n2 , then − 21 b b ) BIAS(B) n (Pn ) = BIASn (Pn ) + OP ((Bn) B b − 32 BIASn (Pn ) = BIASn (P ) + OP (n ) (B) d SD = SDn (Pbn ) + OP (n−1 B −1 ) n 3 SDn (Pbn ) = SDn (P ) + OP (n− 2 ) . Hint. For (10.3.6), use the definition of the bootstrap based on B replications. For (10.3.7), use the expansion g ′′ (X̄) ∗ g ′′′ (X̄) ∗ 3 θ(P ∗ ) = θ(Pbn )+g ′ (X̄)(X̄ ∗ − X̄)+ (X̄ − X̄)2 + (X̄ − X̄)3 +OP (n− 2 ) . 2 6 σ 3 b2 ∗ ∗ b b BIAS(B) + OP (n− 2 ) . n (Pn ) = E θ(P ) − θ(Pn ) + n Section 10.6 259 Problems and Complements 2. Suppose g : Rd → Rp has total differential g (1) (x) = k(∂gi /∂xj )(x)kp×d at x = µ ≡ EX1 . Generalize Theorem 10.3.2 (b) to √ L∗ ( n(g(X̄∗ ) − g(X̄))) → N (0, g(1) (µ)Var(X1 )g(1) (µ)) . Hint. See Theorem 5.3.4. 2 3. Assume X1 , . . . , Xn i.i.d. P . In Example 10.3.1, find the bias of σ bB = σ 2 (Pbn ) − (B) b 2 BIASn (Pn ). Assume that 0 < σ (P ) < ∞. Compare your result to the bias −σ 2 (p)/n of the empirically bias corrected estimate. 4. Establish (10.3.4). 5. Establish (10.3.5). √ 6. In Section 10.3.3, show that if n[b µn − µ(P )] tends to a Gaussian distribution and the 1 difference between the bootstrap quantiles and population quantiles are of order OP (n− 2 ), then Efron’s percentile bootstrap µ∗n[B(1−α)+1] with B = ∞ is an asymptotic (1−α) UCB. 2 7. Let K4 (P ) = EP X 4 − 3 EP (X) . Show that EK4 (Pbn ) = K4 (P ) + O(n−1 ). R R8. 2The Mallows’ metric. Let F be the class of df’s with x dF (x) = 0 and 0 < x dF (x) < ∞. The Mallows’ distance ρ between F , G ∈ F is defined by ρ2 (F, G) = inf EP (X − Z)2 , P where the inf is over bivariate probabilities P for (X, Z) with marginals F and G. A relevant result is Fréchet’s theorem: Z 1 2 ρ (F, G) = [F −1 (u) − G−1 (u)]2 du , 0 w Let −→ denote weak convergence. Mallows’ (1972) lemmas: R R w 1. ρ(Fk , G) −→ 0 iff F −→ G and x2 dFk (x) −→ x2 dG(x). Pk 2. X1 , . . . , X PIf 2Fk = L( i=1 ai Xi ) where Pn are independent with L(Xi ) = Fi ∈ F and ai = 1, then Fk ∈ F and ρ2 (F k , Φ) ≤ a2i ρ2 (Fi , Φ), where Φ is the N (0, 1) df. (a) Show that ρ is a metric. (b) Use Mallows’ Lemma 2 above to show that if X1 , . . . , Xk and Z1 , . . . , Zk are i.i.d. from F and G, respectively, and if Fk , Gk are the df’s of 1 Sk = k − 2 k X i=1 [Xi − E(Xi )], 1 Tk = k − 2 k X i=1 [Zi − E(Zi )] , then ρ(Fk , Gk ) ≤ ρ(F, G). (c) Suppose X1 , . . . , Xn are i.i.d. as X ∼ F , where σ 2 = Var(X) < ∞. Show that 1 n− 2 (X̄ ∗ − X̄) converge in law in probability to N (0, σ 2 ). 260 Monte Carlo Methods Chapter 10 1 1 Hint: By (b), the distance between L∗ n 2 (X̄ ∗ − X̄) and L n 2 (X̄ − µ) is bounded by 1 ρ(F, Fbn ). We know that L n 2 (X̄ − µ) −→ N (0, σ 2 ) by the central limit theorem. Remark. Let | · | denote the Euclidean metric. Bickel and Freedman (1981) define the Mallows’ metric on the class P of probabilities P on Rd with EP |X|2 < ∞ as follows: 1 ρα (P, Q) = inf E{|X − Z|α } α , where 1 ≤ α < ∞ and the inf is over all joint distributions of (X, Z) with marginals P w and Q. They show that ρ22 (Pk , P ) −→ 0 is equivalent to “Pk −→ P and EPk |X|2 −→ 2 E R P |X| ” and they show that if d = 1 and (X, Z) has marginals (F, G), then ρ1 (F, G) = |F (t) − G(t)| dt. 9. Establish Theorem 10.3.2 for d > 1. Hint: Use the uniform SLLN and the multivariate Lindeberg-Feller theorem. See Appendix D.1. P 10. Suppose X1 , . . . , Xn are i.i.d. N (µ, σ 2 ). Set T = ni=1 (Xi − X̄)2 . Use available software to plot the exact and bootstrap distribution of T when n = 10 and σ = 1. Use B = 100 and B = 500. (B ) 11. Suppose Bn → ∞ as n → ∞. In Example 10.3.3, show that X̄−dbnαn is an asymptotic (1 − α) UCB. √ 12. In Example 10.3.4, show that the simultaneous confidence region Fb (x) ± cn / n has level (1 − α) for all distributions F . √ Hint. By example 7.1.6, n|Fb(x) − F (x)| converges in law to W 0 F (x) . Note that supx W 0 F (x) ≤ supu W 0 (u). 13. Establish (10.3.26). 14. For the model of Example 10.3.5, show that n[max(X1 , . . . , Xn )−max(X1∗ , . . . , Xn∗ )] does not converge in law in probability. → 0 and 15. In Example 10.3.5 show how an appropriate choice of m(n) with m(n) n m(n) → ∞ in the m out of n bootstrap makes λm(n) (P ) = λ(P )+0(1) and the “variance” λn (Pn ) − λn (P ) arbitrarily small. 16. Let X1 , . . . , Xn be i.i.d. as F and let Fbn denote the empirical distribution. Let Xn∗ = (X1∗ , . . . , Xn∗ ) be a bootstrap sample drawn with replacement from Fbn . (a) Suppose F has mean µ, unit variance, finite third absolute moment. Define √ Gn (x, F ) = P [ n(X̄ − µ) ≤ x] . √ Similarly, define the bootstrap counterpart G∗n (x, Fn ) = P ∗ [ n(X̄ ∗ − X̄) ≤ x], where P ∗ denotes bootstrap probability given Fbn . Prove that as n → ∞, P sup |G∗n (x, Fn ) − Gn (x, F )| → 0 = 1 . x Hint: You may use the Berry-Esséen theorem. See Section A.15. Section 10.6 261 Problems and Complements b is unbiased. 17. Show that when θb = X̄, the jackknife estimate of Var(θ) Pn 18. Show that when θb = n−1 i=1 (Xi − X̄)2 , the expected value of the jackknife bias estimate is E(BIASn(J) ) = −(n − 1)−1 σ 2 . 19. Show that for linear statistics of the form T (X) = a + n−1 n X h(Xi ) , i=1 The jackknife values T (x−i ), 1 ≤ i ≤ n, can be used to determine the value of T (x∗ ) for any bootstrap sample x∗ . Hint: Set ti = T (x−i ) and solve the equations X ti = a + n−1 hj , 1 ≤ i ≤ n , j6=i for h1 , . . . , hn . Use this to find the value of T (x∗ ) for any bootstrap sample x∗ = (x∗1 , . . . , x∗n )T . 20. Establish (10.3.12) formally. Problems for Section 10.4 1. Assume the conditions to Appendix D.5.D. Show that X1 , . . . , Xn of Section 10.4.1 will be approximately distributed as p for M → ∞ provided (a) and (b) of Section 10.4.1 hold. Hint: See Appendix D.5. 2. Show that a symmetric kernel K0 has the uniform distribution as its stationary distribution. 3. Under the conditions of Proposition 10.4.2, show that the joint distribution of (Um , Vm ) in a step of Gibbs sampler obeys detailed balance. 4. Consider the Markov chain with state space {1, 2} and kernel a 1−a K= . 1−b b Compute the upper bounds in Corollary 10.4.1. 5. In Example 10.4.2, show that the conditional distribution of σ −2 given µ and X1 , . . . , Xn is Γ(p + 12 n, λ), where λ is given by (10.4.14). 6. (a) Assume the conditions of Theorem 10.4.1. Suppose (X1 , X2 ) is drawn from the distribution with joint probability K(x1 , x2 )p(x1 ). Let λ1 , . . . , λN denote the eigenvalues of K ordered so that λ1 ≥ λ2 ≥ . . . ≥ λN . Show that min{Corr (g(X1 ), g(X2 )) : g arbitrary} = min λj = λN ≡ ρ− . j≥2 262 Monte Carlo Methods Chapter 10 (b) Deduce that max{Corr (g(X1 ), h(X2 )) : g, h arbitrary} = max{ρ+ , −ρ− } . Hint: E[g(X1 )h(X2 )] = E[g(X1 )E(h(X2 )|X1 )]. 7. Show that if the state space is finite and the transition matrix K satisfies detailed balance, then K has equal right and left eigenvalues, all being real. 8. In Example 10.4.3, show that X (k+1) ∼ N 0, ρ4 σk2 + (1 − ρ4 ) . 9. Establish (10.4.17) using Lagrange multipliers. 10. Assume the conditions of Theorem 10.4.1. Show that max corr g(X1 ), g(Xk+1 ) = (ρ+ )k . Hint. Note that by Theorem B.1.1, E g(X1 )g(Xk+1 ) = E g(X1 )E g(Xk+1 )|X1 . (10.6.1) Let K be the transition matrix of a chain with state space {1, 2, . . . , N } and let pij (n) be the (i, j) element of Kn . Then use (10.6.1) to show that the eigenvalues of Kn are the nth powers of eigenvalues of K. 11. Prove Theorem 10.4.2. Hint. Using Problem 10.4.10, Var B X b=1 B B X X X g(Xb ) = Var g(Xb ) + 2 Cov g(Xa ), g(Xb ) b=1 = B Var g(X1 ) + 2 a=1 b<a B a−1 X X a=1 b=1 a−b ρ+ ≍ B Var g(X1 ) 12. Show that the Markov Chain transition matrix K = largest eigenvalue λ2 < 0 iff (a + b) < 1. a 1−b 1 + ρ+ . 1 − ρ+ 1−a b has second Problems for Section 10.5 1. Show that the expression (10.5.11) tends to infinity at an exponential rate as n → ∞. That is, let ∆n denote the right hand side of (10.5.11). Then show that n−1 log ∆n → c for some c > 0. 2. In the transformation model (10.5.6) with f0 the N (0, σ02 ) density, set µi = θT zi , −2 Pn 2 2 σ0 = Var(ǫ). Assume that max |µi |/σ0 → 0 as n → ∞ and that σ0 i=1 µi ≤ M for some fixed M > 0. Show that the variance (10.5.11) is O(1). Section 10.7 Notes 263 3. Show that the model (10.5.6) with f0 a Gumbel density is equivalent to the Cox proportional hazard model. Hint. See Problem 1.1.13. 4. Implement the likelihood sampler (10.5.12) of Example 10.5.3 on the computer for the transformation model (10.5.6) when f0 is the N (0, 1) density. Describe the algorithm. Generate Y1 , . . . , Y50 from the model where a(t) = log t, E(a(Y )) = 1 + Zθ, θ = 1, Z ∼ U (0, 1). Compute L1 (θ) and L2 (θ) for θ = .5 + δ, δ = 0, 0.1, 0.2, . . . , 1. Plot L1 (θ) and L2 (θ) at these points. 5. Use the EM-MCMC approach of Example 10.5.2 to estimate θ in the transformation model of Example 10.5.3. Describe an algorithm. 6. Suppose that Y and ξ are independent continuous random variables, and that (10.5.7) holds. Show that if ξ ∼ G, then the model for Y can be written as (10.5.6) for some F0 . Hint. Λ(y|z, ξ) = ξr(θ, z)Λ(y) where Λ(y|z, ξ) = − log[1 − P (Y ≤ y|z, ξ)] and Λ(y) = − log[1 − H(y)] for some continuous df H. Solve for P (Y ≤ y|z, ξ). 10.7 Notes Note for Section 10.1 (1) We speak of generating independent observations by machine. This is not possible since computers are deterministic. But many methods have been devised for generating “pseudo random” numbers (U(0, 1) observations) which behave as if they were random for most usual purposes. For more on this topic see, for instance, Ripley (1987). Note for Section 10.3 (1) Efron’s presentation in Efron (1979) is much less abstract and focussed on the estimation of bias and variance, while this presentation follows more that of Bickel and Freedman (1981). But the basic idea is in Efron’s paper. See also Efron and Tibshirani (1993) and Bickel, Götze and van Zwet (1997). Note for Section 10.4 Another generalization of rejective sampling which uses (Y1 , Y2 , . . .) to generate X1 , X2 , . . . , XB , which, though not necessarily independent, does have distribution p, called perfect sampling, is also coming into use. Chapter 11 NONPARAMETRIC INFERENCE FOR FUNCTIONS OF ONE VARIABLE 11.1 Introduction We saw in Chapter 9 that to construct estimates of low dimensional Euclidean parameters in semiparametric models we need to estimate infinite dimensional parameters such as curves. Examples are densities and derivatives of densities on the one hand and conditional expectations on the other. These quantities, which will be considered in this chapter and the next, are not defined for discrete distributions and, so, as we saw in Section I.5 we need to consider regularized approximations to these parameters before we can construct empirical plug-in estimates. This is not a consequence of our needing to estimate functions but rather of the irregular nature of these parameters. For instance, we can estimate distribution functions using plug-in estimates without regularizing. To indicate the usefulness of the estimates in this chapter, we cite Example 9.3.1 where efficiency (and even consistency for arbitrary underlying distributions) requires us to estimate the derivatives of the logs of the densities. Similarly, in Example 9.1.10 we need estimates of the regression functions u → E(Y |U = u) and u → E(Z|U = u). A very important class of situations where we are led to function estimation is nonparametric classification or, more generally, prediction. This class of problems is also loosely categorized under the topic “ machine learning.” Such topics, including the estimation of the nonparametric regression µ(z) = E(Y |Z = z) and some of its semiparametric versions when z is d-dimensional, will be covered in Chapter 12. We introduce and analyze in this chapter nonparametric estimates of a continuous case density f (x), x ∈ R, and estimates of the nonparametric regression curve µ(x) = E(Y |X = x) for a continuous bivariate pair (X, Y ) ∈ R2 . Key methods and asymptotic expressions are introduced. We assume all distributions are continuous since discrete cases are “easy.” 265 266 11.2 nonparametric inference for functions of one variable Chapter 11 Convolution Kernel Estimates on R There are many uses for density estimates in visualization and, more generally, exploratory data analysis. For instance suppose we are contemplating using a gamma distribution to model the distribution of the failure times of a new type of computer chip. Comparing a nonparametric density estimate visually with a fitted gamma density is one way of exploring possible models. For more details, see Silverman (1986), Scott (1992), Wand and Jones (1995), and Loader (1999). The methods of this chapter can also be used to estimate the hazard rate discussed in Section 9.1. We turn to the simplest methods of density estimation for real valued random variables and more generally vector valued variables: histogram and kernel density estimation. Consider first the problem we discussed in Example 7.1.5, estimation of the density (in the continuous sense) f of a real valued random variable X on the basis of X1 , . . . , Xn , which are i.i.d. as X. The models P we shall consider are nonparametric, in the sense that any probability distribution may be approximated weakly by members of P, but members of P have densities which range over a set F which is restricted in some way. For instance, we may consider Fr = {f : kf (r) k∞ < ∞} for r ≥ 1. More generally we can consider Sobolev spaces, Z F = {f : |f (j) (t)|dt < ∞, 0 ≤ j ≤ r} , and, more generally still, spaces which permit discontinuities (Besov spaces; see Daubechies (1993) for instance). When f is not continuous it is not unambiguously defined. For simplicity we shall assume in this section that f is continuous and positive on intervals of the form (−∞, ∞), (−∞, b], [a, ∞) or a closed interval [a, b] with f (x) = 0 for x 6∈ [a, b]. This interval S(f ) is called the support of f and the upper and lower bounds are called boundary points. It is relatively difficult to find good estimates of f (x) when x is at or near one of the boundary points. Within this context we consider local problems such as estimating f (x) at points x and global ones such as estimating f (·) as a function. We will occasionally use the notation f (P ; x) for f (x) and f (P ; ·) for the function to remind ourselves that these are parameters defined on P. Recall that the histogram estimator of Section I.5 is obtained by approximating f (P ; x) by h−1 P (I(x)) where for integers j, Ij = (jh, (j + 1)h], −∞ < j < ∞, and I(x) = Ij(x) is the unique interval containing x; and then plugging in the empirical probability Pb for P . Now fh (Pb ; ·) is a discontinuous function and intuitively should be improvable if we assume f is smooth. To demonstrate this we introduce a famous easily analyzed family of estimates called the kernel estimates. These can be approached by taking histogram estimates and recentering them at each x. Specifically, approximate f (x) by the continuous function fh (x) ≡ fh (P ; x) ≡ 1 P (x − h, x + h] 2h and plug-in Pb to get the density estimate fbh (x) = fh (Pb ; x). (11.2.1) Section 11.2 Convolution Kernel Estimates on R 267 Let F denote the distribution function of X and set Jh (u) = h−1 J(u/h) with J(u) = ≤ u ≤ 1]. Then it is easy to see that, if S(f ) = R (Problem 11.2.1), Z Z b b fh (P ; x) = Jh (x − z)dF (z), fh (x) = fh (P ; x) = Jh (x − z)dFb(z). 1 2 1[−1 That is, fh (·) is the convolution of a uniform density Jh (·) on [−h, h] with f (·) and fbh (x) is the empirical plug-in estimate of fh (x). Convoluting f (·) with a density K gives a function with at least the smoothness of K, and the resulting estimate can be expected R to be better if f itself is as smooth as K. More generally, for any function K(·) with K(u)du = 1, we consider regularizations of f of the form Z fh (x) = fh (F ; x) = Kh (x − z)dF (z) = EKh (x − Z) (11.2.2) where Z ∼ F , Kh (u) = h−1 K(u/h) and we identify P with the distribution function F . If K(·) is a density and S(f ) = R, then (11.2.2) is the density of X + hV where V ∼ K(·) and V is independent of X. See Problem 11.2.2. The plug-in estimate of f is then fbh (x) = fh (Fb ; x) = n−1 n X i=1 Kh (x − Xi ) (11.2.3) where Fb is the empirical dfR based on X1 , . . . , Xn i.i.d. from F . A function K(u) with K(u)du = 1 is called a kernel function. Thus all densities are kernels, but kernels need not be non-negative. We shall sometimes refer to the estimates fh (Fb , x) with fh (F ; x) of the form (11.2.2) as convolution kernel estimates to distinguish them from generalized kernel estimates of the form Z f (Pb , x) = K(x, z)dPb (z) . (11.2.4) The condition Z K(x, z)dx = 1 and K(x, z) ≥ 0 for all z is required for f (Pb, ·) to be a density. In addition to the uniform kernel J(u), common kernels are the N (0, 1) density ϕ(u), the Epanechnikov kernel, KE (u) = 3 (1 − u2 )1[−1 ≤ u ≤ 1], 4 and the quartic kernel, KQ (u) = 15(1 − u2 )2 1[−1 ≤ u ≤ 1]/16 . 268 nonparametric inference for functions of one variable Chapter 11 This can be put in the general regularization framework of Chapter 9 by considering the sequence of parameters fh (P ; x) which tend to f (x) as h ↓ 0 but which for fixed h are estimated by the “plug-in” method where P is replaced by the empirical probability Pb. Let the support of K be the closure of the set {u : K(u) 6= 0}, then Lemma 11.2.1. Let fh (·) be as defined in (11.2.2). If either of the following holds: (a) f is continuous at x, x is in interior of S(f ), and K has compact support, (b) f is continuous at x and bounded on R, then, as h → 0, fh (x) → f (x). Proof. Note that R K(u)du = 1 permits us to write Z 1 x−z fh (x) − f (x) = [f (z) − f (x)]dz K h h Z = K(u)[f (x − hu) − f (x)]du (11.2.5) by changing variable to u = (x − z)/h. Since, for all M , sup{|f (x − hu) − f (x)| : |u| ≤ M } → 0 as h → 0 and because, under (a), K(u) = 0, |u| > M , for some M > 0, our claim follows for (a). For (b), see Problem 11.2.3. ✷ We proceed to study the plug-in estimate fbh (x) given by fbh (x) ≡ Z n 1 X K Kh (x − z)dFb (z) = nh i=1 x − Xi h . (11.2.6) Lemma 11.2.2. If K is bounded and has compact support, then if h = hn → 0 and nhn → ∞ as n → ∞, P fbh (x) → f (x) at all x which are points of continuity of f . Proof. We can show that E fbh (x) → f (x) and Varfbh (x) → 0 as n → ∞. Consistency follows. See Problem 11.2.4. Section 11.2 11.2.1 Convolution Kernel Estimates on R 269 Uniform Local Behavior of Kernel Density Estimates We analyze the bias and variance of fbh (x) at a point x under the assumption of a bounded third derivative f ′′′ . This strong assumption gives uniformity of our convergence statements in f ∈ F0 ⊂ F and in x restricted to some finite interval [c, d] contained in the support of f . Without such uniformity our asymptotic analyses comparing the performance of estimates are suspect since depending on f , the n for which an approximation to precision ε is valid might be arbitrarily great. Let mj (K) = Z uj K(u)du, (11.2.7) and for some constant M ∈ (0, ∞), let F3 = {f : S(f ) = [a, b], f three times differentiable on (a, b), kf ′′′ k∞ ≤ M } . Proposition 11.2.1. Let K be a kernel with m1 (K) = 0, mj (K) finite, 1 ≤ j ≤ 3. Then, as h → 0, uniformly in x ∈ [c, d] ⊂ (a, b), f ∈ F3 , and n, 1 E fbh (x) = fh (x) = f (x) + m2 (K)f ′′ (x)h2 + O(h3 ). 2 (11.2.8) Proof. Evidently, E fbh (x) = fh (x). For x ∈ [c, d], Taylor expand f (x − hu) about h = 0, to get, if |u| ≤ M1 for some constant M1 ε(0, ∞), 1 1 f (x − hu) = f (x) − huf ′ (x) + h2 u2 f ′′ (x) − h3 u3 f ′′′ (x∗ ) 2 6 for |x − x∗ | ≤ |hu| ≤ hM1 . Next substitute this expression into (11.2.5) to obtain the result. ✷ Note that mj (K) is finite for all j if K has compact support, and that m1 (K) = 0 if K is symmetric about 0. Also note that the bias does not depend on n, but the result holds if, in particular, we let h = hn → 0 as n → ∞. As with histogram estimates, the bias tends to 0 as h → 0, but we now have a faster rate and uniformity of convergence of the bias based on the assumption of a bounded third derivative. If we assume more derivatives we can control the bias further if we do not restrict to nonnegative K. See Problem 11.2.5. Next we turn to the variance of fbh . Let Z νj (K) = uj K 2 (u)du, ν(K) = ν0 (K), (11.2.9) and for M > 0, let F1 = {f : S(f ) = [a, b], f is differentiable on(a, b) and kf ′ k∞ ≤ M }. 270 nonparametric inference for functions of one variable Chapter 11 Proposition 11.2.2. Assume that νj (K) < ∞, j = 0, 1. Then, as n → ∞, uniformly for x ∈ [c, d] ⊂ (a, b), f ∈ F1 , Var fbh (x) = (nh)−1 ν(K)f (x) + O(n−1 ). (11.2.10) Proof. By (11.2.6), x−X −2 b Var fh (x) = (nh) n Var K h Z x−z −1 −2 2 2 2 = n h K f (z)dz − h fh (x) . h Change variable to u = (x − z)/h and rewrite the first term as (nh)−1 Z K 2 (u)f (x − hu)du = (nh)−1 f (x)ν(K) (11.2.11) Z + (nh)−1 K 2 (u)(f (x − hu) − f (x))du . Argue as in Proposition 11.2.1 to complete the proof (Problem 11.2.6). ✷ As with histogram estimates, Varfbh (x) is proportional to f (x) and is small if the density is small. Further, the variance is proportional to (nh)−1 and tends to 0 as n → ∞ for fixed h and to ∞ as h → 0 for fixed n. Since, by Proposition 1.3.1, MSE = VAR + (BIAS)2 , we have established Corollary 11.2.1. Under the conditions of Propositions 11.2.1 and 11.2.2, uniformly for x ∈ [c, d] ⊂ (a, b), f ∈ F3 , as n → ∞ and h → 0 1 MSE[fbh (x)] = (nh)−1 ν(K)f (x) + h4 m22 (K)[f ′′ (x)]2 + O n−1 + h5 ). (11.2.12) 4 The bias-variance tradeoff is clear: Small h leads to small bias and large variance while large h has the opposite effect. By choosing h = hn such that hn → 0 and nhn → ∞, we have consistency of fbh (x). The sum of the first two terms is sometimes called the asymptotic MSE[fbh (x)] (AMSE[fbh (x)]). Set A(h) = AMSE[fbh (x)]. Then by solving A′ (h) = 0 for h we find that when f (x) > 0 and 0 < [f ′′ (x)]2 < ∞, the minimizer of 1 A(h) is of the form hopt = cn− 5 for some c(f ) > 0 (Problem 11.2.10). Substituting hopt in (11.2.12) gives 4 inf MSE[fbh (x)] ≍ n− 5 , h a rate that is slower than the rate n−1 for estimation of regular parameters. Section 11.2 11.2.2 Convolution Kernel Estimates on R 271 Global Behavior of Convolution Kernel Estimates We can measure the distance, ρ(f, g), between two curves f and g in many ways, e.g. R R 1 1 kf − gk2 ≡ ( (f − g)2 ) 2 , kf − gkp ≡ ( |f − g|p ) p , p ≥ 1, and kf − gk∞ ≡ sup |f (x) − g(x)|. For most types of density estimates, all of these, other than kf − gk∞ , behave in the same way qualitatively in terms of rate of convergence R ∞ to 0. We focus on kf − gk22 . Then the loss, using fb, is the Integrated Squared Error −∞ (fb − f )2 (x)dx and the risk is the Integrated MSE, Z ∞ E(fbh (x) − f (x))2 dx (11.2.13) IMSE(fb, f ) = −∞ Z ∞ Z ∞ = [E fbh (x) − f (x)]2 dx + Varfbh (x)dx. −∞ −∞ For IMSE we invoke slightly stronger conditions than for MSE. For some constant M > 0, let Z b Z b I ′′ 2 F3 = f 3 ∈ F3 : [f ] (x)dx < ∞, |f ′′′ |2 (x)dx ≤ M . a a Theorem 11.2.1. Suppose the conditions of Corollary 11.2.1 hold save that F3 is replaced by F3I . Then, uniformly for f ∈ F3I , Z ∞ h4 2 −1 b [f ′′ ]2 (x)dx + O(n−1 + h5 ). (11.2.14) IMSE(fh , f ) = (nh) ν(K) + m2 (K) 4 −∞ Proof. We use Laplace’s form of the remainder in the Taylor expansion of f (x − hu): Z (hu)3 1 (hu)2 ′′ ′ f (x) − (1 − λ)2 f ′′′ (x − λhu)dλ. f (x − hu) = f (x) − huf (x) + 2 2 0 R Then, by (11.2.5) and uK(u) = 0, E fbh (x) − f (x) = 1 2 h m2 (K)f ′′ (x) (11.2.15) 2 Z Z h3 ∞ 1 3 − u K(u)(1 − λ)2 f ′′′ (x − λhu)dλdu. 2 −∞ 0 The square of the integral term in (11.2.15) is bounded by Z 1 nZ ∞ o2 2 3 u K(u) (1 − λ)2 f ′′′ (x − λhu) dλdu −∞ 0 1 which, by the inequality EU ≤ E 2 (U 2 ) applied to U = |f ′′′ (x − Λhu)|, with Λ having density 3(1 − λ)2 on (0, 1), is bounded by Z hZ 1 i 21 o2 1n ∞ 3 |u K(u)| 3(1 − λ)2 [f ′′′ (x − λhu)]2 dλ du (11.2.16) 3 −∞ 0 272 nonparametric inference for functions of one variable which, because 1 3 Z ∞ −∞ R1 0 2 (1 − λ)2 f ′′′ (x − λhu) dλ ≤ |u3 K(u)|du 2 Z ∞ −∞ R∞ −∞ Chapter 11 2 f ′′′ (t) dt, in turn is bounded by |f ′′′ (x)|2 dx . (11.2.17) R 2 It follows from (11.2.15) that the squared bias is 41 h4 m2 f ′′ (x) dx + O(h5 ). Next, using (11.2.11) (Problem 11.2.11), Z ∞ Z ∞ Varfbh (x)dx = (nh)−1 K 2 (u)du + O(n−1 ) (11.2.18) −∞ −∞ and the result follows. ✷ Note that the bias-variance tradeoff for IMSE is the same as for MSE and that the first two terms of (11.2.14) are of the form A(h) = c1 (nh)−1 + c2 h4 . To minimize A(h), we set the derivative equal to zero and find (Problem 11.2.12) 1 1 hopt = {ν(K)/m2 (K)ν(f ′′ )} 5 n− 5 R provided 0 < ν(f ′′ ) = [f ′′ (x)]2 dx < ∞. Substituting this in (11.2.14) gives inf IMSE[fb(·)] = h>0 1 5 4 4 {m2 (K)ν(K)ν(f ′′ )} 5 n− 5 + o(n− 5 ). 4 (11.2.19) (11.2.20) That is, as with MSE, assuming a well behaved f ′′′ , the order of uniform convergence over 4 the family F3 is n− 5 . 11.2.3 Performance and Bandwidth Choice As we have noted, an immediate observation from the preceding results is that the order of convergence of MSE and IMSE is strictly worse than the n−1 we have seen in the case of regular parameters. This is not a feature of a local vs. global focus nor of curve vs. Euclidean parameter estimation since the distribution function is a regular parameter. In fact, performance depends quite specifically on the type of f we consider. We have already seen (Section I.5) that with histogram estimates if we only assume 0 < |f ′ (z)| ≤ M < ∞ for z in a neighborhood of x the fastest pointwise convergence locally is n−2/3 and this continues to be true locally with kernel estimates. Moreover the optimal bandwidth is of the form cn−1/3 . On the other hand, if we specify greater smoothness for f , for instance, 0 < |f (2k+1) (z)| ≤ M < ∞, k > 2, for z in a neighborhood of x we can use K which are sometimes negative to achieve the rate n−2k/(2k+1) using bandwidth cn−1/(2k+1) . See Problem 11.2.9. The major difficulty that has to be faced is the choice of the bandwidth hn . The trouble is that even if we are willing to accept the order of magnitude n−1/5 (see (11.2.19)) dictated by the amount of smoothness for f that we are willing to assume, the optimal c in 1 hopt = cn− 5 for both MSE and IMSE depends on f ′′ . For instance, from (11.2.19), hopt Section 11.2 Convolution Kernel Estimates on R 273 R in this case depends on [f ′′ (x)]2 dx which, unfortunately, is harder to estimate than f and requires more smoothness assumptions. We are caught in a no win situation. If we believe that more than two derivatives of f are available we should use bandwidth of larger order than n−1/5 and obtain a final rate better than n−4/5 , but that requires the estimation of R (4) 2 f (x) dx. However, if f (4) (·) exists, we should use an even larger order bandwidth, and so on. There are a number of ways out. The one we consider soundest is cross validation which does not require a smoothness assumption. We discuss this method further in Section 12.5. A simple bandwidth choice if we are willing to assume a bounded f ′′ is to employ the hopt appropriate for a reference distribution (Bickel and Doksum (1977)), such as the Gaussian distribution and replacing σ by the sample standard deviation s. For F = √ 1 N (µ, σ 2 ), we find ν(f ′′ ) = 3/(8 πσ 5 ) and hopt is c(K)n− 5 σ with c(K) equal to 1.95, 2.34, and 1.06 for the U[−1, 1], Epanechnikov, and N (0, 1) kernels, respectively. With this choice fbh (x) will achieve the optimal rate of convergence for f ∈ F3 provided 0 < f ′′ (x) < c and 0 < σ 2 < ∞. 11.2.4 Discussion of Convolution Kernel Estimates Perhaps the strongest point in favor of convolution kernel estimates is that they are easy to analyze theoretically and that they can easily be extended to the multivariate case. However, the convolution kernel estimate (11.2.6) has a weakness if f has compact support [a, b] because for x near the boundary points a and b, (11.2.6) with a symmetric kernel will lead to biases by trying to account for impossible observations. A skew kernel which reduces the bias near a will worsen the bias near b, and vice versa. A possible solution is to permit the shape of the kernel to vary with x, that is, use a generalized kernel K(x, z) as introduced in (11.2.4). Put another way, in replacing the density f of X with that of X + hV , we will not limit ourselves to X, V independent. A simple method leading to such solutions is based on using a local parametric approximation to f (x) and using the plug-in method to estimate the parameters in this approximation. We consider this approach in Section 11.3.1. Other methods such as penalization can also lead to estimates insensitive to the support. Summary. We consider estimation of a continuous case density f (·) and because this is an irregular parameter where empirical plug-in estimates are not directly available, we first approximate f (·) by a regular parameter fh (·), h > 0, that converges to f (·) as h ↓ 0. This regular parameter is theR convolution of f (·) with a function Kh (u) = h−1 K(u/h) where the kernel K satisfies K(t)dt = 1. The constant h is called the bandwidth and is an example of a tuning parameter. By applying the empirical plug-in principle to fh (·) we obtain the convolution kernel estimate fbh (·). We develop uniform asymptotic properties of the bias, variance, and mean squared error (MSE) of fbh (x) first for fixed x and next for the mean integrated squared error (MISE). We find that the MSE and IMSE tend to zero and fbh (·) is consistent provided nh → ∞ as n → ∞. Assuming a bounded third derivative for MSE and a bounded integrated absolute third derivative for IMSE, we find that the fastest 4 possible rate for uniform convergence of MSE and IMSE to zero is n− 5 , which is obtained 274 nonparametric inference for functions of one variable Chapter 11 1 when h = O(n− 5 ). In this section we discuss a simple approach which asks for optimality at a reference density. We postpone an effective data based method for selecting h called cross validation until Section 12.5. 11.3 Minimum Contrast Estimates: Reducing Boundary Bias In this section we extend the minimum contrast approach of Volume I to density estimation. This extended contrast approach also applies to other curves and to a variety of classes of regular approximating functions to these curves. Here we show that this approach produces density estimates with desirable properties. Let F be a nonparametric class of densities (e.g. densities that satisfy the conditions of Corollary 11.2.1), and let Fs denote a class of regular approximating functions fs (·, β), β ∈ Rp , which contains the constant functions. Here s ∈ S is a tuning constant such as h whose choice was discussed above and we will return to later. We introduce a discrepancy Dt,x (f (·), fs (· ; β)) > 0 between f (x) and fs (x, β) where t ∈ T is another tuning constant. Typically, either S or T is empty. For fixed s and t, we assume that there is a unique minimizer βs,t (F, x) of D, and we assume that for each x in the interior (a, b) of the support S(f ) of f inf{Dt,x (f (·), fs (· ; βs,t (F, x))) : s ∈ S, t ∈ T } = 0. We also assume that βs,t (F, x) depends on F only, that is, not on f . Our estimate of β is b Note that we will get the same solution βb = βs,t (Fb, x) and the estimate of f (x) is fs (x, β). if we use the global discrepancy Dt (f (·), fs (· ; β)) = EF Dt,X (f (·), fs (· ; β)) because local minimization implies global minimization, cf. Proposition 3.2.1. Example 11.3.1. Linear kernel estimates as minimum contrast estimates. We first consider x fixed and assuming f is continuous we want an approximation f (z; β) that is close to f (z) for z in a neighborhood of x, where β = β(x) depends on x. One way is to select a kernel K ≥ 0 and to find β(x) that minimizes the local discrepancy Z 1 Dh,x (f (·), f (· ; β)) = {[f (z) − f (z; β)]2 Kh (z − x)}dz (11.3.1) W (x) S(f ) = EQh {[f (Z) − f (Z; β(X))]2 |X = x} where (X, Z) has joint df Qh with density qh (x, z) = f (x)qh (z|x), for x, z ∈ S(f ), and qh (z|x) = Kh (z − x)1[a ≤ z ≤ b] W (x) Section 11.3 Minimum Contrast Estimates: Reducing Boundary Bias 275 with W (x) = Z a b Kh (z − x)dz = Z (b−x)/h K(u)du. (a−x)/h That is, qh (z|x) is the conditional density of Z = X + hV given that X = x and that X +hV is in [a, b], where V ∼ K. Note that the global Dh measures how well f (Z; β(X)) predicts f (Z). By equating the derivative of Dh,x with respect to β to zero, we formally obtain a minimizer βh (F, x) and then the estimators βbh (x) = βh (Fb, x) and fbh (x) = f (x; βbh (x)). We can use the empirical plug-in principle because minimizing Dh,x is equivalent to minimizing Z Z f 2 (z; β)Kh (z − x)dz . f (z; β)Kh (z − x)dF (z) + R(F ; β) ≡ −2 S(f ) S(f ) We call R(Fb ; β) a contrast function and fbh (x) a minimum contrast estimate, cf. Section 2.1.1. As a simple illustration, take {f (z; β), β ∈ Rp } to be the class of constants {β, β ∈ R}. In this case we find, provided EF Kh (X − x) > 0 and x ± h ∈ S(f ), that the unique minimizer of D is βh (F, x) = EF Kh (X − x), x ± h ∈ S(f ) . When K is symmetric, this coincides with the convolution kernel approach (11.2.2) when x ± h ∈ S(f ), but when x ± h is not in S(f ), that is, x is in the boundary regions of S(f ), then the minimum contrast estimate is consistent (Problem 11.3.2) and the convolution estimate is not (Problem 11.3.1). A key feature of the minimum discrepancy approach is that it automatically leads to boundary corrections. We next elaborate on this remark. ✷ Boundary points Suppose S(f ) = [a, b] with a and b finite and 2h < b−a. We say that x is in the interior region of S(f ) if x − h and x + h both are in [a, b]. In the asymptotic theory of Section 11.2, we could argue that for any x in (a, b) we can choose h = hn small enough so that this condition holds. However, for finite n as well as when n → ∞ the kernel estimator (11.2.6) may be severely biased for x in the boundary regions [a, a + h) and (b − h, b]. In fact for xh of the form a + λh or b − λh, 0 < λ < 1, the kernel estimator (11.2.6) is asymptotically biased when K has compact support (Problem 11.3.1). However, if we use the minimizer of (11.3.1) with f (z; β) a constant, and we assume that K is a density with support [−1, 1], then this asymptotic bias of the plug-in estimate disappears. We can write the unique solution to dD/dβ = 0 for x ∈ (a, b) as (Problem 11.3.3) Z b Z (b−x)/h fh (x) = EQ [f (Z)|X = x] = Kh (z − x)dF (z)/ K(u)du, (11.3.2) a (a−x)/h 276 nonparametric inference for functions of one variable Chapter 11 which leads to a generalized kernel estimate of the form (11.2.4). For symmetric K, it corresponds to the density of a random variable of the form X + hV , V ∼ K, where V depends on X, i.e., we no longer have a convolution as in (11.2.2). Let fbh (x) be the plug-in estimate using (11.3.2). It can be shown (Problem 11.3.2) that for xh in the boundary region, Bias fbh (xh ) = O(h). The rate O(h) at which the bias tends to zero is slower than the rate O(h2 ) obtained for interior points. A remedy is to use a locally linear approximation. Example 11.3.1 continued: Locally linear and polynomial estimates. Suppose S(f ) = [a, b] with a and b finite. Let K(u) be a nonnegative symmetric kernel with support [−1, 1]. We find α = α(x, F ) and β = β(x, F ) that minimize the local discrepancy Z a b {f (z) − [α + β(z − x)]}2 Kh (z − x)dz (11.3.3) from the locally linear fit g(z −x) = α+β(z −x) to f (z). The locally linear approximation to f (x) is fh (x) = g(0) = α(x, F ) + β(x, F )(x − x) = α(x, F ), Setting the derivative of the quadratic in (11.3.3) with respect to α and β equal to zero and simplifying, α(x, F ) and β(x, F ) will, for x ∈ (a, b), be the solutions to the normal equations Z a b Z b a Kh (z − x)dF (z) = αm0 (x, h) + βhm1 (x, h) (11.3.4) (z − x)Kh (z − x)dF (z) = αhm1 (x, h) + βh2 m2 (x, h) where mj (x, h) ≡ Z (b−x)/h uj K(u)du, j = 0, 1, 2. (a−x)/h By the argument above fh (x, Fb ) ≡ α(x, Fb) is an estimate of f (x). If a + h ≤ x ≤ b − h, or, equivalently, a − x ≤ −h, b − x ≥ h, then m1 (x, h) = 0 by the symmetry of K and K vanishing outside [−1, 1]. Thus, on this range, Z fh (x, Fb ) = Kh (x − z)dFb (z), (11.3.5) the usual convolution kernel estimate. However, for a ≤ x < a + h and b − h < x ≤ b we obtain, solving (11.3.4) (see Problem 11.3.5), Z b fh (x, F ) = Kh (x, z)dFb(z) (11.3.6) Section 11.3 Minimum Contrast Estimates: Reducing Boundary Bias 277 where Kh (x, z) = m2 (x, h) − m1 (x, h)(z − x)/h Kh (z − x). m0 (x, h)m2 (x, h) − m21 (x, h) (11.3.7) Note that Kh (x, z) is not a convolution kernel, and is sometimes negative. For M > 0, let f ∈ F2 ≡ {f on [a, b] : |f ′′ (x)| ≤ M for all xǫ[a, b]}, where f ′′ is one sided at a and b. Proposition 11.3.1. Suppose that the conditions of Propositions 11.2.1 and 11.2.2 hold. Then, uniformly for f ∈ F2 and x ∈ [c, d] ⊂ (a, b), Efh (x, Fb ) = f (x) + O(h2 ), as h → 0 Varfh (x, Fb ) = O((nh)−1 ), as n → ∞ . (11.3.8) (11.3.9) The proof of these assertions and the asymptotic MSE of fh (x, Fb ) are left to Problem 11.3.5. Note also that since β(x) = f ′ (x), β(x, Fb ) gives us an estimate of f ′ (x) which is given in Problem 11.3.5 as well. We give extensions to estimates based on local polynomial approximations in Problem 11.3.6. Remark 11.3.1 The conditional expectation version of (11.3.1), and Theorem 1.4.3 with (f (Z), Z − X) in place of (Y, Z) and with L(f (Z), Z − X|X = x) in place of L(Y, Z), give, α(x, F ) = E[f (Z)|X = x] − β(x, F )E(Z − X|X = x), Cov(f (Z), (Z − X)|X = x) β(x, F ) = . Var(Z − X|X = x) That is, α(x; F ) is the value of the optimal MSPE linear predictor of f (Z) at z = x when L(Z|X = x) is qh (z|x). This representation gives a different derivation of fh (x; Fb ) (Problem 11.3.7) and shows the connection to prediction. Remark 11.3.2. The locally linear estimate of Example 11.3.1 can be computed as follows: Let {x(1) , . . . , x(g) } denote a set of grid points in [b a, bb] where b a = x(1) and bb = x(n) . (k) For a given x determine whether it is in the left or right boundary region or the interior region of [b a, bb]. Thus if x(k) is in the left boundary region, x(k) = b a + λh, 0 < λ < 1, determine λ(k) = (x(k) − b a)/h, and use Kh (x, z) with mj (x, h) = Z λ(k) uj K(u)du . −1 A point x(k) in the central part uses the kernel K(u), and the right boundary point x(k) uses R1 Kh (x, z) with mj (x, h) = −γ (k) uj K(u)du and γ (k) = (bb − x(k) )/h. This procedure 278 nonparametric inference for functions of one variable Chapter 11 yields {(x(k) , fb(x(k) )); k = 1, . . . , g}. These points are then connected using a smoothing routine and the final output is a smooth curve. Example 11.3.2. Series expansions. Suppose f is in L2 (a, b), for −∞ ≤ a ≤ b ≤ ∞. Rb Thus, a f 2 (x)dx < ∞. Let {Bj (·) : j ≥ 1} be a complete orthonormal basis of L2 , Z b Bj (x)Bk (x)dx = 1(j = k) , a and let βj (F ) = Z ∞ f (x)Bj (x) dx . −∞ Then, f (x) = ∞ X βj (F )Bj (x) (11.3.10) j=1 Rb 2 Pm in the sense that a f (x) − j=1 βj (F )Bj (x) dx → 0 as m → ∞ (L2 convergence). For example, if a = −π, b = π, we can consider the Fourier basis, 1 Bk (x) = π − 2 cos kx, = − 21 (2π) = π − 12 , sin kx, k>0 k=0 k<0 Rπ since −π cos2 kxdx = π. Another example is the Legendre polynomials. If a = −∞, b = ∞, we can consider the Hermite functions defined by Bk (x) = ϕ(x)Hk (x) = (−1)k dk ϕ(x) . ck dxk Here ϕ is the N (0, 1) density, Hk are the Hermite polynomials, and ck makes Z Hk2 (x)ϕ2 (x)dx = 1 (see also Section 5.3). Other important bases are the wavelet bases — see Donoho and Johnstone (1995) for instance. For all these cases, we can estimate βk (F ) by empirical plug-in, n 1X βbk = βk (Fb ) = Bk (Xi ) . n i=1 (11.3.11) Unfortunately, while, by Plancharel’s theorem, Z 2 f (x)dx = ∞ X k=1 βk2 (F ) < ∞ , Section 11.3 Minimum Contrast Estimates: Reducing Boundary Bias 279 βk (Fb ) 6→ 0 as k → ∞ in general. Thus, the direct plug-in approach for f (x) as in (11.3.10) is impossible. On the other hand, minimizing the discrepancy D(f (·), fm (·, β)) = Z [f (x) − m X βj Bj (x)]2 dx j=0 yields βk (Fb) as in (11.3.11) and the preceding plug-in estimates βb1 , βb2 , . . ., as well as the density estimate where fbm (x) = m X j=0 Km (x, z) = βbj Bj (x) = m X Z Km (x, z)dFb (z) Bj (x)Bj (z). j=0 R It may be shown that if [a, b] is compact and B0 is constant, then Km (x, z) dx = 1 (Problem 11.3.8). Analysis of such estimates is not as simple as that of convolution kernels and they are necessarily sometimes negative. We refer to Silverman (1986) and Viollaz (1989) for further discussion. This idea becomes much more important and attractive in connection with regression. Note that, as with convolution kernel estimates, a crucial “tuning parameter,” m in this instance, needs to be chosen. ✷ Kernel estimates, despite their formal simplicity, have been computationally unattractive since a separate summation has to be carried out for each x in a grid of x-values. However, advances in computer technology and the availability of software (Loader, 1999, Chapter 5) have reduced this disadvantage. Orthogonal series estimates do not share this problem since the Bj (·) can be stored once and for all and m << n but, as we have indicated, are hard to analyze and can be negative. There are many nonlinear (nongeneralized kernel) methods of density estimation which are competitive with kernel density estimates. We consider these in a general discussion of regularization in the next section. Summary. We consider a systematic approach to constructing regular approximations to f (·) which consists of minimizing a discrepancy between f (·) and a parametric approximation f (·; β), β ∈ Rp , to f (·). The minimizer will be regular if a local quadratic discrepancy is used. The empirical version of the minimizer is called a minimum contrast estimate. We illustrate this approach using f (·; β) ≡ β, β ∈ R, and linear approximations f (·; β) to f (·) over local neighborhoods of x. We show that this approach yields estimates of f with reduced bias for boundary points xh of the form a + λh and b − λh where 0 < λ < 1 and a and b are the left and right boundaries of the support S(f ) of f . In particular, by plugging the empirical distribution into the linear minimizer of a local quadratic discrepancy based on bandwidth h, we obtain an estimate with bias of order O(h2 ) for all x ∈ (a, b) including boundary points. The variance of this estimate is of order O (nh)−1 . We also 280 nonparametric inference for functions of one variable Chapter 11 briefly consider estimates based on approximating f (·) by a truncated series expansion and show how minimizing a quadratic discrepancy leads to a natural series expansion estimate of f (·). 11.4 11.4.1 Regularization and Nonlinear Density Estimates Regularization and Roughness Penalties What we call regularization is related in spirit to the notion of regularization introduced by Tikhonov (1963) in applied mathematics. His problem was, essentially, given an integral equation of the form Kf = g with g ∈ K(F0 ) and F0 a “nice” set of functions (not necessarily densities), to find numerical approximations fN ∈ F0 to f which converged to f as N → ∞. F0 being “nice” corresponded to smoothness, for instance f twice R1 differentiable on (0, 1) with J(f ) ≡ 0 [f ′′ (x)]2 dx < ∞. Simply discretizing K to a matrix KN and gN to a vector g, defining fN = KN gN , and extrapolating fN to fN ∈ F0 would not work because the matrix KN was too ill conditioned. Tikhonov proposed minimizing |KN f − g|2 + λJN (f ) for λ = λN → 0 as N → ∞, where JN (f ) penalizes f for an extrapolation with J(f ) too large. In particular, Tikhonov considered the natural 2 PN numerical approximation to J(f ), N −1 i=1 f i+2 − 2f i+1 + f Ni . Passing to N N the limit, as N → ∞, for fixed λ > 0 we see that we have defined f as the solution of λ R1 R 1 ′′ 2 2 the problem: Minimize 0 (Kf − g) + λ 0 [f (x)] dx. Under weak conditions on K, discretizations fλN of this problem converge to fλ in strong senses for fixed λ > 0, and fλ → f as λ → 0. By letting λN → 0 sufficiently slowly, we have fλN → f . As we indicated in Sections I.5 and 9.1, we think of regularization as follows: We are given a parameter θ(P ) such as θ(P ) ≡ f (·), the continuous case density, which is on a model P0 , for instance, all P with twice differentiable densities f such that Rdefined |f ′′ (x)2 |dx < ∞, but not on M ≡ all distributions. Since we cannot plug the empirical distribution Pb into θ(P ), we construct a sequence θs (P ), s ∈ R such that (i) θs is defined on M. (ii) θs (P ) → θ(P ) on P0 as s → ∞. Then use θbs (Pb) with the tuning parameter sb chosen using the data. We have already noted that if θ(P ) = f (·), then the parameter corresponding to convolution kernel estimation is Z fs (x) ≡ s K(s[z − x]) dP (z) R where s ≡ 1/h, h → 0. As we also noted the log likelihood log f (x)dPb (x) is not maximized for P ∈ P0 sinceP the supremum is infinite and is formally achieved as f (x) looks n more and more like n−1 i=1 δ(x − Xi ), where δ is the Dirac function. What Tikhonov regularization, as he introduced it, naturally corresponds to in this case is penalized maximum likelihood introduced by Good and Gaskins (1978) and Tapia and Thompson (1978). Section 11.4 Regularization and Nonlinear Density Estimates 281 Penalized maximum likelihood (PLM) Let F be a space of densities and let J : F → R+ be a “roughness penalty.” In analogy to Tikhonov we define a population parameter f (·, F ) as the minimizer fλ of Z log fλ (x)dF (x) + λJ(fλ ). (11.4.1) The corresponding estimate is just fλ (·, Fb ). Good and Gaskins (1978) took, for f on R ′ ) = {[f (x)]2 /f (x)}ds. A computationally simpler proposal, to take J(f ) = RR, J(f ′′ [f (x)]2 dx, consistent with the Tikhonov penalty and with proposals made by Wahba (1969) for nonparametric regression, was made by de Montricher, TapiaRand Thompson (1975). See also Wahba (1990). The solution to (11.4.1) for J(f ) = [f ′′ (x)]2 dx, if we restrict f to densities on [a, b] with f (j) (a) = f (j) (b) = 0, j = 0, 1, is a quadratic spline with one continuous derivative and knots at a, x(1) < · · · < x(n) , b. Here a spline is a continuous piecewise polynomial function with the knots t1 < · · · < tm being the points where the polynomial changes. That is, the spline f is a different polynomial of the (j) + same degree over each (tk , tk+1 ], and f (j) (t− (tk ), k = 1, . . . , m, some j ≥ 0. k) = f See de Boor (1978) for an extensive discussion. For proofs of these claims see Tapia and Thompson (1978), p.106. We do not pursue this subject further here. 11.4.2 Sieves. Machine Learning. Log Density Estimation A general method of regularization, introduced in Section 9.1, is the following: Let F be a general (NP) class of functions. Then, (i) Find a sequence of parametric models (usually but not necessarily nested), Fk ≡ {fk (·, θ k ) : θ k ∈ Rdk } such that every f ∈ F is the limit as k → ∞ in some norm of fk (·, θk (f )) (for some θk (f ) ∈ Rdk ), where fk ∈ Fk . (ii) Select a method of estimating θ k from the data, under the assumption that fk ∈ Fk holds. Typically this method is maximum likelihood. If specifying fk does not completely specify a model, such as a regression framework where the distribution of the error is not assumed to be of specified parametric form, least squares or some other contrast function criteria may be used. (iii) Determine, using a data based criterion or otherwise, a suitable sequence kn which tends to ∞ as n → ∞. bk ), where θ bk is the estimate (iv) Act as if Fkn is true. That is, estimate f (·) by fk (·, θ from (ii). When fk determines a model and the estimation method is maximum likelihood then this method, for fixed x, reduces to replacing f (x) by the regular parameter fk (x, θ k (f )) where Z θ k (f ) = arg max log fk (x, θ k )dF (x). (11.4.2) θ 282 nonparametric inference for functions of one variable Chapter 11 Recall Section 2.2.2 where we showed that in the i.i.d. case the MLE is the plug-in estimate of the θ that minimizes the Kullback-Leibler divergence between f (x) and fk (·, θ). In our bk ) is a minimum discrepancy estimate in the sense of Section new terminology, fk (x, θ 11.3. There are two critical choices involved, that of {Fk } and that of kn . We shall discuss the latter choice further in Section 12.5. The choice of {Fk } is clearly problem dependent. We want Fk to be a good approximation for the type of f we expect. The method of estimation is also dependent on the problem, and two other considerations. (a) Ease of computation Methods which are in closed form or involve optimization of convex functions over convex sets are preferred for obvious reasons. (b) Efficiency for the family Fk . The effectiveness of this criterion depends on how closely f can be approximated by a member of Fk since estimates may have bias of order much greater than √1n for f ∈ / Fk . Since no density of interest typically belongs to any Fk this criterion is secondary. Vapnik (1999) has argued for a third consideration. (c) Empirical optimization of a loss function: Machine Learning Vapnik (1998,1999), as did Wald (1950) before him, has argued that one should start with a loss function, attached to some feature of P , such as classification with 0-1 loss or estimation of E(Y |X) with L2 loss. Vapnik then requires us to specify a class of decision procedures. Given this framework, he insists one should minimize a natural empirical estimate of the risk. This point of view can be viewed as a modification of the method of sieves with model selection and fitting method combined. It differs from the classical Wald (1950) framework in two ways. First no model is specified but only a class of decision procedures. Second, empirical risk minimization has to be followed. For instance, having our loss function for choosing g when f is true as the Kullback-Leibler divergence and specifying Fk would, to Vapnik, be equivalent to choosing the class f is R of procedures in which estimated by f (·, θk ), θk ∈ Rk . The risk would be just − log f (x, θk )/f (x) f (x)dx. R The empirical risk which is to be minimized is − log f (x, θk )/f (x) dPb (x), which is R equivalent to maximum likelihood since log f (x)dPb(x) is fixed. If, however, we specify loss as Z Z Z Z 2 f (x) − f (x, θ k ) dx = f 2 (x)dx − 2 f (x, θk )f (x)dx + f 2 (x, θ k )dx R R we would be led to minimize −2 f (x, θk )dPb(x) + f 2 (x, θ k )dx, a method suggested by Rudemo (1982). Note that Vapnik’s criterion contradicts (b) unless Kullback-Leibler loss is used because the theory of Section 6.2 tells us that if P ∈ Fk (or presumably if it is close), then maximum likelihood should be strictly better than the Rudemo method even using Rudemo’s loss! See Problem 11.4.1. Log density estimation Section 11.4 283 Regularization and Nonlinear Density Estimates One sieve {Fk } that satisfies both efficiency and relative computational ease is based on expanding the log of the density. Let B1 , B2 , . . ., be a sequence of functions on (a, b) ⊂ R. Suppose that the Bj are such that finite linear combinations can approximate an “arbitrary” function. Examples of B1 , B2 , . . . include orthogonal polynomials with respect to different measures such as the Hermite, Laguerre, or Legendre systems, splines, trigonometric polynomials, and wavelets. Define fk (x, η) = exp k nX j=1 o ηj Bj (x) − A(η) h(x) (11.4.3) where η ∈ Ek , the natural parameter space of this canonical exponential family, which we assume open. Assume that we can approximate any f ∈ F , a “nonparametric” set of densities on (a, b) as follows, inf{ρ f (·, η), f : η ∈ Ek } → 0 (11.4.4) as k → ∞, where ρ is a “distance” such as the Kullback-Leibler discrepancy. P Let X1 , . . . , Xn be i.i.d. as X ∼ F , f = F ′ , and set B̄j (X) = n−1 nj=1 Bj (Xi ), pk (x, η) = n Y fk (xi , η). i=1 By Corollary 1.6.2 and Theorem 2.3.1, we have Proposition 11.4.1. If Ek is open and 1, B1 (x), . . . , Bk (x) are linearly independent, then b which maximizes log pk (x, η) exists and satislog pk (x, η) is strictly concave. A unique η fies Ȧ(b η ) = B̄(x), where B̄(x) = (B̄1 (x), . . . , B̄k (x))T . Recall from Section 2.2.2 that this ηb is the plug-in estimate of the minimizer of the Kullback-Leibler divergence between f (·) and fk (·, η). The estimate of f (x) is now b ). fbk (x) = fk (x, η A key property of fbk not satisfied by many of the methods we have considered so far is that fbk ≥ 0 by definition. Here η can be computed using standard optimization algorithms; see Section 2.6, for example. Also note that [log fbk (x)]′ , which has a simple expression, is an estimate of [log fk (x)]′ as required in Example 9.3.1. If the Bj are a splines basis Stone and Koo (1986) and Stone, Hansen, Kooperberg and Truong (2007) have established further properties of fbk (x). We are left here too with a choice of tuning parameters, in this case the number and location of the knots. We turn to this in Section 12.5. 284 nonparametric inference for functions of one variable 11.4.3 Chapter 11 Nearest Neighbor Density Estimates We close with a simple nonlinear method that plays an important role in classification. Consider the convolution kernel estimate with K corresponding to U(−1, 1). Then Fb (x + h) − Fb(x − h) fbh (x) = . 2h (11.4.5) It seems reasonable to use larger h if f (x) is small. One way of achieving this is to note that for h close to 0 and n large, Fb(x + h) − Fb (x − h) ≃ 2f (x)h , and then choose h so that Fb (x + h) − Fb (x − h)− is the same for each x. This notion corresponds to finding h such that Fb (x + h) − Fb (x − h)− = k and choosing b h = b h(k, x) as the smallest such h. Let |x − X|(1) < · · · < |x − X|(n) denote |x − Xi |, 1 ≤ i ≤ n, ordered, then where ik is defined by b h(k, x) = |x − Xih | |x − Xih | = |x − X|(k) . The kth -nearest neighbour (kNN) density estimate is Fb x + b h(k, x) − Fb (x − b h(k, x)) − k e fk (x) = = . b b 2h(k, x) 2h(k, x) (11.4.6) Evidently, {b h(k, x)} can also be used as a family of bandwidth choices for the other kernels. Unfortunately, it is easy to see that if S(f ) = R, p lim|x|→∞ fek (x) 6= 0 and thus kNN estimates are properly defined and renormalizable as densities only when S(f ) is a finite interval (Problems 11.4.2 and 11.4.3). Using the arguments used to establish Corollary 11.2.1, we can show (see Problem 11.4.4) Theorem 11.4.1. If f has support [a, b] for finite a and b, if k = kn → ∞, n−1 kn → 0 as n → ∞, if x ∈ (a, b), and if sup |f ′′ (x) : x ∈ [a, b]| ≤ M for some M > 0, then, uniformly in f , M SE fek (x) = O(k −1 ). Thus fek is consistent. Summary. We compare our approach to regularization with regularization based on roughness penalties. Our approach involves approximating a parameter θ(P ) where θ(Pb ) is undefined with a parameter θs (P ) where θs (Pb) is defined, and using θs (Pb) to estimate θ(P ). Section 11.5 Confidence Regions 285 This approach applies in particular when θ(P ) is a smooth function such as f (·). In statistics, the roughness penalty approach to estimating a smooth function f (·) that does not allow a plug-in estimate involves finding the minimizer fλ (·) = fλ (·; λ) of an expression of the form Z ρ fλ (x), f (x) dx + λJ(fλ ) where the R first term measures the discrepancy between f , and J(fλ ) is a roughness penalty such as |fλ′′ (x)|2 dx that increases with the “roughness” of fλ , and λ > 0 is a tuning parameter. The estimate fλ (·; Fb ) of f (·) is typically a piecewise polynomial called a spline. This approachincludes the penalized maximum likelihood approach obtained by setting ρ fλ (x), f (x) = f (x) log fλ (x). Next we discuss the application of the method of sieves to the regularization and estimation of an irregular function f (·) and discuss its properties in terms of Wald decision theory and Vapnik learning theory. We apply the sieve approach to log density estimation and find that by using results from Section 1.6 on convexity in exponential family models, we obtain simple estimates of the log density and its derivative. Finally we introduce kth nearest neighbor estimates that are obtained from convolution kernel estimates with K(u) = 12 1[|u| ≤ h] by choosing h so that there are k x’s in the interval [−h, h]. 11.5 Confidence Regions We have until now focussed on estimation. For testing the hypothesis that f is a certain parametric shape it is, in some ways (see Section 9.5), better to use goodness-of-fit tests based on the empirical df Fb , rather than on fbh . However, by plotting confidence regions for f based on fbh , we get insights into the possible shapes for f provided the sample size is large. We will indicate how to find such regions using an asymptotic result of Bickel and Rosenblatt (1973) and the bootstrap. We motivate this approach by developing the regions from student t-type intervals. First we consider confidence intervals for the density f (x) with x a fixed point of interest, then we show how to widen these intervals to obtain confidence regions valid for an interval of x’s. We use ideas from Section 5.3. Set Zi = Kh (x − Xi ), θ = E(Zi ), τ = Var(Zi ). Then fbh (x) = Z, θ = fh (x), and to get a confidence interval for fh (x) we can use the pivot √ n(Z − θ) (11.5.1) sz where s2z is the sample variance of Z1 , . . . , Zn . The asymptotic distribution of this pivot is N (0, 1) (see (5.3.18)). Moreover, Examples 4.4.1 and 5.3.3 suggest that its distribution can be closely approximated by the Tn−1 distribution. Thus √ fbh (x) ± |tn−1 |1−α sz / n (11.5.2) 286 nonparametric inference for functions of one variable Chapter 11 defines an approximate level (1 − α) confidence interval for fh (x). If h = cn−a with 1 5 < a < 1, then (11.2.8) implies that the interval is asymptotically valid for f (x) as well under the conditions of Corollary 11.2.1, because √ √ nh[fbh (x) − f (x)] = nh{[fbh (x) − fh (x)] + [fh (x) − f (x)]} √ and nh[fh (x) − f (x)] → 0 by (11.2.8) as n → ∞. However, h = cn−a , 51 < a < 1, 1 excludes the asymptotically optimal choice h = cn− 5 . If instead of estimating τ we approximate it using (11.2.10), we obtain the pivot √ nh[fbh (x) − fh (x)] p T (fh (x), fbh (x)) = (11.5.3) ν(K) fh (x) which by the Central limit theorem and Slutsky’s theorem has an asymptotic N (0, 1) distribution (Problem 11.5.1). Solving |T (fh (x), fbh (x))| ≤ |z|1−α 2 2 b b2 b for fh (x) is equivalent to solving √ θ − (2θ + cα )θ + θ ≤ 0 for θ where θ = fh (x), θ = fbh (x), and cα = |z|1−α ν(K)/ nh. The solution (see Example 4.4.3) gives the approximate level (1 − α) confidence interval fb− (x) ≤ fh (x) ≤ fb+ (x) where q (11.5.4) fb± (x) = fbh (x) + c2α /2 ± cα fbh (x) + c2α /4. Under the conditions of Proposition 11.2.2, this band is valid asymptotically for fh (x), and for f (x) if h = cn−a , 15 < a < 1. Asymptotic regions To modify the above intervals to have probability (1 − α) of containing f (x) for all x in an interval I = [r, s], one approach is to widen the preceding intervals to attain asymptotic level (1 − α). Thus we could use the α quantile kα of the asymptotic distribution (Bickel and Rosenblatt (1973)) of T = sup |T (fh (x), fbh (x))|. x∈I This would lead to a simultaneous confidence region fb± (x), x ∈ [r, s], that covers fh (x) (and, under certain conditions, f (x)) with probability tending to (1 − α) as n → √ ∞ when cα in (11.5.4) is replaced by e cα = e kα ν(K)/ nh, where e kα is the asymptotic Bickel-Rosenblatt (1973) approximation to kα . Here e kα can be a poor approximation to kα for small and moderate n. Unfortunately, the ordinary bootstrap is, we believe, not an alternative but the m out of n bootstrap may be — see Bickel and Sakov (2008). Summary. Confidence regions for a curve provide a visualization of the possible shapes of that curve provided the sample size is large. We first give approximate confidence Section 11.6 Nonparametric Regression for One Covariate 287 intervals for fh (x) and f (x) for a fixed x by using pivots based on treating fbh (x) as a sample average. Then we show how to construct (1 − α) confidence regions for the curves fh (·) and f (·) by taking the maximum over x of the fixed x pivot and using the asymptotic distribution of this new pivot to construct the boundaries of the (1 − α) confidence region. 11.6 Nonparametric Regression for One Covariate 11.6.1 Estimation Principles This book has considered many experiments where the task is to model and analyze statistically the relationship between a response variable Y and a vector of covariates X. In this section we consider nonparametric regression in the case of a one-dimensional covariate X. The one dimensional procedures are often used as building blocks for the d-dimensional case, and some of the one dimensional methods we consider will have straightforward ddimensional extensions. Moreover in some applications there is one variable X of principle interest while the others are confounding variables. In this case Y will be the response after the effect of the other variables have been subtracted out or controlled for. Throughout the book we have considered various models connecting Y with X. In Section 2.2.1 we argued that for x restricted to a narrow range, a model linear in x may provide a good fit to µ(x) = E(Y |X = x). We now will allow µ(x) to take a general shape over the range X of X. One very effective technique is the one suggested in Section 2.2.1: Do linear fits for X restricted to local neighborhoods of multiple x’s, then piece together the results from the multiple fits. The result is local linear regression. Principles for construction of estimated nonparametric regression include local averaging, local parametric modelling such as local linear regression, global modelling, and penalized modelling. Or we can make a somewhat different breakdown: (i) Methods based on plugging the empirical probability Pb into a regular approximation to µ(x) in the formula, Z µ(x) = E(Y |X = x) = y f (y|x) dy where f (· | ·) is the density of (Y |X = x). (ii) Methods based on viewing E(Y |X) as the minimizer of E(Y −g(X))2 over g(·) (see Section 1.4) and obtaining a regular approximation to µ(·) by restricting g to a locally parametric class of functions and replacing quadratic loss with locally weighted quadratic loss. Under each of these, there are local methods and global methods. We begin with two examples of paradigm (i) approaches. We assume that we observe (Xi , Yi ), 1 ≤ i ≤ n, i.i.d. as (X, Y ) with (X, Y ) having a continuous case density f ∈ F . We let PeX (·) and Pb (·, ·) denote the empirical distributions of X1 , . . . , Xn and (Xi , Yi ), 1 ≤ i ≤ n. 288 nonparametric inference for functions of one variable Chapter 11 Example 11.6.1. The regressogram and Nadaraya-Watson estimates. If we divide the x axis into intervals Ij = (jh, (j + 1)h], −∞ < j < ∞, the natural approximation to the parameter f (·|x) is the conditional density fh (·|x) of Y given X ∈ I(x), the Ij to which x belongs. This leads to approximating the parameter µ(x) by Eh (Y |X = x) = ∞ Z X j=−∞ y f (y|x) 1 (x ∈ Ij ) dy . The empirical plug-in estimate is µ bR h (x) ≡ ∞ X j=−∞ Ȳjh 1(x ∈ Ij ) where Ȳjh is the average of the Yi with Xi ∈ Ij . This is called the regressogram and like the histogram density estimate doesn’t take enough advantage of possible smoothness of µ(x). Let Kh (t) = h−1 K(t/h) where K is a kernel as in Section 11.2. A smooth weighted average is Pn Yi Kh (Xi − x) µ bNW (x) = Pi=1 , (11.6.1) n i=1 Kh (Xi − x) the Nadaraya-Watson (NW) estimate. By selecting a kernel with enough derivatives we can intuitively obtain a version of NW which is as smooth as we please. ✷ We can obtain the NW estimate by viewing it from the optimization viewpoint as well and, in that way, also generalize itP to estimate smoother µ(x). Both are examples of lon cal averaging estimates, µ b(x) = i=1 Yi wn (x, Xi ) where the data-determined weight, wn (x, Xi ), decrease as |Xi − x| increases. Example 11.6.2. Minimum contrast and local polynomial estimates. Here is a class of procedures which falls under paradigm (ii). Let x be fixed and let µ(z, β) be a parametric family of functions such that for suitable β(x), µ(x, β(x)) = µ(x) and µ(z, β(x)) approximates µ(x) well locally for z close to x. Mainly we consider µ(z, β(x)) a polynomial in z − x. Suppose K ≥ 0. We obtain a regular parameter by replacing minimization of E(Y − g(X))2 by that of the discrepancy Z 2 y − µ z, β(x) Kh (z − x)dzf (x, y)dxdy (11.6.2) D β(·) = with respect to β = β(x), where β(x) is in Rp for some p ≥ 0, and then approximating µ(x) by µ(x, β(x)). As h → 0 we approach paradigm (ii). To simplify our notation, we introduce (X ′ , Y, Z) with density proportional to Kh (z − x)f (x, y)1 x ∈ S(f ) . Conditioning on X ′ = x, we obtain (Problem 11.6.1) Z 2 y − µ(z, β) q(z, y|x)dzdy β(x) = arg min β Section 11.6 Nonparametric Regression for One Covariate 289 where q(z, y|x) is the conditional density of (Z, Y )|X ′ = x, that is q(z, y|x) = f (z, y)Kh(z − x)1(x ∈ S(f )) fh (x) with fh (·) the marginal density of X ′ given by Z fh (x) = Kh (z − x)f (z, y)dzdy1 x ∈ S(f ) Z = Kh (z − x)fX (z)dz1 x ∈ S(f ) . Note that β(x) = β(P ; x) ≡ arg min β R 2 y − µ(z, β) Kh (z − x)1 x ∈ S(f ) dP (z, y) R . (11.6.3) Kh (z − x)1 x ∈ S(f ) dPX (z) The empirical plug-in estimate β(Pb ; x) is a minimum contrast and local modelling estimate in the same spirit as (11.3.1); we approximate µ(x) locally by µ(z, β), where β can be expressed as β(P ; x). If we specialize to µ(z, β) = β ∈ R we obtain the Nadaraya-Watson estimate (Problem 11.6.2). If we take µ(z, β) = β0 + β1 (z − x) we get β1 (x, P ) = EQ (Y (Z − EQ Z)) , VarQ Z β0 (x, P ) = EQ (Y )−β1 (x, P )(EQ (Z)−x) (11.6.4) where Q is the distribution corresponding to q(z, y|x). See Remark 11.3.1. We can use empirical plug in because dependence on P is purely through dP (z, y) and dPX (z). Specifically, Pn Yi (Xi − X̄(x))Kh (Xi − x) β1 (x, Pb ) = Pi=1 (11.6.5) n 2 i=1 (Xi − X̄(x)) Kh (Xi − x) where Pn i=1 Xi Kh (Xi − x) X̄(x) = P . n i=1 Kh (Xi − x) Set Ȳ (x) = µ bNW (x), then the estimate of µ(x) is the locally linear estimate µ bh (x) = β0 (x, Pb ) = Ȳ (x) − β1 (x, Pb )(X̄(x) − x) . (11.6.6) (11.6.7) Evidently, we may replace β0 + β1 (z − x) by a higher order polynomial in (z − x) and use Theorem 1.4.4 to obtain locally polynomial estimates. ✷ Remark 11.6.1. We obtain a robust minimum contrast estimate if we replace the squared error e2 , e = y −u, in (11.6.2) by the Huber function ρ(e) = e2 , |e| ≤ k, ρ(e) = k|e|, |e| > k. With this alteration, our estimates resemble the popular LOESS estimate of Cleveland (1979) and Cleveland and Devlin (1988). 290 nonparametric inference for functions of one variable 11.6.2 Chapter 11 Asymptotic Bias and Variance Calculations Our calculations are kept simple by first computing the bias and variance conditional on X = (X1 , . . . , Xn )T and then computing limits in probability as n → ∞. Thus, because E(Yi |X) = µ(Xi ), Pn n−1 i=1 Kh (Xi − x)µ(Xi ) Pn E{b µNW (x)|X} = . (11.6.8) n−1 i=1 Kh (Xi − x) Moreover, if kµ′′′ k∞ ≤ M < ∞, then uniformly in x, 1 µ(Xi ) = µ(x) + µ′ (x)(Xi − x) + µ′′ (x)(Xi − x)2 + OP (Xi − x)3 2 which makes E µ bNW (x)|X a simple function of the i.i.d. sums Snj = n X i=1 (11.6.9) j Xi − x Kh Xi − x , j = 0, 1, 2. Recall that [a, b] is the support of the density f (x) of X and recall the notation Z Z j mj = mj (K) = u K(u)du, ν = ν(K) = ν0 (K), νj = νj (K) = uj K 2 (u)du. Theorem 11.6.1. Suppose that [c, d] ⊂ (a, b), inf{f (x) : x ∈ [c, d]} > 0, and that µ′′′ (·) and f ′′ (·) are bounded on [c, d]. If K is symmetric, m2 (K) < ∞, and ν5 (K) exists, then, for the estimate (11.6.1), uniformly for x ∈ [c, d] as n → ∞, h = hn → 0, 1 ′′ 2µ′ (x)f ′ (x) Bias[b µNW (x)|X ] = µ (x) + m2 (K)h2 + OP (h3 ) . (11.6.10) 2 f (x) Proof. Using (11.6.8) and (11.6.9), we have Bias[b µNW (x)|X] = µ′ (x)Sn1 + 21 µ′′ (x)Sn2 + OP (Sn3 ) . Sno (11.6.11) The rest of the proof uses Chebychev’s inequality, the law of large numbers, and Taylor expansions. See Appendix D.6. Remark 11.6.2. We see from the expansion used to derive (11.6.10) that under the conditions of Theorem 11.6.1, if m2 (K) = 0, and if m4 (K) < ∞, then the conditional bias of µ bNW is of the order OP (h4 ). In general, if r is even, f ′′ and µ(r+3) are bounded, K is symmetric, mj (K) = 0, j = 1, . . . , r, and mr+3 (K) < ∞, then the conditional bias of µNW (K) is of order OP (hr+2 ). Kernels with mj (K) = 0, j = 1, . . . , r, are called higher order kernels of order r. See Problem 11.2.8 for an example. If r is odd, the preceding with r replaced by r − 1 applies. ✷ Section 11.6 291 Nonparametric Regression for One Covariate We turn to Var(b µNW (x)|X) = n X wi2 (x)σ 2 (Xi ) (11.6.12) i=1 where X σ 2 (x) = Var(Y |X = x), wi (x) = Kh (Xi − x) Kh (Xi − x) . (11.6.13) Theorem 11.6.2. Suppose that for [c, d] ⊂ (a, b), inf f (x) : x ∈ [c, d] > 0 and that f ′ (x) and [σ 2 (x)]′ are bounded on [c, d]. If νj (K) < ∞, j = 0, 1, then, uniformly for x ∈ [c, d], as n → ∞, h → 0, nh → ∞, ν(K)σ 2 (x) 1 Var(b µNW (x)|X) = . (11.6.14) + oP nhf (x) nh Proof. See Appendix D.6. Remark 11.6.3 The bias-variance tradeoff is again clear. Let B(x)h4 and V (x)(nh)−1 denote the leading terms of the squared conditional bias (11.6.10) and variance (11.6.14), then the asymptotic conditional MSE is Ah (x) = B(x)h4 + V (x)(nh)−1 . By solving (d/dh)Ah (x) = 0 for h, we find the minimizer h = hoptimal = V (x) 4B(x) 51 1 n− 5 . It follows that the conditional MSE of µ bNW (x) satisfies 4 1 4 4 4 inf MSE(b µNW (x)|X) = 5 · 4− 5 B 5 (x)V 5 (x)n− 5 + oP (n− 5 ). h>0 1 4 Thus µ bNW (x) with h of the form cn− 5 converges to µ(x) at the rate OP (n− 5 ). ✷ If S(f ) = [a, b] and xℓ = a + λh, xr = b − λh are left and right boundary points, then µ bNW (xℓ ) is still consistent, but the bias at these points is of order OP (h) rather than the OP (h2 ) rate of Theorem 11.6.1. To improve on this rate we turn to other estimates. Local polynomial estimates In Example 11.6.2 consider the polynomial µ(z; β) = p X k=0 βk (z − x)k . 292 nonparametric inference for functions of one variable Chapter 11 The partial derivative with respect to βj is ∇j µ(z − x, β) = (z − x)j and the minimizer of the discrepancy (11.6.2) satisfies the set of linear equations (j = 0, . . . , p) EP {Y (X − x)j Kh (X − x)} = p X k=0 EP {(X − x)j Kh (X − x)(X − x)k }βk . (11.6.15) Let L(X − x) = (1, X − x, . . . , (X − x)p )T , then the solution to (11.6.15) is β = β(x; P ) = [EP {L(X −x)Kh (X −x)LT (X −x)}]−1 EP {Y Kh (X −x)L(X −x)} provided the indicated matrix inverse exists. Thus the empirical plug-in estimate of β is b β(x) = (Z T W Z)−1 Z T W Y (11.6.16) where W = Diagonal(Kh (X1 − x), . . . , Kh (Xn − x)),Y = (Y1 , . . . , Yn )T i, and Z = ((Xi − x)j ), i = 1, . . . , n, j = 0, . . . , p. Note that (11.6.16) is also the solution to the weighted least squares problem with weight matrix W ; see (2.2.20). Using the notation b β(x) = (βb0 (x), βb1 (x), . . . , βbp (x))T the estimate of µ(x) is b b µ b(x) = µ(z; β(x))| z=x = β0 (x). For the local linear case where p = 1, this reduces to (11.6.7). We next show that when µ is smooth the rate at which the bias tends to zero as h → 0 can be made arbitrarily fast by selecting p large. Note that (11.6.16) and E(Yi |X i ) = µ(Xi ) yields b E(β(x)|X) = (Z T W Z)−1 Z T W µ where µ = (µ(X1 ), . . . , µ(Xn ))T . Set β = (µ(x), µ′ (x), . . . , µ(p) (x)/p!)T and r = b µ − Zβ; then the conditional bias of β(x) as an estimate of β is b Bias(β(x)|X) = (Z T W Z)−1 Z T W r. (11.6.17) Here is a key lemma. Lemma 11.6.1. Assume that inf{f (x) : x ∈ [c, d]} > 0 for [c, d] ⊂ (a, b). Also assume that f ′ (·) and µ(p+2) (·) are bounded on [c, d]. Then, uniformly for x ∈ [c, d], r = (βp+1 (Xi − x)p+1 + oP (Xi − x)p+1 )1≤i≤n . (11.6.18) Proof. The result follows because by Taylor expansion µ(Xi ) = p X j=0 βj (Xi − x)j + βp+1 (Xi − x)p+1 + oP (Xi − x)p+1 , (11.6.19) Section 11.6 Nonparametric Regression for One Covariate that is, the first (p + 1) terms in the Taylor expansion of r are identically zero! It follows that 293 ✷ b Proposition 11.6.1. If µ b(x) is a polynomial of order p, then β(x) is unbiased. In particular, b µ b(x) = β0 (x) is unbiased. Lemma 11.6.1 also yields the asymptotic conditional bias of µ b(x) in general: For the b proof and the conditional bias of the vector β(x), see Appendix D.6. Theorem 11.6.3. Suppose that K is symmetric with m2p+1 (K) < ∞. Under the conditions of Lemma 11.6.1, uniformly for x ∈ [c, d], Bias [b µ(x)|X] = ηp+1 (K) (p+1) µ (x)hp+1 + oP (hp+1 ) (p + 1)! (11.6.20) where ηp+1 (K) is a constant defined in Appendix D.6. When p is even, then ηp+1 (K) = 0 (Appendix D.6), so the leading term in (11.6.20) is zero. In the case of p = 0, we have shown that the next term in the expansion gives (11.6.10) as the leading term, which is seen to be OP (h2 ). For the leading term when p ≥ 2 is even, see Fan and Gijbels (1996, Theorem 3.1). When p = 1, we show in Appendix D.6 that η2 (K) = m2 (K), thus 1 Bias µ b(x)|X = m2 (K)µ′′ (x)h2 + oP (h2 ), p = 1 . 2 This bias is of same order, OP (h2 ), as the bias (11.6.10) of the local constant estimate. However, at boundary points we can show that the local linear estimate retains the order OP (h2 ) while the bias of the local constant estimator converges to zero at the slower rate R1 OP (h) (Problem D.6.5). Let mℓ,λ (K) = −λ uℓ K(u)du and Qλ (K) = m22λ (K) − m1λ (K)m3λ (K) . m2λ (K)m0λ (K) − m21λ (K) Proposition 11.6.2. Suppose the support of f is S(f ) = (a, b) with a and b finite, K is symmetric with K(u) = 0 for |u| > 1, µ(p+1) (a+ ) exists, f (a+ ) > 0, and f (·) and µ(p+1) (·) are right continuous at a, then for x = a + λh, (a) for p = 0, Bias [b µ(x)|X] = m2λ (K)µ′ (a+ )h + oP (h); (b) for p = 1, Bias [b µ(x)|X] = 1 Qλ (K)µ′′ (a+ )h2 + oP (h2 ) . 2 The variance (11.6.21) (11.6.22) 294 nonparametric inference for functions of one variable Chapter 11 Theorem 11.6.4. Suppose K is a symmetric density with ν2p+1 (K) < ∞, inf{f (x) : x ∈ [c, d]} > 0 for [c, d] ⊂ (a, b) and that f ′ (x), µ′p+1 (x) and [σ 2 (x)]′ exist and are bounded on [c, d]. If h → 0 and nh → ∞ as n → ∞, then, uniformly for x ∈ [c, d], Var(b µ(x)|X) = νp (K) σ 2 (x) (nh)−1 {1 + op (1)} , f (x) where νp (K) is a constant given in Appendix D.6. It follows from Appendix D.6 that ν1 (K) = ν0 (K), and the asymptotic variances for the p = 0 and p = 1 cases coincide. Also note that the order OP {(nh)−1 } of Varb µ((x)|X) does not depend on p. Thus by increasing p, assuming smoothness, we can make the asymptotic conditional bias arbitrarily small without increasing the order of the asymptotic conditional variance. We next turn to conditional MSE, its asymptotic and integrated versions, and its minimizers. MSE and IMSE Corollary 11.6.1. Under the conditions of Theorems 11.6.3 and 11.6.4, uniformly for x ∈ [c, d], 2 σ 2 (x) ηp+1 (K) (p+1) µ (x)hp+1 + νp (K) (nh)−1 M SE(b µ(x)|X) = (p + 1)! f (x) oP {hp+1 + (nh)−1 } . (11.6.23) We denote the sum of the first two terms in (11.6.23) by A(x, h). Note that A(x, h) is of the form B(x)h2(p+1) + V (x)n−1 h−1 , where B(x) and V (x) are given in (11.6.23). By solving (d/dh)A(x, h) = 0 for h we find the minimizer of A(x, h) V (x) h0 (x) = arg min A(x, h) = h>0 2(p + 1)B(x) 1 2p+3 1 n− 2p+3 . By substituting h0 (x) in A(x, h) we find 2p+2 min A(x, h) = c′ n− 2p+3 h>0 (11.6.24) where c′ does not depend on n. The same order of convergence holds for the IMSE. Remark 11.6.4. Note that when µ(x) is infinitely differentiable the rate of convergence of the MSE and IMSE to zero can be made arbitrarily close to the regular parametric rate n−1 . ✷ The (asymptotic) integrated mean squared error (IMSE) is Z IM SE(b µ) = A(x, h)w(x)dx Section 11.6 295 Nonparametric Regression for One Covariate for an appropriate weight function w(x). If f (·) has finite support [a, b], then w(x) = f (x) is a natural choice. In this case, if σ 2 (x) is constant and E [µ(p+1) (X)]2 < ∞, the optimal h is (Problem 11.6.2.1) 1 ( ) 2p+3 1 [b − a]σ 2 h0 = Cp (11.6.25) n− 2p+3 . (p+1) 2 E [µ (X)] 2p+2 With this h0 , IMSE tends to zero at the rate n− 2p+3 . Regression curves So far our main focus has been on estimating the regression level µ(x) = E(Y |X = x). However, in parametric analysis the emphasis is on estimating regression slopes (often called regression coefficients) because they give the expected change in the response Y as the covariate x is increased (or decreased). In a nonparametric framework the regression curve b(x) ≡ µ′ (x) has a similar interpretation, and is also of interest. Recall that our b = β(x) b b (x), . . . , β b (x))T provides an estimate µ local polynomial estimate β = (β b(z) = 0 p Pp b j b(z) for z close to x. It follows that j=0 β j (x)(z − x) of µ bb = bb(x) ≡ µ b (x) b′ (z)|z=x = β 1 is an estimate of b(x). The asymptotic properties of bb are contained in the results in Appendix D.6. We state these results when p = 2, which is a good choice of p because it gives conditional bias of order h2 at boundary as well as interior points. Theorem 11.6.5. Suppose the conditions of Theorems 11.6.3 and 11.6.4 are satisfied with p = 2. Then n o2 m4 (K) (3) ν2 (K) σ2 (x) −1 −3 2 MSE(bb(x)|X) = 61 m µ (x)h +m n h + oP (h2 + n−1 h−3 ) 2 (K) 2 (K) f (x) = B(x)h4 + V (x)n−1 h−3 + Rn , say. The asymptotically optimal bandwidth h is obtained by setting the derivative of Bh4 + 1 V n−1 h−3 equal to zero, which with C(K) = {27m2 (K)ν2 (K)/m24 (K)} 7 (Problem 11.6.7) yields h0 = 3V 4B 17 n − 71 = C(K) σ 2 (x) [µ(3) (x)]2 f (x) 71 1 n− 7 . (11.6.26) Smoothing, variance estimation, and confidence regions Our estimates of µ(·) are often referred to as linear smoothers of the data (Xi , Yi ), ′ b= µ 1 ≤ i ≤ n. More precisely, let µ = µ(X1 ), . . . , µ(Xn ) and µ b(X1 ), . . . , µ b(Xn ) be an estimator (predictor) of µ. Then we can write b = SY µ (11.6.27) 296 nonparametric inference for functions of one variable Chapter 11 for some n × n smoother matrix S. In the case of the NW estimate, (11.6.1), S = wi (Xj ) n×n (see (11.6.13)). Estimates of the form SY include regression splines; see e.g. Hastie and Tibshirani (1990). For variance estimation, we consider the model Yi = µ(Xi ) + εi , i = 1, . . . , n where Xi and εi are independent and v(Xi ) = Var(Y |Xi ) = Var(εi ). An estimator of T v = v(X1 ), . . . , v(Xn ) is obtained by considering TR2 for some n × n smoother matrix T, where R = (I − S)Y is the vector of residuals and R2 is the vector of squared residuals. To obtain a bias correction for TR2 , we note that when v(x) ≡ σ 2 , 2 E(R2 |X) = E(b µ − µ|X) (1 + A)σ 2 where A = diag(SST − 2S) and 1 = (1, . . . , 1)Tn×1 . To reduce bias, we write A = (aij ) and set Ri , r2 = (r12 , . . . , rn2 )T . ri = √ 1 + aii Our estimate of v is (see Ruppert, Wand, Holst and Hössjer (1997)) b = Tr2 . v b is conditionally unbiased (see Proposition 11.6.1) and v(x) ≡ σ 2 , vb is condiWhen µ tionally unbiased. Moreover, for local polynomial estimates, by Theorem 11.6.3, the conditional bias is of order oP (h2(p+1) ). In the homeoscedastic case when v(x) ≡ σ 2 , we use σ b2 = n−1 n X i=1 vb(Xi ) . (11.6.28) We now turn to the estimation of the variances Pnof our estimates. The estimates of µ(j) (·), j = 0, . . . , p, can be written in the form i=1 wij (x)Yi for appropriate wij (x); see (11.6.1), (11.6.5), and (11.6.7). It follows that in the homeoscedastic case n X 2 wij (x) , Var µ b(j) (x)|X = σ 2 i=1 where we use σ b2 from (11.6.28) to estimate σ 2 . In this case, under conditions on h where (j) µ b (x) has bias of lower order than the order of its standard deviation, the pivot Zj (x) = µ b(j) (x) − µ(j) (x) Pn 2 (x)} 21 σ b{ i=1 wij will approximately have a N (0, 1) distribution and yield approximate (1 − α) confidence intervals for µ(j) (x) with x in a grid G = {x(1) , . . . , x(g) }. We expect 100(1−α)% of these intervals to contain the true values of µ(j) (x), x ∈ G. Simultaneous level (1 − α) intervals can be obtained from the asymptotic distribution of supx |Zj (x)| (Nadaraya (1989)). Section 11.7 297 Problems and Complements Multiple testing in matched pair experiments In Section 4.9.2 we considered experiments where subjects serve as their own control and the difference Y between a treatment response and a control response is recorded. Now in addition to Y we include a covariate such as age or BMI (body mass index). In this case we may want to investigate for which covariate values the treatment has an effect and turn to multiple testing where we test Hj : µ(xj ) = 0, j = 1, . . . , g, vs one-sided or two-sided alternatives using Zj (x) or |Zj (x)|. The simultaneous confidence intervals discussed above provide possible solutions to the multiple testing problem. If we test H0 : µ(x) = 0 for all x, µ b(j) (x) will not have a bias problem under H0 . Bandwidth choice The discussion of Section 11.2.3 also applies to the local polynomial estimates of µ(·). Again the reference distribution approach to bandwidth selection yields the optimal rate of convergence but not the optimal constant when p = 1 if we assume a well behaved µ′′ R ′′ 2 but are not willing to estimate µ (x) f (x)dx nonparametrically. For instance, we can use the parametric model where (Y |X = x) is N (α0 + α1 x + α2 x2 , σ 2 ) and substitute the parametric estimates of σ 2 and µ′′ (x) = α2 in (11.6.25) with (b − a) replaced by (X(n) − X(1) ). However, the soundest approach is cross validation, which adapts to the degree of smoothness present. See Section 12.5. Summary. Let (X1 , Y1 ), . . . , (Xn , Yn ) be i.i.d. as (X, Y ) ∼ P where (X, Y ) is continuous. We consider kernel estimates µ bh (x) of the curve µ(x) = EP (Y |X = x) based on weighted local fits to µ(x) over intervals [x − h, x + h]. The Nadaraya-Watson (NW) estimate µ bNW (x) is a kernel estimate of the form Σwi (x)Yi , where wi (x) = Kh (Xi − x)/ n X i=1 Kh (Xi − x), and can be viewed as giving a locally constant fit to µ(x) over [x − h, x + h]. We also consider locally linear and locally polynomial fits and give their bias and variance properties for n → ∞ and h → 0. The NW estimate has conditional bias of order OP (h2 ) for interior points x and order OP (h) for boundary points while the locally linear estimate has conditional bias of order OP (h2 ) for all points x in the support of X. These estimates have the same asymptotic variance. 11.7 Problems and Complements Problems for Section 11.2 1. Let X1 , . . . , Xn be i.i.d. with density f , where f has support S(f ). (a) Suppose S(f ) = R. Show that (2h)−1 Pb (x − h, x + h] = Jh (u) = h−1 J(u/k) and J(t) = 21 1[−1 ≤ t ≤ 1]. R Jh (x − z) where 298 nonparametric inference for functions of one variable Chapter 11 (b) Suppose that S(f ) = [a, b] and 0 < 2h < b − a. Specify Kh (x, z) so that Z Pb(x − h, x + h] = Kh (x, z) dPb(z) . 2h Consider (i) x ∈ [a, a + h), (ii) x ∈ [a + h, b − h], and (iii) x ∈ (b − h, b] 2. Suppose K is a kernel with K ≥ 0 and support S(K) = [−c, c], 0 < c ≤ ∞. Let Xh = Z + hV where Z ∼ f , V ∼ K, Z and V are independent, S(f ) = [a, b], −∞ ≤ a < b ≤ ∞, 2h < b − a. (a) Show that Xh has density fh (x) = Z b a Kh (x − z)f (z)dz, x ∈ [a − ch, b + ch] = 0, otherwise. Hint. Z and V have joint density f (z)K(v), z ∈ [a, b], v ∈ [−c, c]. By the Jacobian transformation theorem (Theorem B.2.2), Xh and Z have joint density f (z)h−1 K (x − z)/h , z ∈ [a, b], x ∈ [a − ch, b + ch] . Rb fh (x)dx = 1. Thus a fh (x)dx may not be one and Rb fh (x) may not be a density on [a, b]. Find a fh (x)dx when V = U ∼ U[−1, 1] and (i) f is N (0, 1), (ii) f is U[0, 1]. Rb (c) We can think of fh (x) = a Kh (x − z)f (z)dz as a weighted average of f (z) with weights Kh (x − z), z ∈ [a, b]. However, the weights may not integrate to one. Let Rb Wh (x) = a Kh (x − z)dz. Show that Wh (x) = L (x − a)/h − L (x − b)/h , Rt where L(t) = −∞ K(t)dt. (b) It follows from (a) that R b+ch a−ch (d) Suppose −∞ < a < b < ∞. For x = a + λh and x = b − λh, 0 < λ < 1, find Wh (x) as defined in (c) when (i) K is U[−1, 1] and (ii) K = KE . (e) We can get a closer approximation Rto f (x) than fh (x) by rescaling the weights Kh (x − z); that is, use f (x; h) = Kh (x, z)f (z)dz with Kh (x, z) = Kh (x − Rb z)/Wh (x) and Wh (x) as in (c). Show that a f (x; h)dx = 1. 3. Show that part (b) of Lemma 11.2.1 holds using the dominated convergence theorem. 4. Establish Lemma 11.2.2 using Lemma 11.2.1. Hint. By (A.15.2), P |fbh (x) − f (x)| ≥ ε ε2 ≤ MSE fbh (x) = Bias fbh (x) 2 + Var fbh (x) . Section 11.7 299 Problems and Complements 5. Show that a symmetric kernel with m2 (K) = 0 will yield a bias term of uniform order h4 for fbh (x) if kf (5) k < ∞ and m5 (K) < ∞. Show that K(t) = 12 (3 − x2 )φ(x) is such a kernel when φ is the N (0, 1) density. Do the same for K(t) = (15/32)(7t4 − 10t2 + 3)1(|t| ≤ 1). 6. Complete the proof of Proposition 11.2.2. 7. Show that the conclusions of Propositions 11.2.2 and Corollary 11.2.1 remain valid if K doesn’t have compact support. 8. Kernels K such that Z ∞ K (j) (x) dx = 0 −∞ for 2 ≤ j ≤ r (11.7.1) are called higher order kernels of order r. Show that K(x) = Hj (x) exp(−x2 /2), where Hj is a Hermite polynomial (Section 5.3), satisfies (11.7.1). 9. By repeating the arguments of Propositions 11.2.1 and 11.2.2, show that if we assume that kf (r+3) k∞ < ∞, mr+3 (K) < ∞, and r is even, then, by using K as in (11.3.7) and 2r bandwidth cn−1/(2r+1) , we achieve the uniform rate of convergence n− 2r+1 of MSE fb(x) to zero. Show that if K is symmetric and r is odd, the result holds with r replaced by r − 1. 10. Assume the conditions of Corollary 11.2.1, that f (x) > 0 and 0 < [f ′′ (x)]2 < ∞. (a) Show that the sum AMSE(fbh (x)) of the first two terms in the expansion (11.2.12) of 1 MSE[fb(x)] is minimized when h = cn− 5 for some c 6= 0. Find the expression for c. Hint. Note that the expression is of the form a(nh)−1 + bh2 . Set the derivative equal to zero and solve for h. Check that this gives the minimum. 1 4 (b) Show that using h = cn− 2 leads to the convergence rate n− 5 for MSE [fbh (x)]. 11. Establish (11.2.18). 12. Establish (11.2.19) and (11.2.20). Verify that when F = N (µ, σ 2 ) and K = U [−1, 1], 1 then fopt = 1.95σn− 5 . See also Chapter 12. 13. Generate a sample of size n = 100 from a chi-square distribution with 5 degrees of freedom. Plot the density of f and the estimate fbh (·) given in (11.2.6) based on 1 (a) N (0, 1) kernel and bandwidth 1.06n− 5 s based on a N (0, 1) reference distribution. 1 (b) U [−1, 1] kernel and h = 1.95n− 5 s based on a N (0, 1) reference disribution. 14. Smooth distribution function estimates. When F is a continuous distribution function, we may want a continuous estimate of F . Consider the two estimates (see Problem 11.2.1). Z x Z b Z x fbh (t)dt = Kh (t − z)dFb (z)dt Fbh (x) = −∞ −∞ a 300 nonparametric inference for functions of one variable Fb(x; h) = Z x −∞ fb(t; h)dt = Z x −∞ where K(t − z; h) = Kh (t − z)/wh (t), wh (t) = Z a Rb b Chapter 11 K(t − z; h)dFb (z)dt Kh (t − z)dz. Suppose S(f ) = [a, b]. Rx Pn h (a) Let L(x) = −∞ K(t)dt. Show that Fbh (x) = n−1 i=1 L (x − Xi )/h − L (a − i Xi )/h . a (b) When K = U[−1, 1] and a and b are finite, find Fb (x; h) for xε[a + h, b − h], x = a + λh and x = b − λh, 0 < λ < 1. (c) Show that E fbh (x) = L (x − b)/h + Z (x−a)/h (x+b)/h F (x − uh)K(u)du Z h − L (a − b)/h) + 0 (a−b)/h Hint. Use integration by parts: R LdF = LF − R i F (a − uh)K(u)du . F dL. (d) When K = U[−1, 1] and a and b are finite, find E Fb(x; h) for x as in (b) preceding. Compare your answer to E Fbh (x) with K = U[−1, 1]. (e) Find VarFbh (x). Show that we can choose h = hn so that MSE Fbh (x) = O(n−1 ). What smoothness conditions are needed on F ? What conditions are needed on K? Hint. Use (c) preceding. (f) A good choice for L is the logistic df L0 . For K = L0 , write a program that computes Fbh (x), Fb (x; h), and Fb (x) and plot the graphs for a sample of size n = 20 from F (x) = 1 − e−x , x > 0. Use h = 0.5s. Repeat with h = 0.25s and h = s. Include the true f (x) on the figures. Hint. See Remark 11.2.2. Problems for Section 11.3 1. Suppose that K has support [−1, 1]. Under the conditions of Proposition 11.2.1, show that forfbh (x) defined by (11.2.6), if S(f ) = [a, b] with a and b finite and x = a + λh, 0 < λ < 1, then 1 E fbh (x) = µ0,λ (K)f (x) − hµ1,λ (K)f ′ (x) + h2 f ′′ (x)µ2,λ (K) + O(h3 ) 2 Rλ j where µj,λ (K) = −1 u K(u)du. Section 11.7 301 Problems and Complements 2. Suppose K(·) is a symmetric density on [−1, 1]. Assume the conditions of Problem 11.3.1. Show that the estimate fbh (x) based on (11.3.2) has bias O(h) for the boundary points a + λh and b − λh, 0 < λ < 1. 3. Establish (11.3.2). 4. Let fbh (x) be defined by (11.2.6). Suppose f is the U[0, 1] density, 0 < h < h ≤ x ≤ 1 − h. When K = U[−1, 1] find 1 2, and (a) MSE[fbh (x)] and give the rate at which MSE[fbh (x)] tends to zero as n → ∞. (b) hopt which minimize MSE[fbh (x)]. (c) Solve (a) and (b) above when K(u) = 34 (1 − u2 )1(|u| ≤ 1). (d) Solve (c) when x = λh, 0 < λ < 1. 5. In Example 11.3.1, (see (11.3.4)) set mj = mj (x, h), j = 0, 1, 2, and Z b sj = (z − x)j Kh (z − x)dF (z), j = 0, 1. a (a) Show that the solution to (11.3.4) is s0 h2 m2 − s1 hm1 1 . (α, β)T = h2 (m0 m2 − m21 ) s1 m0 − s0 hm0 (b) Use (a) preceding to show that an estimate of f ′ (x) is Z b f ′ (x, Fb ) ≡ β(x, Fb) = Kh (x, z)dFb (z) a where Kh (x, z) = m0 (x, h)[(z − x)/h − 1]/h Kh (z − x) . m0 (x, h)m2 (x, h) − m21 (x, h) (c) Establish (11.3.8) and (11.3.9). Hint: For x ∈ [a + h, b − h], by (11.3.5), this follows from Propositions 11.2.1 and 11.2.2. For xh = a + λh, 0 < λ < 1, follow the proofs of these propositions. 6. Local polynomial estimates. Suppose that for z close to x, in Example 11.2.1 continued, we use a local polynomial approximation f (z − x; β) ∼ = β0 + p X j=1 βj (z − x)j . That is, we approximate f (x) by fh (x) = f (0; β(x)) = β0 (x) where Z β(x) = β(F ; x) = arg min [f (z) − f (z − x; β)]2 Kh (z − x)dz. β (11.7.2) 302 nonparametric inference for functions of one variable Chapter 11 (a) Show that β(x) = β(F ; x) is the functional β(F ; x) = H −1 C −1 a(F ) where H = diag(1, h, . . . , hp ), C = (sjk ),a(F ) = (a0 (F ), . . . , ap (F ))T , Z (b−x)/h [1−(j+k)] sjk = h uj+k K(u)du, j, k = 0, 1, . . . , p, (x−a)/h aj (F ) = Z z−x h j Kh (z − x)dF (z), j = 0, . . . , p. Hint. This β(x) satisfies the linear equations, for k = 0, . . . , p, Z Z p X k βj (z − x)j+k Kh (z − x)dz. (z − x) Kh (z − x)dF (z) = j=0 Our approximation to f (x) is the functional β0 (F ; x) with plug-in estimate fbh (x) = β0 (Fb; x). (b) Show that if f (x) is a polynomial of order p and if fbh (x) is the empirical plug-in estimate of f (x) obtained from the local polynomial approximation to f (x) of order p, then fbh (x) is an unbiased estimate of f (x). If f (x) has a continuous (p + 2)th derivative at the point x in its support, if p is odd, and if K is continuous and symmetric with support [−1, 1], then it can be shown (Jiang and Doksum (2003)) that this fbh (x) converges in probability to f (x) at the rate OP (n−a ) with a = (p + 1)/(2p + 3) provided the optimal (the one minimizing asymptotic MSE) −1 h ≍ (n(−2p+3) ) is employed. This holds for boundary as well as interior points. Note that if f (x) is infinitely differentiable, the rate of convergence of fbh (x) to f (x) can be 1 made arbitrarily close to the regular rate n− 2 by choosing a polynomial approximation with p large. Suppose we assume that f (x) has a continuous (p+ 2)th derivative. Are there estimates that achieve better rates of convergence than OP n−(p+1)/(2p+1) ? The answer is no; see e.g. van der Vaart (1998). R 7. Show that α(x; F ) as given in Remark 11.3.1 reduces to fh (x; F ) = Kh (x, z)dF (z) where Kh (·, ·) is defined by (11.3.7). R (z−x)Kh (z−x)dF (z) , Hint. E[(Z − x)f (Z)|X = x) = m0 (x;F ) R (z − x)Kh (z − x)dz hm1 (x, h) E(Z − x|X = x) = = , etc. m0 (x; h) m0 (x, h) R 8. In Example 11.3.2, show that Km (x, z) dx = 1 when B0 is a constant and a and b are finite. 9. Let X1 , . . . , X200 be a sample from a beta (2,3) distribution. Consider the grid {0, 0.02, . . . , 0.98, 1}. Section 11.7 Problems and Complements 303 (a) Plot the locally linear density estimate (11.3.6) using the algorithm in Remark 11.3.2 using h = 0.1. Also plot the beta (2,3) density on the same figure. (b) For comparison, repeat (a) with (11.3.6) replaced by (11.2.6). Problems for Section 11.4 1. Consider a sample X1 , . . . , Xn from f2 (x, η), x ∈ R, as given in (11.4.3) with h(x) ≡ 1, B1 (x) = x and B2 (x) = x2 − 1, the first two Hermite polynomials. b of η. (a) Give the asymptotic covariance matrix of the MLE η (b) Suppose we estimate η using the Rudemo criteria. We obtain Z Z b R = arg min −2 f2 (x, η)dPb (x) + f22 (x, η)dx . η bR . Give the asymptotic covariace matrix of η (c) Compare your answers to (a) and (b). Because the MLE of a function of a parameter is the function of the MLE of this parameter, this comparison has implications for the comparison of f2 (x, ηb) and f2 (x, ηbR ). Give this implication. Hint. f2 (x, η) is a normal density. Remark. If X1 , . . . , Xn is from f (·) 6= f2 (·, η), it could be that f (x, ηbR ) has smaller IMSE than f (x, ηb). 2. Show that if f (x) > 0, x ∈ R, then for the nearest neighbor estimate (11.4.6), p lim|x|→∞ fek (x) > 0. Hint. 2b h(k, x) is the difference X(l) − X(j) of two order statistics where (X(j) , X(l) ] contains k order statistics. Let r = [nF (x)]; then l ≤ r + 2k, j ≥ r − 2k, (l/n) → F (x), and (j/n) → F (x). 3. Suppose S(f ) = [a, b] with a and b finite. Show that lim|x|→∞ fek (x) = 0. Hint. See Problem 11.4.2. To contain k order statistics, the length of the interval (X(j) , X(l) ] must tend to ∞ as |x| → ∞. 4. Let fek (x) be the kth nearest neighbor density estimate. Assume the conditions of Theorem 11.4.3.1. (a) Derive an expression of the form (11.2.12) for M SE[fek (x)]. (b) Compare the solution to (a) preceding to M SE[fek (x)] as given by (11.2.12) for the case K(w) = 1[|u| ≤ 1]/2. (c) Derive an expression of the form (11.2.14) for IM SE(fek , f ). (d) Compare the solution to (c) preceding to IM SE(feh , f ) as given by (11.2.14) for K(u) = 1[|u| ≤ 1]/2. 304 nonparametric inference for functions of one variable Chapter 11 (e) Justify informally the approximation k ∼ = 2nhf (x) for n large. (f) Use (e) to informally find the k that minimizes IMSE (see (11.2.19)). (g) Use (f) to informally find the k that minimizes IMSE for the N (µ, σ 2 ) reference distribution (see Section 11.2.3). Problems for Section 11.5 1. Argue that T as defined by (11.5.3) has an asymptotic N (0, 1) distribution. 2. Let X1 , . . . , X1000 be a sample from a χ25 distribution. Plot fbh and the confidence intervals (11.5.2) at the grid points {0.5, 1.0, 1.5, 2, . . . , 8} using the U (−1, 1) kernel and h = 0.2. Problems for Section 11.6 1. Let Dx (µ(·), µ(·; β)) = 1 fh (x) Z [µ(z) − µ(z; β)]2 Kh (z − x)dF (z) = E{[µ(Z) − µ(Z; β(X))]2 |X = x} (11.7.3) where (X, Z) has joint density f (x)qh (z|x), (x, z) ∈ [a, b], with qh (z|x) = f (z)Kh (z − x)1(x ∈ [a, b]) . fh (x) Show that minimizing (11.7.3) is equivalent to minimizing (11.6.2). 2. Show that if we select µ(z, β) ≡ β ∈ R in (12.6.13), then (12.6.13) yields µ b NW (·). 3. Establish (11.6.4). 4. Establish (11.6.25). 5. In Remark 11.6.1, give a type (11.6.10) expansion of Bias(b µNW (x)|X) when K is a kernel of order 3. 6. In Remark 11.6.2, justify the formula for (a) hoptimal ; (b) the conditional MSE of 4 µ bNW (x); (c) the optimal convergence rate n− 5 for the MSE in (b). 7. Establish (11.6.26). 8. Consider the model Yni = µ(xni ) + εni , where xn1 ≤ · · · ≤ xnn , E(εni ) = 0, Var(εni ) = σ 2 (xni ). Let Ink (x) denote the indices of the k values of xn1 , . . . , xnn closest to x, where ties are broken by inducing only the smallest index. Define the nearest neighbor estimate X Yi /k . µ b(x) = i∈Ink (x) Section 11.7 305 Problems and Complements (a) Show that −2 P (sup |b µ(x) − E µ b(x)| ≤ t) ≥ 1 − 8(kt) x h n−k X 2 σ (xni ) + n X i=1 i=1 i σ 2 (xni ) . Hint: Use Kolmogorov’s inequality. (b) Find bounds on supx |E µ b(x) − µ(x)|. (Bjerve, Doksum and Yandell (1985) bound supx |b µ(x) − µ(x)| in probability.) 9. Nonparametric binary regression. Suppose we want to study the effect of a certain treatment. Consider the following approach. Dosages x1 < . . . < xn are fixed. Dosage xi is administered to the ith member of this sample and a binary response Yi is recorded; e.g., Yi = 0 if the treatment does not have a beneficial effect, Yi = 1 if the treatment has a beneficial effect. Assume that the Yi are independent and that Yi has the Bernoulli distribution with success probability θ(xi ), i = 1, . . . , n. Let x be some fixed dosage level. Consider the following nearest neighbor estimate of θ(x): i+k 1 X b θ(x) = Yj , k j=i+1 where x ∈ Ii , i = 0, . . . , n − k 1 I0 = − ∞, (x1 + x1+k ) 2 1 1 Ii = (xi + xi+k ), (xi+1 + xi+k+1 ) , i = 1, . . . , n − k − 1 2 2 1 In−k = (xn−k + xn ), ∞ . 2 Show that Pr sup −∞<x<∞ b − E θ(x) b θ(x) ≤ε ≥1− Hint: Use Kolmogorov’s inequality. n . ε2 k 2 Chapter 12 PREDICTION AND MACHINE LEARNING 12.1 Introduction In this chapter we face the major challenge of modern statistics, devising and analyzing methods of inference for high dimensional complex data. The issues we discuss in this chapter are: 1. Prediction with high dimensional covariates Under this heading we will consider classification, regression and prediction problems now associated with machine learning, and statistical learning. The methods we shall briefly discuss include ones easily related to statistical methods such as the univariate curve estimation in Chapter 11 and the logistic regression in Sections 6.4.3 and 6.5; and ones more closely related to machine learning such as support vector machines and boosting. 2. Regularization A central theme as in the last chapter is regularization. How do we control overcomplexification of models and procedures which both theoretically and practically lead to poor performance? We will discuss major semiparametric and parametric methods such as those based on sieves (Section 9.1.4), shrinkage (Section I.6), and penalized least squares. 3. Dimension reduction and model selection While prediction and classification are the primary goals in many engineering and marketing applications, the goal in science is usually to try to isolate key factors which can be interpreted. Factors with subject matter interpretations, if available, can lead to the construction of better predictors and scientific insights. Isolating key factors is usually identified in statistics with model selection and variable selection — see Section I.7. 4. Nonparametric prediction and classification. The Bayes classifier. We were presented in Section 1.4 with the classical prediction problem: Given a d dimensional vector Z = (Z1 , . . . , Zd )T of covariates, predict the value of a response Y . In that section we showed that for quadratic loss, the best predictor is µ(Z) = E(Y |Z). 307 308 Prediction and Machine Learning Chapter 12 In the “known P ” case, the joint distribution of (Z, Y ) is known and µ can be computed. In practice, when P is unknown, we can construct an estimate µ bn (·) based on a sample (Z1 , Y1 ), . . . , (Zn , Yn ) i.i.d. as (Z, Y ) ∼ P . When k > n and Zk is known but Yk is unknown, we can use µ b(Zk ) to predict Yk from Zk where (Zk , Yk ) ∼ P . In this context, (Z1 , Y1 ), . . . , (Zn , Yn ) is referred to as a training sample. b |Z = z) where E b is We may try to estimate µ(z) ≡ E(Y |Z = z) by µ b(z) ≡ E(Y −1 the expected value under the empirical probability Pb which assign probability n to each observed data vector (zi , yi ), and 0 elsewhere. If Z is discrete with a small number of values, this is a parametric problem and our choice of µ b(·) is reasonable if we observe all possible values of Z. However, if as is typically the case, Z is continuous, then the estimate µ b(z) is undefined for unobserved z. To go further we need to assume some special properties of µ(·), most naturally, smoothness. In some parts of this chapter we will consider generalizations of the problem of estimating a nonparametric regression that we discussed in Section 11.6. The major differences are: i) The Y to be predicted can be categorical so that 0 − 1 loss as discussed in Section 1.3 is appropriate. That is, we have a problem of classification where Y can take on values 1, . . . , C corresponding to C distinct categories. Throughout we consider the loss function l(j, d) = 0 if d = j = 1 otherwise where d is the predicted value (class) and j is the actual value. Then, if pj (z) is the conditional density of Z given Y = j, and the “prior” probability is P [Y = j] = πj , the conditional Bayes risk of a classifier d given Z = z is E l(Y, d)|z = 1 − P (Y = d|z) = 1 − πd pd (z)/q(z) where q(z) is the density of Z. Thus the conditional Bayes risk is minimized by δB (Z) = arg max P [Y = j|Z] j = arg max πj pj (Z) . (12.1.1) j This δB is called the Bayes classifier. We also see that the minimum Bayes risk is RB ≡ 1 − E max{πj pj (Z)}/q(Z) . (12.1.2) j Since P [Y = j|Z] = E l(Y, d)|Z , we see that, in some ways, the classification problem can be viewed as the same as estimating the regression of 1(Y = j) on Z from a sample, where a sample (Z1 , Y1 ), . . . , (Zn , Yn ) is i.i.d. as (Z, Y ) ∼ P with Y ∈ {1, . . . , C}. The sample is used to construct a method that can be used to classify Yk from Zk for k > n, where (Zk , Yk ) ∼ P . This version of classification as estimation is somewhat misleading in some frameworks, as we shall see in Section 12.2.4 when we consider boosting. Section 12.1 Introduction 309 ii) The predictor Z is potentially very high dimensional. The coordinates Z1 , . . . , Zd can be categorical or ordinal as well as real but what matters is that there are a lot of them. The high dimensional case is important because of new technologies that generate vast amounts of data. It is worth noting that high dimensionality is implicit in the case of one dimensional regression as well. As is well known in prediction, covariates can always be added to a regression. Thus, if Z ∈ [0, 1], say, then E(Y |Z) = E(Y |Z, Z 2 , . . . , Z p ), and estimation of the second form leads naturally to linear regression of Y on the p + 1 vector (1, Z, . . . , Z p )T , a method we considered in Section 11.6.2, for smooth µ(Z). Similarly, if Z takes on a large number of known categories, the categories can, for instance, be coded by e1 , . . . , ed , the standard basis vectors of Rd . That is, e1 = (1, 0, 0, . . . , 0)T , e2 = (0, 1, 0, . . . , 0)T ,. . . , ek = (0, 0, . . . , 0, 1). Now if Cj denotes the jth category, E(Y |Z ∈ Cj ) = E(Y |ej ). iii) In addition to estimation of the regression E(Y |Z), the prediction of Y is of major concern. In particular, prediction error plays a key role in model and tuning parameter selection based on cross validation. ✷ Remark 12.1.1. Prediction versus estimation. This chapter examines ways of constructing good predictors; however the discussion is sometimes in terms of estimation. This is because the problem of selecting a good predictor is sometimes equivalent to selecting b (·) is based on X = a good estimator. For squared error loss this is clear because if µ (Z1 , Y1 ), . . . , (Zn , Yn ) and we are trying to predict Yn+1 from Zn+1 , where (Zn+1 , Yn+1 ) is independent of X, then, for µ(Z) = E(Y |Z), b (Zn+1 )]2 Prediction MSE = E [Yn+1 − µ b (Zn+1 )]2 = E [Yn+1 − µ(Zn+1 )]2 + E [µ(Zn+1 ) − µ = constant + estimation IMSE because the cross product term is zero by the iterated expectation theorem. Thus optimal prediction is equivalent to optimal estimation in this case. (Here the constant does not b , but it contributes a substantial amount to the prediction MSE depend on the procedures µ of any predictor. Thus while estimation IMSE typically tends to zero, prediction MSE does not.) ✷ The formulation given so far is the center of our discussion in this chapter which examines most of the major approaches to regularization, dimension reduction, model selection and to the construction of prediction rules, regression procedures, and classifiers. We next add more introductory details. 12.1.1 Statistical Approaches to Modeling and Analyzing Multidimensional data. Sieves As in Section 12.1, let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. as (Z, Y ) ∼ P with Z = (Z1 , . . ., Zd )T and Y ∈ R. Consider the problems of estimating µ(z) = E(Y |Z = z), predicting Y from Z, or using Z to classify Y . Two methods are: 310 Prediction and Machine Learning Chapter 12 a) Generalizations of kernel regression methods. These methods for the estimation of E(Y |z) use product kernels to replace the univariate ones of Section 11.6.2. See Section 12.2.1. b) Regularization methods based on sieves. We pursue the regularization ideas introduced in Section 9.1.4. Thus for P in a general class of models, we approximate µ(Z) = EP (Y |Z) by postulating a sequence of parametric models {Pj } for (Z, Y ) with members Pj of Pj identified by a kj dimensional parameter θ(j) (P ) which is defined for all P . We assume that each Pj is regular as defined in Section 1.1.3. Define µj (Z) = EPj (Z). We require that for any P in the nonparametric class we consider, as j → ∞, µj (Z) → µ(Z) , where “→” denotes convergence in an appropriate metric. The parameter θ(j) (P ) is defined by minimum contrast; e.g. if pj denotes the density of Pj , we could use the KullbackLeibler contrast and set Z θ (j) (P ) = arg min log pj (z, y))dP (z, y) . θ(j) We can then generate a sequence {b µj (·)} of estimates or predictors with µ bj (·) ≡ EPbj (Y |Z) (j) b b b and Pj ∈ Pj corresponding to θ (P ), where P denotes the empirical distribution function based on (Z1 , Y1 ), . . . , (Zn , Yn ). Here regularization is achieved by using a sieve made up of regular parametric models. The index j is called a regularization or tuning parameter and is chosen to minimize an estimate of risk. We postpone the issue of choosing b j ≡ j(Pb ) to obtain our final estimate µ bbj (·). See Section 12.5. There is a huge literature on regularization; see Bickel and Li (2006), for discussions and a bibliography. We now turn to several special cases of sieves. (1) Gaussian Sieves A natural special case, if, say, Z ∈ I d where I ≡ [0, 1], is to choose an orthonormal basis {fk (·)}k≥1 with f1 ≡ 1, Z fa (z)fb (z)dz = 1(a = b) Id and postulate as model Pj with dimension kj Y = kj X (j) θk fk (Z) + ε (12.1.3) k=1 (j) (j) where ε ∼ N (0, σ 2 ) independent of Z, and θ(j) = (θ1 , . . . , θkj )T is defined as the minimizer of mean squared prediction error as in Theorem 1.4.4. (2) Penalized Least Squares and Ridge Regression Section 12.1 311 Introduction Having chosen {fk } as in (12.1.3), it is natural to consider a Lagrangian version of the method of sieves which simplifies computation and makes the need for regularization explicit. Choose a function J(θ), J : R × Rkj → R+ and λ > 0; and in model Pj minimize, over θ(j) , k j n X 2 1X (j) θk fk (Zi ) + λJ(θ (j) ) . Yi − n i=1 k=1 (j) The population parameter θλ (P ) in model Pj is defined by (j) θ λ (P ) = arg min Z (y − [θ(j) ]T f (j) (z))2 dP (z, y) + λJ(θ (j) ) : θ(j) , T where f (j) (·) = f1 (·), . . . , fkj (·) and P (·, ·) is the probability distribution of (Z, Y ). b(j) = θ(j) (Pb). Note that this approach still falls The empirical plug-in estimate is now θ λ λ under our general formulation of empirical plug-in estimation. A choice we pursue further is ridge regression which corresponds to kj J(θ (j) 1 X (j) 2 1 )= [θk ] = |θ(j) |2 . 2 2 k=1 Another interesting choice is J(θ (j) )= kj X k=1 (j) |θi | which corresponds to the Lasso. (3) Bayes and Penalized Least Squares Ridge regression has a natural Bayesian interpretation. Consider the sequence of models {Pj } as a dense subset of the grand model, Y = ∞ X θk fk (Z) + ε (12.1.4) k=1 where ε ∼ N (0, σ 2 ). Now assume that θ1 , . . . , θj are i.i.d. N (0, 1/λ), θk = 0, k > j. b(j) is just the mode of the posterior Then, for fixed j and λ, the ridge regression estimate θ λ distribution of θ(j) given the data. This formulation suggests that other choices of prior distribution might also give interesting results. A class of possible penalties corresponding to priors in this way is Jr (θ) = j X k=1 |θk |r , 0<r<∞, 312 Prediction and Machine Learning Chapter 12 with J0 (θ) = j X k=1 1(|θk | 6= 0) defined by a limiting process as r → 0 and not corresponding to a proper prior. Among Jr (θ), r > 0, the penalty J1 (θ) which corresponds to θ1 , . . . , θj i.i.d. with prior π(θ) = 1 −|θ| , the Laplace or double exponential distribution, has particularly interesting proper2e ties. See Section 12.2.3. (4) Logistic Sieves. Linear Classifiers Another important sieve suitable for classification problems is based on logistic regression. Here Y takes on the values 1, . . . , C. The functions to be estimated in the Bayes classifier (12.1.1) are πa pa (Z), a = 1, . . . , C, where πa = P (Y = a) and pa (Z) is the density P of Z given Y = a. Of these, if we have a training sample, πa is estimated by n−1 ni=1 1(Yi = a), the fraction of class a in the training sample, while estimating pa (Z) is the problem of estimating a general multivariate density from {Zi : Yi = a}. Alternatively we can think of, from the beginning, estimating π(1|Z), . . . , π(C|Z) , where C X π(a|Z) ≡ P [Y = a|Z] = πa pa (Z) πk pk (Z) , k=1 using a generalized linear logistic regression model (see Sections 1.6.2, 6.4.3, 6.5 and Hastie, Tibshirani and Friedman (2001, 2009)). That is, given functions f1 (Z), . . . , fkj (Z) (j) and 1 ≤ a ≤ C, specify conditionally, with π ≡ (π1 , . . . , πC ), θ ≡ {θa,k : 1 ≤ k ≤ kj , 1 ≤ a ≤ C}, the jth model as Pkj (j) k=1 θa,k fk (Z)} Pkj (j) l=1 πl exp{ k=1 θl,k fk (Z)} πa exp{ π(a|Z, θ, π) = PC , (12.1.5) the logistic regression model. In the jth model Pj , the classifier using training data based on (12.1.5) classifies an unknown Y corresponding to covariate vector Z by b = a if δ(Z, θ) kj X k=1 (j) (l) (θba,k − θbl,k )fk (Z) ≥ log π bl /b πa (12.1.6) b and π b are obtained by maximum likelihood for all l, with ties broken arbitrarily. Here θ from εia L(θ, π) = Πni=1 ΠC (a|Zi , θ, π) , a=1 π P b is the MLE for a canonical exponenwhere εia = 1(Yi = a). Thus π ba = i εia /n and θ tial family — see Example 1.6.7 and Section 2.3, for instance. The classifier (12.1.6) is an example of a linear classifier, that is, one such that log π b(a|Z)/b π (l|Z) is a linear combination of functions fk (Z), 1 ≤ k ≤ kj . Geometrically, we can think of such classifiers as Section 12.1 313 Introduction T classifying a new point Y with given Z in class l if f1 (Z), . . . , fk (Z) is on one side of a hyperplane in Rk and “not l” if it is on the other side. We have already seen one example of such a classifier, Fisher’s linear discriminant function in Section 4.2. 12.1.2 Machine Learning Approaches The machine learning point of view as discussed in Section 11.4.2 is to specify a parametric family D of possible decision procedures, based on predictors in our case, and select a member by some optimization criteria which may or may not include a penalty for complexity (overfitting). The focus is, in particular, on classification. a) Neural Nets The motivation behind neural nets is a primitive notion of how human stimulus response learning proceeds. Abstracted it is relatively simple. If Zd×1 is the vector of predictors, let the classes that a corresponding Y can fall in be {1, . . . , C}. By definition, a neural net with one hidden layer of M interior nodes, based on {(Z1 , Y1 ), . . . , (Zn , Yn )}, outputs a PC b (k|Z) = 1}, of estimated posterior probabilities vector, {0 ≤ π b (j|Z) : 1 ≤ j ≤ C, k=1 π j, 1 ≤ j ≤ n, satisfying log π b(j|Z) b −β b )T σ(AZ b ∗) . = (β j 1 π b (1|Z) (12.1.7) b are (M + 1) × 1, A b is (M + 1) × (d + 1), Z∗ = (1, ZT )T , and σ(ω) = Here the β j T σ(ω1 ), . . . , σ(ωM+1 ) , where σ(ω) = (1 + e−ω )−1 and ω T = (ω1 , . . . , ωM+1 ). Essentially the classifier is obtained by a fitting process from all classifiers based on arg max π(j|Z, A, B) j where A is (M + 1) × (d + 1), B = (β 1 , . . . , βC ) is (M + 1) × C, and π(j|Z, A, B) is given by (12.1.7) with the “hats” removed. Neural nets essentially use logistic regression applied to a nonlinear coordinatewise transformation of an affine transformation of Z into RM+1 . The second optimization problem is not convex. The tuning parameter here is M , the number of nodes in the hidden layer parametrized by A. It may be shown that as M → ∞ the functions log π(j|Z, A, B)/π(1|Z, A, B) can approximate any log π(j, Z)/π(1, Z) . Neural nets may also be applied to the estimation of the regression mean µ(z) = E(Y |Z = z). See Problem 2.3.41. We do not pursue this method. A full discussion may be found, for instance, in Ripley (1996) and Hastie et al (2001, 2009). b) Support Vector Machines This method introduced by Vapnik (1996) is limited to classification. We discuss it initially for the case of C = 2 classes here denoted by “−1” and “1”. These are, in fact, linear 314 Prediction and Machine Learning Chapter 12 classifiers, but as with neural nets, follow a nonlinear transformation of Z from Rd to a typically much higher dimensional space RM . As we remarked earlier, given any Z, we can always choose an orthonormal basis in L2 (P ), {fk (Z), k ≥ 1}, approximating any L2 (P ) function g of Z in the L2 sense. That is, we can formally write g(Z) = ∞ X ak (g)fk (Z) (12.1.8) k=1 where ak (g) = Eg(Z)fk (Z). Now consider a two class {−1, 1} classification problem, where the Bayes rule is δ(Z) = 21 π(Z) ≥ 21 − 1 with π(Z) ≡ P [Y = 1|Z|]. Then, if we represent log π(·) as in (12.1.8), P the Bayes rule is a linear classifier in R∞ determined by the hyperplane {(a1 , a2 , . . .) : k ak fk (Z) = log 12 }. All points on one side of the hyperplane are classified as Y = 1 and the others as Y = −1. In fact, it is possible to show (Problem 12.1.2) that if we map Z1 , . . . , Zn , Zj ∈ Rd , in a 1–1 fashion to RM , M >> max(d, n), and if Z∗i are the RM images of the Zi , then {Z∗i : Yi = 1} and {Z∗i : Yi = −1} can be separated by a hyperplane. That is, we can construct a linear classifier that classifies the Y ’s in a training sample (Z1 , Y1 ), . . . , (Zn , Yn ), 1 ≤ i ≤ n, perfectly. There are evidently many such hyperplanes if one exists. Vapnik (1996) proposed using that separating hyperplane which maximizes the minimum distance between points of the training sample and the hyperplane. This problem it turns out can be cast as one in quadratic programming and is then rapidly and uniquely solvable. The original form of support vector machines, for C = 2, is valid only in a context where there is a perfect separation of the two classes in the training sample. However, the introduction of suitable slack variables in the program permit regularization, i.e. lead to the choice of a hyperplane which permits a specified minimal number of misclassification of the training sample. We shall go further into the theory of this linear classifier in Section 12.2.4. c) Boosting Schapire (1990), Freund and Schapire (1997), and others proposed a method of classifying using larger and larger linear combinations of so called weak classifiers. This method has since been identified (Friedman, Hastie and Tibshirani (2000)) as an algorithm of a known type (Gauss-Southwell) for minimizing a data-based convex objective function. See Section 12.2.4. d) Tree-based Methods These are hybrid methods introduced independently by Breiman, Olshen, Friedman, Stone (1984), Quinlan (1987), and earlier by Morgan and Sonquist (1963). They may be viewed as constructing iteratively a partition of the covariate space Rd into blocks and then, for regression, estimating conditional expectations of Y in each block or, for classification, estimating the conditional probabilities of belonging to class j ∈ {1, . . . , C} given that Z belongs in a block. Again we shall discuss this method in Section 12.2.4. Section 12.2 12.1.3 Classification and Prediction 315 Outline Here is the rest of our outline for this chapter. In Section 12.2. we describe methods mentioned in Section 12.1 in more detail and start to give their properties. In Sections 12.3 and 12.4 we shall focus on asymptotic theory both from a statistical (minimax theorems) and machine learning (oracle inequalities) point of view and focus on the critical implication that “sparsity” in some sense is needed for any statistical successes when the covariate space Rd is high dimensional. Sparsity in this context means roughly that either the population regression function depends on a small subset {Zj .j ∈ S} of covariates, or that the function takes on a simple form, e.g. is a sum of functions of one variable Zj only. In other words, the regression fuction is much less complex than a general function on Rd . In Section 12.5 we introduce cross validation as a tool for tuning parameter choice and in Section 12.6, we will discuss model selection and dimension reduction. In this context we shall discuss Bayesian model selection and principal component analysis. Finally, Section 12.7 discusses multiple testing and has pointers to work not covered in this book. Summary. In this section, we have introduced the prediction framework both in the context of classification and regression, and i) Statistical approaches to these problems, in particular, sieves, penalization, and empirical Bayes methods. ii) Machine learning approaches to these problems, in particular, support vector machines, boosting, and tree-based methods. 12.2 Classification and Prediction In this section, we give more detailed descriptions and properties of the approaches outlined in Section 12.1. However, first we introduce, in Section 12.2.1, tools to be used for classification and prediction. These include a multivariate density estimate that is used in classification procedures and a nonparametric statistical method based on multivariate kernel regression that can be used to predict a response Y from a vector x of observed covariate values. In Sections 12.2.2, 12.2.3, and 12.2.4 we describe classification procedures based on nonparametric methods, sieves, and machine learning approaches. 12.2.1 Multivariate Density and Regression Estimation The statistical methods in Chapter 11 for univariate nonparametric estimation will be generalized straightforwardly in this section to the multivariate case with continuous X. (a) Density Estimates Let X1 , . . . , Xn be i.i.d. as X ∼ F , where X ∈ Rd . We assume that X has a density f (x) with convex support S(f ) equal to the closure of {x : f (x) > 0}. All the methods we have considered for estimating f when d = 1 generalize naturally. Here are two of the generalizations. 316 Prediction and Machine Learning Chapter 12 (1) Convolution kernel estimates Let K(u) be a multivariate kernel, that is, a map Rd → R with Z K(u)du = 1. Rd A kernel is said to be of order r if Z Rd d Y j=1 i ujj K(u)du = 0, for all ij with 1 ≤ i1 + · · · id ≤ r. For simplicity, we consider a product kernel K(u) = d Y Kj (uj ), j=1 where each Kj is a 1 dimensional kernel. A product kernel is of order r iff each of its components is. Since scale may vary by coordinate, we specify a vector of bandwidths h = (h1 , . . . , hd )T . Thus, as for d = 1, we define Kh (u) = d Y h−1 j Kj (uj /hj ). j=1 Often Kj (u) = K1 (u), j = 1, . . . , d, and K1 (u) is chosen as the normal density ϕ, or if compact support is desired, K1 (u) = 12 1[|u| ≤ 1] or K1 (u) = 34 (1 − u2 )1(|u| ≤ 1). The empirical distribution is Fb(x) = n−1 n X i=1 1[Xi ≤ x] where X ≤ x is componentwise inequality, x ∈ Rd . We define Z fh (x, F ) = Kh (x − z)dF (z) (12.2.1) and the (plug-in) convolution kernel density estimate, fbh (x) ≡ fh (x, Fb ) = Z n 1X Kh (x − Xi ). Kh (x − z)dFb (z) = n i=1 (12.2.2) Bias, variance, MSE, and IMSE results are completely analogous to the d = 1 case; see Silverman (1986), Scott (1992), and the problems. Briefly, let D2 f (x) = ∂ 2 f (x) ∂xi ∂xj , d×d F = {f : D2 f (x) ≤ M all x ∈ Rd }. Section 12.2 317 Classification and Prediction Then, if K has compact support and is of order 1, EF fbh (x) = f (x) + O(|h|2 ) VarF fbh (x) = (n|h|d )−1 f (x) (12.2.3) Z K 2 (u)du(1 + o(1)) (12.2.4) as n → ∞, h → 0; uniformly for f ∈ F . Note that the dimension d inflates R the variance by the factor |h|−d . If Kj ≡ K1 , j = 1, . . . , d, with K1 of order 1, if 0 < |D2 f (x)|2 dx < ∞, and we select h = h(1, . . . , 1)T to minimize the approximate IMSE Z Z Z 1 AIMSEfbh (x) ∼ = ( |t|2 K(t)dt)2 ( |D2 f (x)|2 dx)h4 + ( K 2 (u)du)(nhd )−1 4 (12.2.5) then we obtain the optimal asymptotic rate 1 h ≍ n− d+4 . For this h, the minimum IMSE is of order n−4/(d+4) (Problem 12.2.2). Remark 12.2.1. For d large, say d ≥ 4, the rate of convergence of fbh (x) to f (x) is no better than n−1/4 . Thus if it takes 100 observations to achieve desired precision in a parametric setting this suggests it will take at least 10,000 in this high dimensional nonparametric setting. We can do no better with any other method (van der Vaart (1998), Section 24.3) unless we use stronger regularity conditions. As in the one dimensional case, the optimal rates become better if one assumes f has bounded derivatives of order p > 2, but there is no data-dependent way of checking such assumptions. This gloomy picture is relieved by the observation that data distributions tend to stay close to low dimensional manifolds and thus the worst case analyses (nonparametric for d large) are much too conservative. Cross validation methods to estimate h are given in Section 12.5 and the reference distribution approach is given in Problem 12.2.2. As in the one dimensional case, we can eliminate difficulties in estimation at the boundary of f if S(f ) is a convex set by using locally polynomial estimates or more generally plugging into the minimizer of Dh,x (f (·), f (·, β)) defined as in Example 11.3.1, with the univariate integral over (a, b) replaced by integrating over S(f ) ⊂ Rd , and Kh now a multivariate kernel. Details are given in Problem 12.2.3. (2) Multivariate nearest neighbor estimates Let K(u) = 1(|u| ≤ 1)/V (d) where V (d) is the volume of the unit sphere in Rd (not a product kernel) and set u Kh (u) = h−d K . h Then, if fbh (x) is defined by (12.2.2), fbh (x) = Fb (z : |z − x| ≤ h)/V (d)hd . 318 Prediction and Machine Learning Chapter 12 If we take h = b h(k, x) to be the smallest h such that there are k sample points xi in the set {z : |z − x| ≤ h} then we obtain the kth nearest neighbour estimate fek (x) = k V (d)[b h(k, x)]d = k V (d)[|x − X|(k) ]d (12.2.6) with |x − X|(1) < · · · < |x − X|(n) denoting |x − Xi |, 1 ≤ i ≤ n, ordered. (b) Kernel Regression Estimates We introduce a nonparametric product kernel estimate of E(Y |X = x) and give mean squared error properties of this estimate. Suppose the observations (Xi , Yi ), i = 1, . . . , n are, i.i.d. as (X, Y ), where Xi ≡ (Xi1 , . . . , Xid )T , Yi ∈ R, with the Xi having continuous case density f and loss is mean square error. We assume that S 0 ≡ {x : f (x) > 0}, the interior of the support Sf of f (·) is an ellipse or a rectangle with possibly one or more infinite boundaries. We want to estimate µ(x) = E(Y |X = x), where L(Y |X = x) is defined to be pointmass at zero for x 6∈ S 0 , so Rthat µ(x) = 0, x 6∈ S 0 . Suppose K : Rd → R is a kernel, that is, a function such that Rd K(t)dt = 1. Then, define Kh (t) ≡ (h1 h2 . . . hd )−1 K t1 td ,..., , h1 hd with the tj , hj being the coordinates of t, h, where hj > 0. Evidently Kh is also a kernel. Assume that K is symmetric, that is K(−t) = K(t). Then fbh (x) = n−1 n X i=1 Kh (Xi − x) is the multivariate convolution kernel density estimate of f (x) of part (a) of this section. The generalized multivariate Nadaraya-Watson estimate of the nonparametric regression P function µ(x) ≡ E(Y |X = x) is gbh (x)/fbh (x) where b gh (x) = n−1 Yi Kh (Xi − x), that is, Pn Yi Kh (Xi − x) , (12.2.7) µ b NW (x) = Pi=1 n i=1 Kh (Xi − x) where (0/0) = 0. In the asymptotic analysis, we assume the product kernel K(t) ≡ Πdj=1 K0 (tj ) , hj = h, 1 ≤ j ≤ d, where K0 is a univariate, non-negative, symmetric kernel. We will develop asymptotic results for the multivariate µ bNW : Let S be a compact set contained in S 0 = {x : f (x) > 0}. We select S so that it has a “simple” and “smooth” boundary which we for simplicity take to be an ellipse. Let |·| be the Euclidean norm, let G denote the distribution of (X, Y ), and let σ 2 (x) = Var(Y |X = x). Moreover, let M > 0 and ε > 0 be generic constants. We will use the assumptions: A1 : µ(x) = EG (Y |x) is Lipschitz; |µ(x) − µ(z)| ≤ M |x − z|, all x, z ∈ S, and µ(x) ≤ M < ∞ all x ∈ S. Section 12.2 Classification and Prediction 319 A2 : f (x) is Lipschitz; |f (x) − f (z)| ≤ M |x − z| all x, z ∈ S, and f (x) ≤ M , all x ∈ S. A3 : K0 ≥ 0 has compact support, is continuous, symmetric, and is bounded above; K0 (t) ≤ M for all t ∈ R. A4 : f (x) is bounded below on S; f (x) ≥ ε all x ∈ S. A5 : The conditional distribution of Y |X = x has a bounded moment generating function; EG {etY |X = x} ≤ M for |t| ≤ δ < ∞, some δ > 0, all x ∈ S. Let G be the class of G where A1 , . . . , A5 hold uniformly in G and x. Note that by A5 , b NW (X) 1(X ∈ S) , then σ 2 (x) ≤ M < ∞ for all x. Let IMSE (b µ NW ) = E M SE µ rates of convergence of the Nadaraya-Watson estimates are Theorem 12.2.1. Under assumptions A1 , . . . , A5 , 2 b NW (x) = O n− 2+d unif ormly f or x ∈ S and G ∈ G. (i) inf h>0 MSE µ 2 (ii) inf h>0 IMSE(b µ NW ) = O n− 2+d unif ormly f or G ∈ G, f or S with smooth boundaries. (iii) The minimizers in (i) and (ii) are of the form h = cn−1/(d+2) . Proof. The proof is in Appendix D.7. Remark 12.2.2. (1) Assumptions A1 , A2 , and A3 can be considerably weakened. Moreover A4 can be replaced by a condition on how quickly f (x) → 0 as x → δS and A5 can be replaced by moment conditions. The technical cost of weakening the assumptions does not seem worthwhile. (2) If G is defined by Lipschitz assumptions on partial derivatives of µ or more generally, µ(·) lies in a Sobolev ball, the rate n−2/(2+d) can be replaced by n−2s/(2s+d) , where s is some measure of smoothness determined by G provided that we use local polynomial regression estimates of a sufficiently high order, such as linear, quadratic, etc. (3) For other results of this type see Stone (1984), Ruppert and Wand (1994), Fan and Gijbels (1996), Jiang and Doksum (2003), and Problem 12.2.4. ✷ Remark 12.2.3. (1) As in the case of density estimation (Remark 12.2.1), we face slow convergence rates unless the distribution of X concentrates near a low dimesional manifold (sparsity). (2) We also face the choice of bandwidth h, or number of nearest neighbours k, if we use a kernel with support [−1, 1]d and use the h that yields a fixed number k of xi ’s in [x − h, x + h]. The asymptotically optimal choice of h given in Problem 12.2.4 depends on an unknown constant c. Prediction suggests the use of cross validatory choice, that is, fitting of the rule using a fraction of the sample and selection of the regularization parameter h by optimizing prediction or classification performance on the rest of the sample. We shall discuss these issues further in Section 12.5. 320 Prediction and Machine Learning 12.2.2 Chapter 12 Bayes Rule and Nonparametric Classification We begin with a result relating the Bayes classification rule to nonparametric density estimation. We observe a training sample (X1 , I1 ), . . . , (Xn , In ) i.i.d. as (X, I) ∼ P , where I ∈ {1, . . . , C}, X ∈ X , P ∈ P. Let (Xn+1 , In+1 ) ∼ (X, I) independent of the other (Xi , Ii ). On the basis of Xn+1 we are to predict In+1 using 0-1 loss: l(I, a) = 0 if a = I, 1 otherwise. Suppose pbj (·), 1 ≤ j ≤ C, are estimates of pj (·), the conditional densities with respect to ν(x) of X given I = j, and π bjn are estimates of πj , 1 ≤ j ≤ C. The Bayes rule is, from (12.1.1), δB (x) = j iff πj pj (x) = max{πk pk (x) : 1 ≤ k ≤ C} , with ties broken arbitrarily if more than one j achieves the max. Let δbn (x : X1 , . . . , Xn ) = j iff π bj pbj (x) = max{b πk pbk (x) : 1 ≤ k ≤ C} (12.2.8) with the same rule for ties. We show that δbn has risk close to the minimum Bayes risk. This minimum Bayes risk is from (12.1.2) Z RB = 1 − max πj pj (x) : 1 ≤ j ≤ C}dν(x) . Let P be a general class of P ’s satisfying the conditions specified in what follows. Theorem 12.2.2. Let δbn be as defined in (12.2.8). Suppose that as n → ∞, Z P P pbj (x) − pj (x) dν(x) −→ 0 and π bj −→ πj , 1 ≤ j ≤ C. Then, as n → ∞, the risk of δbn for 0 − 1 loss satisfies R(P, δbn ) ≡ P [δbn (Xn+1 : X1 , . . . , Xn ) 6= In+1 ] = RB + o(1) . (12.2.9) (12.2.10) If (12.2.9) holds uniformly for P ∈ P, then o(1) in (12.2.10) is also uniform. Proof. For δ(·) : X → {1, . . . , C}, we note the following useful formula, Z max πj pj (x) − πδ(x) pδ(x) (x) dν(x) . R(P, δ) − RB = j Therefore, for b a = arg max π bj pbj (x), Z R(P, δbn ) − RB = E max πj pj (x) − πba pba (x) dν(x) j Z max πj pj (x) − min πj pj (x) 1 ∪C ≤E a=1 Aa dν(x) j j (12.2.11) Section 12.2 321 Classification and Prediction where Aa = {x : π bba pbba (x) > π ba pba (x), πba pba (x) < πa pa (x)}. By hypothesis, P [Aa ] → 0, P since max |b πa pba (X) − πa pa (X)| −→ 0. The result follows from the dominated convera gence theorem (Theorem B.7.5). ✷ Definition 12.2.1. A classifier δbn that satisfies (12.2.10) is called Bayes consistent. We next give examples of rules that satisfy the conditions of Theorem 12.2.2. Example 12.2.1. Kernel and nearest neighbour rules. (a) Kernels. It is evident that we can plug in kernel estimates pbj (·) for pj (·) when X is 1 continuous. For instance, suppose PnX = R, K(u) = 2 1(−1, 1), we use the same bandwidth −1 h for all pbj , and the π bj = n i=1 1(Ii = j) in the sample are unbiased and consistent estimates of the population πj . Then, a reasonable classification rule is π bm pbm (Xn+1 ) for all m δbn (Xn+1 : X1 , . . . , Xn ) = j if pbj (Xn+1 ) > π bj where pbj (·) is the kernel density estimate based on {Xi : Ii = j}. This rule is equivalent to preferring j over m if the ratio of the number Nj (h) of X observations from category j in the training set at distance ≤ h from Xn+1 to the corresponding number Nm (h) from category m is larger than π bm /b πj . (b) Nearest Neighbours. If we use the kth nearest neighbour density estimate of Sections 11.4.3, and 12.2.1(a), and if πj is uniform, πj = C −1 , we order |Xn+1 − Xj | to get |Xn+1 −Xj |(1) < . . . < |Xn+1 −Xj |(n) , the ordered distances between the Xj , 1 ≤ j ≤ n, and Xn+1 . The arg maxj {b pj (xn+1 )} rule leads to (Problem 12.2.5): Predict In+1 = Ibjk , where b jk is defined by |Xn+1 − Xbjk | = |Xn+1 − X|(k) . That is, b jk is the subscript of the kth nearest neighbour to Xn+1 . If k = 1, this leads to the highly plausible rule: Predict In+1 = Ibi , where |Xn+1 − Xbi | = min{|Xn+1 − Xm | : 1 ≤ m ≤ n} . That is, bi is the subscript of the Xi , 1 ≤ i ≤ n, closest to Xn+1 . We call this the nearest neighbour classifier. Corollary 12.2.1. Assume the framework of Example 12.2.2 (a). If hn → 0, nhn → ∞, if C = 2, if pj (·) concentrates on (0,1), if kpj k∞ ≤ M for j = 0, 1, and if δB is the Bayes rule, then P [δbn (Xn+1 : X1 , . . . , Xn ) 6= In+1 ] = P [δB (Xn+1 ) 6= In+1 ] + o(1) . (12.2.12) Moreover, (12.2.12) holds uniformly over F = {(p0 , p1 ) : kp′′j k∞ ≤ M }. Proof. Note that, using Theorem 11.2.1, Z 1 Z 1 2 E |b pj (x) − pj (x)| dx ≤ E (b pj (x) − pj (x))2 dx = O (nh)−1 + O(h4 ) 0 0 322 Prediction and Machine Learning uniformly on F . The result follows from Theorem 12.2.2. Chapter 12 ✷ Remark 12.2.4 (1) The conditions of Corollary 12.2.1 are much too strong. If no uniformity is required, then (12.2.12) holds for kernel density estimates for any f — see Devroye and Lugosi (1996) for instance. (2) If d > 1, the condition nhn → ∞ is replaced by nhdn → ∞. (3) The kth nearest neighbour classifier is consistent under regularity conditions when d = 1 if k = kn → ∞, n−1 kn → 0 as n→ ∞. See Theorem 11.4.1 and Problem 12.2.6. (4) Define a (poor) estimate based on the training sample of the misclassification probability by C X n X b = n−1 1 Ii = j, δbn (Xi : X1 , . . . , Xn ) 6= j . R j=1 i=1 b = 0, a clear underestimate because the Note that the nearest neighbour classifier has R nearest neighbour classifier, which corresponds to a kernel estimate with bandwidth h = OP (n−1 ), has a bandwidth that is too small to give a good density estimate. To obtain Bayes consistency we have seen that we have to take k = kn → ∞, kn /n → 0. On the other hand, if δbn (·) is the 1-nearest neighbour rule and R(P, δbn ) is its classification risk corresponding to the 0 − 1 loss function, then limn R(P, δbn ) ≤2 RB (P ) where RB (P ) is the minimum Bayes risk (Cover and Hart (1967)). Thus, a terrible density estimate can lead to a perfectly reasonable, though in general, suboptimal classification rule. (5) Kernel and kth nearest neighbour methods were first advocated by Fix and Hodges (1951). The latter particularly are still in wide use. (6) Properties of classifiers will be discussed in more detail in Section 12.2.4. 12.2.3 Sieve Methods Logistic regression (LR) As we have discussed one of the most widely used classification method is based on logistic regression. Fitting is particularly easy since it involves just the calculation of the MLE for a canonical exponential family. The issue of regularization appears in this context through the choice of the number of functions fi (Z) appearing in the logistic regression. Of even greater importance is the nature of the sieve determined by the class F = {f1 , f2 , . . .}. If fj (Z) = Zj ∈ R, then logistic regression may be viewed as a competitor to the rule based on Fisher’s linear discriminant analysis (LDA) which is constructed on the assumption that Z has an Nd (µj , Σ) distribution for j = 1, . . . , C. It can be shown (Problem 12.2.8 and 12.2.9) that LDA can never do better than LR asymptotically at least to 0th order, that is converge to the Bayes rule. Section 12.2 323 Classification and Prediction Penalized least squares Recall the framework of Section 12.1.1. Again the choice of sieve is of first importance. Consider {fj (Z) : 1 ≤ j ≤ k} as the kth member of a sieve, that is, Y = FT θ + e , (12.2.13) T where F = f1 (Z), . . . , fk (Z) , θ = (θ1 , . . . , θk )T ∈ Rk . Let ej be i.i.d. N (0, σ 2 ), so that the model for n observations is Yn×1 = Fn×k θ + e . (12.2.14) F ≡ fj (Zi ) , 1 ≤ j ≤ k, 1 ≤ i ≤ n, e ≡ (e1 , . . . , en )T , Y ≡ (Y1 , . . . , Yn )T . Then, penalizing by J(θ) = λ2 |θ|2 is equivalent to minimizing |Y − F θ|2 + λ 2 |θ| 2 = YT Y − 2θT F T Y + θ T (F T F + λ I)θ 2 (12.2.15) where I is the k × k identity matrix. The minimizer is easily seen to be bR = (F T F + λI)−1 F T Y . θ (12.2.16) bR . µ b(Z) = F θ (12.2.17) The minimizer is always uniquely defined if λ > 0 even if rank(F ) < k. The ridge bR leads to the estimate of the regression µ(Z), regression estimator θ Ridge regression was first advanced (Hoerl and Kennard (1970)) for numerical stability purposes in situations where F was ill-conditioned, but has important statistical properties bR can be interpreted as a Bayes estimate. As such, as well. As is noted in Section 12.1.1, θ it shrinks the (unpenalized) least squares predictor bLS (Z) = F [F T F ]−1 F T Y towards the µ mean 0 of the prior distribution Nn 0, F F T /λ for F θ. The choice of regularization parameter λ will be discussed further in Section 12.5. However, it is straightforward to make a connection to Stein shrinkage as in Section I.6. Suppose F T F = I, so that the vectors {fj (Zi ) : i = 1, . . . , n} are orthonormal. Then the MLE of θ is b ≡ F T Y ∼ θ + ek×1 θ where ek+1 ∼ N (0, I) and θ ∼ N (0, I/λ) are independent under our (empirical) Bayesian b ∼ N (0, (λ + 1)I/λ) and the MLE of model. Thus, marginally, θ (1 + λ)−1 = 1 − is −1 \ (1 + λ) d , = 1− b2 + |θ| λ + 1 −1 λ (12.2.18) 324 Prediction and Machine Learning Chapter 12 b 2 /d is replaced by |θ| b 2 /(d − 2). which is almost the Stein positive part estimate in which |θ| It can also be interpreted as the estimate which for some c(λ) minimizes |Y − F θ|2 subject to |θ|2 ≤ c(λ). Pk r Recall the notation Jr (θ) = l=1 |θl | . The penalty J1 (θ) is the closest convex penalty of the form Jr to J0 . It is also the only convex member of the Jr family which permits making some coordinates of the optimizing θ to be 0 (Problem 12.2.10). Thus if bj is a regression coefficient and is set to zero, this corresponds to leaving the jth predictor θ Zj out of the model. As mentioned earlier, this method is known as the “Lasso” (least absolute shrinkage and selection operator) (Tibshirani (1996)). The properties of the Lasso are discussed extensively by Bühlmann and van de Geer (2011), and Hastie, Tibshirani and Wainwright (2015). See also Problem 12.2.10. Properties of extensions of the Lasso to grouped variables (the grouped Lasso) can be found in Yuan and Lin (2007) and Zhao, Rocha and Yu (2009). 12.2.4 Machine Learning Approaches We return to the methods discussed in Section 12.1.2 but now with Y ∈ {−1, 1}. Recall that (Zi , Yi ), 1 ≤ i ≤ n, are i.i.d. as (Z, Y ), where (Z, Y ) is independent of (Z, Yi ), 1 ≤ i ≤ n . The goal is to build a classifier δn (Z) based on a training sample (Zi , Yi ), 1 ≤ i ≤ n, that can be used to classify an individual or object with unknown Y and known Z ≡ (Zi , . . . , Zd )T as being either in the “Y = −1” or “Y = 1” category. a) Neural Nets We do not further discuss these methods but refer to Hastie et al (2001, 2009) and Ripley (1996). b) Support Vector Machines As we indicated, support vector machines, were proposed by Vapnik as a classification algorithm for 2 classes which finds the hyperplane of “maximum margin” separating the members of the training sample perfectly according to class. Of course, there may be no hyperplane which can separate at all — for instance, imagine one class distributed uniformly on a spherical shell and the other limited to the interior of the sphere. But if separation is possible then we can define the margin for any separating hyperplane H by M (H) = min |Zi − Π(Zi |H)| i where | · | is the Euclidean norm and Π is projection on H. This is just the minimum distance of the training set from H. The hyperplane sought is the one which maximizes M (H). Formally, for the hyperplane H = {z : β T z + β0 = 0}, |β| = 1 , Section 12.2 325 Classification and Prediction the problem may be stated as (Problem 12.2.13) max min |βT Zi + β0 | : |β| = 1, Yi (β T Zi + β0 ) > 0, 1 ≤ i ≤ n . (12.2.19) β,β0 i We refer to Hastie et al (2009), Section 12.2, for a discussion showing how this convex optimization problem may be transformed upon dropping the restriction that |β| = 1 into min |β| (12.2.20) β,β0 subject to Yi (ZTi β + β0 ) ≥ 1 (12.2.21) ∗ ∗ for all i. Here the maximum margin is 1/|β |, where β is the minimizer in (12.2.20). If separation is impossible we can modify (12.2.21) to Yi (ZTi β + β0 ) ≥ 1 − ξi (12.2.22) for all i where ξi ≥ 0. Note that if ξi0 > 1, then any hyperplane misclassifying Yi0 but correctly classifying all Yi such that ξi = 0 becomes a candidate for separating all {Zi : ξi = 0}. The reason is that we can write (12.2.22) as Yi ZT β i |β| + β0 1 − ξi ≥ . |β| |β| Then, if ξi0 > 1, and there is a hyperplane H = {z : zT β ∗ + β0 = 0} which misclassifies only Yi0 and Yi0 (ZTi0 β ∗ + β0 ) = −δ < 0, we see that, for some 0 < C < ∞, −δ/C ≥ 1 − ξi0 and so |β ∗ |/C is feasible, i.e. a candidate for the minimizing |β| as in (12.2.20) subject to (12.2.22). Evidently, if we permit P ξi ≥ 0 to be arbitrary we can make the inf in (12.2.20) to be 0. Thus, the restriction ni=1 ξi ≤ K is added, and we arrive at 1 min |β|2 2 subject to ξi ≥ 0, Yi (ZTi β + β0 ) ≥ 1 − ξi , i = 1, . . . , n, n X i=1 ξi ≤ K . (12.2.23) By Lagrangian duality (see, for instance, Boyd and Vandenberghe (2004)), we note that the problem is equivalent to max n nX n o 1 X αi − αi Yi Zi 2 i=1 i=1 subject to 0 ≤ αi ≤ γ, (12.2.24) n X i=1 αi Yi = 0 326 Prediction and Machine Learning Chapter 12 for some γ depending on K and with the relation between the optimizing α∗ and β ∗ , β0∗ , ξ = (ξ1 , . . . , ξn ), and γ, given by (12.2.25) α∗i Yi (ZTi β ∗ + β0∗ ) − (1 − ξi ) = 0 , ξi (α∗i − γ) = 0 , (12.2.26) for i = 1, . . . , n and ∗ β = n X α∗i Yi Zi . (12.2.27) i=1 Note that α∗i > 0 iff Yi (ZTi β ∗ + β0∗ ) = 1 − ξi . The corresponding Zi are called support vectors. Note also that all α∗i with ξi > 0 equal γ. Any 0 < α∗i < γ will correspond to a correctly classified Zi at minimum distance from the hyperplane corresponding to (β ∗ , β0∗ ). For further discussion of the implementation of these methods via reproducing kernel Hilbert spaces, we refer to Hastie et al (2009). c) Boosting The initial form of this method, as limited to classification into 2 classes was in terms of reweighting. We are given (i) A set of “weak” classifiers, hk : Z → {−1, 1}, k = 1, 2, . . . , M , M ≤ ∞, where {−1, 1} correspond to the classification decisions. “Weak” loosely means that the classfiers are expected to have P [Misclassification] < 12 , but barely so. (ii) A training sample (Zi , Yi ), i = 1, . . . , n, where Zi ∈ Rd and Yi ∈ {−1, 1}. Boosting then produces a sequences of linear classifiers of the form Gj (Z) = sgn M X k=1 αkj hk (Z) , j = 1, 2, . . . , by selecting the weights so as to emphasize the classifiers doing particularly well on the hard to classify observations. There are many flavors of boosting; see Meir and Rätsch (2003) for an overview. One division can be made between the algorithms in which (1) the hk are considered in a prespecified order, so that αkj = 0, k > min(j, M ), and (2) the other more common approach where αkj 6= 0 for at most min(j, M ) indices, but the h to be considered at stage j is chosen optimally. The basic form of the original AdaBoost algorithm describes a process of generating the αkj by iterative reweighting of the training sample. Specifically, suppose we introduce hm at step m, AdaBoost 1. 1. Initialize G1 (Z) wi1 err1 α1 = = = = h1 (Z) 1 n , i = 1, . . . , n P n i=1 wi1 1 Yi 6= G1 (Zi ) log (1 − err1 )/err1 . Section 12.2 327 Classification and Prediction Gm wi(m+1) = = errm+1 αm+1 = = 2. Given define Gm+1 Pm sgn k=1 αk hk wim exp αm 1(Yi 6= hm (Zi ) i = 1, . . . , n Pn i=1 wi(m+1) 1 Yi 6=hm+1 (Zi ) Pn i=1 wi(m+1) log (1 − errm+1 )/errm+1 . Pm+1 = sgn k=1 αk hk . 3. Iterate for m = 1, . . . , M, . . . , Note. Iterations need not stop once hM has been considered but we can restart with GM replacing G1 . AdaBoost 2. Instead of hk being presented in the original order, in the more common version of the algorithm, hi1 , hi2 , . . . are determined in data driven order. Thus, set h(k) ≡ hik and initialize as before. At stage m m X Gm = sgn k=1 αk h(k) , then at stage m + 1, him+1 = arg min k where Pn i=1 wi(m+1) 1 Yi 6= h(k) (Zi ) Pn i=1 wi(m+1) wi(m+1) ≡ wim exp αm 1 Yi 6= h(m) (Zi ) . Then Gm+1 = sgn m+1 X k=1 with αk h(k) αm+1 = log (1 − errm+1 )/errm+1 and errm+1 = m X i=1 wi(m+1) 1 Yi 6= h(m+1) (Zi ) (12.2.28) ✷ As was discovered by a number of authors (see Meir and Rätsch (2003)) the first of these algorithms is just coordinate ascent as described in Section 2.4.2 and the more common version is a variant usually called the Gauss-Southwell algorithm for a particular optimization problem. We next establish this result. Given P , a distribution of (Z, Y ), let Q(α, P ) ≡ Z exp − M n X j=1 o αj hj (z) y dP (z, y) . 328 Prediction and Machine Learning Chapter 12 This is evidently a convex function of α ∈ RM and if Pbn is the empirical distribution of the training sample, we note that M n o n X 1X αj hj (Zi ) . Qn (α) ≡ Q(α, Pbn ) = exp − n i=1 j=1 If we fix αjm = αj(m−1) , 1 ≤ j ≤ m − 1, and αjm = 0, j > m, then, n iff ∂ 1X exp − Gm−1 (Z)Yi e−αm 1(hm (Z)Yi = 1) Qn (α) = ∂αm n i=1 − eαm 1(hm (Z)Yi = −1) = 0 (e or 2αm Pn e−Gm−1 (Z)Yi −Gm−1 (Z)Yi 1(h (Z)Y = 1) m i i=1 e + 1) = Pn αm = i=1 (1 − errm−1 ) 1 . log 2 errm−1 The correspondence to the first version of AdaBoost we described is complete when we notice that (Problem 12.2.12) n−1 −αj hj (Z)Yi e−Gm−1 (Z)Yi = Πj=1 e Pm 2αj 1(hj (Z)Yi =−1) e− j=1 αj . = Πm j=1 e (12.2.29) Define logit(u) = log[u/(1 − u)], 0 < u < 1 , and suppose that M X αj hj (Z) logit P (Y = 1|Z) = (12.2.30) j=1 for some α ≡ (α1 , . . . , αM )T . If M is fixed and if the sample optimizer α b of Q(α, Pbn ) exists, then if we iterate either version of AdaBoost, the algorithms converge to δ(Z) = 1 iff M X j=1 α bj hj (Z) > 0 with probability tending to 1 for fixed n > M (Problem 12.2.14). In turn, α b converges as n → ∞ in probability to the minimizer of Q(α, P ) (Problem 12.2.15). However, suppose logit P (Y = 1|Z) is of the form (12.2.30) with M = ∞ and it is approximable by P functions of the form m j=1 αj hj (Z). This is, for instance, the case if the [hj (Z) + 1]/2 Section 12.2 329 Classification and Prediction are taken as indicators of arbitrary hyper rectangles in Rd . Then, it may be shown that the population version of the algorithm of type 2 where the “best” direction hj is chosen on each iteration converges to the Bayes classifier, but the sample version may not unless it is regularized in some way by stopping the algorithm early. See Problem 12.2.16. Other convex objective functions than Q(α, P ) can also be considered and these ideas can also be applied to nonparametric regression and even density estimation — see Bickel, Ritov, Zakkai (2001), Friedman, Hastie, Tibshirani (2000), and references therein for further discussions of boosting from a statistical point of view. See also Meir and Rätsch (2003) and Rudin, Daubechies and Schapire (2004) for a discussion from a machine learning/dynamical systems point of view. We note that support vector machines, as defined by (12.2.24), may be put in the boosting framework (Problem 12.2.17). In particular, solving (12.2.24) is equivalent to minimizing the convex objective function, R(α, P ) = Z W (β0 + n X j=1 αj hj (z))y dP (z, y) + λ|α|2 (12.2.31) where λ = (2γ)−1 and the function W is given by W (t) = (1−t)+ , with x+ = x1(x ≥ 0). W may be thought of as the closest convex approximation to 1(t ≤ 1). Note (Problem 12.2.17) that, if λ = 0, and the Bayes rule is of the form δB (Z) = sgn M X j=1 α∗j hj (Z) , then δB minimizes R(α, P ), as well as the Bayes risk, M X α∗j hj (Z) < 0 . P Y j=1 For a fuller discussion of these relations, see Hastie et al (2001, 2009). d) Tree-structured Methods We define trees informally. We ask that the reader examine the captions of Figure 12.1 for the definition of the terms we use in our discussion. We consider the two-class case, that is, Y = 0 or 1, say. The general multiclass case is developed in Problem 12.2.17. We will consider tree-based classifiers based on classification regions derived from a training sample (Z1 , Y1 ), . . . , (Zn , Yn ). At level 1, a classification rule δ (1) based on a predictor Z ∈ Z can be viewed as specifying a “critical” region C1 = {z : δ (1) (z) = 1}, where Y is classified as 1, and its complement, C0 ≡ {z : δ (1) (z) = 0}, where Y is classified as 0. We can represent δ (1) as a two-branch tree as follows. We ask the question: Does the observed Z belong to C1 ? If so, send Z to node 1 , on level 1 in Figure 12.1, and predict Y = 1, or, if not, then send Z to node 0 and predict Y = 0. For level 2, we view C1 and C0 as separate universes with classifiers δ (2) and δ̄ (2) where δ (2) takes values 330 Prediction and Machine Learning Chapter 12 = = = Figure 12.1. An example of a classification tree. The nodes at the final level are called terminal nodes. Y is classified to equal the last digit in the terminal node. 0 and 1 on C1 while δ̄ (2) does so on C0 . Let C11 , C10 , C01 , C00 be the sets defined by {δ (1) = 1, δ (2) = 1}, {δ (1) = 1, δ (2) = 0}, {δ (1) = 0, δ̄ (2) = 1}, {δ (1) = 0, δ̄ (2) = 0}. These form a partition of Z, even as C0 and C1 did. Classify as follows: If δ (1) sends Z to C1 , ask if Z ∈ C11 and if yes, send Z to node 11 on level 2 and classify Y as 1; if not send Z to node 10 and classify Y as 0, etc. We have in this way constructed a more complex classifier taking values 1 on C11 ∪ C01 and 0 on C01 ∪ C00 . Given a sequence of such classifiers, we can evidently grow as large a tree as we wish. The classification we make for a given Z = z always corresponds to the value of the last rule, for the terminal node we send z to. We are immediately faced with the questions: 1) How do we select the consecutive rules for the different levels of the tree? 2) How deeply do we grow the tree? Our goals in 1) are to have each member of the sequence easily computable but exhibiting the feature that the complex rules corresponding to growing the tree to a large depth can approximate the Bayes rule. Question 2) is more complex and can be viewed as a regularization question as discussed in Section 12.5. To understand how Question 1 is answered, we begin with the unrealistic assumption that we know P , that is, π, p0 , and p1 . Then, A: We specify a class of simple rules. If Z is real the class of rules for classification and Section 12.2 Classification and Prediction 331 regression trees (CART)(Breiman, Friedman, Olshen and Stone (1984)) is {δc , δ̄c : δc ≡ 1[c, ∞), δ̄c ≡ 1 − δc } . That is, δ(z) = 1 if z ≥ c, 0 otherwise; and δ̄c (z) = 1 if z < c, 0 otherwise. This means that we consider all possible ways of classifying Y according as z is larger or smaller than a threshold. B: Construct level 1. We begin by examining how well the populations would be split if we used δc or δ̄c . A perfect or “pure” split would occur if P [Y = 1|Z ≥ c] = 1 − P [Y = 0|Z < c] or conversely. A way to measure the “goodness of split,” proposed by Breiman et al. (1984), is to use τ 2 (c) ≡ E Var Y |δc (Z) which vanishes iff the split is “pure” as an impurity measure of the split. In the two-class case this means: Choose c so that, if Ac ≡ {δc (Z) = 1}, τ 2 (c) ≡ P (Ac )P [Y = 1|Ac ]P [Y = 0|Ac ] + P (Āc )P [Y = 1|Āc ]P [Y = 0|Āc ] is minimal. Once the optimal c, cOPT is chosen we use δc or δ̄c accordingly as πP [Y = 1|Ac ] > (1 − π)P [Y = 1|Āc ] . An alternative to τ 2 , proposed by Quinlan (1993), is to use the entropy (see Problem 12.2.18) −E E(log P [Y = 1|δc (Z)]) . Here is an algorithm. The T algorithm. Suppose Z = (Z1 , . . . , Zk )T ∈ Rk is the vector of predictors. For each, 1 ≤ j ≤ k, compute cOPT (j), the split of variable j which gives highest purity, i.e. smallest τ 2 (c). Then, for the first level of the tree, choose that variable Xj which minimizes τ 2 cOPT (j) , the impurity, over all possible variable choices. Call the index on this variable j1 and the optimal splitting point c(1) . This gives a rule δ (1) for the first level of the tree. For the second level, consider splitting separately [c(1) , ∞) and (−∞, c(1) ) into [c(1) , c′ ), [c′ , ∞) on the right and (−∞, c), [c, c(1) ) on the left for c < c(1) and c′ ≥ c(1) , respectively. That is, use δ (2) (z) = δc (z) or δ̄c (z) defined for z ≥ c(1) , and c ≥ c(1) , and similarly, δ̄ (2) (z) = δc′ (z) or δ̄c′ (z) for z < c(1) , c < c(1) , and proceed as in level 1 to obtain (c(21) , j21 ), (c(20) , j20 ). Here, c(21) , c(20) are the optimal cut points for variables Zj21 , Zj20 . Two nodes now arise from each of the nodes of level 1 with predictions carried out as described earlier — see Figure 1. We can continue in this fashion indefinitely if P ↔ (π, p0 , p1 ) where p0 , p1 are densities. ✷ It is intuitively clear and can be shown rigorously (Problem 12.2.24), that if each member of the partion defined by level m of the tree and has positive probability and 332 Prediction and Machine Learning Chapter 12 Cm (z) is the member of the partition containing z, then Cm (z) ⊃ Cm+1 (z) for all m and ∩ Cm (z) = {z}. Hence, if |Cm (z)| denotes the length of Cm (z) m P [Z ∈ Cm (z)|Y = y]/|Cm (z)| → py (z) as m → ∞ where py (z) is the conditional density of Z given Y = y. Thus, the decision rule δm (z) = 1 iff πP [Y = 1|Z ∈ Cm (z)] ≥ (1 − π)P [Y = 0|Z ∈ Cm (z)] converges to the Bayes rule for each z, as m → ∞. If we have a training sample Z1 , . . . , Zn and replace P by the empirical probability bm based on Pb , and π with π Pb , Cm with the version C b = Pb [Y = 1], we have a way of growing empirical classification trees. We claim that if we stop at any level m and define bm (z)|Y = j], the ratio of the number of Yi = j which appear in the pbmj (z) = Pb[Z ∈ C node, to which z belongs, to the total number of Yi = j. Then, the pbmj (z) are estimates of the pj (z) and the rule b = 1 iff π δ(z) b pbm1 (z) ≥ (1 − π b )b pm0 (z) converges to the Bayes rule provided that n → ∞, mn → ∞ slowly — see Breiman, Friedman, Olshen and Stone (1984) for rules for selecting the depth m. CART and Sieves It should be clear that, if we stop at stage m, the estimates of pmj are just histograms but with bin size adapted to the data. They can be related to a sieve as follows. Take without loss of generality the support of P to be the cube I k . Let Pm = { All histograms on I k with up to m break points in each coordinate }. That is, if pj ↔ P ∈ P, j = 0, 1 m pj (z) = 2 X i=1 aij 1(z ∈ Amij ) where {Amij } is a partition of I k into k-cubes {z : bi ≤ zi ≤ c1 , 0 ≤ i ≤ m + 1, b0 = 0, bm+1 = 1}. So Pm is parametrizable by {aij , 1 ≤ i ≤ 2m − 1, j = 0, 1}, the 2m−1 left hand vertices of k-cubes, and the indices (j1 , . . . , jm ) of the variables. If we call this parameter θ, then the tree gives us an estimate νbm of a parameter νm (P ) such that νm (Pθ ) = θ. However, the method of estimation is not maximum likelihood or more generally of minimum contrast form. As we have noted, if P ∈ / P, then Pνm (P ) , the member of P which is “closest” in the appropriate sense to P converges to P as m → ∞. Then, again, regularization, which in this case is choice of the tuning parameter m, is called for. For an extensive discussion of tree-structured methods, see Breiman et al (1984). ✷ Summary. In this section we describe and analyze some of the procedures we mentioned previously in detail. In Section 12.2.1 we establish consistency and determine the rates of uniform convergence of the Nadaraya-Watson over suitable nonparametric models in regression. In Section 12.2.2 we study consistency of kth nearest neighbour rules in the classification. In Section 12.2.3 we also touch on logistic regressions, ridge regressions as examples of sieve methods, and their connections to empirical Bayes. Finally, in Section 12.2.4 we describe support vector machines, boosting, and the tree-based method CART. Section 12.3 12.3 333 Asymptotic Risk Criteria Asymptotic Risk Criteria In what sense are the various procedures we have discussed best or even good? The classical decision theory point of view we have discussed in Section 1.3 of Volume I is to first consider a probability model P for X ∈ X , specify a decision space D, a loss function l : P × D → R+ , and the class of all possible (possibly randomized) decision procedures δ : X → D. We measure the performance of a decision procedure δ by R(P, δ) ≡ EP l P, δ(X) , P ∈ P . (12.3.1) The principle focussed on in the analysis of nonparametric classification and regression procedures is minimax. That is, we compute R(δ) ≡ max R(P, δ) , P and consider δ ∗ minimax optimal if R(δ ∗ ) = arg min R(δ) . D A key result which we extend is Theorem 3.3.3 which states that δ ∗ is minimax if we can find a sequence of priors πk such that inf r(πk , δ) → R(δ ∗ ) , δ where r(πk , δ) denotes the Bayes risk of δ. Recall from Section 12.1 that classification, regression, and other prediction problems are naturally split into two parts: (i) “P known”: Here we envisage observing a vector of predictors Z ∈ Z and know P , where P is the distribution of X = (Z, Y ). Y is unknown and we want to predict Y for each given Z = z. A decision rule is a prediction rule d(·) : Z → Y where Y is the set of values or classes to be predicted. The risk of d(·) is Z R P, d(·) = l y, d(z) dP (z, y) = EP l Y, d(Z) , where l(y, d) is the loss associated with predicting y to be d(z). If we think of Y as a random “parameter,” there is, for any such problem, an optimal rule, the Bayes rule of Proposition 3.2.1, δP∗ (z) = arg min EP l(Y, d(z))|Z = z d which yields the minimum Bayes risk, RP∗ ≡ EP R P, δP∗ (Z) . (ii) The “real” problem. P unknown. We are given training data Xi ≡ (Zi , Yi ), i = 1, . . . , n, i.i.d. as X ∼ P . A decision procedure is a mapping δ(· : Xi , 1 ≤ i ≤ n) : Z → D 334 Prediction and Machine Learning Chapter 12 and has risk R(P, δ) = EP Z l y, δ(z : Xi , 1 ≤ i ≤ n) dP (z, y) , where l is the loss when predicting y to be δ(z : Xi , 1 ≤ i ≤ n). It’s useful in this framework to equivalently measure performance of δ as its regret e R(P, δ) ≡ R(P, δ) − RP∗ , (12.3.2) and ask first that, asymptotically, we have, for a sequence {δn }, e Consistency: R(P, δn ) → 0 as n → ∞ for all P , uniformly on P. And second, given consistency, that we have Asymptotic minimax regret on P: e n ) ≡ sup R(P, e e en . (12.3.3) R(δ δn ) : P ∈ P ≍ inf sup R(P, δ) : P ∈ P ≡ R δ Here ≍ can mean strongly. e n) R(δ →1 en R (12.3.4) or weakly e n ) = O(R en ) and R en = O R(δ e n) , R(δ (12.3.5) en → 0. Typically, unless P is regular parametric, noting that we are in a situation where R (12.3.5) is the best that can be hoped for. When proving minimaxity the essential argument always has the same nature. Exhibit, or at least show, the existence of a sequence of Bayes priors and corresponding Bayes rules whose Bayes regret, that is, the regret of the Bayes rules, is asymptotically the same or no e n ) of the candidate minimax procedures. We begin better than the maximum regret R(δ with the parametric case. 12.3.1 Optimal Prediction in Parametric Regression Models We establish the consistency and strong asymptotic minimax regret property of an empirical plug-in Bayes predictor for squared error loss when P is a regular parametric model. Let X = (Z, Y ) be distributed according to Pθ with density p(z, y|θ) : θ ∈ Rp such that the model {Pθ : θ ∈ Θ}, Θ ⊂ Rp , is regular and obeys the conditions of Theorem 6.2.2. Suppose also that, as usual, the marginal density q(z) of Z does not depend on θ, and that (i) Y is bounded above: |Y | ≤ M for some M > 0. Section 12.3 Asymptotic Risk Criteria 335 (ii) q(·) is bounded below: P q(Z) ≥ δ = 1 for some δ > 0. nR o ∇Tθ e(z, θ)I −1 (θ)∇θ e(z, θ)q(z)dz < ∞, (iii) max θ ∈Θ where, by (1.4.5), the Bayes regret of δ at Pθ for squared error loss is Z 2 Rθ (δ) ≡ Eθ δ(z : X1 , . . . , Xn ) − e(z, θ) q(z)dz . Here, Eθ is expectation with respect to Pθ , and by (1.4.4), the optimal predictor of Y given θ and Z = z is Z ∞ e(z, θ) = Eθ (Y |z) = yp(z, y|θ)dy/q(z) . (12.3.6) −∞ b where When θ is unknown, the natural estimate of e(z, θ) is the plug-in estimate e(z, θ) b θ is the MLE of θ, which we assume behaves regularly. Thus, we take b . δ ∗ (z : X1 , . . . , Xn ) ≡ e(z, θ) By the expansion b = e(z, θ) + ∇T e(z, θ)(θ b − θ) + oP (θ b − θ) , e(z, θ) θ formally, ∗ Rθ (δ ) = Eθ = Eθ Z Z = n−1 b − e(z, θ) 2 q(z)dz e(z, θ) T b − θ) 2 q(z)dz 1 + oP (1) ∇θ e(z, θ)(θ Z 1 + o(1) ∇Tθ e(z, θ)I −1 (θ)∇θ e(z, θ)q(z)dz (12.3.7) where I(θ) is the Fisher information matrix. These formal calculations can be justified under our conditions (Problem 12.3.1) and establish Bayes consistency of the decision rule δ∗. Proposition 12.3.1. Under (i), (ii), (iii), and the assumptions of Theorem 6.2.2, the predicb is asymptotically Bayes consistent for squared error loss. tor δ ∗ = e(z, θ) To show that the asymptotic minimax regret property (12.3.4) holds we will examine what the Bayes solution of this problem is if we assign to θ a positive continuous bounded prior density π. Then, for Z independent of X1 , . . . , Xn and θ, the Bayes risk of a decision rule δ is Z 2 Rθ (δ)π(θ)dθ = Eπ Eθ δ(Z : X1 , . . . , Xn ) − e(Z, θ) |θ 2 = Eθ Eπ δ(Z : X1 , . . . , Xn ) − e(Z, θ) |Z, X1 , . . . , Xn , (12.3.8) 336 Prediction and Machine Learning Chapter 12 and Theorem 1.4.1 leads to the Bayes rule, δπ (Z : X1 , . . . , Xn ) = Eπ [e(Z, θ)|Z, X1 , . . . , Xn ] = Eπ [e(Z, θ)|X] , where now X denotes (X1 , . . . , Xn ), and the Bayes regret for δπ , the minimizer of Bayes risk, is Z 2 Rπ = Eθ Eπ e(Z, θ)|X − e(Z, θ) π(θ)dθ . (12.3.9) Next we argue that we get a close approximation to Rπ if in (12.3.9) we replace b By assumptions (i), (ii), and (iii), Eπ [e(Z, θ)|X] with e(Z, θ). 1 1 −2 b + vn− 2 ) − e(z, θ) b = ∇T e(z, θ)vn b e(z, θ + OP (n−1 ) , θ (12.3.10) b + ∇θ e(z, θ)(θ b b , we have uniformly for v in a compact set. Writing e(z, θ) ∼ − θ) = e z, θ) b |X) . b Eπ (θ − θ) b ∼ Lθ Eπ [e(Z, θ)|X] − e(z, θ) = Lθ ∇θ (e(Z, θ) Noting that e is uniformly bounded, and applying the Bernstein–von Mises theorem to √ b n(θ − θ)|X (Theorems 5.5.2 and 6.2.3), we get that Z b 2 π(θ)dθ Eθ Eπ e(Z, θ)|X − e(Z, θ) = 1 n Z Eθ Z b vT ϕ(v, I(θ))dv b ∇Tθ e(Z, θ) 2 π(θ)dθ 1 + o(1) , (12.3.11) where ϕ(v, Σ) is the N (0, Σ) density. By the symmetry of the N (0, Σ) distribution the inside integral is zero. Then, (12.3.11) is o(n−1 ) and using (1.4.5), we can conclude that Z b − e(Z, θ) 2 π(θ)dθ + o(n−1 ) . Rπ = Eθ e(Z, θ) (12.3.12) An application of the argument leading to the Bernstein–von Mises theorem similarly yields Z Rθ (δ ∗ )π(θ)dθ = Rπ 1 + o(1) . (12.3.13) What we have shown is that, by (12.3.12) and the expansion leading to (12.3.7), Proposition 12.3.2. Under the conditions of Proposition 12.3.1, (i) the Bayes regret is Rπ = where c(π) = and thus Z c(π) 1 + o(1) n Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) π(θ)dθ Section 12.3 Asymptotic Risk Criteria 337 b is an asymptotically efficient estimate of e(·, θ). (ii) e(·, θ) We next show that δ ∗ has the strong asymptotic minimax property. Proposition 12.3.3. Under the assumptions of Proposition 12.3.1, δ ∗ is asymptotically minimax regret in the sense of (12.3.4). Proof. Take π = πt with πt converging weakly to point mass at θ∗ , as t → 0, where θ ∗ ≡ arg max Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) . Since the integral of a function is less than its maximum, max c(π) = max Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) . π θ It follows that the minimum Bayes regret is 1 max Eθ ∇Tθ e(Z, θ)I −1 (θ)∇θ e(Z, θ) 1 + o(1) n θ and hence we conclude that δ ∗ is asymptotically minimax regret in the strong sense of (12.3.4) by the arguments establishing Theorem 3.3.3. That is, we have shown that there are priors whose Bayes regret has the same asymptotic behavior as the maximum regret of our candidate rule δ ∗ . ✷ 12.3.2 Optimal Rates of Convergence for Estimation and Prediction in Nonparametric Models In this section we will develop methods for showing weak regret minimaxity in general nonparametric frameworks. Again, this involves introducing a candidate procedure and finding a “least favorable” prior where the Bayes risk converges to the maximum risk of the candidate rule. Our main application is Example 12.3.1. Nonparametric regression. We show that with proper bandwidth choice, the Nadaraya-Watson kernel of Section 12.2.1 is asymptotically minimax regret in the weak sense for the nonparametric problem of predicting µ(z) where µ(z) ≡ E(Y |Z = z) satisfies a first order Lipschitz condition, P = P : |µ(z) − µ(z′ )| ≤ M |z − z′ | for all z, z′ . (12.3.14) We recall first that the rate n−2/(2+d) is achievable in nonparametric regression for the problem of predicting µ(Z) = E(Y |Z) with squared error loss on the basis of (Z, Y ), Z ∈ Rd , when the joint distribution P of (Z, Y ) belongs to P given by (12.3.14). The following theorem, which is easily generalized to more general Z, Y , follows from the proof of Theorem 12.2.1 in Appendix D.7. Theorem 12.3.1. Let P0 = {P ∈ P : P on I d }, Z uniform on I d . Let µ bNW be the NW estimate given in (12.2.7) with product kernel Πdj=1 K0 (tj ), where K0 is non-negative, continuous, bounded, and has compact support. Then, using bandwidth h = cn−1/(2+d) , c > 0, 338 Prediction and Machine Learning Chapter 12 2 (i) For fixed z in the interior of I d , sup | µ bNW (z) − µ(z)| : P ∈ P0 = O(n− 2+d ) . 2 (ii) sup EP |b µNW (Z) − µ(Z)|2 : P ∈ P0 = O(n− 2+d ) . (12.3.15) Remark 12.3.1(a). In the problems we will show that rate n−2s/(2s+d) can be obtained if we generalize (12.3.14) to the Hölder class Ps ≡ {P : |Ds−1 (µ(z) − µ(z′ ))| ≤ M |z − z′ | for all z, z′ } where Ds−1 is the s − 1 differential of µ, D0 µ = µ, provided we replace Nadaraya-Watson by a suitable local polynomial estimate. (b) These rates for Hölder classes or similar Sobolev classes can be achieved by a variety of predictors — see Tsybakov (2008) and Györfi et al (2002) for examples. This is also true for classes which permit discontinuous functions. In this case an example of appropriate procedures is methods based on wavelet expansions of the members of P in (12.3.14). See Donoho and Johnstone (1994) for basic results in this direction. ✷ Lower bounds 2 Using lower bounds on risk, we will show that for P as in (12.3.14), indeed, n− 2+d is the best minimax rate one can hope for. We also reveal something equally significant. For IMSE, the “least favorable” prior distributions are quite unlike those in Section 12.3.1. The former can be thought of as smooth densities concentrating around a least favorable point θ∗ . The latter concentrate around probabilities P whose corresponding µ(·) is as wiggly as possible but still obeys the constraints on P . Such µ seem rather implausible in actual practice and the minimax theorems are not as compelling as in the low dimensional parametric case. As Einstein said (in English translation), “God is subtle but not malevolent.” ✷ In general, to prove lower bound results, we necessarily want to use asymptotic versions of the minimax theory of Chapter 8. There are a number of approaches to lower bounds. Some of the ones we present and others are discussed in full detail in Tsybakov (2008), Chapter 2. The testing approach to lower risk bounds Suppose that our decision problem is of the form: We observe X ∈ X , X ∼ P , P ∈ P and want to estimate a parameter θ : P → T where we are given a metric ρ on T and our goal is to construct θe : X → T so that θe has the minimax property e b max EP ρ θ(X), θ(P ) = min max EP ρ θ(X), θ(P ) . P θb P Evidently prediction in regression is a problem of this type with θ(P )(·) = EP (Y |·) = µ(·). Here X = (Z, Y ), T is a collection of functions of Z and a possible metric to use as a loss function is Z 12 2 ρ(t1 , t2 ) ≡ t1 (z) − t2 (z) dν(z) Section 12.3 339 Asymptotic Risk Criteria for some probability ν and t1 , t2 ∈ T . For this θ(P ) and ρ, if ν = Q is the distribution of a Z independent of the data vector X, b b(Z) = IM SE(b µ) θ(P ) = EQ M SE µ E ρ2 θ(X), by Fubini’s theorem. Another example is ν = point mass at a given point z. Returning to the general case, let θ0 ≡ θ(P0 )(·) where P0 is fixed and for r > 0 let Sr ≡ P : ρ θ(P ), θ0 ≥ r . Our strategy is to select a subset Ωr = {Qλ ∈ S(r) : λ ∈ Λ ⊂ R} of S(r) parametrized by a parameter λ ∈ Λ and to select a prior πr (λ) on Λ such that the Bayes test generates lower bounds in risk. We will identify P ∈ Ωr with λ and write Πr (dP ) for Πr (λ)dλ as well as Πr (Ωr ) for Πr (Λ). Theorem 12.3.2. Suppose all P ∈ Pr ≡ {P0 } ∪ Ωr have discrete or continuous case density functions p. Consider the testing problem, H : P = P0 vsK : P ∈ Ωr with 0−1 loss. Let Πr be a prior distribution on Pr such that 12 = Πr {P0 } = 1 − Πr Ωr and let ϕB be a corresponding Bayes test, Z [p(x)/p0 (x)]Πr (dP ) > c ϕB (x) = 1 if L ≡ Ωr = 0 if L<c. Let 1 RB (Πr ) ≡ r(Πr , ϕB ) = E0 ϕB (X) + 2 Z EP 1 − ϕB (X) Πr (dP ) (12.3.16) be the minimum Bayes risk, and assume 0 < RB (Πr ) < 1. Then, for any θb and any metric ρ, b θ(P ) ≥ r ≥ RB (Πr ) . max P ρ θ, Pr 2 Proof: Consider the test ϕ(x) = 1 if =0 if r 2 b θ0 ) ≤ r . ρ(θ, 2 b θ0 ) > ρ(θ, Then ϕ has Bayes risk no lower than the Bayes test, thus 1 E0 ϕ(X) + max EP 1 − ϕ(X) ≥ RB (Πr ) . Ωr 2 (12.3.17) 340 Prediction and Machine Learning So, either or Chapter 12 b θ0 ) > r ≥ RB (Πr ) P0 ρ(θ, 2 b θ0 ) ≤ r ≥ RB (Πr ) . max P ρ(θ, Ωr 2 (12.3.18) In the second case, writing θ for θ(P ) we have, by the triangle inequality, Then, b θ0 ) ≥ ρ(θ, θ0 ) − ρ(θ, b θ) ≥ r − ρ(θ, b θ) . ρ(θ, r b θ) ≥ r . =⇒ ρ(θ, 2 2 Using (12.3.18), we see there exists P ∈ Ωr such that b θ0 ) ≤ r ≤ P ρ(θ, b θ) ≥ r , RB (Πr ) = P ρ(θ, 2 2 b θ0 ) ≤ ρ(θ, and the result follows. ✷ Further, by Markov’s inequality we obtain a minmax risk bound, Corollary 12.3.1. Under the conditions of Theorem 12.3.2 b θ(P ) ≥ 2RB (Πr ) . min max EP ρ θ, P b r r θ Bounds from asymptotic likelihoods in the Gaussian contiguous case. The main weakness of the bound based on Theorem 12.3.2 is that we are left with the problem of lower bounding the Bayes risk RB (Πr ) itself. However, inQthe case that n (X1 , . . . , Xn )T is a vector of independent random variables with p(x) = j=1 p(j) (xj ) densities, there is a natural approach to asymptotic bounds using the contiguity ideas of Section 9.5. Let r = rn depend on n and write Πn for Πrn . We are looking for πn (·) on S(rn ) ∪ {P0 } which make the testing problem as hard as possible, i.e. an (approximately) least favorable prior distribution πn . Let, Ωn denote Ωrn and Ln (P ) = log n Y p(j) (j) j=1 p0 (Xj ) , where P ∈ Ωn . Suppose that we can choose Ωn and Πn such that πn (Ωn ) = 12 = πn (P0 ) and, as n → ∞, n σ 2 2 o P0 ,σ →1. πn P0 : Ln =⇒ N − 2 It readily follows that, asymptotically, we should take c = 12 in the Bayes test ϕB and then (Problem 12.3.7), as n → ∞, σ RB (Πr ) → 1 − Φ( ) . 2 Section 12.3 341 Asymptotic Risk Criteria Thus, a general approach to getting lower bounds is to construct Ωn and Πn as above. We can then conclude by Corollary 12.3.1 that it will not be possible to estimate θ(P )(·) at a rate faster than rn ↓ 0. Example 12.3.1. (Continued) Nonparametric regression. We next apply the likelihood idea to the Lipschitz model P0 of Theorem 12.3.1 and show minimaxity of the Nadaraya-Watson estimate. As before, for simplicity, assume Z is uniform on the unit cube I d = [0, 1]d . (a) Estimation of µ(·) at a Point We begin with the easier problem of estimating µ(z) at a point, say, z0 ≡ ( 21 , . . . , 21 )T with loss function |b µ(z0 ) − µ(z0 )|. For our submodel, we consider Yi = µh (Zi ) + Ni , i = 1, . . . , n (12.3.19) where N1 , . . . , Nn are i.i.d. N (0, 1) and we assume, without loss of generality, that µ0 (·) ≡ µP0 (·) = 0 . Given that µ bNW is the candidate rule, it is natural to consider a prior πn putting all its mass on µh (·) given by µh (z) ≡ hg h z − z0 1(|z − z0 | ≤ ) h 2 (12.3.20) where h = hn depends on n, and the kernel g is of the form g(z) ≡ d Y g1 (zj ) j=1 with g1 differentiable and, for some M0 > 0, g1 : [0, 1] → R, g1 6= 0, g1 (0) = g1 (1) = 0, |g1′ | ≤ M0 . (12.3.21) Note that, for z(1) , z(2) ∈ I d , (i) |g(z(k) )| ≤ M0d , k = 1, 2 . (ii) |g(z(1) ) − g(z(2) )| ≤ M0d−1 ≤ M0d d X j=1 (1) (2) Pd (2) (1) j=1 |g1 (zj ) − g1 (zj )| 1 |zj − zj | ≤ M0d d 2 |z(1) − z(2) | . 1 (12.3.22) Thus if M0 ≡ (M d− 2 )−d , for some M > 0 and |z(k) − z0 | ≤ k = 1, 2, then |µh (z(1) )−µh (z(2) )| ≤ h g z(1) − z 0 h −g z(2) − z 0 h 1 2 h, |z(k) − z0 | ≤ 1 2 h, ≤ M |z(1) −z(2) | . (12.3.23) 342 Prediction and Machine Learning Chapter 12 In view of (12.3.21), hg1 [(z− 12 )/h]1(|z− 12 | ≤ h) is differentiable on (0, 1) with derivative bounded by M0 in absolute value. Thus the first order Lipschitz condition (12.3.14) holds for all z1 , z2 . Moreover, by construction, µh (z) ≤ hM0d , (12.3.24) and Ln (P ) = n X i=1 µh (Zi )Ni − µ2h (Zi ) , 2 (12.3.25) where Ni are i.i.d. N (0, 1). So, given Z1 , . . . , Zn , n n X 1X 2 µ2h (Zi ) . µ (Zi ), Ln ∼ N − 2 i=1 h i=1 Now, µh (Zi ) = 0 if |Zi − z0 | > h Z − z i 0 = hg otherwise. h Evidently, and given |Zi − z0 | ≤ h, E n X i=1 Var n X i=1 Zi −z0 h P |Zi − z0 | ≤ h ≍ hd is uniform on I d . Thus, µ2h (Zi ) ≍ nhd+2 µ2h (Zi ) ≤ nEµ4h (Z) ≍ nhd+4 1 and if we take h = n− d+2 , then Pn P i=1 µ2h (Zi ) −→ σ 2 . Hence, 1 (12.3.26) Ln (P ) =⇒ N (− σ 2 , σ 2 ) , 2 R1 2 where σ 2 = g (z)dz)d . In view of (12.3.23), (12.3.24) and (12.3.26) we can apply 0 2 Corollary 12.3.1 and obtain that the lower bound on the risk has the rate of n− d+2 we got for the Nadaraya-Watson estimate in Theorem 12.2.1. That is, the Nadaraya-Watson estimate achieves the optimal rate of convergence. (b) Estimating µ(·) as a Function. We approach this problem from the estimation point of view with a prior suggested by our analysis of estimation at a point. Let µh (z, ε) ≡ h N X j=1 εj g Z − z j h 1(z ∈ Cj ) , (12.3.27) Section 12.3 343 Asymptotic Risk Criteria where the εj are i.i.d. P [εj = 0] = P [εj = 1] = 12 , ε ≡ (ε1 , . . . , εN )T , and g(z) = d Y g1 (zj ) j=1 where g1 (z) = z1(0 ≤ z ≤ 12 ) + (1 − z)1( 21 < z ≤ 1) for 0 ≤ z ≤ 1, C1 , . . . , CN are a partition of I d into N = h−d cubes of side length h, and the zj are the centers of the Cj . It can be shown (Problem 12.3.4) that if N1 , . . . , Nn are i.i.d. N (0, 1) and Yi = µh (Zi , ε) + Ni , i = 1, . . . , n , (12.3.28) 1 then, for M = (d− 4 ), (12.3.27) and (12.3.28) define a prior distribution π on P with |µ(z1 , ε) − µ(z2 , ε)| ≤ M |z1 − z2 | . We now apply (12.3.6) and (12.3.9) treating ε ≡ (ε1 , . . . , εN )T as θ. Then, e(z, ε) = µh (z, ε) and E e(z, ε)|X1 , . . . , Xn = h N X P [εj = 1|X1 , . . . , Xn ]g j=1 (z − zj ) 1(z ∈ Cj ) . (12.3.29) h Using the fact that the 1j ≡ 1(z ∈ Cj ) are orthogonal we see from (12.3.9) and (12.3.29) that the Bayes risk is Rπ = E h2 = h2 E Z N X j=1 N X j=1 (P [εj = 1|X1 , . . . , Xn ] − εj )g Eε (P [εj = 1|X1 , . . . , Xn ] − εj )2 z − z 2 j 1j (z) dz h Z g2 z − z j h 1j (z)dz , (12.3.30) where Eε indicates expectation over ε, while the second E is over X1 , . . . , Xn . But, Z Z z − zj g2 1j (z)dz = g 2 (z)dz hd , (12.3.31) h and 2 Eε P [εj = 1|X1 , . . . , Xn ] − εj 2 1 1 = 1 − P [εj = 1|X1 , . . . , Xn ] + P 2 [εj = 1|X1 , . . . , Xn ] 2 2 1 = − P [εj = 1|X1 , . . . , Xn ] 1 − P [εj = 1|X1 , . . . , Xn ] 2 1 1 1 ≥ − = . 2 4 4 (12.3.32) 344 Prediction and Machine Learning Chapter 12 1 Put h = n− d+2 and combine (12.3.30)–(12.3.32) to get that, for some universal C, Rπ ≥ h2 2 M 4 0 Z 2 g 2 (z)dz = Cn− 2+d Id since N hd = 1. Thus the rate of convergence to zero of the lower bound on risk is no faster than the rate of convergence of µ bNW (·), and we have established the asymptotic rate optimality of the Nadaraya-Watson estimate for IMSE. ✷ As we have seen in our examples establishing the required lower bound on Bayes risks even with plausible least favorable priors can be subtle. For instance, computing the natural bound on integrated absolute error is not easy using the method we developed for IMSE in the second part of Example 12.3.2. The multiple decision approach to lower bounds There are a number of general purpose lower bounds for the Bayes risk in problems with finite parameter spaces and corresponding finite multiple decision problems which generalize Lemma 12.3.1 and can be similarly applied. Chief among these are lemmas due to Fano (1952) and Assouad (1983). See also Tsybakov (2008), Theorem 2.4 for a more direct generalization of Lemma 12.3.1 to multiple decision procedures rather than just tests as we did in Corollary 12.3.1. We discuss Fano’s Lemma and refer to Tsybakov (2008) for a general form of Assouad’s and other lemmas. Let P = Pθ : θ ∈ {1, . . . , m} be probability distributions on some X and consider the decision problem of deciding, given an observation X, which is the true Pj with 0-1 loss. Identify Pj with j. A possible rule δ : X → {1, . . . , m} has risk R(j, δ) = Pj [δ(X) 6= j], 1≤j ≤m. (12.3.33) We introduce two quantities familiar in information theory; see Cover and Thomas (1991), Chapter 2. For (X, Y ), a pair of random variables, with X taking on m values {1, . . . , m} only, define m X H(X) = − P [X = j] log P [X = j] , j=1 the entropy of X. It is immediate that H(X) = log m − K(PX , U ) , (12.3.34) where PX is the distribution of X, U is the uniform distribution on {1, . . . , m}, and K is the Kulback-Leibler divergence defined in (2.2.24). Next define the mutual information of (X, Y ) as I(Y, X) = I(X, Y ) ≡ K(P(X,Y ) , PX PY ) Z Z p(x, y) = pX (x)pY |X (y|x) log dx dy , pX (x)pY (y) (12.3.35) Section 12.3 345 Asymptotic Risk Criteria the Kullback-Leibler divergence between the joint distribution of (X, Y ) and the distribution with independent coordinates having pX , pY , as marginals. Identifying Θ with Y ∼ U , L(X|Θ = j) with Pj as above, Fano’s inequality states Theorem (Fano). For any δ as in (12.3.33), m ≥ 2, m 1 X I(Θ, X) R(j, δ) ≥ 1 − m j=1 log m h K(P , P ) + log 2 i (m − 1) k l . ≥ 1 − max k,l log m m max R(j, δ) ≥ j (12.3.36) (12.3.37) The interpretation of (12.3.36) is natural in information theory; see Cover and Thomas (1991) for a discussion and proof of (12.3.36). The modification (12.3.37) is not the best possible — see Le Cam (1986) for a better but somewhat harder to prove version. To prove (12.3.37) we argue as follows. If Y ∼ U and X takes values in {1, . . . , m}, log h m X −1 p(x, y) i = log p(x|y) m−1 p(x|k) pX (x)pY (y) k=1 m X −1 = log p(x|y) − log m p(x|k) (12.3.38) k=1 m m k=1 k=1 1 X 1 X p(x|y) ≤ log p(x|y) − , log p(x|k) = log m m p(x|k) since − log z is convex in z. Substituting (12.3.38) back in (12.3.35) we get m Z m m−1 1 X p(x|j) 1 X dx ≤ max K(Pj , Pk ) p(x|j) log I(Θ, X) ≤ j,k m j=1 m p(x|k) m k=1 and (12.3.37) follows. ✷ Fano’s inequality can be utilized as a generalization of the testing argument of Corollary 12.3.1. We look for m points P1 , . . . , Pm in P such that ρ θ(Pk ), θ(Pl ) ≥ rn for all k 6= l, rn and ∪m = P, where S(θ, r) is the ρ sphere of radius r around θ, and define k=1 S θ(Pk ), 2 b b θ(Pj ) < rn /2. Note that j is uniquely defined (Problem 12.3.8). the rule δ(x) = j iff ρ(θ, Then, by Markov’s inequality, we have max K(Pl , Pk ) b θ(P0 ) ≥ rn P ρ θ, b θ(P0 ) ≥ rn ≥ rn 1 − l6=k . EP ρ θ, log m Therefore b θ(P ) ≥ rn δ max EP ρ θ, (12.3.39) P for some δ > 0, all n, provided that max K(Pl , Pk )/ log m is bounded above by 1 − δ. l6=k 346 Prediction and Machine Learning Chapter 12 We apply this method to Example 12.3.1 with integrated squared error loss. It is natural to consider the m = 2N functions µε (z) = N X εj hg j=1 z − zj h as ε varies. Any two such functions µε1 , µε2 differ on at least one of the Cj and, by the 2 R calculation made for part (i) of the example, if ρ2 (µ1 , µ2 ) = I d µ1 (z) − µ2 (z) dz, then 2 ρ (µε1 , µε2 ) ≥ h 2 Z g 2 (z)dz . (12.3.40) Id On the other hand, the maximum divergence corresponds to the functions µ2 (z) = N X hg j=1 z − zj 1(z ∈ Cj ) , h µ1 (z) ≡ 0 . If P1n , P2n are two product probabilities for which (Zi , Yi ), 1 ≤ i ≤ n are i.i.d., then K(P1n , P2n ) = nK(P11 , P21 ) , (12.3.41) and if we let W ∼ N (0, 1), then 1 ϕ(Y1 − µ2 (Z)) = E µ2 (ZW ) + Eµ22 (Z) K(P11 , P21 ) = −E1 log ϕ(Y1 ) 2 Z 1 = h2 g 2 (z)dz . (12.3.42) 2 Id 1 d If we put rn = n− 2+d , h = rn so that N = n 2+d , we see from (12.3.40)–(12.3.42) that (12.3.39) with δ = 21 yields 2 r2 b(Z) − µ(Z) ≥ n max EP µ P 2 in accordance with our results in part (ii) of Example 12.3.1. Remark 12.3.2. (a) This method based R on Fano’s inequality permits us to consider other loss functions ρ such as ρ(µ1 , µ2 ) = I d |µ1 (z) − µ2 (z)|dz. (b) We can also consider other P than the ones corresponding to Lipschitz µ. For instance, suppose P corresponds to {µ : |Ds µ| ≤ M < ∞} where Ds is the collection of sth partial derivatives of µ, s ≥ 1. Then, if we use integrated squared error, we can show (Problem 12.3.9) that the minmax risk of the Nadaraya-Watson estimate is no smaller in 2s order than n− 2s+d . It is also fairly easy to show (Problem 12.3.10) that this rate can be achieved. This type of result was first obtained by Stone (1982). Section 12.3 12.3.3 347 Asymptotic Risk Criteria The Gaussian White Noise (GWN) Model Donoho and Johnstone (1994) andKerkyacharian and Picard (1995) initiated the study of behaviour we expect in nonparametric formulations generally. µ ≡ (µ1 , . . . , µp , . . .)T be an infinite dimensional vector lying in l2 ≡ {µ : P Let 2 2 2 j µj < ∞}. Suppose σ = σ0 is known and that we observe Y1 , . . . , Yn i.i.d. as Y with Y =µ+ε, where µ ∈ l2 and ε is a vector of independent N (0, σ02 ) variables. Note that Y ∈ / l2 since P ε2i = ∞ with probability 1. By sufficiency of Ȳ this model reduces to the model Ȳ = µ + ε̄ , (12.3.43) where ε̄ is a vector of i.i.d. N (0, σ02 /n) variables. This, (12.3.43), is the Gaussian white noise model (GWN). Note that if the vector were p rather than ∞ dimensional this would be a known variance linear model. Just as we did in Remark 9.5.4, we will argue that, at least formally, the GWN is the limit of nonparametric models in analyzing decison procedures just as the canonical linear model is the limit of regular parametric models. As we shall see doing so not only suggests convergence rates and the difficulty of problems, but also procedures. We illustrate by considering again Example 12.3.1 (Continued). Nonparametric regression. Using the same notation as in Example 11.3.2, suppose that we estimate by orthogonal series. For instance, consider first one dimensional Fourier series on I ≡ [0, 1]. That is, µ(x) = ∞ X θk ϕk (x) k=1 where ϕ1 (x) ≡ 1 , ϕ2k (x) = √ √ 2 cos(2πkx) , ϕ2k+1 (x) = 2 sin(2πkx) , is the Fourier basis. Note that this basis is orthonormal in L2 (0, 1) because √ ϕ2k (x) + iϕ2k+1 (x) = 2e2πikx . We can promote these functions by tensor products, to orthogonal series for µ(·) on I d . We consider H = {h : h(x) = Πdj=1 ϕij (xj ) : 1 ≤ ij < ∞} . Then, if µ ∈ L2 (I d ), we can write, for hl , hk ∈ H, µ(Z) = ∞ X l=1 θl hl (Z) , θk (P ) = Z Id µ(z)hk (z)dz . (12.3.44) 348 Prediction and Machine Learning Chapter 12 Consider the sieve of approximating regression models, Yi = p X θk hk (Zi ) + εi i = 1, . . . , n k=1 T where the εi are i.i.d. N (0, σ02 ). By Theorem 6.1.4, the MLE of θ(p) ≡ θ1 (P ), . . .,θp (P ) is b(p) = G−1 (θ b∗ , . . . , θ b∗ )T , θ n 1 p where n Gn ≡ k 1X hj (Zi )hk (Zi )kd×d n i=1 and n X b∗ = 1 θ Yi hj (Zi ), j n i=1 1≤j ≤p. Given Z1 , . . . , Zn , by Section 6.1, b(p) ∼ Np θ(p) , σ 2 G−1 /n . θ 0 n (12.3.45) As n → ∞, for fixed p, √ b(p) − θ(p) ) =⇒ N (0, σ 2 Jp ) n(θ 0 P since Gn −→ Jp by our choice of h1 , . . . , hp , . . .. If we now let p → ∞ we see that we have arrived at the Gaussian white noise model (12.3.43). Then, estimation of µ(·) and estimation of θ(P ) = θ1 (P ), . . . , θp (P ), . . . ∈ l2 are formally equivalent, and for IMSE and squared error on θ rigorously so, since Z Id X 2 (θbj − θj )2 µ b(z) − µ(z) d(z) = j by the Parseval identity. That suggests that if we can carry over descriptions in µ(·) space into ones on θ in l2 we can gain some insight into the behaviour of procedures, even as we did from the correspondence between “limits” of finite dimensional parametric models and the canonical Gaussian linear model in Chapter 6. We will only proceed formally but want to note that the rigorous theory of asymptotic approximation of experiments, developed by Le Cam, which fully justifies the equivalence of results such as Wilks’ theorem in the parametric case has been developed by Brown et al, Nussbaum, and others in an important series of papers (1996–2004) to apply to regression, density estimation, and other nonparametric methods. ✷ Section 12.3 12.3.4 Asymptotic Risk Criteria 349 Minimax Bounds on IMSE for Subsets of the GWN Model In Example 12.3.1 (continued) we restrict θ to subsets of l2 of the form, ∞ X Θβ = θ : (k β θk )2 ≤ M , β > 0 . (12.3.46) k=1 These sets are, if θk are Fourier coefficients, sets that specify some degrees of smoothness on µ(·). For instance, Θ1 corresponds to the set, {µ(·) : µ(·) is absolutely continuous, and R1 ′ 2 0 µ (s) ds ≤ M }. To see this we can use Parseval’s identity since the Fourier coefficients of µ′ are (Problem 12.3.5) θe1 = 0, θe2k = 2πk, θe2k+1 = −2πk . b ∈ R∞ with loss We consider estimation of θ by θ b θ) ≡ l(θ, ∞ X j=1 (θbj − θj )2 (12.3.47) (12.3.48) and return to the problem of setting lower bounds in the GWN model with IMSE risk. We use a direct approach. Note that because Σk −α converges iff α > 1, e λ ≡ {θ : |θk | ≤ Mλ k −λ , λ > β + 1 , k ≥ 1} Θβ ⊃ Θ 2 for suitable Mλ . We restrict, without loss of generality, to priors making the θk independent. See Problem 12.3.6. It turns out that such priors are asymptotically least favorable. In that case, it seems reasonable to consider θk ∼ Fk where Fk is least favorable for the univariate model {θk : |θk | ≤ M k −λ }. It has been shown (see Casella and Strawderman (1981), Bickel (1981), and Levit (1981)), that the least favorable priors, πδ , for estimating µ, given X ∼ N (µ, 1), |µ| ≤ m > 0, have finite support and as m ↓ 0 are symmetric on points {−c, 0, c} and in the limit are point mass at 0. That is, there exists m1 √ > 0 such that the least favorable prior is of the above form for m ≤ m1 . Replacing Yi by nYi ∼ N (µi , 1) 1 e λ corresponds to {µ : |µk | ≤ Mλ k −λ n 12 } so that for with µi ≡ n 2 θi we see that Θ 1 1 k ≥ [Mλ n 2 mo ] λ ≡ k0 the least favorable prior is pointmass at 0. In general, let δ{t} denote point mass at t, consider the prior hδ i {−c} + δ{c} πδ = εδ{0} + (1 − ε) . 2 The Bayes estimate is (Problem 12.3.12) 350 Prediction and Machine Learning √ √ ϕ (t − m) n − ϕ (t + m) n δm (t) = (1 − ε)m √ √ ϕ(t − m n) + ϕ (t + m) n √ 1 − e−2m nt √ = (1 − ε)m 1 + e−2m nt Chapter 12 (12.3.49) and its Bayes risk is 2 Rm ≡ (1 − ε)Em δm (Y ) − m + εm2 √ √ = m2 (1 − ε)E e−4m nY (1 + e−2m nY )−2 + ε . (12.3.50) 1 Put m = n− 2 , ε = 0, and Y = Z + m where Z ∼ N (0, 1) and get Rm = 4 4 −4mZ Ee (1 + e−2mZ )−2 1 + o(1) = 1 + o(1) . n n 1 It is easy to see that putting θk ∼ πm , m = n− 2 , k < k0 , and m = 0, k ≥ k0 , yields a Bayes risk for the loss (12.3.48) that is bounded by ∞ ∞ X X 1 4k0 4 2 2 θk = k −2λ ≍ n−(1− 2λ ) . (k0 − 1) + + Mλ n n n k=k0 k=k0 −1 Since this holds for all λ > β + 12 we expect that n−2β(1+2β) is a correct minmax order 2 bound for Θβ . In particular, β = 1 give us the familiar Lipschitz n− 3 rate. This is, in fact, established in Theorem 2.7 of Tsybakov (2008). The Bayes estimate achieves minimaxity in terms of rate but the simpler estimate, θbk = Yk , = 0, 1 ≤ k ≤ k0 k > k0 , (12.3.51) does as well. One can, in this and related cases, do better and determine the actual value of the minmax bound asymptotically by using Pinsker’s theorem (1980). See also Theorem 3.1 in Tysbakov (2008). By taking θbk of the form (1 − αk )Yk 1(k ≤ k0 ), it is possible to obtain the bound asymptotically. 12.3.5 Sparse Submodels The GWN also accommodates subsets which are “sparse” but where the small {θk } can occur anywhere, and not just for k large. These types of models arise quite generally. An example arises with covariance matrices of vectors (X1 , . . . , Xd )T which represent gene expression in a microarray. Then we expect that most genes are independent corresponding to 0’s in the off diagonal part of the covariance matrix, but, if the genes are dependent, we Section 12.3 351 Asymptotic Risk Criteria expect the dependence is strong. Donoho and Johnstone (1994) suggested a class of GWN subsets exhibiting this type of sparsity Θq,r ≡ {θ : θk = 0, k > r, r X k=1 |θk |q ≤ M } (12.3.52) for 0 ≤ q < 1. The case q = 0 where the sum is simply counting nonzero elements in {θk : 1 ≤ k ≤ r} is the prototype. The minmax risk of squared error estimation for such models is discussed extensively in Donoho and Johnstone (1994), Kerkyacharian and Picard (1995), and more recently in Abramovich, Benjamini, Donoho and Johnstone (2006) and Johnstone and Silverman (2005). A key emphasis is on adaptive procedures, procedures defined here as achieving uniform minimax rates. We will only derive minmax rates for Θ0,n and exhibit procedures which achieve them. The principle that we should consider priors with independent components is still legitimate, as is the principle that least favorable priors asymptotically concentrate on {−c, 0.c}. However, we do not know which are likely to be the 0 coordinates other than θj , j > r. On the other hand there is no reason to behave differently for different nonzero coordinates. Fano’s inequality suggests considering the points θS where S ranges over all subsets of size M of {1, . . . , r} given by θj = 0, j ∈ / S, θj = ±c, j ∈ S. The l2 distance between two such points is ≥ c. The Kullback-Leibler divergence between the probability distribution corresponding to two such points is K(Pθ1 , Pθ2 ) = n r c2 X 1(θ1j 6= θj2 ) , 2 i=1 so that max K(Pθ1 , Pθ2 ) = nM c2 . r−M M M r r 1− M ≤ 1, we obtain from There are m = 2M M such sets. Since M r r (12.3.39), for M/r → 0, the bound if b ≥ c2 min max Eθ l(θ, θ) θb Θ0,r nM c2 r ≥δ . M log M For r = n and M bounded we obtain the bound M δn−1 log(n/M ) for c2 . We have shown that Proposition 12.3.4. For squared error loss, the asymptotic minimax rate for Θ0,n is n−1 log n. The natural way to achieve this rate is by hard or soft thresholding (Problem 12.3.11). Here δch (y) ≡ y1(|y| ≥ c) 352 Prediction and Machine Learning Chapter 12 and δcS (y) = (y − c sgn y)1(|y| ≥ c) are the hard and soft thresholded estimates, respectively. Essentially for hard thresholding, we test H : θj = 0 for each j and use a critical value 1 of the form K(n−1 / log n) 2 . If we accept, we take θbj = 0. Else, we leave θbj = Yj . On theoretical grounds, there is no reason to prefer one type of thresholding over the other. Summary. In Section 12.3.1 we go more fully into asymptotic theory for classification and regression, introducing optimal rates of convergence in various models, in particular, regular parametric ones. In Section 12.3.2 we introduce asymptotic minimax lower bounds and develop ways of obtaining these explicitly, in particular via testing, Theorem 12.3.2, and Fano’s inequality, and apply these to obtain minimax rates for nonparametric regression. We show the asymptotic rate optimality of the Nadaraya-Watson estimate under a first order Lipschitz condition. In Section 12.3.3 we introduce the Gaussian white noise model (GWN), which enables us to put nonparametric problems in a canonical context just as we did for parametric ones in Chapter 6. Section 12.3.4 discusses minimax rate bounds on IMSE for subsets of the GWN model and Section 12.3.5 establishes such a bound for a sparse submodel of the GWN model. 12.4 Oracle Inequalities As we mentioned in Section I.1 oracle inequalities are another approach to measuring the performance of prediction methods. In the minimax approach we consider, for a model P, all possible prediction rules from the point of view of their worst behaviour on P. The oracle inequality point of view is instead to consider a limited class of rules D. We compare the statistician’s performance to what an “oracle” who knows P but is restricted to using δ ∈ D could achieve, at each P ∈ P. Abstractly, suppose D = {δt }, t ∈ T , and we define, Rt (P ) = R(P, δt ) = EP ℓ(P, δt ) , where ℓ is the loss we consider. An oracle then picks t∗ (P ) such that, R∗ (P ) ≡ Rt∗ (P ) (P ) = min Rt (P ) . t Suppose the statistician has a data determined rule for choosing t, call it b t. We would like R(P, δbt ) ≈ R∗ (P ) . (12.4.1) An oracle inequality makes (12.4.1) precise through a statement such as R(P, δbt ) ≤ R∗ (P ) + ∆(P ) , (12.4.2) Section 12.4 353 Oracle Inequalities where ∆(P ) can be explicitly bounded independent of P . Example 12.4.1. As in Section I.4, we consider X = (Z, Y ) where Z is a vector of predictors and Y the response variable to be predicted. Let λ(y, d) denote the loss when d is the predicted value of y. Given a predictor δ : Z −→ Y we define the prediction loss of δ as the expected loss under P , Z ℓ(P, δ) = λ y, δ(z) dP (x) . (12.4.3) With this formulation we can also think of the oracle as choosing t∗ = arg min ℓ(P, δt ) . t Suppose we have a training sample X = X1 , . . . , Xn = (Z1 , Y1 ), . . . , (Zn , Yn ) from P and use the b t that minimizes a criterion for how closely δt (Zi ) predicts Yi , 1 ≤ in . In this case we want an inequality of the form, P ℓ(P, δbt ) ≤ ℓ(P, δt∗ ) + ∆(P, Pb , ε) ≥ 1 − ε (12.4.4) where Pb is the empirical probability and ∆(P, Pb , ε) can be bounded independent of P . An inequality of the type (12.4.4) implies one of type (12.4.2) under integrability conditions if ∆(P, Pb , ε) is boundable independent of ε. ✷ Oracle inequalities have three important features. (i) They provide information at each P (ii) They are “nonasymptotic” since the bounds ∆(·) are more or less explicit. (iii) They exhibit explicit important features of the problem such as the role of dimension on the performance of δt . The first two features are limited, however, since the first requires one to know how well the best of class D can do against a particular P (the oracle’s performance) and the second can be thought of as merely making asymptotic calculations carefully since the bounds are only of interest if the bound on ∆ tends to 0 as sample size n tends to infinity. Also, the bound when made independent of P corresponds to worst case behaviour, and is very conservative in practice. How do we choose b t? At first sight we might try b t = arg min ℓ(Pb, δt ) , (12.4.5) t Unfortunately, as we have seen in Section I.7, this will not work unless P is regular parametric. We are typically considering P = ∪t Pt where Pt is a sequence of nested parametric models with δt based on a minimum contrast estimate of the parameters in Pt . Then, (12.4.5) inevitably leads to choosing the largest model in P, i.e. “overfitting”. 354 Prediction and Machine Learning Chapter 12 As we have noted in Sections I.7, 9.1 and Chapter 11, we need to regularize, typically using a penalty pen(t) which can also depend on the data. That is we choose b t as, b t = arg min ℓ(Pb , δt ) + pen(t) , t and ∆(P ) as defined in (12.4.2) reflects the penalty. We will follow Tsybakov (2008) in first clarifying these notions in the context of the GWN model. Example 12.4.2. Nested linear models. Consider data (X1 , Y1 )T , . . . , (Xn , Yn )T i.i.d. as (X, Y ) where X is an N vector and Y is scalar, and consider the model (j) Yi = Xi β i + εi for some 1 ≤ j ≤ N, i = 1, . . . , n , where {εi } are i.i.d. N (0, σ02 ), and β(j) = (β1 , . . . , βj , 0, . . . , 0)TN ×1 . This is a sequence of nested regresssion models, Pj . As we did in Section 6.1.1 we can by an orthogonal transformation A(X1 , . . . , Xn ) transform the model to one involving its sufficient statistics only given by ε′ ηbk = ηk + √k , 1≤k≤N , n where the model Pj now corresponds to ηk = 0, k > j, and {ε′k } are i.i.d. N (0, σ02 ), 1 ≤ k ≤ N . This is just a GWN model with coordinates equal to zero after the N th entry. 12.4.1 Stein’s Unbiased Risk Estimate A key tool for studying and constructing estimates in the GWN model is Stein’s unbiased risk estimate (Stein (1981)). We follow our notation in Section 12.3.3 identifying ηbi with Yi and ηi with µi , where Yi = µi + εi , with ε1 , . . . , εn independent N (0, σ02 ). Let δ(Y1 , . . . , Yn ) be an estimate of µ ≡ (µ1 , . . . , µN )T . Using (8.3.33) we find that if δ satisfies the conditions of Theorem 8.3 and ∆(y) ≡ δ(y) − y , then, Eµ |δ(Y) − µ|2 = N X ∂∆j σ2 N σ02 + 2 0 Eµ (Y) + Eµ |∆(Y)|2 . n n ∂y j=1 If we define (12.4.6) √ σn ≡ σ0 / n we arrive at Stein’s unbiased risk estimate of R(µ, δ) − N σ02 /n, bU (δ) = 2σ 2 R n N X ∂∆j j=1 ∂µj (Y) + |∆(Y)|2 . (12.4.7) Section 12.4 355 Oracle Inequalities Following our general prescription, given a family {δ t }, t ∈ T , of estimates of η with δ t ≡ (δt1 , . . . , δtN , . . .)T , such that δtj = 0, j > N , it seems reasonable to pick b t by bU (δ t ). Consider the classes of estimates minimizing R D0 ≡ {δ λ (b η ) = (λb η1 , . . . , λb ηN )T , 0 ≤ λ ≤ 1} D1 ≡ {δ m (b η ) = (b η1 , . . . , ηbm , 0, . . . , 0, . . .)T , 1 ≤ m ≤ N } indexed by t = λ and t = m, respectively. The members of D0 are Stein-type shrinkage estimators. They “shrink” ηbj = Yj towards zero when 0 ≤ λ < 1, uniformly for all coordinates. The members of D1 are truncated estimators. 12.4.2 Oracle Inequality for Shrinkage Estimators We first compute the MSE of δ ∈ D0 for η ∈ PN where PN is the GWN model with mean vector (η1 , . . . , ηN , 0, . . .). This MSE is evidently rN,n (λ) ≡ N X j=1 E(λb ηj − ηj )2 = N λ2 N X σ02 ηj2 . + (1 − λ)2 n j=1 (12.4.8) Using calculus, for a given η the oracle risk is obtained for λ(η) = |η|2 , N σn2 + |η|2 where σn2 ≡ σ02 /n. By substitution in (12.4.8), the oracle risk is given by RO (N, n, η) = N σn2 |η|2 . N σn2 + |η|2 (12.4.9) To consider this problem in a different form we note that every δ ∈ D0 can for some γ be identified with the penalty estimator e δ γ ≡ arg min η N nX i=1 (Yi − ηi )2 + γ N X i=1 o ηi2 , which we recognize as the posterior mode if the ηi are considered as i.i.d. N (0, 1/γ), and also as ridge regression (Section 12.2.3) for the canonical Gaussian regression model of Section 6.1.1. For δ ∈ D0 and η ∈ PN , Stein’s unbiased estimate of risk is (Problem 12.4.1) bO = 2N σn2 (1 − λ) + (1 − λ)2 R N X j=1 ηbj2 , (12.4.10) 356 Prediction and Machine Learning Chapter 12 which when minimized on λ ∈ [0, 1] gives Stein’s positive part type estimate, δλb (b η) = N σn2 b. 1− η |b η |2 + (12.4.11) What can we say about the risk of this estimate? Theorem 12.4.1. For N ≥ 4 and η ∈ PN , the risk of δλb (b η ) for squared error loss satisfies R(η, δ λb ) = Eη |δ λb (b η ) − η|2 ≤ RO (N, n, η) + 4σn2 . Proof. (After Tsybakov (2008)). By (8.3.39) if δ ∗ (b η) = 1 − is the Stein estimator, then N σn2 b η |b η |2 R(η, δ λb ) ≤ R(η, δ ∗ ) . Now, by (8.3.31) R(η, δ ∗ ) = N σn2 − (N 2 − 4N )σn4 Eη |b η |−2 . But, by Jensen’s inequality, E|b η |−2 ≥ (E|b η |2 )−1 = |η|2 + σn2 N )−1 . Substituting in, after some arithmetic (Problem 12.4.1), R(η, δ λb ) ≤ 4σn4 N N σn2 |η|2 + ≤ RO (N, n, η) + 4σn2 . 2 2 |η| + N σn |η|2 + σn2 N ✷ Note. The oracle inequality penalty term for not knowing η does not depend on η. Remark 12.4.1. 2β (a) If we know β and can choose N we can achieve the minimax rate of n− 2β+1 over Θβ . 1 The oracle risk is achieved by taking N = n 2β+1 , since |η| is uniformly bounded over Θβ (Problem 12.4.1). (b) In the absence of the possibility of varying N , an asymptotically minimax rule is not possible using D0 . (c) On the other hand, as we shall see in the next subsection, using D1 , the estimate selected using Stein’s unbiased risk estimate is what we shall define as adaptively minimax over all β > 0. Section 12.4 12.4.3 357 Oracle Inequalities Oracle Inequality and Adaptive Minimax Rate for Truncated Estimates We first establish that the oracle minimax rate for η ∈ Θβ and D1 agrees with the general minimax rate of Section 12.3.4, and then show we can obtain this rate adaptively. Here, for δm ∈ D1 and σn2 = σ02 /n, ∞ X R(η, δm ) = mσn2 + ηi2 . (12.4.12) i=m+1 Let m0 (η) = arg min R(η, δm ) . m P∞ Then, if η ∈ Θβ = η : j=1 (j β ηj )2 ≤ M , some M > 0, since P∞ j=N +1 R(η, δm0 (η) ) ≤ N σn2 + N −2β M , ηj2 ≤ N −2β P∞ j=N +1 (j β ηj )2 . Minimizing over N we obtain 1 N = {2βM n/σ02 } 2β+1 and the oracle minimax rate 2β R(η, δm0 (β)) ≍ n− 2β+1 . (12.4.13) We proceed to show using an oracle inequality that for procedures in D1 , if we select m using Stein’s unbiased estimator of risk to obtain m, b then δ m b has the correct minimax rate for all β. This is an example of adaptive (asymptotic) minimaxity which we next define in the context of i.i.d. sampling. Generally, for a statistical procedure δbn with risk R, δbn (X1 , . . . , Xn ) is asymptotically minimax over a model P iff sup R(P, δbn ) : P ∈ P ≍ inf sup R(P, δ) : P ∈ P δ as n → 0 where δ ranges over all decision procedures. Now suppose P1 , P2 , . . . is a b sieve of models with ∪∞ m=1 Pm = M (with closure in law). Then δn is asymptotically b adaptive minimax over M iff δn is asymptotically minimax for each m. Note that δbn has no knowledge of M. Here is our procedure based on Stein’s unbiased risk estimator. For rules in D1 ∆ ≡ (∆1 , . . . , ∆n )T is given by ∆j = 0, j ≤ m; ∆j = −yj , m + 1 ≤ j ≤ n. It follows that Stein’s unbiased estimate of risk is R(m) = −2σn2 (n− m)+ n X j=m+1 b 2j = −(2ρ2 n+ η n X j=1 ηbj2 )+ (2ρ2 m− m X j=1 ηbj2 ) . (12.4.14) 358 Prediction and Machine Learning Chapter 12 The second term is just “Mallows’ Cp ” and since this is the only term that depends on m, minimizing Stein’s unbiased risk estimate leads to Mallows’ selector m b = arg min{2σn2 m − m X j=1 ηbj2 : 1 ≤ m ≤ n} . Theorem 12.4.2. For all η ∈ Θβ , all M , 0 < β < 1 2 2β − 2β+1 2 . sup Eη |b ηm b − η| : η ∈ Θβ = O n (12.4.15) The technical proof of this result is sketched in Problems 12.4.2–12.4.6. Theorem 12.4.2 does establish rate optimality as n → ∞, defined by limn sup P EP ℓ(P, δbt ) ≤ K0 . ℓ(P, δt∗ (P ) ) (12.4.16) Thus, as n → ∞, the statistician’s method, in this case Mallows’, achieves the same rate of convergence of the risk to 0 as does the oracle, uniformly on P. More difficult to achieve but evidently even more satisfactory is asymptotic equivalence defined by lim sup E n→∞ P ℓ(P, δbt ) =1, ℓ(P, δt∗ ) (12.4.17) uniformly on P. More sophisticated methods are able to achieve (12.4.17) for a class of rules. See Tsybakov (2008), Theorem 3.6. Remark 12.4.2. 1) Other Models: Although, as we have seen, the methods discussed apply to regression with Gaussian errors, the Gaussian white noise model can not be appealed to for density estimation or regression with non-Gaussian errors, not to speak of nonlinear semiparametric models such as logistic regression. However, deep results of Nussbaum (1996) and Brown and collaborators (1997, 2004) show that it is possible to generalize these conclusions to models which can be approximated in the strong sense of Le Cam’s distance between experiments and include the models cited. 2) Bayesian Selection: If we assume that P ∈ Pm0 for some m0 and if m0 is known we can estimate η efficiently using δ m0 and obtain the optimal minimax rate of n−1 . This is possible without knowing m0 if one uses Schwarz’s Bayes criterion (SBC) (Schwarz(1978) ), also called Bayes information criterion (BIC). (See also Rissanen (1983) for the equivalent (in this case), Minimum Descriptive Length criterion.) For SBC, for some constant c > 0, we select (see Section 12.6.1) m b = arg max m X j=1 ηbj2 − 2c2 m log n : 1 ≤ m ≤ n . Section 12.4 359 Oracle Inequalities If, however, η is in the closure of ∪∞ m=1 Pm , then the rate of convergence of the risk of bm bm the SBC η b is worse than that of Mallows’ η b by a factor of log n (Problem 12.6.1). Note that the Mallows’ rule has the same formulation as SBC with 2 log n replaced by 2. For the SBC m b (Problem 12.4.7), p σn−1 |b ηm 2 log n . b| ≤ 12.4.4 An Oracle Inequality for Classification The oracle inequality formulation in the forms (12.4.2) and (12.4.4) arose early on in the work of Vapnik (1998) on classification. Classification is a problem with a more complex loss function than quadratic loss and the difficulty of the problem is determined not only by that of estimating the regression, P [Y = 1|Z] , but also the contours of the regression, a task governed by what is called the margin (Mammen and Tsybakov (1995)). On the other hand, the boundedness of all variables involved makes inequalities come out as simple consequences of the empirical process bounds of Section 7.1 and Appendix D.2. Specifically consider a two-class problem so that the training samples (Zi , Yi ) are i.i.d. as (Z, Y ) ∼ P with Y = ±1 and Z ∈ H. In this case, δt : H → {−1, 1} . We use 0 − 1 loss so that R(P, δt ) = P [Y δ(Z) = −1] . Let λ(y, d) denote the loss when Y is classified as y ∈ {−1, 1}. An oracle would use t∗ (P ) that minimizes the classification loss ℓ(P, δ) = EP λ(Y ), δ(Z) , ℓ(P, δt∗ ) = min ℓ(P, δt ) . t Since we don’t know P but have a training sample X = {(Zi , Yi ) : 1 ≤ i ≤ n}, we want to construct δbt (·) such that R(P, δbt ) = E P Yn+1 δbt (Zn+1 ) = −1 X bt is small. A natural approach in the context of this section is to find an unbiased estimate R of R(P, δt ) and minimize it. If we know nothing about P a natural approach is to use n and use δbt where X bt ≡ 1 R 1 Yi δt (Zi ) = −1 n i=1 bt . b t = arg min R t 360 Prediction and Machine Learning Chapter 12 bb is an underestimate of R(P, δb). We use an oracle inequality to relate We expect that R t t R(P, δbt ) to R(P, δt∗ ) although this does not actually give us an idea of what R(P, δbt ) is. What an oracle inequality requires is ∆P such that Eℓ(P, δbt ) ≤ Eℓ(P, δt∗ ) + ∆P . Before going to an oracle inequality we establish the fundamental, Theorem 12.4.3 bt − ℓ(P, δt )| . R(P, δt∗ ) ≤ R(P, δbt ) ≤ R(P, δt∗ ) + 2E sup |R (12.4.18) t Proof. By definition, Hence, bb ≤ R bt∗ , R t R(P, δt∗ ) ≤ R(P, δbt ) . bb + sup |R bt − ℓ(P, δt )| ≤ R bt∗ + sup |R bt − ℓ(P, δt )| ≤ ℓ(P, δt∗ ) ℓ(P, δbt ) ≤ R t t t bt − ℓ(P, δt )| . + 2 sup |R t The theorem follows. ✷ We now go to empirical process theory. Suppose that the bracketing number of FT ≡ {ℓ(P, δt ) : t ∈ T } satisfies N[ ] δ, FT , L2 (P ) ≤ cδ −d , (12.4.19) for all P , δ sufficiently small, c independent of P . Then, we can use a simplified version of Theorem 7.1.3, given below, which is appropriate since |ℓ(P, δt )| ≤ 1, and we do not need to bound the variance of |ℓ(P, δt )| separately. Recall the notation En (ft ) = n−1 n X i=1 ft (Xi ) − EP ft (Xi ) . Theorem 12.4.4. (van der Vaart and Wellner (1996) Theorem 2.14.29). Given FT : {ft : t ∈ T } with ft satisfying, |ft |∞ ≤ 1, for all t ∈ T , and N δ, FT , L2 (P ) ≤ cδ −d for all P, δ, and c as in (12.4.19), then, for a universal constant D ≡ D(c), 1 D d −2a2 P n 2 sup |En (ft )|∞ > a ≤ √ e . d t∈T (12.4.20) Theorem 12.4.5. If FT satisfies (12.4.19) then √ − 21 D d 2π √ . R(P, δbt ) ≤ inf R(P, δt ) + n t 4 d (12.4.21) We can now give an oracle inequality: Section 12.5 Performance and Tuning via Cross Validation 361 Proof. Use Theorem 12.4.3 and (12.4.20) and the standard (see Example 4.4.7) formula, Z ∞ P [U ≥ v]dv if U ≥ 0 . EU = 0 ✷ Remark 12.4.3. 1. The bound (12.4.21) does not rely on P belonging to any family P and as such may be poor. For instance, suppose δt is the Bayes rule if P = Qt where {Qt : t ∈ Rp } is a regular parametric model. Then, although δbt is not chosen optimally, as it is in Example 12.3.1, it may still be shown that its Bayes regret is O(n−1 ) whenever 1 P = Qt0 for some t0 , rather than O(n− 2 ) given by (12.4.21). 2. On the other hand (12.4.21) is indeed correct for ascertaining minmax bounds over large families P — see Marron (1983) where the first such results were presented. 3. The bounds can give qualitative information on where ordinary asymptotic theory may not hold well such as the behaviour of the best one can do as a function of the bracketing number “dimension” d. For regular parametric families FT , this is just the dimension of the parameter whose effect on performance is not apparent in results such as Theorem 6.2.2. This additional type of information is a valuable feature of oracle inequalities. 4. This particular oracle inequality does not capture the phenomenon of “margin.” Roughly speaking, Bayes classification requires that we estimate, not necessarily P [Y = 1|Z = z] as a function of z well but rather its contours C = {z : P [Y |Z = z] = 21 }. This may be easy even if the function is estimated badly if the set of z’s near C has small probability. On the other hand the contours of a polynomial in several dimensions may be very hard to estimate. The correct combination of conditions on P which determine minmax classification risk was determined by Mammen and Tsybakov (1995). Summary. In this section we introduce the notion of oracle inequalities, the principal tool in the machine learning literature for characterizing the performance of classification rules both for restricted and unrestricted model classes P. They are of value not only in their own right but also as principal tools in establishing the performance of potentially minmax rules. We consider the first, oracle inequalities, for submodels of the Gaussian white noise model, in which we also study the behaviour of Stein’s unbiased risk estimate in constructing adaptive estimates. We next consider the basic type of such inequalities valid for all P considered first by Vapnik (1998) in the context of classification. The literature and uses of such inequalities is perpetually growing. For many more examples see Györfi et al (2002), Bishop (2006), and van de Geer (2000). A very thorough treatment is in Buhlmann and van der Geer (2010). 12.5 Performance and Tuning via Cross Validation In the previous section we have discussed some methods of choosing the tuning parameter t that determines the level of regularization. We continue this theme in Section 12.5 for a highly important approach referred to as cross validation. Not only does this approach 362 Prediction and Machine Learning Chapter 12 yield good prediction but it also provides methods of assessing prediction performance. We have considered prediction performance from the point of view of oracle inequalities. These bounds are worst case lower bounds on performance and as such are much too conservative. On the other hand we have, in the past, considered data-based measures such as confidence intervals which though not giving the guarantees of oracle bounds are asymptotically correct and do well in practice. The analogue in the “machine learning context” is to give reasonable estimates of misclassification probabilities and confidence bounds in regression. To illustrate the difference between oracle bounds and data-based methods consider the simple problem of estimating the mean µ of a distribution concentrated on [−1, 1] using a sample X1 , . . . , Xn i.i.d. as X. Our usual data-based approach outside this chapter would be to use 1 X̄ ± z1−α/2 σ bn n− 2 P as an approximate confidence interval, with σ b2 = n1 ni=1 (Xi − X̄)2 . On the other hand, we could give an oracle type guaranteed bound on µ. By Hoeffding’s inequality, (7.1.10), since |X| ≤ 1, √ v2 P [ n|X̄ − µ| ≥ v] ≤ 2e− 2 . This leads to guaranteed level (1 − α) confidence bounds for µ, p 1 X̄ ± − log(α/2)n− 2 . Evidently, if α and Var(X) are small, this is a grossly conservative bound and Hoeffding’s inequality gives a poor estimate of the performance of X̄. Thus the data-based bound is preferable. Cross validation is a data-based approach. Cross validation will be used to estimate criteria functions. Let δt denote a decision rule with tuning parameter t ∈ T . Crude estimates of performance such as the empirical misclassification rate n 1X 1 δt (Zi ) 6= Yi n i=1 (12.5.1) are poor, as oracle inequalities make clear. Since regularization is based on optimizing an estimate of performance over a family of procedures, using a good estimate of performance is of importance for choosing the tuning parameter that corresponds to an appropriate amount of regularization; see Arlot and Celisse (2010). We pursue this aspect of data determined selection of the tuning parameter that yields the optimal estimated risk, returning to measurement of performance at the end. 12.5.1 Cross Validation for Tuning Parameter Choice A number of methods involving sample splitting or approximations to such methods have come under the heading of cross validation. The one most widely studied in the statistical literature is Section 12.5 Performance and Tuning via Cross Validation 363 Leave one out cross validation and empirical risk minimization The cross validation idea works for any prediction problem and is closely related to the jackknife — see Section 10.3. Let S ≡ {Xi = (Zi , Yi ), i = 1, . . . , n} with Y1 ∈ R, Zi ∈ Rp , be an i.i.d. sample, let t be a tuning parameter, and {δt (·, X1 , . . . , Xn ) : t ∈ T } be a family of predictors which has been determined, for example, by the fitting of a sieve. Let l(y, d) be the prediction loss corresponding to the response y to be predicted and decision d. Thus, we have the examples l(y, d) = 1(y 6= d) 2 l(y, d) = (y − d) 0-1 loss squared error loss The naive estimate of the Bayes risk of δt (·) is, as we have noted, the empirical risk, n X b t) = 1 R(δ l Yi , δt (Zi ) . n i=1 Although typically asymptotically correct for any fixed t, this is, as we have noted, overoptimistic, since the performance of the rule is being judged on the sample used to create it. One way to reduce this negative bias is to use leave one out cross validation (jackknife) with risk defined as n X (−i) bJ (δt ) = 1 l Yi , δt (Zi ) , R n i=1 (12.5.2) (−i) where δt is a rule based on X (−i) ≡ {Xj , j 6= i}. Implicit in this is the assumption that δt (·|X (−i) ) is defined. We consider δt that can be written ′ δt (·|X1′ , . . . , Xm ) = δt (·|Pbm ) ′ where Pbm is the empirical distribution of a subsample X1′ , . . . , Xm , of S and δt (·|P ) is assumed to be defined for all P used in the following discussion. An indicator of the bJ (δ) is an unbiased estimate of the Bayes promise of leave one out cross validation is that R risk of δ for sample size n − 1, that is, Z bJ (δ) = E l(y, δ(z|Pbn−1 )dP (z, y) . ER (12.5.3) We define the leave one out cross validation choice of t as bJ (δt ) . b tJ = arg min R Barron, Birgé and Massart (1999) exhibit oracle inequalities for δbtJ and show a close relation between b tJ and Mallows’ Cp . V fold Cross Validation 364 Prediction and Machine Learning Chapter 12 A more general, computationally faster and much more broadly applicable method is V fold cross validation. We consider t ∈ T ≡ {1, . . . , Mn } . The sample S is split into V disjoint parts and each part has m = n/V observations. Call the V subsamples S1 , . . . , SV . V = 5 and V = 10 are customary choices. If n = 500 and V = 10, we will have 10 subsamples of S, each with m = 50 observations. The bV of the Bayes risk of a rule δt is defined as follows. Let S (−j) = S − Sj and let estimate R (j) δt = δt (·|P (j) ), where P (j) is the empirical distribution of {Xi : Xi ∈ S (−j) }. Then, for loss function ℓ, X (j) bV (δt ) ≡ 1 R l(Yi , δt (Zi )) : i ∈ Sj , j = 1, . . . , V mV . (12.5.4) (j) That is, leave out the first m X’s, use the remaining m(V − 1) X’s to compute δt , and average the loss on the m X’s left out. Then do the same for the next m observations, and so on for the V subsamples, then average over the subsamples. In practice, the division into bV ’s averaged. V pieces may be made several times and the resulting R Define the V fold cross validation tuning parameter selector as bV (δt ) . b t ≡ arg min R We will consider properties of δbt for the problem of estimating µ(z) = E(Y |Z = z) using the squared error loss function, ℓ(y, d) = (y − d)2 . In particular, we shall state a set of assumptions under which V fold cross validation for quadratic loss is optimal in a sense we shall define. We give the argument for sample splitting cross validation. Our proof will make it clear that V fold cross validation is argued in the same way. We split the sample (Z1 , Y1 ), . . . , (Zn , Yn ) into a test sample of cardinality m and a training sample of cardinality n − m. The predictor is computed for the training set and its loss is evaluated at the test set. Without loss of generality let the test set be S1 = (Z1 , Y1 ), . . . , (Zm , Ym ). Use the notation, for a ≡ a(Z, Y ), Z kak2P ≡ a2 (z, y)dP (z, y) m kak2m ≡ 1 X 2 a (Zi , Yi ) . m i=1 We introduce the following assumptions and further notation. A0: Y is bounded: |Y | ≤ C < ∞. A1: Write δt (·) for δt (·|P (1) ), where P (1) is the empirical distribution of the training set {Xi : Xi ∈ S1 }. Section 12.5 Performance and Tuning via Cross Validation 365 Let t∗ = arg min kY − δt (Z)k2P = arg min kµ(Z) − δt (Z)k2P . t t Assume there exists a best oracle rule defined by δO (z) ≡ δt∗ (z) which for fixed P ∈ P, and some sequence rn , satisfies: kδO (Z) − µ(Z)k2P = rn2 1 + o(1) . A2: There exists a best “CV selected rule” defined by where b t = arg min kY − δt (Z)k2m . δb ≡ δbt t A3: Let ∆t (Z) = |δt (Z) − µ(Z)|, and assume that for fixed P ∈ P, max t n k∆ (Z)k2 o t m − 1 : t ∈ {1, . . . , M } = oP (1) . n k∆t (Z)k2P If we assume |δt (Z)| ≤ |Y | ≤ Cn < ∞ for all t ∈ {1, . . . , Mn } and n−1 log Mn = o(1), then A3 holds by Hoeffding’s inequality. 1 A4: log 2 Mn 2 mrn = o(1). Let Theorem 12.5.1. Under A0–A4, rbn = rn 1 + oP (1) . rbn = kδbt − µkP . (12.5.5) Remark 12.5.1. For Theorem 12.5.1, 1) A0 can be weakened with the result still holding. 2) The conditions can be strengthened to yield an oracle inequality form of the Theorem. 3) The difference between this sample splitting result and the result for V > 2 is that sample splitting is only carried out once to form the risk estimate. This is inessential because the average of V such risk estimates is more accurate although the summands are highly correlated. 366 Prediction and Machine Learning Chapter 12 4) It is, as we have seen, common to have for P ∈ P, nonparametric, rn ∼ cn−1+δ , δ>0. If Mn = nK , K < ∞, and m = n/ω(n), where ω(n) ↑ ∞ slowly enough and A3 and A4 hold, then the theorem holds. The take-home story is that V fold cross validated choice where V = ω(n), i.e. is almost bounded, yields oracle prediction for quadratic loss. Proof of Theorem 12.5.1 By definition, b b) ≤ R(δ b t∗ ) R(δ t (12.5.6) kY − δt∗ k2P ≤ kY − δbt k2P . (12.5.7) Rewrite (12.5.6) as m m 2 X 2 X 2 b b µ(Zi )− δ(Zi ) εi +kδ−µkm ≤ µ(Zi )−δ0 (Zi ) εi +kδ0 −µk2m (12.5.8) m i=1 m i=1 where εi ≡ Yi − µ(Zi ). Note that m 1 X b i ) εi µ(Zi ) − δ(Z m i=1 2 m n 1 X µ(Zi ) − δt (Zi ) εi 2 o b 2 . ≤ max kµ − δk m 1≤s≤Mn m i=1 kµ − δt km (12.5.9) The first factor of the RHS of (12.5.9) is of the form m n 1 X Un ≡ max ait εi m i=1 2 : t ∈ {1, . . . , Mn } o where {ait } and εi are independent given Z1 , . . . , Zm , E(εi |Z1 , . . . , Zm ) = 0 and P m 2 i=1 ait = 1. By an inequality of Pinelis which is an application of Hoeffding’s inequality (7.2.9), and the boundedness of the εi , |εi | ≤ 2C, we have EUn ≤ 16C 2 log Mn /m . (12.5.10) Conditioning on Xm+1 , . . . , Xn we obtain from (12.5.8)–(12.5.10) that 1 Next 1 b 2 ≤ 4Cm− 2 (log Mn ) 2 kµ − δk b m + kµ − δ0 k2 . kµ − δk m m k∆s k2m k∆bs k2m − 1 ≤ max − 1 = oP (1) , 2 t k∆bs kP k∆s k2P (12.5.11) (12.5.12) Section 12.6 Model Selection and Dimension Reduction 367 by A3, and the same holds for ∆s0 . From (12.5.11) and A4, we obtain b 2 ≤ 1 + oP (1) kµ − δ0 k2 , kµ − δk m m and applying (12.5.12) the theorem follows. ✷ Remark 12.5.2. This V fold criterion is the most widely used one for tuning parameter choice. It can be shown (Barron, Birgé and Massart (1999), Propositions 1, 2 and Theorem 3) that if we consider multiple regression with all 2p − 1 submodels possible as a sieve and p ∼ log n, then leave one out cross validation does not regularize least squares sufficiently while V fold, for suitable V, n, p, does. We now turn briefly to assessment of performance. 12.5.2 Cross Validation for Measuring Performance b is typically an underestimate of misclassification As we have noted, the naive estimate R risk. Efron (1983) shows both theoretically, through higher order expansion, and experibJ is a better estimate mentally, by Monte Carlo, that the leave one out jacknife estimate R b of the Bayes risk than R. He develops an even better procedure using a combination of 2 b In this case, 2 fold cross validation itself is not very satisfacfold cross validation and R. tory since one is essentially unbiasedly estimating the behaviour of the rule using a training b sample of size n/2, thus overestimating the Bayes risk for a sample of size n. Since R is an underestimate it is reasonable to combine the two and Efron shows that the weights b are best. It seems plausible, in view .632 for the cross validation estimate and .368 for R of Theorem 12.5.1, that using log n fold cross validation to estimate the Bayes risk of say δbbt should behave well. Unfortunately, using log n cross validation for t, itself chosen using log n cross validation, is apt to yield an underestimate. Arlot has shown that using Efron’s .632 estimate to select s has given excellent results; see Arlot and Celisse (2010). But again the goals of performance assessment and optimization clash although we could apply the .632 approach to the optimized rule. Summary. In this section we first study cross validation, the chief method for selecting tuning parameters in prediction methods, and show the wide applicability of V fold cross validation as opposed to leave one out cross validation with a theorem on the asymptotic optimality of V fold cross validation for squared error loss. Actually estimating performance of a final data determined choice of decision rule is difficult although Efron’s .632 rule may be generally helpful. 12.6 Model Selection and Dimension Reduction The first of these two topics starts with a model point of view more akin to Chapters 1–5 where we focussed on parametric models. We still, as in the previous sections, assume we have a nested sieve of models whose closure is nonparametric, but add the assumption that one of the finite dimensional members of the sieve is generating the data. Specifically, if 368 Prediction and Machine Learning Chapter 12 we are, say, considering regression models with a fixed number k of real factors Z1 , . . . , Zk and we can expand E(Y |Z) by means of tensor bases to a function of an infinite number of regression covariates, as in Example 12.3.1, we assume that one of these, say a polynomial of order d in the factors, is, in fact, generating the data. Our second topic is the more fundamental but the hardest to quantify. The point of departure is a set of data, say a sample from a population, which is very high dimensional and has a very complex representation. The object is to find a low dimensional, simple representation which can give subject matter insights and lead to efficient prediction and classification procedures. There is no commitment to a particular model a priori. We are dealing, for example, with a sample of small size in relation to the dimension of the data so that there is no hope of nonparametric estimation of the density. We seek to find a representation of the data and/or the model which is sufficiently low dimensional to be worked with, but sufficiently accurate to enable us to find the essential features of the situation we are analyzing. In this context we will introduce Principal Component Analysis (PCA), a classical method for dimension reduction. We will discuss some difficulties of this and related methods in modern applications. We will also discuss the relationship between the importance of factors and sparsity of models. 12.6.1 A Bayesian Criterion for Model Selection We consider the problem of selecting a model M from a sieve of regular parametric models, M0 ⊂ · · · ⊂ Mp , where Mk ≡ {Pθ(k) (·) : θ(k) ∈ Θk open in RNk , N0 < . . . < Np } . For a given P in the sieve let k0 = k0 (P ) be the smallest k such that P ∈ Mk , and let (k ) θ0 = θ0 0 ∈ Θk0 denote the subscript on this P . Then P is also in the models with k > k0 . (k (P )) If P is in the sieve and generates the data, k0 (P ) and θ 0 0 are assumed identifiable. (k ) We will discuss the asymptotic determination of k0 and estimation of θ0 0 within Θk0 . Here we consider the simplest model Mk0 to be the most parsimonious or “best” model and we want to have high probability of selecting Mk0 for large sample sizes. To focus ideas, first consider a simple sieve with data Xi = (Zi , Yi ) i.i.d. and the familiar regression model, Yi = β0 + d X βj Zij + εi , i = 1, . . . , n j=1 where εi are i.i.d. N (0, σ 2 ), Zi = (Zii , . . . , Zid )T . Suppose we can order the factors, Z1 , . . ., Zd , in terms of importance to Y . Then, Mk is parametrized by θ(k) ≡ (β0 , . . . , βk , σ 2 )T , Θk ⊂ Rk+2 , Np = d + 2, and P ∈ Mk iff βk+1 = · · · = βd = 0. Determining k0 , the number of factors Zj with non-zero beta coefficients, is the subject of much research. We could try the following: “Consecutively test Hj : βj = 0 vs βj 6= 0, Section 12.6 Model Selection and Dimension Reduction 369 j = 1, . . . , d, stopping as soon as we do not reject Hj , and set k0 = j − 1”. The main issue with this approach is what critical value to use, that is, how to balance type I and II error probabilities. Here we turn to the simplest approach, which is based on a Bayesian point of view. Pp Returning to the general case, we place prior mass αk on Mk , k=0 αk = 1, αk > 0. That is, αk is the probability that k0 = k. Given k0 = k and Pθ ∈ Mk , we assume θ = θ(k) has a prior density πk (·) on Θk where πk > 0 and continuous. A natural rule for choosing k0 (P ) is “Select b k which maximizes the posterior probability α bk of Mk ”. That is, b k = arg max α bk ≡ arg max Prob(k0 = k|X) k k T where X = (X1 , . . . , Xn ) with X1 , . . . , Xn i.i.d. as X ∈ Rq , q ≥ 1. We prove, under suitable regularity conditions, a theorem which gives a general asymptotically optimal b Bayes rule b k. Schwarz (1978) gave this result for the special case of exponential families. In our Bayesian framework, we assume that given k and θ(k) , Xi has density p(·|θ(k) ), θ(k) ∈ Θk . Then the density of X given k is Z Πni=1 p(xi |θ(k) )πk (θ (k) )dθk . p(x|k) = Θk Let Ik (θ (k) ) = Eθ (k) ∇∇T log p(X|θ(k) ), θ(k) ∈ Θk , be the Fisher information matrix and let K(θ(k) , θ0 ) = −Eθ0 log p(X|θ(k) )/p(X|θ0 ) be the Kullback-Leibler divergence (k ) where θ 0 ≡ (θ01 , . . . , θ0Nk0 )T ≡ θ0 0 and k0 = k0 (P ) corresponds to the true P . We assume (i) {p(·|θ(k) )}, k ≥ 0, satisfy the conditions of Theorem 6.2.3, for each Mk . (k) (ii) θ∗ ≡ arg min{K(θ(k) , θ0 ) : θ(k) ∈ Θk } > 0 for k < k0 . (iii) If k < k0 , and Pθ 0 . generates the data, then the assumptions of Theorem 6.2.1 apply (k) with ψ = log p(x|θ(k) ) and θ(P ) = θ∗ Conditions that imply (iii) can be obtained from the M -estimate consistency results of b(k) maximizes n−1 Pn log[p(Xi |θ(k) )/p(Xi |θ 0 )] so that the Chapters 6 and 9 because θ i=1 corresponding population parameter minimizes K(θ(k) , θ0 ) over θ(k) ∈ Θk , k < k0 . By Bayes formula, for the appropriate constant c > 0, the posterior probability of model Mk is Z α bk = αk Πni=1 p(xi |θ(k) )Πk (θ (k) )dθ(k) /c . We next give approximations to the maximizer b k of α bk and establish their consistency in a frequentist context. 370 Prediction and Machine Learning Chapter 12 Qn b(k) ) where θ b(k) is the Theorem 12.6.1. Under (i), (ii), and (iii), let Lk = i=1 p(Xi |θ MLE of θ (k) in model Mk , 0 ≤ k ≤ p. Suppose the data is generated by P ≡ Pθ 0 with (k0 ) θ0 = θ0 . Then, if | · | denotes determinant, (a) For k > k0 , as n → ∞, log α bk n 1 Lk k − k0 |Ik (θ (k0 ) )| αk = log − log − log 0 (k) + log + oP (1) . α bk0 Lk0 2 2π 2 αk0 |Ik (θ )| 0 (12.6.1) (b) For k < k0 , α bk P −→ 0 as n → ∞ . α bk0 b (c) If b k is the maximizer of α bk and b k maximizes b 2 log Lk − k(log n − log 2π) + log αk + log |Ik (θ (12.6.2) (k) )| , b then b k is consistent in the sense that b b P [b k = k = k0 (P )] −→ 1 as n → ∞ . (12.6.3) (12.6.4) (d) The universal criterion 1 k ∗ = arg max{log Lk − k log n} 2 satisfies P k∗ = b k = k0 (P ) → 1 as n → ∞ . Proof. We begin with (12.6.1). Lettting k > k0 , write Z Y n p(Xi |θ(k0 ) ) α bk Lk πk0 (θ (k0 ) ) dθ (k0 ) log = log − log (k0 ) α bk0 Lk0 b ) i=1 p(Xi |θ Z Y n (k) p(Xi |θ ) αk + log πk (θ (k) ) dθ (k) + log . (k) αk0 b ) i=1 p(Xi |θ (12.6.5) (12.6.6) Arguing as in the proofs of Theorems 5.5.2 and 6.2.3, reasoning as for (5.5.12), noting that (k) (k ) Pθ(k) = Pθ(k0 ) , and both θ0 and θ 0 0 are interior points of their respective parameter 0 0 spaces, we find that for n → ∞ the two integrals are Z 1 (k0 +1) k0 +1 (k ) exp − tT Ik0 (θ 0 0 )t dt 1 + oP (1) A = (2π) 2 n− 2 2 Z 1 T (k+1) (k+1) (k) − B = (2π) 2 n 2 exp − t Ik (θ 0 )t dt 1 + oP (1) 2 Section 12.6 Model Selection and Dimension Reduction 371 where the A integral is over RNk0 and B is over RNk . Evaluating A and B, using the fact that the integral of a density is 1, we obtain (12.6.1). Note that even though the information in Mk , Mk0 is evaluated at the point P , they are typically different. Now, (12.6.1) follows. To prove (12.6.2) note that, for k < k0 , Z Y n h p Xi |θ(k0 ) i (k ) αk ) = log log(b αk0 /b πk0 (θ0 0 ) dθ(k0 ) (k0 ) b i=1 p Xi ||θ Z Y n h p(Xi |θ (k) ) i αk0 (k) (k) − log π (θ ) dθ + log k αk b(k0 ) ) i=1 p(Xi |θ ≡ C1 + C2 + C3 . The first term is identical to the corresponding k > k0 term and thus converges to − log A, which is negative and of order log n. We establish in (12.6.2) by showing that the second term is positive and of order n. Note that, by Jensen’s inequality, Z Y n h b(k0 ) ) i p(Xi |θ πk (θ (k) ) dθ (k) (k) p(X |θ ) i i=1 Z X n h b(k0 ) ) i p(Xi |θ log ≥ πk (θ (k) ) dθ (k) . (k) p(X |θ ) i i=1 C2 = log (12.6.7) b(k) is a MLE, p(x|θ b(k) ) ≥ p(x|θ (k) ). After replacing θ (k) with θ b(k) only the πk Because θ part of the integrand depends on θ (k) . Because the integral of a density is one, we obtain n b(k0 ) )/p(Xi |θ0 ) o p(Xi |θ 1 Xn 1 log C2 ≥ n n i=1 b(k) )/p(X |θ ) p(X |θ i i 0 n n b(k0 ) ) p(Xi |θ0 ) 1X p(Xi |θ 1X log + log = n i=1 n i=1 p(Xi |θ0 ) p(Xi |θ (k) ∗ ) n b(k) ) p(Xi |θ 1X log ≡ D1 + D2 + D3 . − n i=1 p(Xi |θ(k) ∗ ) The first term D1 converges by the law of large numbers almost surely to K(θ(k) ∗ , θ 0 ) > 0. For the second term, note that 2nD2 is the likelihood ratio statistic for testing H : θ = θ0 , and thus converges in law to a χ2k0 random variable by Theorem 6.3.2. It follows that D2 = Op (n−1 ). We can also show that 2nD3 converges in law to a χ2k random variable by arguing as we did to obtain Theorem 6.3.2 from Theorem 6.2.2. Thus D3 = Op (n−1 ) also. We conclude that C2 is positive and of order n. Thus αk /αk0 → 0 as n → ∞ when k < k0 . To show consistency of b k, note that maximizing α bh is equivalent to maximizing ρbk ≡ k 6= log[b αk /b αk0 ]. Because ρbk0 = 0, then max ρbk ≥ 0. We next use this to show that P (b k0 ) → 0. 372 Prediction and Machine Learning Chapter 12 a) If k < k0 , then P (b k = k) ≤ P (b ρh ≥ 0) → 0 by (12.6.2). b) If k > k0 , then, by (12.6.1), P (b k = k) ≤ P ρbk ≥ 0| = P (2 log[Lk /Lko ]) ≥ (k − k0 ) log n + oP (1) . k = k) → 0. By Theorem 6.3.2, 2 log[Lk /Lk0 ] converges to a X 2 variable. Thus, P (b b b Next consider the consistency of k. It is of the form arg max{b ρk + ak } where ak = b b Op (1), and thus if k 6= k0 , then P (k = k) ≤ P ρbk ≥ Op (1) → 0 as before. Similar b b and k ∗ . arguments apply to k ✷ Remark 12.6.1 (a) As we have noted in Section I.7, in the context of regression models, the asymptotic Bayes criteria k ∗ in (12.6.5), (called SBC and BIC) has the same structure as Mallows’ Cp but with log n replacing 2. So log n corresponding to a test with significance level tending to 0 gives the correct threshold according to this criterion. The equivalence up to threshold continues to hold for any sequence of smooth nested models if Cp is replaced by the Akaike AIC criterion. (b) As we have noted in the Gaussian white noise (GWN) framework, Section 12.4, if the models are not nested, for instance, if we consider all regression models with d or fewer variables, the correct threshold for prediction changes to something close to BC. For instance, this is the case if we modify Section I.7 slightly and suppose we observe Xij = µj + εij , j = 1, . . . , d, i = 1, . . . , n, with εij i.i.d. N (0, 1), that is, one way ANOVA. Let any subset {µj1 , . . . , µjk }, k ≤ d, correspond to the model where µj1 = · · · = µjk = 0, all√other µ’s vary freely. Then, for fixed j, the best test is to reject H : µj = 0 if |X j | ≥ c/ n. If we want to choose the correct model with probability tending to 1 we need to take c of order √ 2 log d. But if we consider the sparse set d X 1(µj 6= 0) ≤ M ΘM ≡ µ : j=1 1 we saw in Section 12.3.3 that, if M is bounded, hard thresholding with c of order (log d) 2 1 yields the minimax MSE order (n−1 log d) 2 . (c) Software for the SBC rule can be found under “BIC” in “R” and other software packages. (d) The prior αk does not appear in the rule k ∗ . ✷ 12.6.2 Inference after Model Selection b denote a consistent estimate of k0 . Again consider the sieve of Section 12.6.1 and let k b Given consistency of k it follows that acting as if Mbk is the true model gives asymptotically Section 12.6 Model Selection and Dimension Reduction 373 correct results for testing and estimation of the parmeters θ1 , . . . , θk0 . To see this suppose L Tn is a standardized estimate or test statistic such that Tn −→ T as n → ∞ for model Mk0 , then P0 (Tn ≤ t) = P0 (Tn ≤ t|b k = k0 )P0 (b k = k0 ) + P0 (Tn ≤ t|b k 6= k0 )P0 (b k 6= k0 ) which implies P0 (Tn ≤ t|b k = k0 ) → P0 (T ≤ t) as n → ∞. Consider our regression example with d coefficients β1 , . . . , βd and p = d + 2. It seems rather unlikely in most cases that the number of factors p − k0 which have zero beta coefficients bears no relation to the selection of k and asymptotic inference. Put another way, if p − ko → ∞, it is unlikely that there exists a k0 = k0 (P ) that is bounded independent of p. We expect indeterminancy of k0 to affect inference procedures. In this case, the regression example makes clear what happens. Suppose Zj = hj (Z) are a basis of L2 (Z) where Z is a univariate regressor and consider the model Yi = d X βj hj (Zi ) + εi , i = 1, . . . , n j=1 R where {εi } are i.i.d. N (0, σ 2 ) and Ehj (Zi ) = 0. If the hj are orthogonal, i.e. hj (z)hk (z) dP (z) = 0, j 6= k, the MLE βbj of βj stabilizes no matter how big p fixed is since the vecT tors hj (Z1 ), . . . , hj (Zn ) are asymptotically orthogonal. Using Example 6.2.1, we can show that βbj = n X i=1 n X h2j (Zi ) + oP (1) Yi hj (Zi ) (12.6.8) i=1 and arguing formally E(βbj ) = E E(βbj |Z) = βj + o(1) because E(βbj |Z) = and E(Y |Z) = E(Y hj (Z)|Z) + oP (1) = βj + oP (1) , Eh2j (Z) Pd j=1 Varβbj = n−1 (12.6.9) βj hj (Z). Similarly, by (1.4.6), E[h2 (Z)Var(Y |Z)] + o(n−1 ) [Eh2 (Z)]2 (12.6.10) since E(βbj |Z) is nearly constant. Thus inference does not depend on the number of nonzero coefficients to first order. However if the hj (Z) are not orthogonal, Example 6.2.1 shows how βbj depends on h1 (Z), . . . , hd (Z) in general. For fixed 1 ≤ k0 < d E(βbj |Z) = E(Y hjk0 (Z)|Z) + oP (1) Eh2jk0 (Z) (12.6.11) 374 Prediction and Machine Learning Chapter 12 where hjk (Z) = hj (Z) − Π hj (Z)|hl (Z), 1 ≤ l ≤ k0 , l 6= j and Π is the projection of Z on the linear span of h1 (Z), . . . , hk0 (Z). This reduces to n Pd hk0 (Z)hjk0 (Z)βl o + o(1) 2 (Z) Ehk 0 Pd l=k0 +1 E hl (Z)hjk0 (Z) βl + 0(1) . = βj + 2 (Z) Ehk 0 E(βbj ) = E l=1 (12.6.12) Thus, βbj is biased with the bias depending on d. Its true variance is not the same as that obtained by acting as if b k = k0 is correct. If we let d → ∞ which makes b k → ∞, the bias goes away but at a rate lower than n−1 . This is the same phenomenon we have observed with nonparametric regression and kernel density estimates. The minimax root MSE rate is 1 of order smaller than n− 2 and both bias and variance are of that order. Although asymptotic normality still holds, mean and variance depend on the smoothness assumptions made on the nonparametric function being estimated. 12.6.3 Dimension Reduction via Principal Component Analysis Approximating nonnparametric models by finite dimensional parametric models is one way of reducing the dimension of the problem and in Sections 9.1.4, 11.4.2, 12.1.1 and 12.2.3 we have seen how sieves play a role in this approach. Techniques which also play a major role involve approximating high dimensional data by lower dimensional data. A very important method in this regard is principal component analysis (PCA) which is applied to samples {Xi : 1 ≤ i ≤ n} of p dimensional vectors where p is large and can exceed n. It is a way of reducing the number of covariates by replacing them by weighted sums of covariates thereby reducing the number of covariates. The theory is developed in Rao (1973), Seber (1984) and Anderson (2003) while applications can be found in Mardia et al (1979); see also Examples 8.3.11(continued), 9.1.5, Problem 8.3.24, Johnson and Wichern (2003), and Hastie, Tibshirani and Friedman (2009). The focus is on the covariance matrix Σ(P ) of X = (X1 , . . . , Xp )T , where the X’s have been centered so that EX = 0. By spectral theory (see Appendix B.10), Σ(P ) = K X λj (P )ξ j (P )ξ Tj (P ) j=1 where K ≤ p, the λj are (not necessarily distinct) eigenvalues of Σ(P ), and the p dimensional eigenvectors {ξj } are orthonormal. The linear combination Zj = ξTj X, 1 ≤ j ≤ K is called the jth principal component (PC). A small K corresponds to a sparse subspace of the parameter space of Σ. Section 12.6 375 Model Selection and Dimension Reduction Although the eigenvalues up to this point appear as artificial constructs they are quite natural parameters. If λ1 (P ) > · · · > λp (P ) are distinct, the eigenvector ξ 1 corresponding to λ1 (P ) solves the problem ξ 1 = arg max{Var(aT X) : |a| = 1} . That is, the first principal component Z1 is the linear combination P a21j = 1 with the maximum variance. Moreover (Problem 12.6.3) λ1 (P ) = Var p X j=1 P ξ1j Xj = ξ 1 Σ ξ1 . a1j Xj subject to (12.6.13) That is, ξ1 is the maximizer of aT Σa. P second principal component Z2 is the linear combination a2j Xj subject to P The a22j = 1, aT1 a2 = 0, with the maximum variance, and so on. Also p X a2j Xj ) λ2 (P ) = Var( j=1 Pp Pp j=1 VarXj . l=1 λl (P ) = Even more, ξ1 equivalently gives the direction of the line in Rp through the origin closest, on average, to X. More precisely, let x = αa, α ∈ R, be the line through the origin in direction a, where aT a = 1. The distance between an arbitrary point y ∈ Rp and the line is yT y − (aT y)2 (Problem 12.6.4). Next consider an unobserved random vector X in Rp with E(X) = 0 and Cov(X) = Σ. Finding the line αa such that the expected squared distance of X from the line is a minimum means maximizing and E(aT X)2 = aT E(XXT )a = aT Σa . Or in other words, ξ 1 solves the minimization problem (α1 , ξ1 ) = arg min{E|X − αξ|2 : α ∈ R, |ξ| = 1} α,ξ We have shown that ξ1 defines the one dimensional (dependent on L(X)) linear subspace of Rp which is, on average, closest to X. Similarly ξ 2 defines the line in Rp through the origin perpendicular to αξ 1 closest to X, etc. The eigenvectors and eigenvalues can be estimated from the empirical covariance mab using established methods such as singular value decomposition, yielding λ bj and trix Σ b ξj = (ξb1j , . . . , ξbpj )T , 1 ≤ j ≤ p. The estimated eigenvalues are arranged in decreasing order and thresholded to 0 when they become too small. The resulting first d < p eigenvectors and corresponding eigenvalues give a dimension reduction by replacing the original b i ; 1 ≤ i ≤ n}, where {Xi ; 1 ≤ i ≤ n} by the sample principal components {Z bij = Z p X k=1 b ξkj Xik , 1 ≤ j ≤ d, 1 ≤ i ≤ n . 376 Prediction and Machine Learning Chapter 12 These principal components are uncorrelated. It follows that if we regress {Yi } on bij } using least squares, the fitted regression is {Z b = Ȳ 1 + Y d X j=1 bj ηbj Z P P b j = (Z b ij −Z b1j , . . . , Z bnj )T and ηbj = n (Yi −Ȳ )(Zbij −Z b̄j )/ n (Z b̄j )2 . In the where Z i=1 i=1 context of the regression model where we observe (X1 , Y1 ), . . . , (Xn , Yn ) i.i.d., Xi ∈ Rp , Y ∈ R with p large, sparse modeling with small d makes it possible to do classification b 1 , Y1 ), . . . , (Z b n , Yn ). and prediction using (Z Association studies In the preceding PCA approach the effects of the individual original variables on the response in not apparent. In genome-wide association studies (GWAS) this is fixed by bij }, 1 ≤ i ≤ n, singling out one variable Xk one at a time and regressing Yi on {Xik , Z 1 ≤ j ≤ d. When the number of X’s p is larger than n, this approach is difficult to bij is unstable. However, for testing the hypothesis that Xk and Y implement because Z are uncorrelated, the software package “eigensoft” give tests that produce approximately correct Bonferroni p-values for balanced case-control studies where the number of cases and the number of controls are nearly the same, and other assumptions are satisfied (see Yang (2013), and Yang, Doksum, and Tsui (2014)). b depend on the scale of the X’s, the corresponding Remark 12.6.2. Because Σ and Σ correlation matrices are often used in their place. ✷ Remark 12.6.3. One approach to sparse PCA is to use only the first 10 principal compobj , j > 10, are nearly zero. When p is nents which often can be justified by checking that λ large, using d = 10 PC’s often produce nearly the same results as d chosen P by model selection procedures, e.g. cross-validation. Another approach is to replace pj=1 a2kj = 1 in the Pp optimization PC problem with adjusted versions of the Lasso type condition j=1 |akj | = 1. Adjustments are made to obtain convex optimization problems. See Hastie, Tibshirani, and Wainwright (2015) and Remark 8.3.2. We do not discuss this further here but note again that for large p, the empirical eigenvectors and eigenvalues can be very poor estimates of the population quantities of interest. Here too, as in regression, we typically regularize reflecting our beliefs that the data nearly live in a much lower dimensional subspace. For instance, λk+1 = · · · = λp = 0 correspond to a sparse subspace where p − k of the predictors are linearly dependent on the others and X is in a k dimensional subspace of Rp . We will discuss such issues even more briefly in our last Section 12.7 in which we point to enormous areas, barely touched by this book, and current areas of research interest. Summary In this section we have briefly focused on the huge body of research on model selection and dimension reduction of data, one of the most important issue in modern statistical practice. The reduction in model dimension centered on an asymptotic Bayes criteria while Section 12.7 Topics Briefly Touched and Current Frontiers 377 dimension reduction of data briefly discussed principal component analysis. We had already considered, in a slightly different context, dimension reduction via Mallows’ Cp and cross validation in Section 12.4. 12.7 Topics Briefly Touched and Current Frontiers Complex data structures As we have noted in the past, modern statistical data are full of structures. To the old types of temporal data, time series, have now been added spatio-temporal data, matrix time series, images which can be thought of as spatial data, but have a special complexity and structure, networks as in genomics or the World Wide Web, dynamical computer models, hierarchical structures such as text, and so on. Beyond a few examples in time series such as autoregressive processes we have not discussed modelling such situations parametrically or nonparametrically mainly because models are evolving for these complex situations. For stationary Gaussian time series, the spectrum defines the most general semiparametric model and its estimation and its analysis has been highly developed (see, for example, Brillinger (2001)). However, just as with linear regression, the tool of choice in the Gaussian case, while the spectral techniques have been of great value even in the absence of Gaussianity, robustness and other considerations have led to time domain modelling. Here semiparametric models such as general autoregressive and more general Markov and hidden Markov models have been used and analyzed. Asymptotics as the time series lengthens have been the major theoretical tools. Bootstrap and other Monte Carlo techniques have been developed for some of these models (Politis, Romano and Wolf (1999), Bühlmann and Künsch (1995), Künsch (1989), and Bickel, Götze and van Zwet (1997)). Hierarchical, latent variable and Bayesian graphical models There has been an enormous revival of Bayesian modelling. It has taken the form of hierarchical models, or, more generally, Bayesian graphical models of the type also familiar in the frequentist world, such as factor analysis, random effects, and mixed models, i.e. models where the observed variable structure is postulated to be due to unobservable or unknown hidden variables. A mixed model, for example, might be modelling genetic factors in Type II diabetes. To what extent do genetic factors account for Type II diabetes? This development has been driven by the possibility of incorporating partial scientific knowledge into the statistical model in this way and also because, with the advent of the MCMC techniques discussed in Section 10.4, computation of posterior distributions becomes, in principle, feasible in complex situations. Accounts may be found in Berger (1985) and Gelman, Carlin, Stern, Dunson, Vehtari and Rubin (2014). These approaches have proved successful in the context of prediction and machine learning and are favored by many Bayesian, frequentists, or the most common type of statistician, these days, pragmatists. See Wainwright and Jordan (2008). The reason for this, as we indicated earlier, is that no matter how a prediction algorithm is constructed its value is judged nonparametrically through checking predictions from cross validation based on training samples on known cases. 378 Prediction and Machine Learning Chapter 12 From a scientific point of view there is an interest in establishing causal relationships, importance of variables, etc. Equivalently, we may start with a graph of variables (nodes) and edges (associative relations). We want to end with a directed graph where the directions represent causation. This can be problematic. We typically observe association between variables which can be causally explained in many ways. For instance, if we observe (X, Y ) ∼ N2 (0, 0, 1, 1, ρ) we can’t tell whether Y was generated from Y = ρX +ε1 , ε1 ⊥ X, ε1 ∼ N (0, 1 − ρ2 ), or from X = ρY + ε2 with ε2 ⊥ Y , ε2 ∼ ε1 . However, some causative relations can be ruled out as being inconsistent with others. For instance, a loop in the graph, A → B → C → A is problematic since either A causes B or B causes A. Important applications of these notions have appeared in statistics and biostatistics — see Pearl (2009), Rubin (1990), and van der Laan and Robins (2003). Not infrequently, the hierarchical Bayesian approach appears dangerous. Distributional and causal assumptions are thrown in at various points of the hierachy only for convenience of computation and if final posterior probabilities are taken seriously, it is difficult to disentangle what is coming from the data, what is coming from reasonable scientific assumptions, and what is coming from convenience assumptions. This is a problem shared by frequentists building hierarchical models as much as Bayesians. It seems plausible that what matters in a complex hierarchical model are the parametric assumptions made along the way rather than whether one puts a prior on at the root of a hierarchical model step or uses maximum likelihood. This last remark is supported by the Bernstein-von Mises theorem (Theorems 5.5.2 and 6.2.3). Modelling Deterministic Situations Computer models of great complexity for traffic, wildfires, and atmospheric science are now common; see Sacks et al (1989). Although, in principle, these situations are deterministic and can, again in principle, be modelled only by subjective Bayesians, in fact probabilistic models are very useful and important tools in their analysis. For instance, numerical PDE computations running on super computers for climate prediction are “assimilated” with data (with stochastic error structure) to produce far more accurate predictions than could be done using data alone. Computational questions An issue that high dimensionality and possibly sample size have made paramount is computation time. Despite the ever increasing speed and capabilities of computers, running all possible regressions on a set of n observations with p covariates requires on the order of 2p n operations, an impossibly large number if p is even in the hundreds. Things are, of course, much worse for optimization algorithms for functions on Rp arising routinely in maximum likelihood or for that matter for Markov Chains Monte Carlo simulations in high dimensional spaces. It has been clear that to get good statistical performance, we need to add computational performance, to get so called Polynomial (P) rather than NonPolynomial (NP) algorithms although even that may be excessive. That is, if p characterizes the number of parameters, numbers, objects that the algorithm needs to deal with, then, for some α, a (P ) algorithm takes at most on the order of pα operations to yield an answer whatever be the values of the parameters. An (NP) algorithm either will need more than pα for any α for some situations or have an unknown worst case performance. The issues are, Section 12.7 Topics Briefly Touched and Current Frontiers 379 in fact, rather subtle since as for decision procedures, this is based on a worst case analysis and an algorithm may perform well in practice even though it is known to be NP. A famous example is the simplex method for solving linear frequency problems. These issues are only now starting to be treated systematically; see Chandrasekaran and Jordan (2013) for instance. Behaviour of classical procedures when p, n are comparable or p ≫ n It has been known for some time in the statiscal physics and probability communities that if p/n → c > 0, then the behaviours of various statistical functions of the empirical covariance matrix are very different than if p/n → 0. A striking example studied by Wachter (1978), Geman(1984), and more recently by El Karoui (2005) following classical work in statistical physics is that of the empirical covariance matrix of n i.i.d vectors which are multivariate normal with mean 0 and variance covariance matrix the p × p identity. The distribution of the eigenvalues of this matrix does not tend to point mass at 1, as they should and do for fixed p. Moreover, for slight perturbations to the identity covariance, the empirical eigenvectors have strange properties. Thus, principal component analysis (PCA) becomes suspect. Similar difficulties arise for the classical Gaussian linear discriminant function (Bickel and Levina (2008a,b)). These can be studied by using models with different forms of sparsity and regularization such as those we have analysed earlier in this chapter for regression. See Remark 8.3.2 for references to these studies of sparse PCA. It is worth considering situations where the vector of parameters is unstructured a priori such as the p × p population covariance matrix when sampling from a p-dimensional population and situations where some of the population parameters are structured, for instance, the Fourier coefficients of a smooth density function appearing in a semiparametric model Here, smoothness corresponds to a sparsity assumption i.e. most of the coefficients vanish. Examples of this type of large p situation have been dealt with in Sections 12.3, 12.4, and 12.6. Another example of situations where many parameters appear but sparsity is assumed is the following: Studying many parameters simultaneously from testing and related points of view In genomics and other fields, it has become routine to test many null hypotheses at the same time, in the expectation that only some small fraction of these if any will not be null. The classical approach here has been to use the Bonferroni inequality (see Section 4.4). That is, if there are N hypotheses, to ensure that the probability of type I error, in the sense of making any mistake whatsoever, is no more than α, one tests at level α/N . The probability of at least one Type I error when testing N hypotheses is called the family-wise error rate (FWER). That is, if V denotes the number of true null hypotheses rejected, then F W ER = P (V ≥ 1). If the model distribution P assumes that all N null hypotheses are true, then F W ER ≤ α is called weak control of FWER, while if P is not required to satisfy this assumption, F W ER ≤ α is called strong control of FWER. The popularity of the Bonferroni multiple testing method is in part due to: Proposition 12.7.1. Suppose we have available tests ϕ1 , . . . , ϕN for testing N null hypotheses H1 , . . . , HN . Let pbj denote the p-value of ϕj for testing Hj , j = 1, . . . , N . Then the multiple testing procedure that rejects the Hj with pbj ≤ α/N strongly controls FWER. 380 Prediction and Machine Learning Chapter 12 Proof. Let J0 = {j : Hj is true} and N0 = cardinality of J0 . Then F W ER = P (V ≥ 1) = P ( ∪ {b pj ≤ α/N }) ≤ j∈J0 X j∈J0 P (b pj ≤ α/N ) ≤ N0 α≤α. N ✷ Holm (1979) constructed a simultaneous testing procedure that strictly improves on the Bonferroni method: Let p(1) ≤ . . . ≤ p(N ) denote the ordered p-values for N tests. Reject Hj if α . p(j) ≤ N −j+1 A proof of strong control of FWER for the Holm method is outlined in Problem 12.7.1. When the number of hypotheses is large and α ≤ 0.1, the Bonferroni and Holm methods rule out all but the very strongest signals and then makes discovery of strong but less extreme signals impossible. To remedy this failing, Benjamini and Hochberg (1995) proposed replacing the Bonferroni criteria by bounding the False Discovery Rate (FDR), the expected fraction of falsely rejected hypotheses to the total number of rejections, and derived a simple rule for doing so with desirable properties. This rule rejects Hj if p(j) ≤ p(bj) , where b j = arg max{j : p(j) ≤ α/N } . If the p-values are independent for j with Hj true, the rule controls the FDR at level α for such j. An overview may be found in papers of Benjamini and Hochberg (1995), Lehmann, Romano and Shaffer (2003), Dudoit, Shaffer and Boldrick (2003), and others. The monograph of Efron (2010) gives an important overview of issues in modern large data contexts from an empirical Bayesian point of view. This is an area of continuing research — see, for example, the problems and a related solution to a different problem in Q. Li et al (2011) as well as the study of other features of the distribution of p values which underlies the work of Benjamini and Hochberg. Dimension reduction, clustering, and related concepts Sufficiency is not usually a useful concept in the high dimensional world we are now dealing with. But dimension reduction and the related concepts, clustering, or, as the information theorists call it, “lossy compression,” are essential for analysis and understanding. Such methodologies interact intimately with regularization methods we have mentioned. If we do not regularize or, more specifically, set some parameters to 0, the analysis we then give may end up giving us an incorrect qualitative picture of the mechanics behind the data. We end with an augmentation of a famous quotation from Box (1979): “Models, of course, are never true but fortunately it is only necessary that they be useful” to which we would like to add “Useful models generate conclusions which can point to new verifiable questions.” Randomized methods as part of statistics A number of procedures have arisen in recent years in addition to the bootstrap, MCMC and similar methods, which provide new methodology for high dimensional data. An intriguing example now called “sketching” is a class of methods for dimension reduction. Section 12.8 Problems and Complements 381 Here random projections of the data points into lower dimensional subspaces are analyzed, yielding both computational and possibly inferential advantages at a possible cost of information loss. We see this as an emerging frontier on the borders of statistics and computation. See Baraniuk, Cevher, and Wakin (2010) as a recent nonstatistical reference. A method using randomization intrinsically that we discuss more fully is Random Forests (RF). This method for classification and regression, or supervised learning, and clustering or unsupervised learning, was introduced by Breiman (2001), (see also Amit and Geman (1997)), as a follow up to CART. In this approach, ensembles of random classification or regression trees are grown on randomly chosen training sets. Classification is determined by majority voting, and regression by averaging empirical predictors produced the trees. The randomness is introduced in two ways. 1) Each tree is grown on a bootstrap sample of the same sample size as the training sample. Each bootstrap sample, because of sampling with replacement, leaves out about 30% of the data vectors. This leaves, as a complement of the bootstrap training sample, an independent subsample for assessing importance of predictors (also called features). But the trees are typically dependent since the training samples overlap. 2) Even more significantly, when the vector of features/predictors is large, each tree uses only a small random subset of features on which to determine a split. That is, the CART splits are each made by optimizing over their own small subset of variables, of the order of 5, with sets varying randomly from split to split. For classification, although this is not necessary, the trees are grown to purity. Each leaf contains only cases of one data type. Various advantages accrue. The methods are much more stable than CART since they smooth by averaging, even though the trees are not independent. More significantly, the use of different variables at each split and the use of subsamples enables efficient exploration of high dimensional spaces. The details of RF make analysis of the method difficult. However, its empirical performance is superb, and recently research and understanding have advanced. See Hastie, Tibshirani and Friedman (2009) for recent references, and Scornet, Biau, and Vert (2015). Summary In this final section we have listed a number of topics which are of active interest at least to us. Many equally workable directions have not been mentioned. But this is a book of “Selected Topics.” 12.8 Problems and Complements Problems for Section 12.1 1. With the notation of Section 12.1, suppose, initially, Z = (1, Z1 )T where Z1 ∈ [0, 1]. Show that if we enlarge Z to Z∗ = (1, Z1 , Z12 , . . . , Z1d )T , then for any continuous µ(Z1 ), and ε > 0, there exists d, a ≡ (a0 , . . . , ad ) such that, E(µ(Z1 ) − aT Z∗ )2 ≤ ε2 . 382 Prediction and Machine Learning Chapter 12 Hint: Use the Weierstrass approximation theorem. 2. Suppose x1 , . . . , xN are N points in general position in Rd and (y1 , . . . , yN ) are associated signs y ∈ {−1, 1}. Show that there exists a 1–1 map Φ : Rd → RM for some M ≥ d and a vector v ∈ RM such that sgn(vT xi ) = yi , i = 1, . . . , N . Hint: Let e1 , . . . , eN be the coordinate axes in RN , map xi onto ei , and let v = (y1 , . . . , yN )T . Problems for Section 12.2 1. Let fbh (x) denote the multivariate convolution kernel estimate. Assume that K has convex compact support and is of order 1; and assume that D2 f (x) ≤ M for all x ∈ S(f ) and all f , some M > 0. Show that for x in the interior S 0 of S(f ) EF fbh (x) = f (x) + O(|h|2 ) Z VarF fh (x) = (n|h|d )−1 f (x) K 2 (u)du 1 + o(1) . Hint: Follow the steps in the proofs of Propositions 11.2.1 and 11.2.2. R 2. Assume that 0 < |D2 f (x)|2 dx < ∞. Let fbh (x) be as defined by (12.2.1). (a) Show that AIMSE expression (12.2.5) for fbh (x) is minimized by 1 h = af n− d+4 for some af > 0. Give an expression for af . (b) Show that the minimum of the AMISE expression (12.2.5) is 4 bf n− d+4 for some bf > 0. Give an expression for bf . (c) Evaluate the constant af ≡ a(Σ) in part (a) of this problem when X is Nd (0, Σ) b be the sample and Kh (u) is a product kernel with Kj (u) = 1[|u| ≤ 1]/2. Let Σ covariance matrix. Then 1 b b − d+4 h = a(Σ)n is an estimate of the h that yields the optimal asymptotic IMSE if the true density is N (µ, Σ). Here Nd (0, Σ) is called a reference distribution. See Section 11.2.3. 3. Boundary kernel density estimates. Consider h = (h, . . . , h). Suppose the support of f is a rectangle Πdj=1 [aj , bj ] and that D2 f (x) ≤ M for all xεS(f ) and some M > 0. Also assume that K(u) = Πdj=1 K0 (uj ), where K0 is symmetric and has support [−1, 1]. Let Bλ = Πdj=1 [aj , aj + λh], 0 < λ < 1, denote the left boundary region, and let xh = (a1 + λh, . . . , ad + λh) denote a point in Bλ . Section 12.8 383 Problems and Complements (a) Show that the kernel estimate is asymptotically biased in Bλ . That is, as h → 0 (b) Let W (x) = R S(f ) |fbh (xh ) − fh (xh )| = OP (1) . Kh (z − x)dz and define feh (xh ) = Pn Kh (Xi − x) 1 x ∈ S(f ) . nW (x) i=1 Show that feh (xh ) is asymptotically unbiased in Bλ , That is, as h → 0, feh (xh ) − fh (xh ) = oP (1) . Hint: See Problem 11.3.1. Jiang and Doksum (2003) give further asymptotic properties of feh (x). (c) Show that feh (x) = β0 (x, Fb), where β0 (x, F ) = arg min E [f (Z) − β0 ]2 X = x and (Z|X = x) has density qh (z|x) = Kh (z − x)Πdj=1 1(aj ≤ zj ≤ bj ) . W (x) Hint: See Section 11.3. 4. Let Dk be the differential operator defined in Section B.8 and let k k∞ denote sup norm over S, and let mj and νj be as defined in Section 11.6.2. Suppose that inf{f (x) : x ∈ S} > 0, kDf k∞ < ∞, kDJ+2 µk∞ < ∞, kDσ 2 k∞ < ∞, mJ+2 (K) < ∞, ν5 (K) < ∞, mj (K) = 0, j = 1, . . . , J, and J is even. Show that (a) as n → ∞, h → 0, nh3 → ∞, uniformly for x ∈ S, b NW (x)|X(n) ≡ E |b µ NW (x)−µ(x)|2 X(n) = OP (hJ+2 )+OP (n−1 h−d ) . MSE µ b NW (x)|X(n) is of the form (b) The minimizer of the asymptotic version of MSE µ h=c 1 J+2+d 1 (J + 2)n b NW (x)|X(n) of order which leads to MSE µ J+2 n− J+2+d . 384 Prediction and Machine Learning Chapter 12 Hint: See Section 11.6.2. b j = C −1 , 1 ≤ j ≤ C, and if pbjk (·) denotes the kth 5. In Example 12.2.2, show that if Π nearest neighbour estimate, then the classifier based on replacing pj (·) with pbjk (·) in the Bayes rule classifies In+1 as Ibjk where b jk is the subscript on the kth nearest neighbour to Xn+1 . 6. Assume that d = 1, that the support of f is [a, b] with a and b finite, that sup{|f ′′ (x)| : x ∈ [a, b]} ≤ M , and that k = kn → ∞, n−1 kn → 0 as n → ∞. Then the kth nearest neighbour classifier is consistent. Hint: See Section 11.2 and Problem 11.4.3.1 7. Assume that f satisfies the conditions of Problem 6 above. Show that the bandwidth b h that corresponds to the nearest neighbour classifier satisfies b h = OP ( n1 ). 8. Suppose (Xi , , Yi ) are i.i.d. where Yi = 1 with probability π0 , Yi = −1 with probability 1 − π0 , and Xi ∼ N (µ0 Yi , Σ0 ) given Yi . Assume 0-1 loss and that µ0 , Σ0 and π0 are known. (a) Show the Bayes classification rule for Yn+1 given Xn+1 is given by Ybn+1 = 1 if (Xn+1 − µ0 )T Σ−1 0 (Xn+1 − µ0 ) − (Xn+1 + µ0 )T Σ−1 0 (Xn+1 + µ0 ) > 2 log = −1 otherwise or equivalently π0 1 − π0 1 − π0 . Ybn+1 = 1 iff µT0 Σ−1 0 Xn+1 > log π0 (b) Show that the Bayes risk if π0 = 1 2 is Rθ = 1 − Φ(µT0 Σ−1 0 µ0 ). 9. Assume (Xi , Yi ), i = 1, . . . , n are i.i.d. as (X, Y ) with Y = 1 or −1 and d X αj Xj + α0 logit P (Y = 1|X) = j=1 where logit(u) = log u/(1 − u) , 0 < u < 1. This is the logistic regression model. (a) Show that the Bayes classification rule given Xn+1 = (X1,n+1 , . . . , Xd,n+1 )T is Ybn+1 = 1 iff d X j=1 αj Xi,n+1 > −α0 . Section 12.8 385 Problems and Complements (b) Suppose that α is unknown. Show that as n → ∞ the rule d X b b Y n+1 = 1 iff α bj Xj,n+1 + α b0 > 0 j=1 nP n b = arg max where α i=1 converges to the Bayes rule. Pd j=1 (αj Xji d + α0 ) − log(1 + eΣj=1 αj Xji +α0 o (c) Show that if the density of X is (1−π0 )N (µ0 , Σ0 )+π0 N (−µ0 , Σ0 ), then the Bayes rule of (a) coincides with that of Problem 12.2.8. (d) Assume 0-1 classification loss. If we replace Σ0 , µ0 , π0 by their MLEs under the Gaussian model mentioned in Section 12.2.3, show that the resulting classifier (LDA) is, as n → ∞, no better than that of the logistic regression model classifier and, in general, is worse. 10.(a) Show that the minimizer of L(x, µ) ≡ (x − µ)2 + λ|µ| in µ is given by soft thresholding µ b = x, |x| ≤ c. = x − c(sgn x), |x| > c (b) Suppose Y is n × 1 and X is n × p. Deduce that if λ > 0 and n > d the Lasso: Minimize |Y − Xβ|2 + λ|β|1 b = (βb1 , . . . , βbd )T where for some S(λ) ⊂ {1, . . . , d}, has a unique sparse solution β βbj = 0, j ∈ S(λ) . Pd Pd (c) Show that if we replace = j=1 |βj | by |β|c = j=1 |βj |c , c > 1, then if Y has a b is never sparse with probability 1. density β Hint. Use Lagrange multipliers or equivalently the Karush-Kuhn-Tucker conditions (KKT). P Remark. The adaptive Lasso (Zou (2006)) replaces λ|β|1 with λ wj |βj |, where wj are weights of the form 1/|βj∗ | with βj∗ consistent estimates of βj , 1 ≤ j ≤ d. It has desirable consistency and oracle properties. Software is available online. 11. Let M (H) denote the margin of the separating hyperplane H. Show that the hyperplane that maximizes M (H) can be found by solving (12.2.19). Hint. H = {x : β T (x − x0 ) = 0} where β T x0 = β0 and π(x|β) = (β T x)β. 12. Verify (12.2.29). 386 Prediction and Machine Learning Chapter 12 13. Argue geometrically that the support vector machine optimization problem, may be stated as in (12.2.19). 14. Assume model (12.2.30). For M fixed show that the first Adaboost algorithm converges for fixed n > M to the classifier specified by the sample version of (12.2.30) i.e. δ(Z) = 1 iff M X j=1 α bj hj (Z) > 0 b = arg min Q(α, Pbn ) is assumed to exist. where α Hint. Apply the methods used to prove Theorem 12.2.2. 15. Assume model (12.2.30). Show that as n → ∞, (a) the minimizer of Q(α, Pbn ) converges to the minimizer of Q(α, P ), and (b) if the true distribution P (z, y) is such that Z has finite support, then the first Adaboost algorithm is Bayes consistent in the sense of (12.2.10). P∞ 16. Suppose logit P (Y = 1|Z) = j=1 αj hj (Z), αj 6= 0 for all j and Adaboost 1 is iterated forever using the function h1 , h2 , h3 , . . .. Show that Adaboost 1 doesn’t converge. 17. (a) Show that minimizing R(α, P ) in (12.2.31) with λ = (2γ)−1 is equivalent to solving (12.2.24). (b) Show that if λP= 0 in (12.2.31) and the Bayes classifier for model (12.2.30) is of the form δB = sgn α∗j hj (Z) , then δB minimizes both R(α, P ) with λ = 0 and the PM minimum Bayes risk P Y j=1 α∗j hj (Z) < 0 . (c) Show that under the assumptions of (b) support vector machines can be put in the framework of boosting. 18. (Breiman et al. (1984)). Define a measure of impurity of a finite probability distribution, {p1 , . . . , pK } ∈ SK , the K simplex, as a function φ : SK → R such that (i) φ is maximized uniquely for pj = 1 K, 1≤j≤K (ii) φ is minimized for any ej ≡ {δij : i = 1, . . . , K} where δij = 1(i = j). (iii) φ is symmetric. (a) Show that with 0 log 0 = 0 the following are both impurity measures with minimums zero: K K X X 2 pj log pj . pj ; φ2 = − φ1 = j=1 j=1 (b) Show that both φ1 and φ2 are strictly concave functions of p, that is, φ(αp1 + (1 − α)p2 ) ≥ αφ(p1 ) + (1 − α)φ(p2 ) for all α ∈ [0, 1] with equality iff p1 = p2 . Section 12.8 387 Problems and Complements Hint. Because the sum is concave, it is enough to check concavity for k = 1. 19. Suppose Y ∈ {1, . . . , K} so that there are K classes. Let t be a node with class fractions (p1 , . . . , pK ). Split the node into K pieces with corresponding compositions PK πj (p1j , . . . , pKj ), j = 1, . . . , K, i=1 pij = 1 where πj is the fraction assigned to descendant node j. (a) Show that for φ1 , φ2 as in Problem 18, for any such split S, if we define the total purity gain of the descendant nodes, by ∆(S) ≡ K X j=1 πj φ(p1j , . . . , pKj ) − φ(p1 , . . . , pK ) , then, ∆(S) ≥ 0. Hint. Use concavity. (b) Show that for K = 2, φ = φ2 , the splitting rule described in the text as the T algorithm is Sb ≡ arg max ∆(S) . S (c) Show that ∆(S) = 0 iff pij = pi for all i, j. 20. (a) Show that if f, g are densities of P and Q with respect to µ, Z sup |P (A) − Q(A)| = |f − g| dµ . A (b) Deduce that if X = {x1 , . . . , xm } and (p1 , . . . , pn ), (q1 , . . . , qm ) are corresponding vectors of probabilities, X X X {pj : j ∈ A} − {qj : j ∈ A} = |pj − qj | max A Hint. |P (A) − Q(A)| = R A (f − g) dµ . Take A = {x : fg (x) ≥ 1}. 21. Consider the following splitting scheme, called twicing, for a node t with fractions (p1 , . . . , pK ). Let C = {j1 , . . . , jm }, C ⊂ {1, . . . , K}, C¯ = {1, . . . , K} − C. Split the node t into tL , tR such that the node tL corresponds to (p1L , . .P . , pKL ) and similarly for tR . If πL , πR are the fractions of the two nodes, let p(1|t) = {pj : j ∈ C}. Define p(1|tL ), p(1|tR ) similarly. Then p(1|t) = πL p(1|tL ) + πR p(1|tR ) . Show that the purity gain from a split into two classes C and C¯ is ∆(S, C) = − 2p(1|t) 1 − p(1|t) − πL p(1|tL ) 1 − p(1|tL ) − πR p(1|tR ) 1 − p(1|tR ) . 388 Prediction and Machine Learning Chapter 12 22. Suppose K = 2 and φ = φ2 in the previous problem. Given a split S, let t, tL , tR be the parent left and right nodes with compositions p(1|t), 1 − p(1|t) and πL p(1|tL ), 1 − p(1|tL ) , πR p(1|tR ), 1 − p(1|tR ) where p(1|t), p(1|tL ), p(1|tR ) are the (conditional) probability masses for class 1 and πL = 1 − πR is the marginal probability of being assigned to tL . Show that 2 ∆(S, C) = 2πL πR p(1|tL ) − p(1|tR ) . Hint. Write p(1|tL ) = 21 + ∆1 , p(1|tR ) = 21 + ∆2 , p(1|t) = 21 + πL ∆1 + πR ∆2 . 2 Note that p2 (1|t) + 1 − p(1|t) = 1 − 2p(1|t) 1 − p(1|t) . 1 1 1 − (πL ∆1 + πR ∆2 )2 . ∆(S, C) = −2 πL ( − ∆21 ) + πR ( − ∆22 ) − 4 4 4 23. Let p(C|t) = X {pj : j ∈ C}, p(C|tL ) = X {pjL : j ∈ C}, p(C|tR ) = X {pjR : j ∈ C} and let ∆(S, C) be as in Problem 21. Show that the split S(C ∗ ) which maximizes ∆(S, C) over all S, C may be obtained by maximizing X 2 {|pjL − pjR |} πL πR j over all splits, yielding s∗ , and the optimal C ∗ = {j : p(j|t∗L ) ≥ p(j|t∗R )} where t∗L , t∗R are the results of s∗ . 24. Show that in the T algorithm for classification trees, if pi > 0, i = 0, 1, . . . , Cjm , j = 1, . . . , 2m is the partition of Rm defined by level m of the tree and Cm (z) is the member of the partition containing z, then Cm (z) ⊃ Cm+1 (z) for all m and ∩ Cm (z) = {z}. m Problems for Section 12.3 1. Justify the calculations in (12.3.7) using the conditions of Theorem 6.2.2. 2. Suppose (Z, I) has joint density p defined by the logistic regression model log where z is d × 1, θ ∈ Rd and p(z, θ) = θT z 1 − p(z, θ) p(z, θ) ≡ P [I = 1| Z = z] = 1 − P [I = 0| Z = z] . Section 12.8 389 Problems and Complements (a) Show that if Xi = (Zi , Ii ), 1 ≤ i ≤ n, are i.i.d., the joint conditional density of the Ii given Zi ≡ (Zi1,...,Zid )T follows a canonical experimental family with statistics Tj (X) = n X Zij Ii . i=1 b is asymptotically Gaus(b) Using classical MLE theory show that the conditional MLE θ sian and give its limiting mean and variance covariance matrix. 3. Let (Z, I) be as in Problem 12.3.2. Suppose the conditional density of Z given I = j is N (µj , Σ), j = 0, 1, and that P [I = j] = πj , j = 0, 1. (a) Show that the joint distribution of (Z, I) can be written, for suitable η = (µ, Σ), g, g(Z, θ, η) exp(Iθ T Z)(1 + eθ T Z −1 ) . (b) Assume 0-1 classification loss. Deduce that if Z ∼ g(·, θ, η) then the linear discriminant analysis (LDA) classification rule is better than the logistic regression classification rule in terms of regret, but that the asymptotic Bayes risk of LDA is the same as that of logistic regression. (c) Show that if g is not of the form in (a), LDA can be arbitrarily worse than logistic regression and even not asymptotically Bayes. Hint. See Problem 12.2.9. 4. Show that in the nonparametric regression example, (12.3.27) and (12.3.28) define a prior distribution with 1 |µ(z1 , ε) − µ(z2 , ε)| ≤ d− 2 |z1 − z2 | . 5. Establish the Fourier identities (12.3.47). 6. For the framework of Section 12.3.4 show that for any prior π and quadratic loss there is a prior with independent components which has the same Bayes risk. Hint. Given π consider the prior with independent components and the same marginals as π. 7. Show that in the Gaussian contiguous case, c = 12 is the best asymptotic choice to put in the Bayes test ϕB and then the Bayes risk (12.3.16) converges to 1 − Φ( σ2 ) as n → ∞. 2 Hint. Λn =⇒ N ( σ2 , σ 2 ) under P (n) ∈ Ωn and t = 0. See Example 3.3.2. 1 2 2 2 Φ̄( t+ σ2 σ +Φ t− σ2 σ ) is maximized for 8. Consider the notation in the proof of Fano’s inequality. Establish the existence and b b Pj ) < rm /2. uniqueness of the decision rule δ(x) = j iff ρ(θ, 9. For P as in Remark 12.3.2(b), show that the minmax risk of the Nadaraya-Watson 2s estimate is no smaller in order than n− 2s+d . 390 Prediction and Machine Learning Chapter 12 Hint: Apply Fano’s inequality for suitable P1 , . . . , Pm . 10. Exhibit an estimate achieving the rate of Problem 12.3.8. Hint: Use Nadaraya-Watson with kernel having vanishing moments. 11. In the framework of Proposition 12.3.2, show that by using the hard thresholding rule δch (y) with suitable c = cn one can construct estimates minmax on Θ0,n . 12. Establish (12.3.49) and (12.3.50). 13. Consider the Gaussian shift model xi = θi + εi , i = 1, 2, . . ., where xi ’s are the observations and θi ’s are the parameters P∞ to be estimated. Assume that the parameter vector θ = (θ1 , θ2 , . . .) lies in an ellipsoid i=1 i2m θi2 ≤ M , where M is a fixed positive constant and m is a known positive integer. The εi ’s are independent normal random variables with mean 0 and variance n−1 . In the P following problems, you may use this fact: there exists a positive constant C1 −2 such that ∞ ≤ C1 λ−1/2m for all λ > 0, where ρi = i2m − 1. You may i=1 (1 + λρi ) also assume the existence of the solutions to the problems below. (a) The penalized likelihood estimator θbi is the solution to the optimization problem: min θ ∞ X i=1 (xi − θi )2 + λ ∞ X ρi θi2 . i=1 Show that if λ is of order O(n−2m/(2m+1) ), then there exists C2 > 0 such that P∞ E (θbi − θi )2 ≤ C2 n−2m/(2m+1) . i=1 (b) The projection estimator θei is defined by xi , i ≤ n1/(2m+1) e θi = 0, otherwise. Show that there exists C3 > 0 such that P∞ i=1 E(θei − θi )2 ≤ C3 n−2m/(2m+1) . (c) Show that the estimator resulting from L1 penalty estimation: ) (∞ ∞ X X 2 |θi | (xi − θi ) + 2λ min θ i=1 i=1 has the form sgn(xi )[|xi | − λ]+ . (d) Show that the estimator resulting from L0 penalty estimation: ) (∞ X 2 2 (xi − θi ) + λ {# of θi that is not 0} min θ i=1 has the form xi 1{|xi |>λ} . Section 12.8 391 Problems and Complements (e) Consider now a rectangular subset of the original ellipsoid P∞ set of parameters: θi ≤ ξi , where ξi ’s are given positive numbers satisfying i=1 (1 + ρi )ξi2 ≤ M . Find the minimax linear estimator of the form θi = αi xi for all i, in terms of mean squared error. Problems for Section 12.4 1. (a) Verify (12.4.10) and (12.4.11). (b) Complete the proof of Theorem 12.4.1 by establishing the last inequality in the proof. (c) Derive the results of Remark 12.4.1 (a). P 2 2. χ2 inequality. Suppose X ∼ χ2m , X = m i=1 Zi , Zi ∼ N (0, 1) independent. Show that √ √ mt t2 t P |X − m| > t m ≤ 2 exp − t2 (1 + √ )−1 ≤ 4 e− 2 + e− 2 . m √ √ −1 Hint: P X ≥ m − √tm ≤ inf e−s( m+t) 1 − √2sm for s < 2m . Use, s √ t t −1 1 m 2s = √ 1+ √ < , arg min − s m − st − log 1 − √ 2 2 m 2 m m and − log(1 − x) ≤ x + x2 , 0 ≤ x < 21 . Similarly, P X < m + √tm = P [m − X > √ − m √ −t m] ≤ inf es( m+t) 1 + √2sm 2 . s 2 3. Deduce, if Xm ∼ Xm , that for some universal c, D √ E sup |Xm − m| − c m log n + : 1 ≤ m ≤ n ≤ D . Hint: If supm ∆m,n is the random variable above Z ∞ n Z X E sup ∆m ≤ K + P [sup ∆m,n > t] dt ≤ K + K m=1 ∞ P [∆m,n > t] dt. (12.8.1) K By Problem 12.4.2, Z ∞ h t i |Xm − m| √ dt P > c log n + √ m m K Z ∞ Z ∞ √ −(c log n+ √tm )2 /2 ≤4 e e−(t+c m log n)/2 dt dt + K K Z ∞ √ 2 √ − v2 −c m e ≤4 m dv + 4n . c log n The claimed bound follows for c > 1 by substituting this bound in (12.8.1). 4. Show that if ηbi are as in the GWN model E sup n n X i=m+1 ηi (b ηi − ηi ) − cρ n X i=m+1 ηi2 21 o ≤D p log n . 392 Prediction and Machine Learning Hint: ρ2 Pn i=m+1 ηi2 = Var e 5. If R(m) is defined by Pn i=m+1 e (a) E R(m) = σn2 m − Pm e (b) VarR(m) = ηi (b ηi − ηi ). e R(m) = 2σn2 m − show that j=1 Pm m X j=1 ηbj2 , ηj2 j=1 Var ηbj2 = 2σn4 m + σn2 2β e (c) max min E R(m) = O(n− 2β+1 ). Θβ 1≤m≤n Chapter 12 Pm j=1 ηj2 . 1 Hint: Plug in m = n 2β+1 . 6. Show that if m b is defined as in Theorem 12.4.2 m b n i n hX o X 2β (b ηi − ηi )2 + ηi2 : Θβ ≤ Kn− 2β+1 log n for K = K(β) . max E i=1 i=m+1 b Hint: By Problem 12.4.6 (using C, D, etc. generically) E m b X i=1 Therefore, √ 2 (b ηi − ηi )2 − Eσn2 m b ≤ D + C log nE m b σn ≡ ∆ . m b m b m b X X X ηbi2 + ∆ (b ηi − ηi )2 − b +E ηbi2 = E σn2 m E 2σn2 m b− = E σn2 m b −2 =E =E σn2 m b m b X i=1 =E −2 m b hX i=1 i=1 i=1 i=1 m b X i=1 m b X i=1 ηbi ηi + m b X ηi2 + ∆ i=1 (b ηi − ηi )ηi − (b ηi − ηi )2 + (b ηi − ηi )2 + n X i=m+1 b n X i=m+1 b m b X i=1 ηi2 + ∆ n m b X X ηi2 + 2∆ (b ηi − ηi )ηi − ηi2 − 2E i=1 i=1 ηi2 i n n n i h X X X ηi2 + 2∆ . (b ηi − ηi )ηi − + 2 (b ηi − ηi )ηi − 2 i=m+1 b i=1 i=1 Section 12.8 393 Problems and Complements Using Problem 12.4.6, E n X (b ηi − ηi )ηi ≤ Cσn E i=m+1 b n X ηi2 i=m+1 b 21 p log n . Finally, E m b X i=1 (b ηi − ηi )2 + ≤ 2∆ + σn E( n X ηi2 i=m+1 b n X i=m+1 b ηi2 m0 12 p X (b ηi − ηi )2 + log n + E i=1 i=m0 +1 P m0 2 ηi . But since E 2σn2 m b − i=1 ηbi2 ≤ E 2σn2 m0 − i=1 √ 1 Eσn2 m b ≤ σn2 (E m) b 2 n n X 12 X 1 2 ≤ σn E σn + E ηi2 ηi2 2 Pm b n X ηi2 i=m+1 b i=m+1 b and the results follows. 7. Show that in Remark 12.4.1, if m b is selected by SBC (BIC), then p −1 |b ηm 2 log n . b | ≤ σ0 n 2 Problems for Section 12.5 1. Establish the sufficient conditions for A3 leading to Theorem 12.5.1. (Z) Hint. Apply the law of large numbers using (1) to reduce to the case k∆S k2P = 1. Apply Hoeffding’s or a similar inequality. 2. (The Efron–Stein Inequality). Let T (x1 , . . . , xn−1 ) be a symmetric function of n − 1 variables and let X1 , . . . , Xn be i.i.d. as X ∈ R. Set Ti = T (X (−i) ) and T· = Pn −1 n i=1 Ti , where X(−i) = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )T . (a) Show that, Var Tn (X1 , . . . , Xn−1 ) ≤ E n X i=1 (Ti − T· )2 . Hint. Use Var U = 12 E(U − U ′ )2 , where U , U ′ are i.i.d., as U . Also, condition on X(−i) . (b) Use (a) to show that the leave one out cross validation estimate of variance tends to overestimate the variance. 394 Prediction and Machine Learning Chapter 12 3. Why crossvalidate? Consistent estimation of prediction error (PE). Let (X1 , Y1 ), . . . , (Xn , Yn ), where Xi = (Xi1 , . . . , Xip )T , be i.i.d. as X ∈ Rp , Y ∈ R. Assume the model Yi = XTi β + εi , εi ∼ i.i.d. N (0, σ 2 ) with E|X|4 < ∞, Var(X) = Σ0 , where Σ0 is nonsingular and known. Let XD ≡ b = (X T XD )−1 Y be the least squares (and maxi(Xij )n×p be the design matrix and let β D mum likelihood) estimate of β (see (6.1.14)). Show that if β = β0 is known, (a) The average prediction error when we use XTi β0 to predict Yi , 1 ≤ i ≤ n, is 1X PE = E (Yi − XTi β 0 )2 = σ 2 . n b in the formula in (a) is (b) The estimate of PE obtained by replacing Xi β0 by Xi β biased, in fact, n X p . (Yi − XTi β)2 = σ 2 1 − E n−1 n i=1 (c) Show that if p is fixed, b = β + n−1 Σ−1 β 0 0 n X Xi εi + OP (n−1 ) . i=1 b based on (Xi , Yi ), i = 1, . . . , m (d) Show that if we use the least squares estimate β m to estimate β, and we use (Xi , Yi ); i = m + 1, . . . , n to estimate prediction error, then n X 1 b )2 = σ 2 + OP ( 1 ) d PE ≡ (Yi − XTi β m n − m i=m+1 n provided (m/n) → λ for some λ ∈ (0, 1). T Hint. (b), condition on XD . For (c) and (d) use the delta method and E|Xi Σ−1 0 Xi | = −1 E|Xi Σ0 2 |2 . Problems for Section 12.6 1. Show that if we apply the Bayesian criterion (12.6.5) to choose m in the sieve method for estimating θ in Θβ of the GWN model as in Theorem 12.4.2, and if η is in the closure 2β 2β −( 2β+1 ) of ∪∞ rather than n−( 2β+1 ) n=1 Pm , then we obtain the minimax risk rate log n n obtained by using Stein’s unbiased risk estimate (Mallows’ Cp ), as shown in Problems 12.4.2–12.4.6. 2. Consider the one-sample model Yi = µ + εi , i = 1, . . . , n with {εi } i.i.d. N (0, 1). Let M0 be the model with µ = 0 and let M1 be the model with µ ∈ R arbitrary. Section 12.8 395 Problems and Complements (a) Find the model selection rule based on the Bayesian criterion (12.6.5). (b) Show that the rule in (a) selects the correct model with probability tending to one as n → ∞. (c) The AIC criterion with known σ 2 = 1 takes the form AIC(Mk ) = 2 n X i=1 log p(yi |θbk ) − 2k . Find the model selection rule based on AIC. (d) Show that the rule in (c) satisfies Pµ=0 (Decide model M1 ) = P (χ21 > 2). That is, it provides a test wth level P (χ21 > 2). It does not select the correct model with probability tending to one as n → ∞. 3. Show that the maximal eigenvalue of the covariance matrix Σ is given by λ1 (P ) = Var p X j=1 a∗j Xj Pp where |a∗ | = 1 and a∗ = arg max Var j=1 aj Xj : |a| = 1 . 4. In Section 12.6.3, show that the distance between a point y ∈ Rp and the line x = αa is yT y − (aT y)2 . Hint. Consider the point t ∈ Rp where the perpendicular from y to the line intersects the line. Then t = αa where α must satisfy aT (y − αa) = 0. Because aT a = 1, this gives α = aT y. Thus |y − t|2 = |y − (aT y)a|2 . Next expand the right hand side. Problems for Section 12.7 1. Let p(1) ≤ . . . ≤ p(N ) be the ordered p-values and let H (j) denote the null hypothesis corresponding to p(j) , 1 ≤ j ≤ N . Set p(N +1) = 1 and define j∗ = min 1≤j≤N +1 {j : p(j) > α/(N + 1 − j)} . The Holm (1979) multiple testing procedure rejects H (1) , . . . , H (j method strongly controls FWER at level α. ∗ −1) . Show that this Hint. Let J0 = {j : H (j) is true}, N0 = cardinality of J0 , and j0 = N − N0 + 1. Let A be the event “p(j ∗ ) < α/N0 ” and B = “p(j) > α/N0 for all j ∈ J0 .” Show that B =⇒ “j ∗ < j0 ” =⇒ A, and that P (B) ≥ 1 − α. 2. Optimal multiple testing. 396 Prediction and Machine Learning Chapter 12 (a) (Spjótvoll, 1972). Let g01 , . . . , g0N and g1 , . . . , gN be integrable functions of a data vector X and let S(γ) be all multiple tests ψ = {ψ1 , . . . , ψN } of the null hypotheses H1 , . . . , HN that satisfy N Z X ψj (x)g0j (x)dν(x) = γ (12.8.2) j=1 for a given probability ν. Show that for an appropriate c, the multiple test ψ = (ψ1 , . . . , ψN ) = 1 gj (x) > cg0j : 1 ≤ j ≤ N , c > 0 maximizes N Z X ψj (x)gj (x)dν(x) (12.8.3) (12.8.4) j=1 among all tests ψ ∈ S(γ). Hint. Extend the proof of the Neyman-Pearson Lemma. Remark. The functions gj and g0j need not be densities. They are general integrable functions. PN (b) Let R = j=1 1 ψj (x) = 1 be the number of “discoveries,” f0j (x) and fj (x) densities of X under the hypothesis Hj and alternative Aj , respectively, g0j ν = [f0j /R]1(R > 0) if Hj is false, 0 otherwise, and gj ν = [fj /R]1(R > 0) if Hj is false, 0 otherwise. With this notation, by definition, (12.8.2) is the False Discovery Rate (FDR) and (12.8.4) is the Correct Discovery Rate (CDR). Give the rule that maximizes the CDR subject to FDR = γ. (c) (Spjótvoll (1972), Storey (2007), Sun and Cai (2007)). Give the multiple test that maximizes the expected number of true positives for a fixed level γ of the expected number of false positives. (d) Suppose we express ψj in a multiple test rule as ψj = ψ(b pj ) = 1(b pj ≤ a) where pbj is the p-value for the hypothesis Hj . Express the solutions to (b) and (c) in terms of p-values. Hint. When Hj is true, pbj ∼ U nif [0, 1]. (e) Bayes. Let πj denote the prior probability that Hj is true; and let f0 and fj denote the densities of X when Hj is true and false, respectively. The posterior probability of Hj is πj f0 (x) . P (Hj |x) = πj f0 (x) + (1 − πj )fj (x) ′ N Consider the Bayes rule ψ B = (ψB , . . . , ψB ) that rejects the Hj with P (Hj |x) ≤ q for some preassigned q ∈ (01, ). Then ψ B = {1(fj (x) > (1 − q)πj f0 (x))/(1 − πj ) : 1 ≤ j ≤ N } . Thus, if πj = π, ψ B is optimal in the sense of problem (a). Section 12.8 397 Problems and Complements (f) Empirical Bayes (Efron 2008, 2010). Suppose we write ψj in the multiple test rule as ψj (Z) = 1(Zj ≤ bj ) where Zj = Φ−1 (b pj ) and Φ = N (0, 1) df. Then Zj ∼ N (0, 1) when Hj is true. Let f0 denote the the N (0, 1) density, let πj denote the prior probability that Hj is true, and let fj denote the density of Zj when Hj is false. Then the posterior probability of the null hypothesis Hj given Zj = zj is P (Hj |zj ) = πj f0 (zj ) . πj f0 (zj ) + (1 − πj )fj (zj ) Now reject Hj if P (Hj |zj ) ≤ q, q ∈ (0, 1). Efron (2010) operationalizes this rule by constructing estimates of Πj and fj , thereby obtaining an empirical Bayes rule which can be found in the software R under “loc FDR” (local False Discovery Rate). 3. False Discovery Distributions. Let T1 , . . . , TN be test statistics for testing Hj : µj = 0, 1 ≤ j ≤ N . Each is based on a sample of size n. Assume that the joint distribution of T1 , . . . , TN is defined by the equation Tj = √ 1 nµj + ρZ + (1 − ρ2 ) 2 εj where Z, ε1 , . . . , εN are i.i.d. N (0, 1). Let pj be the p-value of the test that rejects Hj for |Tj | large, let N N X X 1(|Tj | > −zu/2 ) 1(pj ≤ u) = R(u) = j=1 j=1 be the number of discoveries, let J0 = {j : µj = 0}, and let the number of false discoveries be X X V (u) = 1(pj ≤ n) = 1(|Tj | > −zu/2 ) . j∈J0 j∈J0 (a) Let N0 = {#j : µj = 0}. Show that the conditional distribution of V (u) given Z can be approximated by the distribution of z o n z u/2 − ρZ u/2 + ρZ √ + Φ , W (µ) ≡ N0 Φ √ 1 − ρz 1 − ρz P Z ∼ N (0, 1) . P (b) Show that (i) W (u) −→ N0 u as ρ → 0; and (ii) W (u) −→ N0 as ρ → 1. Find the limit as ρ → −1. Hint for (ii). Condition on I = 1(Z > 0). (c) Give the range of values of the probability P V (u) ≥ 1 of at least one false rejection as ρ ranges from −1 to 0 and from 0 to 1. 398 Prediction and Machine Learning Chapter 12 (d) Let J1 = {j : µj 6= 0} and N1 = {#j : µj 6= 0}. (i) Find a variable S(u) whose distribution approximates the conditional distribution √ of R(u) given Z when µi = µ/ n and (ii) N1 /N0 → 0, (ii) N1 /N0 → λ ∈ (0, 1). (e) In (d) above, find the probability limit of S(u) as ρ → 0 and as ρ → ±1 for the cases (i) and (ii). Fan, Han and Gu (2012), among others, have given approaches to FDR for dependent p-values. Appendix D SOME AUXILIARY RESULTS D.1 Probability Results A collection {Yni ; 1 ≤ i ≤ n} of random variables is called a double array if the probability distribution Pni of Yni depends on both i and n. An example is the regresssion model Yni = α + βn xni + εi P √ where βn = c/ n, n−1 (xni − x̄n )2 → τ ∈ (0, 1) and ε1 , . . . , εn are i.i.d. Such models are used when investigating asymptotic power. See Section 6.3.2. Another example is the “bootstrap” asymptotic theory of Section 10.3.4. A key result for double arrays is 1. The Lindeberg-Feller Central Limit Theorem. Theorem D.1 (Lindeberg-Feller) Suppose that for Pneach 2fixed n, Yn1 , . . . , Ynn are indepen2 . Then , for n → ∞, dent with 0 < σni ≡ Var(Yni ) < ∞. Set σn2 = i=1 σni max 1≤i≤n 2 σni →0 σn2 (D.1.1) and n 1 X L [Yni − EYni ] → N (0, 1) σn i=1 (D.1.2) iff for each ε > 0, n 1 X E (Yni − EYni )2 1(|Yni − EYni | > εσn ) = 0 . 2 n→∞ σn i=1 lim (D.1.3) Corollary D.1 Suppose Y1 , . . . , YP n are i.i.d. as Y ∼ P , where P does not depend on n. Let µ = EY, σ 2 = VarY , and Tn = ni=1 ci Yi , then for n → ∞ the following are equivalent Pn Tn − µ i=1 ci L → N (0, 1) , (D.1.4) Pn 1 (σ 2 i=1 c2i ) 2 Pn ( i=1 c2i ) 2 → ∞. (D.1.5) vn ≡ max c2i 1≤i≤n 399 400 Supplements to Text Appendix D Proof. See Problem D.1.2. Theorem D.2 The condition (D.1.3) holds for all ε > 0 if for some δ > 0 kn X i=1 E|Yni − EYni |2+δ = o(σn2+δ ) . (D.1.6) Remark D.1 (D.1.3) is called Lindeberg’s Condition and (D.1.6) is called Liapunov’s Condition. Remark D.2 Suppose n−1 σn2 → c for some c > 0. Then (D.1.3) becomes 1X E[(Yni − EYni )2 1(|Yni − EYni | > εσn )] → 0 . n (D.1.7) A multivariate version is Theorem D.3. (Multivariate Lindeberg-Feller.) Suppose for each n, Yn1 , . . . , Ynkn are independent random vectors in Rd with Cov(Yni ), 1 ≤ i ≤ kn , finite. Let | · | denote Euclidean norm. If for ε > 0, kn X E |Yni |2 1 (|Yni | > ε) = 0 lim n→∞ i=1 and kn X i=1 for Σ positive definite, then kn X i=1 Cov(Yni ) → Σ L (Yni − EYni ) −→ N (0, Σ) . Remark D.3. In Theorem D.3, the Lindeberg condition is for uncentered Yni . This turns out to be convenient for the asymptotics of bootstrap methods. Clearly Theorem D.3 also holds if Lindeberg’s condition is for centered Yni . 2. Almost Sure Convergence and the Borel–Cantelli Lemma Almost sure convergence of a sequence of random variables and Kolmogorov’s Strong Law of Large Numbers (SLLN) appear in Section B.7 of Volume I. Here are some useful additional facts about a.s. convergences. P D.4 Convergence in probability and a.s. convergence. Z n → Z as n → ∞ iff every subsequence {Z nk }, k ≥ 1, has a subsubsequence {Z n′k }, {n′k } ⊂ {nk }, k ≥ 1 such that a.s. Z n′k → Z as n′k → ∞. a.s. P D.5 If Z n −→ Z, then Z n → Z. Section D.2 401 Supplement to Section 7.1 a.s. D.6 If Zn −→ Z iff lim P ( sup |Zn − Z| > ε) = 0. n→∞ m≥n d D.7 Uniform SLLN (Chung (1951)). Let ZP 1 , Z2 , . . . be i.i.d. as Z ∈ R , Z ∼ P ∈ P. Let n −1 | · | be Euclidean norm and let Z̄n = n i=1 Zi . If E|Z| < ∞, and lim sup EP |Z|1(|Z| ≥ M ) = 0 , M→∞ P ∈P then the SLLN holds uniformly, that is, for ε > 0, lim sup P sup |Z̄n − EP (Z)| ≥ ε = 0 . n→∞ P ∈P m≥n D.8 The Borel–Cantelli Lemma. Suppose ∞ X n=1 P [ |Z n − Z| ≥ ε] < ∞ a.s. for all ε > 0. Then, Z n → Z as n → ∞. 3. Kolmogorov’s Inequality The following result is useful for establishing uniform convergence in probability of nonparametric regression estimates. See Problems 11.6.8 and 11.6.9. Theorem D.4. Let X1 , X2 , . . . be independent random variables with E(Xj ) = 0, E(Xj2 ) < Pj ∞, and let Sj = i=1 Xi . Then, for every ε > 0, P D.2 max |Sj | > ε ≤ Var(Sn )/ε2 . 1≤j≤n Weak Convergence of Functions. Supplement to Section 7.1 Proof of Theorem 7.1.1. The full proof of this theorem may be found in van der Vaart and Wellner (1996), for instance. However, it is easier and less technical to see that if q is Lipschitz continuous, then (i) and (ii) imply that L(q(Zn (·))) → L(q(Z(·))). Here’s the argument for q Lipschitz. Let tmj , 1 ≤ j ≤ km , m ≥ 1, be points such that tmj ∈ Tmj . Pkm (m) (m) Let Zn (t) = similarly. Comparing the j=1 1(t ∈ Tmj )Zn (tmj ) and define Z characteristic functions of q(Zn ) and q(Z), we compute |E exp{isq(Zn )} − E exp{isq(Z)}| ≤ |E exp{isq(Zn )} − E exp{isq(Zn(m) )}| + |E exp{isq(Zn(m) )} − E exp{isq(Z (m) )}| + |E exp{isq(Z (m) )} − E exp{isq(Z)}| . (D.2.1) 402 Supplements to Text Appendix D Let ∆m (Zn ) = max{sup[|Zn (s) − Zn (t)| : s, t ∈ Tmj ] : 1 ≤ j ≤ km } and let (m) Amn = {|∆m (Zn )| ≤ εm }. Note that, |Zn − Zn |∞ ≤ ∆m . We claim that |E exp{isq(Zn )} − E exp{isq(Zn(m) )}| ≤ |s|E|q(Zn ) − q(Zn(m) )|1(Amn )) + (D.2.2) 2P [Acmn ]. This follows from |eia − eib | ≤ |a − b| and |eia | ≤ 1. But by the Lipschitz nature of q, E|q(Zn ) − q(Zn(m) )|1(Amn ) ≤ M εm and by (7.1.5), P [Acmn ] ≤ δm . By the separability of Z and the FIDI convergence of Zn (·), we have P [∆m (Z) ≥ εm ] ≤ δm also (Problem 7.1.1), and the third term in (D.2.1) can be bounded just as the first term has been. Combining these bounds, we have from (D.2.1), lim sup |E exp{isq(Zn )} − E exp{isq(Z)}| (D.2.3) n ≤ 2(M |s|εm + δm ) + lim sup |E exp{isq(Zn(m))} − E exp{isq(Z (m) )}|. n (m) The second term in (D.2.3) is 0 by the FIDI convergence of Zn (·) to Z (m) (·). Now, let m → ∞ to complete the proof. Note that Z (m) as defined yield tightness of Z(·). ✷ The general proof requires showing that (D.2.3) implies that, for every ǫ > 0, there exists a compact set Kǫ such that P [Zn (·) ∈ Kǫ ] ≥ 1 − ǫ and P [Z(·) ∈ Kǫ ] ≥ 1 − ǫ. By a classical result, functions which are continuous on a compact set are uniformly continuous. Proof of Theorem 7.1.2. Consider the bracket set collection, Tm , corresponding to δ = m−α , α > 1 to be chosen later, m = 1, 2, . . .. Without loss of generality, we can suppose that the bracket sets Tmj form a partition of T in the sense that if Tmj = (f mj , f¯mj ) ∈ Tm , j = 1, . . . , Jm , then Tmj ∩ Tmk = ∅ unless j = k. This can be achieved in such a way (Problem D.2.2) that (ii) will still hold for some constant c. Further, let gmj be a representative member of Tmj for each j, m. It follows that we can identify each f ∈ T by a sequence (j1 (f ), j2 (f ), . . . , ) where f ∈ Tm,jm (f ) , m = 1, 2, . . .. That such jm (f ) exist follows from the partition property. Let gm = gm,jm (f ) , and define f¯m , f m similarly. That (j1 (f ), j2 (f ), . . .) characterize 0 WP (f ) (and f ) (up to sets of probability 0) follows from WP0 (f ) = WP0 (g1 ) + ∞ X (WP0 (gm+1 ) − WP0 (gm )) (D.2.4) m=1 in the sense that WP0 (g1 ) + m X j=1 (WP0 (gj+1 ) − WP0 (gj )) = WP0 (gm+1 ) (D.2.5) and EP (f − gm+1 )2 (X) ≤ EP (f¯m+1 − f m+1 )2 (X) ≤ m−α → 0 (D.2.6) Section D.2 403 Supplement to Section 7.1 as mP → ∞. Let b, w1 , w2 , . . . be non-negative weights which will in fact depend on λ with ∞ b + m=1 wm = 1. Pick m0 which will also depend on λ and write WP0 (f ) = WP0 (gm0 ) + ∞ X m=m0 (WP0 (gm+1 ) − WP0 (gm )). Note that [|WP0 (f )| ≥ λ] ⊂ [|WP0 (gm0 )| ≥ bλ] ∪ ∞ [ m=m0 [|WP0 (gm+1 ) − WP0 (gm )| ≥ wm λ]. Hence, using (D.2.5), P [sup{|WP0 (f )| ≥ λ : f ∈ T }] ≤ P [max{|WP0 (gm0 ,j )| : 1 ≤ j ≤ Jm0 } ≥ bλ] ∞ X + P [max{|WP0 (gm+1,j ) − WP0 (gmj ′ )| : 1 ≤ j ≤ Jm+1 , (D.2.7) m=m0 +1 1 ≤ j ′ ≤ Jm } ≥ wm λ]1(Tm+1,j ∩ Tmj ′ 6= ∅) . The bound (D.2.7) is fundamental. It comes from a generalization of the classical P [|X + Y | ≥ ǫ] ≤ P [|X| ≥ 2ǫ ] + P [|Y | ≥ 2ǫ ], bounding |WP0 (gm )| by max{|WP0 (gmj )| : 1 ≤ j ≤ Jm } and using a similar bound for the differences. Go further and replace the terms on the right hand side of (D.2.7) by J m0 X j=1 + P [|WP0 (gmj )| ≥ bλ] ∞ X Jm+1 Jm X X m=m0 +1 j=1 j ′ =1 (D.2.8) P [|WP0 (gm+1,j ) − WP0 (gmj ′ )| ≥ wm λ]1(Tm+1,j ∩ Tmj ′ 6= ∅) . Although gm+1,j may not belong to Tm,jm (f ) , because f ∈ Tm,jm (f ) ∩ Tm+1,jm+1 (f ) , it is still true that E(gm+1,j − gmj )2 ≤ 2{E(gm+1,j − f )2 + E(gmj − f )2 } ≤ 4m−2α . Therefore, we can use the bounds (7.2.8) and (D.2.6) to bound (D.2.8) by 2 2 −2 2 ∞ X λ b γ λ 2 2α 2Jm0 exp − +2 Jm Jm+1 exp − (wm m ) . 2 8 m=m +1 (D.2.9) 0 Finally, use (ii) to bound (D.2.9) by 2 2 2 −2 ∞ X λ λ b γ 2αd 2 2α + C m exp − C1 mαd exp − (w m ) , (D.2.10) 2 0 2 2 m m=m +1 0 404 Supplements to Text Appendix D ′ where C1 and C2 are generic constants. By taking wm = cm−α , α′ > 1, it is not hard to show (Problem D.2.4) that we can bound (D.2.10) as specified in the statement of the theorem. ✷ Discussion: This argument uses the idea of “chaining” introduced by Kolmogorov. It exploits the representation of WP0 (f ) given by (D.2.4) which in turn reflects the “decimal” representation of f . The key observation is that the increments in (D.2.4) have variances which go down and, as a consequence of the bound (7.2.10), even if the deviation λ from the mean 0 is weighted by a small wm , the probability of that deviation drops much more rapidly than wm does. Note that this argument shows that the order of magnitude of the probability of deviation by λ or more of the maximum is essentially of the same order exp{−λ2 /2γ 2} as that of a single term. D.3 Functional Derivatives. Supplement to Section 7.2 Recall that if f : O → R where O is open ⊂ Rd then, at a point x ∈ Rd , f is partially ∂f (x) iff the limits differentiable with partial derivatives ∂x j f (x + hej ) − f (x) ∂f (x) = lim h→0 ∂xj h exist, where ej is the jth basis vector. The function f has a total differential iff a linear approximation to f holds. That is, suppose Df (x)(t) = lim h→0 f (x + ht) − f (x) h (D.3.1) Pp ∂f holds uniformly for |t| bounded. Then Df (x) is the linear map t → j=1 ∂xj (x)tj , T t = (t1 , . . . , tp ) given by (D.3.1). As is well known, functions with partial derivatives but no total differential are easy to exhibit. If f : B → R where B is a linear space and DG f (x)(t) = lim h→0 f (x + ht) − f (x) , h x, t ∈ B (D.3.2) exists and is well defined, DG is called the Gâteaux derivative of f at x in the direction t. The Gâteaux derivative is the analogue of the partial differential. The more useful and stronger notions correspond to uniformity in t statements about the limit and B, a normal linear space (Banach). That is, we want D such that sup |f (x + ht) − f (x) − hDf (x)(t)| : t ∈ T = o(h) . (D.3.3) If T is any bounded set in B it follows that Df (x)(·) is a bounded linear functional on B and Df (x)(·) is called the Fréchet derivative while if T is any compact set, Df (xT )(·) Section D.4 405 Supplement to Section 9.2.2 is called the Hadamard derivative and Df (x) is a linear functional. If B = Rd the two notions produce the same total differential. Our interest typically focusses on B = {All finite signed µ measures on Rd with total variation norm, kµk = sup |µ(A)| : A ∈ A where A is the class of all sets in Rd for which we can assign probability. Here “all signed measures µ on Rd ” is the class of all µ on A of the form {aP − bQ : a, b ∈ R, P and Q are probabilities on Rp }. For this B, a natural subspace of linear functionals can be identified with the space ofRall bounded continuous functions q on Rd , with corresponding linear functional, Lq (µ) = q(v)dµ(v) where dµ(v) = adF (v) − bdG(v) with F and G equal to the df’s corresponding to P and Q, v ∈ Rd . If Df (x) is of the form Lq and Df is Hadamard then if t is a signed measure, Z Df (x)(t) = q(v)dt(v) (D.3.4) and q(x) = Df (x)(δx ) where δx is point mass at x. Identifying these quantities with objects in Rd we see that (D.3.4) corresponds to the usual formula for the derivative in a particular direction and q(x) corresponds to the partial derivatives in the directions of the coordinate axes. See Example 7.2.1, for these analogies. If B is not the space of all signed measures but just the convex cone of probabilities, we can still carry out limit (D.3.2) in a direction t = G − F at x = F where F and G are df’s since, then, F + h(G − F ) = (1 −R h)F + hG, a member of the cone if 0 ≤ h ≤ 1. The q R we obtain is then not unique since q(v) + c d(G − F )(v) = q(v)d(G − F0 )(v), but 0 R if we require q(v)dF0 (v) = 0 we obtain the derivative as a unique influence function. The Hadamard sense is what is needed to carry out the delta method while its computation is via Gateaux. See van der Vaart (1998) for a careful treatment using this notion of derivative and the preceding treatment of Huber (1974) and Reeds (1976). D.4 Asymptotics for the Cox Estimate. Supplement to Section 9.2.2 The Influence Function of the Cox Partial Likelihood Estimate 1. Derivation of the Gâteaux derivative of Γ̇τ (β, P ). To find the Gâteaux derivative set Pε = (1 − ε)P + εQ, Pjε = (1 − ε)Pj + εQj ; pjε = dPjε , j = 1, 2 . Then we seek ∂ h ∂ Γ̇τ (β, Pε ) = ∂ε ∂ε Z 0 τ − Ṡ0 (t, β, Pε )dP2ε (t) + S0 By taking Q1 to be pointmass at Z, we find Z i 1(t ≤ τ )zdPε (z, t) II = Z1(T ≤ τ ) − E Z1(T ≤ τ ) . ε=0 ≡ I + II . Write S0 (ε), Ṡ0 (ε) for S0 (t, β, Pε ), Ṡ(t, β, Pε ) and S0 , S0 for S0 (0), Ṡ0 (0). For I, we need ∂ Ṡ0 ∂ ∂ (ε) = Ṡ0 (ε)S0−1 − Ṡ0 S0−2 S0 (ε) , ∂ε S0 ∂ε ∂ε 406 Supplements to Text Appendix D where T T ∂ ∂ S0 (ε) = EPε eβ Z 1(T ≥ t) = eβ Z 1(T ≥ t) − S0 ∂ε ∂ε is obtained by letting Q be pointmass at X = (Z, T ). Similarly, T ∂ Ṡ0 (ε) = Zeβ Z 1(T ≥ t) − Ṡ0 . ∂ε Thus T T ∂ Ṡ0 (ε) = S0−1 Zeβ Z 1(T ≥ t) − Ṡ0 − S0−2 Ṡ0 eβ Z 1(T ≥ t) − S0 ∂ε S0 T = S −1 eβ Z 1(T ≥ t) Z − (Ṡ /S ) . 0 0 0 Because p2ε |ε=0 = p2 and ∂p2ε (t)/∂ε = q2 − p2 for T ≤ τ , this gives Z T Z τ T Ṡ0 Ṡ0 Ṡ0 I = −eβ Z S0−1 Z − (t, β, P ) dP2 (t) − (T, β, P ) + (t, β, P )dP2 (t) . S0 S0 0 0 S0 Now I + II = γ(X, P ). 2. Proof of Theorem 9.2.2. Suppose we can show that γ(X, P ) is an influence function for b Γ̇−1 τ (β, P ) in the sense of (7.2.2). Then n X √ √ b − β) = Γ̈−1 (β ∗ , Pb ) n n−1 γ(Xi , P ) + oP (1) . n(β τ i=1 ∗ b By Slutsky’s theorem (Corollary 7.1.2) for processes, we can replace Γ̈−1 τ (β , P ) with −1 Γ̈τ (β, P ) and the results (ii) and (iii) follow. We leave details to the reader in Problem D.4.1. ✷ D.5 Markov Dependent Data. Supplement to Section 10.4 A. Markov Sequences Let X1 , X2 , . . . , be a sequence of X valued random variables. The i.i.d. sequences we have considered so far have two critical properties that we will generalize in this section. (i) Independence L(Xm |Xm−1 , . . . , X1 ) = L(Xm ), m ≥ 2 . (ii) Identical distribution L(Xm ) = L(X1 ), m≥2. The natural generalization of the first property is the Markov property defined by L(Xm |Xm−1 , . . . , X1 ) = L(Xm |Xm−1 ), m≥2. (D.5.1) Section D.5 407 Supplement to Section 10.4 The most important generalization of the second property is stationarity defined by L(X1 , . . . , Xm ) = L(X1+k , . . . , Xm+k ), m ≥ 1, k ≥ 0 . (D.5.2) An intermediate property (Problem D.5.1) implied by (D.5.1) and (D.5.2), but not by either one singly, is called (e.g. Grimmett and Stirzaker, Section 6.1) homogeneous Markov. It is defined by (D.5.1) and “L(Xm |Xm−1 ) does not depend on m,” that is, L(Xm |Xm−1 ) = L(X2 |X1 ), m≥2. A sequence is stationary and Markov iff it is homogeneous Markov and L(Xm ) = L(X1 ), m≥2. (D.5.3) Thus, an i.i.d. sequence is stationary Markov. To study Markov sequences {Xn , n ≥ 1} it is best to think of n as time and Xn as the state of a dynamical system initiated at X1 . Thus, we refer to X as the state space and its members as states. We think of the evolution mechanism, L(X2 |X1 ), as being given by a Markov kernel K : X × X → R defined by K(x, y) = P X2 = y|X1 = x , x, y ∈ X if X is discrete, countable. If X is Euclidean and X2 |X1 has a continuous case density PX2 |X1 (x2 |x1 ), we define the Markov kernel by K(x, y) = PX2 |X1 (y|x), x, y ∈ X . We call L(X1 ) the initial distribution and if the Markov sequence is stationary, L(X1 ) is a stationary distribution because in this case L(Xn ) = L(X1 ) for all n ≥ 1. In what follows, we will define stationary distributions more generally and describe how they, under certain conditions, approximate the distribution of Xn for n large. B. Homogeneous Finite Markov Chains We sketch the theory for X = {1, . . . , N }, that is for finite Markov chains. In fact, for Markov Chain Monte Carlo (MCMC), and many other applications, we need Markov sequences for X countable or X general Euclidean. Such generalizations hold only under additional conditions which we shall give appropriate references to, as needed. Homogenous chains satisfy P (Xm+n = y|Xm = x) = P (Xn+1 = y|X1 = x), so we can call pn (x, y) ≡ P (Xn+1 = y|X1 = x) n-step transition probabilities. The n-step transition matrix Pn is defined by Pn = pn (i, j) N ×N . A key property is Lemma D.5.1. If X1 , X2 , . . . is a finite homogeneous Markov chain, then Pn = Kn , n = 1, 2, . . . 408 Supplements to Text Appendix D where Kn is the nth power of the transition matrix K = K(i, j) N ×N . In particular P (Xn+1 = j|X1 = i) = (Kn )i,j , the (i, j) entry of Kn . Proof. Argue by induction: P1 = K1 by definition. By A.4.4, the Markov property, homogeneity, and induction, P (Xn = j|X1 = i) = N X P (Xn = j|Xn−1 = k, X1 = i)P (Xn−1 = k|X1 = i) k=1 = N X P (Xn = j|Xn−1 = k)P (Xn−1 = k|X1 = i) k=1 n−1 = (K K)i,j = (Kn )i,j . ✷ Next, consider the marginal probabilities (n) (n) (n) q(n) = (p1 , . . . , pN )T , pi = P (Xn = i), i ∈ {1, . . . , N } . Lemma D.5.2. Under the conditions of Lemma D.5.1, (q(n) )T = (q(1) )T Kn . P (n) Proof. By A.4.4 and Lemma D.5.1, pj = i P (Xn = j|X1 = i)P (X1 = i) = P (1) (1) (1) T n Pn )j = (q ) K )j . ✷ i pj pn (i, j) = (q Thus the behavior of the chain is determined by the initial transition matrix K and the initial distribution q(1) . Moreover, we can determine the asymptotic distribution of Xn by examining the limit of Kn . C. Some Definitions Important features of homogeneous Markov chains include (i) Irreducibility: All points in X are accessible from any other point. That is, P (Xm = j|X1 = i) > 0 for some m, all i, j. For finite chains this is equivalent to “Km has all elements positive for some m on.” (ii) Recurrence: A chain is recurrent if for all x ∈ X . P Xk = x for some k X1 = x = 1 (iii) Positive recurrent: This refers to recurrent chains with Eτx < ∞ for all x ∈ X , where τx = min{n ≥ 1 : Xn = x} is the first visit time to x. (iv) Aperiodicity: The period k of a state x is the largest integer k for which returns to state x occurs in multiples of k steps. Formally, k = gcd{n : P (Xn = i|X1 = i) > 0}, where “gcd” denotes greatest common divisor. A Markov chain is aperiodic if k = 1 for all x ∈ X . Section D.5 409 Supplement to Section 10.4 T (v) A stationary probability distribution on X is a vector π = π(1), . . . , π(N ) such that π(i) ≥ 0, Σπ(i) = 1, Σi π(i)K(i, j) = π(j), that is, π T K = πT . Equivalently, if π is the distribution of Xn and π T K = π T , then this sequence is stationary Markov. A stationary distribution may not exist. (vi) A stationary Markov chain with kernel K and pi = P (Xn = xi ) satisfies detailed balance if pi K(i, j) = pj K(j, i) . Summing over j we obtain pi = N X pj K(j, i) j=1 so that (p1 , . . . , pN )T = π is a stationary distribution. Such chains are also called reversible because detailed balance is equivalent to (X1 , X2 ) ∼ (X2 , X1 ). Algebraically, if D = diag(p1 , . . . , pN ), then KD = DKT . (vii) If π = (π(1), . . . , π(N ))T is a stationary distribution, the stationary transition matrix is π(1), . . . , π(N ) .. Π = ... . . π(1), . . . , π(N ) If Y1 , Y2 , . . ., is a homogeneous Markov chain with transition matrix Π, then Y1 , Y2 , . . . satisfy detailed balance and Y1 , Y2 , . . . are i.i.d. as π. (viii) The operator norm on the class of N × N matrices is defined by kM k = sup{|M x| : |x| ≤ 1} where x is N × 1 and | · | denotes Euclidean norm. The Frobenius norm for M = 1 P 2 2 (mij )N ×N is . It is equivlent to k · k for finite matrices. i,j mij D. Fundamental Theorem of Finite Markov Chains. Let X1 , X2 , . . . be an irreducible, aperiodic, homogeneous finite Markov chain. Then there exists a unique stationary distribution π, and if Pn is the n-step transition matrix, then kPn − Πk → 0 . (D.5.4) If π satisfies detailed balance, then more is true: there exists c < ∞ and ρ < 1 such that for all n, kPn − Πk ≤ cρn . (D.5.5) 410 Supplements to Text Appendix D Note that (D.5.4) implies that any initial distribution q(1) , |(q(1) )T Pn − πT | → 0 . (D.5.6) Moreover, (D.5.5) is equivalent to (Problem D.5.4) N X i=1 |P (Xn = i) − π(i)| ≤ M ρn , (D.5.7) for some M < ∞. That is we can compute the stationary distribution arbitrarily closely by starting at any distribution and running the Markov chain. This is sometimes called the forgetting property. The proof of the theorem and bounds on c are in Diaconis and Strook (1991). See also Grimmett and Stirzaker (2001, page 295). Remark D.5.1. As we noted, such results are needed for countable state spaces. What modifications are needed for this case? See Grimmett and Stirzaker (2001, Chapter 6, page 232) for details. The conditions needed are irreducibility, aperiodicity, and positive recurrence. Unfortunately only (D.5.4) can be established under these conditions for countable X. The potential slowness of convergence if |X | = ∞ is not only a weakness of the proof. In fact, (see Gilks et al (1995)) arbitrarily slowly converging Markov chains can be constructed. The situation when X = Rd is even worse. Conditions for (D.5.4) with k · k the variational norm are more difficult to formulate and convergence in high dimensional situations is known empirically and theoretically to typically be very slow. See Meyn and Tweedie (2009). D.6 Asymptotics for Locally Polynomial Estimates. Supplement to Section 11.6 1. Proof of Theorem 11.6.1. We will use (11.6.11). Write Snj = hj−1 n X Uij K(Ui ), j = 0, 1, . . . (D.6.1) i=1 where Ui = (Xi − x)/h with density hf (x + hu). By Chebychev’s inequality (A.15.2) and Taylor expansion, Snj p = ESnj + OP ( Var Snj ) = nhj R uj K(u)f (x + hu)du + OP (hj−1 p nE[U 2j K 2 (U )]) = nhj {f (x)mj + hf ′ (x)mj+1 + OP (cn )} (D.6.2) Section D.6 411 Supplement to Section 11.6 1 where cn = (nh)− 2 (1 + h2 ). The last equality follows because if K is symmetric, so is K 2 , ν5 (K) = 0, and R E[U 2j K 2 (U )] = h u2j K 2 (u)f (x + hu)du R u2j K 2 (u){f (x) + huf ′ (x) + 21 f ′′ (x + hu∗ )h2 }du, |u∗ | ≤ |u| = h = h[f (x)ν2j (K) + O(h2 )]. Next, for ǫ = ǫn = OP (cn ) and any constant c, we have (c + ǫ)−1 = c−1 − c−2 ǫ + OP (ǫ2 ) as ǫ→0 which gives −1 Sno = n−1 {[f (x)]−1 + OP (cn )}. By combining (11.6.11), (D.6.1), (D.6.2), and (D.6.3), the result follows. (D.6.3) ✷ 2. Proof of Theorem 11.6.2. For Xi close to x consider the approximation σ 2 (Xi ) = σ 2 (x) + OP (Xi − x). Using this and the arguments leading to (D.6.2) with j = 0, we have n X i=1 Similarly, −2 Sno σ 2 (Xi )Kh2 (Xi − x) = h−1 nf (x)σ 2 (x)ν(K){1 + oP (1)}. = n−2 {f −2 (x) + oP (1)}, and the result follows from (11.6.12). ✷ 3. The Bias of Local Polynomial Estimates. Let Sn = ZT WZ = (Sn,j+ℓ )0≤j≤p,1≤ℓ≤p , Tn = (Sn,p+1 , . . . , Sn,2p+1 ), S = (mj+ℓ )0≤j≤p,0≤ℓ≤p , S−1 = (S jℓ )0≤j≤p,0≤ℓ≤p , mp = (mp+1 , . . . , m2p+1 ), and H = diag(1, h, . . . , hp ). Theorem D.6.1 Under the conditions of Theorem 11.6.3, h i b (i) Bias β(x)|X = H−1 S−1 mp β p+1 hp+1 {1 + oP (1)} . Proof. Using (11.6.17), (11.6.18), and (D.6.2), the conditional bias is T p+1 S−1 + oP (Xi − x)p+1 1≤i≤n n Z W βp+1 (Xi − x) p+1 = S−1 ) n βp+1 Tn + oP (nh (D.6.4) (D.6.5) p+1 f (x)Hmp + o(h) + oP (nhp+1 ). = S−1 n βp+1 nh Next note that (D.6.2) also gives Sn = nf (x)HSH{1 + oP (1)} (D.6.6) 412 Supplements to Text Appendix D which together with (D.6.5) yields Theorem D.6.1. Theorem 11.6.3 follows by using H−1 = diag(1, h−1 , . . . , h−p ) and a little algebra (Problem D.6.2). ✷ Let ηp+1 ≡ ηp+1 (K) be the first entry of the vector S−1 mp . A useful result is (see Problem D.6.1): Lemma D.6.1. Under the conditions of Theorem 11.6.3, if p is even, then ηp+1 (K) = 0. Proof. Recall that S = (mj+l )0≤j≤p,0≤l≤p and mp = (mp+1 , . . . , m2p+1 ). When K is symmetric, then mj = 0 for j odd. It follows that the vector S−1 mp has zero as its first entry, that is ηp+1 (K) = 0. ✷ The local linear case Lemma D.6.2. Assume the conditions of Theorem 11.6.3. If p = 1, then η2 (K) = m2 (K). Proof. The symmetry of K implies that mj = mj (K) = 0 for j odd. Thus, in the local linear case where p = 1, we have S−1 = diag(1, m−1 2 ) and m1 = (m2 , 0), which implies η2 (K) = m2 (K). ✷ We have shown: Theorem D.6.2. Under the conditions of Theorem 11.6.3, if p = 1, then Bias [b µ(x)|X] = 1 m2 (K)µ′′ (x)h2 + oP (h2 ) . 2 4. The Variance of Local Polynomial Estimates. Let σ 2 (x) = Var(Y |X = x), Σ = diag[Kh2 (Xi − x)σ 2 (Xi )]1≤i≤n , then from (11.6.16), b Var(β(x)|X) = (ZT WZ)−1 (ZT ΣZ)(ZT WZ)−1 . (D.6.7) ∗ Let S∗n = ZT ΣZ = (Sn,j+ℓ )0≤j≤p,0≤ℓ≤p , where ∗ Sn,j = n X i=1 (Xi − x)j Kh2 (Xi − x)σ 2 (Xi ) , R and let S∗ = (νj+ℓ )0≤j≤p,0≤ℓ≤p , where νp (K) = [K(u)]2 du. Theorem D.6.3. Under the conditions of Theorem 11.6.4, σ 2 (x) −1 −1 ∗ −1 −1 b = Var β(x)|X H S S S H {1 + oP (1)} . f (x)nh Proof: By using the expansion leading to (D.6.2), we find Sn∗ = nh−1 f (x)σ 2 (x)HS ∗ H[1 + oP (1)] . Theorems D.6.3 and 11.6.4 follow from this and (D.6.7) (Problem D.6.3). (D.6.8) Section D.7 D.7 413 Supplement to Section 12.2.2 Supplement to Section 12.2.2 Proof of Theorem 12.2.1. b NW (x) slightly for proving that the IMSE conWe temporarily modify the estimate µ verges to zero at an appropriate rate. Consider the modified estimate, µ b NW (x, an ) ≡ µ b NW (x)1 fb(x) ≥ an where an ↓ 0 will be chosen in the proof. We begin by establishing (i) and (ii) for µ b NW (x, an ). Write Ki∗ = Kh (Xi − x), J = b b NW (X, an ) = E(D2 ) where 1 f (x) ≥ an , and (0/0) = 0. We want MSE µ D = (nfb)−1 J n X i=1 Ki∗ Yi − µ(x) . Next write µ(x) = Jµ(x) + (1 − J)µ(x) and note that Jµ(x) = J(nfb)−1 It follows that D = (nfb)−1 J n X i=1 n X µ(x)Ki∗ . i=1 Ki∗ [Yi − µ(x)] − (1 − J)µ(x) . (D.7.1) Using fb−1 = f −1 + (fb−1 − f −1 ), we have D = n−1 f −1 J n X i=1 Ki∗ [Yi − µ(x)] − (1 − J)µ(x) + n−1 (fb−1 − f −1 )J ≡ D1 + R1 + R2 . n X i=1 Ki∗ [Yi − µ(x)] (D.7.2) We will show that D1 is the main term with E(D12 ) of order O(h2 ) + O (nhd )−1 and that R1 and R2 are remainder terms with E(R12 ) and E(R22 ) of smaller order than E(D12 ). We will use the inequality E(D2 ) ≤ 3 E(D12 ) + E(R12 ) + E(R22 ) . Note that because 0 ≤ J ≤ 1, E(D