Document 10940544

AN ABSTRACT OF THE DISSERTATION OF Shuping Jiang for the degree of Doctor of Philosophy in Statistics presented on June 5, 2013. Title: Variable Selection in Semi-parametric Models. Abstract approved: Lan Xue We consider two semiparametric regression models for data analysis, the stochastic additive model (SAM) for nonlinear time series data and the additive coefficient model (ACM) for randomly sampled data with nonparametric structure. We employ the SCADpenalized polynomial spline estimation method for estimation and simultaneous variable selection in both models. It approximates the nonparametric functions by polynomial splines, and minimizes the sum of squared errors subject to an additive penalty on norms of spline functions. A coordinate-wise algorithm is developed for finding the solution for the penalized polynomial spline problem. For SAM, we establish that, under geometrically α−mixing, the resulting estimator enjoys the optimal rate of convergence for estimating the nonparametric functions. It also selects the correct model with probability approach ing to one as the sample size increases. For ACM, we investigate the asymptotic properties of the global solution of the non-convex objective function. We establish explicitly that the oracle estimator is the global solution with probability approaching to one. Therefore, the global solution enjoys both model estimation and selection consistency. In the literature, the asymptotic properties of local solutions rather than global solutions are well estab lished for non-convex penalty functions. Our theoretical results broaden the traditional understandings of the penalized polynomial spline method. For both models, extensive Monte Carlo studies have been conducted and show the proposed procedure works effec tively even with moderate sample size. We also illustrate the use of the proposed methods by analyzing the US unemployment time series under SAM, and the Tucson housing price data under ACM. c © Copyright by Shuping Jiang June 5, 2013 All Rights Reserved Variable Selection in Semi-parametric Models by Shuping Jiang A DISSERTATION submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented June 5, 2013 Commencement June 2014 Doctor of Philosophy dissertation of Shuping Jiang presented on June 5, 2013 APPROVED: Major Professor, representing Statistics Chair of the Department of Statistics Dean of the Graduate School I understand that my dissertation will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my dissertation to any reader upon request. Shuping Jiang, Author ACKNOWLEDGEMENTS I would like to take this opportunity to express my deepest appreciation to many people that helped me in the past five years. I firstly want to thank my advisor, Professor Lan Xue, for her continuous and unreserved guidance, support and encouragement during the past five years. I was lucky to have her as my advisor. With endless patience and full responsibility, she taught me every single step of how to do research. She was always helpful and insightful when I needed guidance, totally understanding and considerate when I came across difficulties, and especially proactive and supportive for any possible opportunities for me. As my advisor, she has not only helped me in my Ph.D. research, but also set a high standard for me to pursue in my future life by her personal example. I would also like to thank Professors Yanming Di, Virginia Lesser, Sinisa Todorovic, and Bo Zhang for serving on my committee. Professor Lesser helped me a lot for my attendance of conferences last year, as well as during my job searching process. I am really grateful about it. Professor Zhang has provided me one year of important research experience as his research assistant. This has granted me a precious chance of studying an area that is totally different from the area of my Ph.D. research. He also gave me many advises of looking for an internship. I deeply appreciate it. Professor Di is always supportive for serving on both of my Ph.D. and master’s committee. I would like to thank all other professors in the department for teaching me statistics: Professors David Birkes, Alix Gitelman, Lisa Madsen, Paul Murtaugh, Cliff Pereira, Dan Schafer, and Bob Smythe. Your teaching has opened the door of statistics to me, and showed me a beautiful world. I would also like to thank Professor Yuan Jiang’s help when I had theoretical questions. Your passion in statistics is impressive and influential. Finally, I would like thank all of my friends, especially Meian, Zoey, Xuan, Gu and Yan, for helping me, being with me, and trusting me. It’s my honor to know you all. TABLE OF CONTENTS Page 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 LAG SELECTION IN STOCHASTIC ADDITIVE MODELS . . . . . . . . . . . . . . . . . 8 2.1 Model and Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Polynomial spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 An initial estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Penalized polynomial spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 12 2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Tuning Parameter and Knots Selection . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Assumptions and Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6.1 2.6.2 2.6.3 2.6.4 3 14 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 20 20 26 CONSISTENT MODEL SELECTION IN ADDITIVE COEFFICIENT MOD ELS WITH GLOBAL OPTIMALITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Penalized Polynomial Spline Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Optimal Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 The local linear approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 43 44 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.1 Example 1: Low dimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Example 2: High dimensional case with intercept. . . . . . . . . . . . . . . . 46 47 TABLE OF CONTENTS (Continued) Page 3.6 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.7 Proof of Lemmas and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 51 53 56 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 LIST OF FIGURES Figure 2.1 2.2 3.1 3.2 3.3 3.4 Page The empirical norms of all estimated additive components are plotted against the tuning parameter λn . We simulated data from both linear AR3 model (left) and a nonlinear NLAR1 model (right side) in Table 2.1 for one run with n = 500. The location of the optimal parameter λ̂n selected by BIC are marked by the dashed line. . . . . . . . . . . . . . . . . . . . . . . . . . 35 The estimated relevant component functions for model NLAR1 ((a) and (b)) and NLAR2 ((c) and (d)) using three approaches. Model NLAR1 is fitted in linear spline space while model NLAR2 is fitted in cubic spline space. The dash-dotted lines and the dotted lines represent polynomial spline estimation of the oracle and full models respectively. The dashed lines represent the penalized polynomial spline estimators. The true com ponent functions are also plotted in solid lines. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Fitted curves for each αls (·), l = 1, 2, s = 1, 2 in Example 1. Plots (a1)-(a4) show the 100 fitted curves for α11 (x) = sin(x), α12 (x) = x, α21 (x) = sin(x), α22 (x) = 0 respectively when sample size n = 100. Plots (b1)-(b4) and (c1)-(c4) are for n = 250 and n = 500 respectively. (d1)-(d4) plot the true model functions (solid) as well as the typically estimated curves of different sample sizes: dashed lines (n = 100), dotted lines (n = 250), and the dot-dashed lines (n = 500). . . . . . . . . . . . . . . . . . . . . 63 Fitted curves for α11 (·), α12 (·), and α21 (·) in Example 2. Plotted are the true functions (solid), typical oracle estimates (dashed) and typical SCAD estimates (dotted). The typical estimated curve is the one whose ISE is the median among the 100 ISEs from replications. . . . . . . . . . . . . . . . . 64 Graphs of estimated curves of all fifteen additive coefficient components αls , l = 1, · · · , 5, s = 1, 2, 3. SCAD estimators are plotted in solid curves, and full estimators are plotted in dashed curves. The non-zero components selected by SCAD are α11 , α12 , α13 , α21 , α41 . . . . . . . . . . . . . . . . 65 Plots of actual housing prices against predicted values from SCAD, FULL and a linear regression model. In order to reduce crowding of all the 891 points, we randomly selected only 80 points for plotting. For each plot, dotted line is the symmetric axis y = x. Two dashed lines respectively represent y = x + 0.1 |x| and y = x − 0.1 |x|. Any point enclosed within the two dashed lines represents a “good” predicted price. . . . . . . . . . . . . . . . 66 LIST OF TABLES Table Page 2.1 Autoregressive models in the simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Lag selection results for the simulation study. The columns of U, C, O give respectively the percentages of under-fitting, correct-fitting and over-fitting over 500 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 The Penalty columns give MISEs from the penalized polynomial spline method. The Oracle and Full columns give MISEs of the polynomial spline estimation of the oracle and full models respectively. . . . . . . . . . . . . . 33 Analysis result of the US unemployment data. The Lags column gives the selected significant lags of the quarterly US unemployment data. . . . 34 Variable selection results for Example 1. The columns of U, C, O give respectively the numbers of under-fitting, correct-fitting and over-fitting from 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Estimation accuracy for Example 1. The Mean(SE) columns give the mean and standard errors of α̂10 and α̂20 . The AISE columns give AISEs of α̂11 , α̂12 , α̂21 and α̂22 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Estimation accuracy for Example 2. The AISE columns give AISEs of α̂11 , α̂12 and α̂21 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Analysis result of Tucson housing price data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3 2.4 3.1 3.2 3.3 3.4 FOR GRANDMOM. VARIABLE SELECTION IN SEMI-PARAMETRIC MODELS 1 INTRODUCTION Linear regression models are widely used in statistical analysis. It is a classical tool for analyzing many types of data, including independent data such as randomly sampled data, and dependent data such as time series data. However, like other parametric re gression models, it relies on an explicit parametric form of the regression function. This usually restricts the availability of the model, and sometimes requires the pre-knowledge of the data structure for correct modeling. An inappropriate application of parametric models may result in estimation bias and erroneous inferences. On the contrary, nonpara metric models have provided great flexibility to explore hidden data structures, without any prior knowledge of the structure of the data. Therefore, in recent years, nonpara metric models have gained increasing popularity as an alternative tool which avoids the problems occurred using inappropriate parametric models. While general nonparametric models are useful for exploring data structures, it suffers the “curse of dimensionality” phenomenon. When there are a large number of variables, the volume of space of predictive variables explodes so fast that the data becomes sparse in the space, and therefore estimation becomes unreliable. In this case general nonparametric models can rarely be used to make effective inferences. Therefore, a class of semiparametric models was developed. Semiparametric models usually impose certain structures to the general nonparametric model, while still reserve some nonparametric components in the model. Therefore, semiparametric models can avoid the “curse of dimensionality”. In the meantime, they can still be more flexible than the parametric 2 ones. This class of models include the partially linear model (Härdle, Liang and Gao 2000) which allows part of the additive components in the linear regression to be nonparametric; the additive model (Stone 1985, Hastie and Tibshirani 1990) which relaxes the strict linear assumption but retains the interpretable additive form; the single index model (Härdle and Stoker 1989, Ichimura 1993) which allows the mean of responses to be a nonparametric function of the linear combinations of predictive variables; the varying coefficient model (Hastie and Tibshirani 1993) which replaces the constant coefficient of each linear predictor by a nonparametric function of another predictor; and the additive coefficient model (Xue and Yang 2006 a & b) whose coefficients of linear predictors are additive nonparametric functions of multiple predictors. To estimate the semiparametric/nonparametric models, we used one classical tool, polynomial splines. Compared to the local polynomial method, the polynomial spline method enables global smoothing and estimation of all nonparametric components through one single least square estimation, therefore reduces computational complexity. General theories of polynomial spline smoothing were established in Stone (1985), de Boor (2001), and Huang (1998 a & b, and 2000). Besides estimation, model/variable selection is also crucial in semiparametric mod eling. A large number of predictors are often introduced at the beginning of modeling process to avoid missing any potentially important explanatory elements. Consequently, selection of significant predictors becomes necessary. The variable selection process not only simplifies the model complexity and its interpretation, but also greatly enhances model predictability. However, according to Breiman (1996), traditional approaches of variable selection including stepwise and subset selection methods are computationally intensive, unstable, and difficult to summarize the sampling properties. This manual selection process also brings in some uninterpretable stochastic errors in the stages of variable selection. Therefore, since the 1990s, a number of penalization methods emerged 3 as an alternative approach to variable selection in linear regression models. These pe nalization methods conduct the estimation and variable selection in the same modeling process, in which the coefficients are appropriately estimated, and the significant variables are simultaneously and automatically selected. These penalization estimation methods in clude the bridge regression methods proposed by Frank and Friedman (1993), the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani (1996, 1997) and Efron, Hastie, Johnstone, and Tibshirani (2004), the smoothly clipped absolute de viation (SCAD) penalty method proposed by Fan and Li (2001), the adaptive LASSO proposed by Zou (2006), the Dantzig selector by Candés and Tao (2007), the minimum concave penalty by Zhang (2010) and the threshold L1 penalty by Shen, Pan, and Zhu (2012). All those work assumed a linear or parametric form for the regression model. Some more recent work extends the aforementioned penalization methods to the nonparametric additive models. In particular, Lin and Zhang (2006) proposed the component selection and smoothing operator for model selection in a more general nonparametric ANOVA model. Meier, van de Geer, and Bühlmann (2009), Xue (2009), Huang, Horowitz, and Wei (2010) Xue, Qu and Zhou (2010), and Wang, Liu, Liang, and Carroll (2011) proposed to approximate the nonparametric components using B-spline basis and conduct variable selection using group-wise model selection approach (Yuan and Lin 2006). Among all the penalization methods, the one with SCAD penalty (Fan and Li 2001) is our focus in this thesis. Fan and Li (2001) have shown that the SCAD penalty has all the three properties of a desirable penalty function: sparsity, unbiasedness and continuity. That is, for a given tuning parameter, on one hand, the SCAD penalty shrinks insignifi cant variables to zero; and on the other hand, leaves relatively large significant variables unbiased after the shrinkage. In addition, the shrinkage for estimators is continuous under SCAD penalties. Combining the SCAD penalty method with polynomial spline estima tion, results in a penalized polynomial spline (PPS) method. In this method, for a given 4 tuning parameter, we define the PPS estimator as the global minimizer of the objective function, which equals to the sum of squared errors plus the sum of dimensional-wise SCAD penalty functions. In the literature, the PPS method has recently been used for variable/model selection in various semiparametric models. For example, Wang, Li and Huang (2008), Wei, Huang and Li (2011), Lian (2012) and Xue and Qu (2012) considered PPS for variable selection in the varying-coefficient models. Xue (2009), Huang, Horowitz and Wei (2010), and Xue, Qu and Zhou (2010) applied PPS for variable selection in (gen eralized) additive models. Ma, Song and Wang (2013) used PPS for variable selection in the single index models. In the PPS method, the minimization of the objective function is challenging since the SCAD penalty has a non-convex form. An approximation method is required to solve the minimization problem. The local quadratic approximation (LQA) method in the nonparametric framework has been proposed by Xue (2009). Another local linear approximation (LLA) method under linear regression has been provided by Zou and Li (2008). Zou and Li (2008) has shown in theory that this method is computationally efficient, since the one-step approximator is already consistent to the oracle estimator. In our research, we extended this LLA method to semiparametric models. Finally, the tuning parameter involved in the PPS methods needs to be deter mined via a model selection criterion such as the final prediction error (FPE, Akaike 1969), Akaike information criterion (AIC, Akaike 1974), Bayes information criterion (BIC, Schwarz 1978), or a model selection method such as cross-validation (CV) and generalized cross-validation (GCV). These criteria/methods were originally proposed for the selection of parametric models for independent and identically distributed data. Recently it has been proved that some of these are also consistent for the selection of nonparametric mod els for weakly dependent time series data. For example, Huang and Yang (2004) provided the theoretical justification of the consistency of BIC in variable selection of nonlinear 5 stochastic regression model, providing the data are from a strongly mixing (i.e. α-mixing) and strictly stationary stochastic process. The consistency of the cross-validation and FPE methods have been shown using stronger mixing conditions (i.e. β-mixing) in other literature (Tschernig and Yang 2000). The specific contribution of our research is the study of the theoretical properties, as well as numerical performances of the PPS method on variable selection under two semi-parametric models: the stochastic additive models (SAM, Chen and Tsay 1993) and the additive coefficient models (ACM, Xue and Yang 2006 a & b). Here SAM models are similar to the additive models (Stone 1985, Hastie and Tibshirani 1990), the difference is that SAM assumes both responses and predictors are from a stationary stochastic process. We use SAM to analyze weakly dependent time series data, and ACM to analyze independent, randomly sampled data. For both models, in addition to numerical studies, we establish the asymptotic oracle properties in theory. For stochastic additive models (SAM), its estimation method has been studied by a large amount of literature. For example, Chen and Tsay (1993) considered two backfit ting techniques: alternating conditional expectation algorithm of Breiman and Friedman (1985) and the BRUTO algorithm of Hastie and Tibshirani (1990). Huang and Yang (2004) estimated the additive components in SAM using an efficient regression splines procedure. Wang and Yang (2007, 2009) proposed a spline-backfitted kernel estimator, which was shown to be both computationally expedient and theoretically reliable. How ever, the important issue of variable/lag selection in the SAM is not well studied in general. Our PPS method estimates and performs variable selection in the same process. Conse quently, the consistent properties in both estimation and model selection are theoretically developed and proved under a more general assumption of weakly dependent data, rather than the commonly used assumption of independent data. For the second model we considered, the additive coefficient model (ACM) proposed 6 by Xue and Yang (2006a), we extend the varying coefficient model to be more flexible. This models the dynamic changes of regression coefficients by allowing the coefficient functions to vary with multiple variables in an additive functional form. It is particularly useful in modeling temporal and spatial data. As discussed in Xue and Yang (2006a), the additive coefficient model includes the aforementioned additive model, varying coefficient model, partially linear model, and linear regression model as special cases. The estimation of the additive coefficient model was studied in Xue and Yang (2006 a & b), using local polynomial and polynomial spline methods respectively. The model selection problem on ACM has not be studied prior to this research. In our theoretical results, we prove a set of strong global oracle properties for ACM. Recent literature including Xue (2009) and Xue, Qu and Zhou (2010) consider SCAD penalty, that only guarantees there exists a sequence of local minimum that is consistent in recovering a sparse model and consistent in estimating the non-zero function compo nents. However, when the objective function is non-convex (which is our case), the local minimum is not necessarily the global minimum. Therefore, there is a discrepancy between the theory and the fact that the PPS estimator is the global solution by the definition. Previous literature have done some work to address this. For example, Xue and Qu (2012) proved the oracle properties of the global minimum under the varying coefficient models, a special case of additive coefficient models for the truncated L1 penalty. Kim, Choi and Oh (2008) discussed the oracle properties of the global minimum for the SCAD penalty, but only in linear regression models. However, our research in the thesis has initially studied the global minimum under the general ACM models with the SCAD penalty. The remainder of the thesis contains two parts. Section 2 discusses lag selection in stochastic additive models. Subsection 2.1 defines stochastic additive models using neces sary notations. In Subsection 2.2, we describe an estimation procedure for the stochastic additive model and established asymptotic properties of the proposed estimator. We then 7 introduce a penalized polynomial spline method for simultaneous estimation and variable selection in the stochastic additive model. We develop the model selection consistency and the oracle properties of the proposed method. In Subsection 2.3, we describe an al gorithm based on the local linear approximation of the nonconcave penalty function. We also discuss to select tuning parameters via the Bayesian information criterion. Subsection 2.4 illustrates the numerical performance of the proposed methods by simulation studies and Subsection 2.5 analyzes the US employment time series. Technical proofs, relative definitions and assumptions are contained in Subsection 2.6. Section 3 focuses on model selection in additive coefficient models. In Subsection 3.1, we introduce the additive coefficient model. In Subsection 3.2, we propose the penalized polynomial spline method for simultaneous model selection and estimation in the additive coefficient model. The asymptotic property of the proposed method is established in subsection 3.3. We describe an algorithm based on the local linear approximation to solve the non-convex minimization problem and discuss a tuning parameter selection by the Bayesain Information Criteria (BIC) in Subsection 3.4. Subsection 3.5 and 3.6 contain simulation studies and an analysis of the Tucson housing price data. Technical proofs are relegated to Subsection 3.8. 8 2 2.1 LAG SELECTION IN STOCHASTIC ADDITIVE MODELS Model and Introduction Linear models such as ARMA, ARIMA and SARIMA models (Brockwell and Davis 1991), VAR, VARMA and PAR models (Lütkepohl 1993) have been a popular method for modeling time series data. These simple linear structures allow for easy estimation and inference in the data generating process. However, for many important time series, there is little a priori justification for assuming such linearity. Tong (1990, 1995) and Tjøstheim (1994) gave interesting examples of data with asymmetric cycles and nonlinear relationship between lagged variables. To model such nonlinear time series, non- and semi-parametric models are useful alternatives. In this paper, we consider a stochastic ∞ T additive model. Let (Xt , Yt )+ t=−∞ with Xt = (Xt1 , . . . , Xtd ) be a stationary stochastic process satisfying Yt = µ0 + µ1 (Xt1 ) + µ2 (Xt2 ) + · · · + µd (Xtd ) + εt , (2.1) where µ0 is an unknown constant and {µl (·)}dl=1 are unknown nonparametric functions and {εt } is white noise conditional on Xt . The covariates Xt can contain both variables from an exogenous time series and lagged values of Yt . When it contains only lagged values of Yt , it reduces to the nonlinear additive autoregressive model considered in Chen and Tsay (1993). 2.2 Polynomial spline estimation Let (Xt , Yt )nt=1 be a sample of size n generated from the stochastic additive model (2.1). The additive functions in model (2.1) are not identified up to a constant. One practical solution is to impose an identification condition that E [µl (Xtl )] = 0 for each 9 l = 1, . . . , d. Furthermore, since nonparametric functions are often estimated only on a compact set, we assume without loss of generality that each Xtl has a compact support on [0, 1]. Note that the identification condition entails that µ0 = E (Y ) . Therefore µ0 can be consistently estimated by the sample average Y at the root-n rate, which is faster than any rate of convergence for nonparametric function estimation. Therefore, for notation simplicity, we assume µ0 = 0 and the response variable Y is centered with Y = 0. Now � let µ(Xt ) = dl=1 µl (Xtl ) be the additive function in (2.1) with E [µl (Xtl )] = 0 for each l = 1, . . . , d. The main objectives of this paper are (a) to establish the rate of convergence of polynomial spline estimation of the nonparametric functions in the stochastic addi tive model (2.1) for weakly dependent data, and (b) to propose a penalized polynomial spline method for simultaneous lag selection and nonparametric function estimation for the model (2.1). l = Let H[0,1] { f : E [f (Xl )] = 0, E [f (Xl )]2 < ∞ } be the space of all square inte grable functions that are centered with respect to the probability measure of Xl . Here l for l = 1, · · · , d. Xl is the l−th component in X = (X1 , . . . , Xd )T . Assume µl ∈ H[0,1] { } �d l Then H = h : h(X) = l=1 hl (Xl ), hl ∈ H[0,1] for l = 1, · · · , d is a space of all square integrable additive functions on [0, 1]d with µ ∈ H. For any functions f, g ∈ H, de fine the theoretical and empirical inner products as (f, g) = E [f (X)g(X)] and (f, g)n = 2 1 �n t=1 f (Xt )g(Xt ) respectively. The induced theoretical and empirical norms are lgl = n (g, g) and lgl2n = (g, g)n respectively. Then, Lemma 2.6.1 in the Appendix entails that, under some reasonable assumptions on the joint density of X, H is theoretically identifi able in the sense that for any h ∈ H, lhl = 0 implies hl = 0 a.s. for l = 1, · · · , d. For each l = 1, . . . , d, we estimate the unknown function µl in (2.1) by polynomial splines. The polynomial spline is a piece-wise polynomial that is connected smoothly over a set of interior knots. Let {0 = xl,0 < xl,1 < · · · < xl,Nn < xl,Nn +1 = 1} be a set of Nn interior knots. Then for any integer p > 0, let Gl be the space of polynomial 10 spline functions that are polynomials of order p (or less) on the intervals [xl,j , xl,j+1 ] for j = 0, . . . , Nn and overall it is p − 1 times continuously differentiable on [0, 1]. The polynomial spline has been widely used in nonparametric function estimation due to the fact that it often provides a good approximation to smooth functions with only a small number of knots. Since each µl in model (2.1) is theoretically centered, we consider �n Gnl = {g : g ∈ Gl , t=1 [g (Xtl )] /n = 0} , the space of empirically centered polynomial splines. Let Gn = {g : g(X) = g1 (X1 ) + · · · + gd (Xd ), gl ∈ Gnl for l = 1, . . . , d}, which provides approximation for functions in the model space H. The following lemma shows the theoretical and empirical norms are asymptotically equivalent on Gn . Lemma 2.2.1. Under Assumptions (A1)-(A3) in the Appendix, one has lgl2n sup lgl2 g∈Gn −1 = oP (1), that is, for any ε > 0, lim supP n→∞ sup g∈Gn lgl2n lgl2 −1 >ε = 0. (2.2) Lemma 2.2.1 indicates that except on a set with probability goes to zero, the theoret ical norm can be approximated by the empirical norm. Therefore together with Lemma 2.6.1, the approximation space G is identifiable under both theoretical and empirical norms. Lemma 2.2.2. For any g ∈ Gn , lgln = 0 or lgl = 0 implies gl = 0 a.s. for l = 1, . . . , d under Assumption (A2). 2.2.1 An initial estimator Suppose that each µl in model (2.1) is smooth and can be approximated by a polynomial spline function gl ∈ Gnl for l = 1, · · · , d. Then one has µ (Xt ) ≈ d L l=1 gl (Xtl ), t = 1, · · · , n. (2.3) Let bnl (x) = bl (x)− n1 in which (x)p+ ( p b (X ) with b (x) = x,x2 ,· · · ,xp ,(x − xl,1 )+ ,· · · ,(x − xl,N )p+ tl l t=1 l �n p = (x+ ) . Here bln (x) 11 )T , is the empirically centered truncated power basis for Gnl . Denote µ = (µ (X1 ) , · · · , µ (Xn ))T , and Bl = (bnl (X1l ), · · · , bnl (Xnl ))T . Then an equivalent expression of (2.3) in terms of matrix is µ≈ d L Bl βl , (2.4) l=1 )T ( where the coefficients β = β1T , · · · , βdT can be estimated by minimizing the sum of squares βj = arg min Y − β T d L 2 Bl βl l=1 . 2 Here Y = (Y1 , . . . , Yn ) and l·l2 is the vector L2 -norm. Equivalently, µ j= j µ jl = bnT l βl satisfying µ j = arg min Y − gl ∈Gn l d L 2 gl l=1 , (2.5) �d jl l=1 µ with (2.6) n where with slightly abuse of notation, Y denotes a random function that interpolates the values Y1 , . . . , Yn at data points. For identically and independently distributed data, it has been shown in the literature (Stone 1985 and Huang 1998) that the polynomial spline j for time series estimator µ j in (2.6) is L2 -consistent. However, the asymptotic property of µ data is less well understood in general. Theorem 2.2.1 below establishes that µ j in (2.6) is also L2 -consistency with an optimal rate of convergence for weakly dependent data that are geometrically α−mixing. Theorem 2.2.1. Under Assumptions (A1)-(A3) in the Appendix, lµ̂ − µl2 = OP (Nn /n+ ρ2n ) and lµ̂ − µl2n = OP (Nn /n + ρ2n ) with ρn = 1/Nnp+1 . The proof of Theorem 2.2.1 is given in the Appendix. Theorem 2.2.1 demonstrates the consistency of the polynomial spline estimator. However it fails to produce a parsi monious model when there exists redundant variables. This result is useful for providing 12 an initial consistent estimator for late development in simultaneous lag/variable selection and estimation of the stochastic additive model. 2.2.2 Penalized polynomial spline estimation For variable selection in the stochastic additive model (2.1), we propose a penalized polynomial spline method. It minimizes the sum of squared errors subject to an additive jP L be the penalty on the L2 norms of the spline functions. To be more specific, let µ penalized polynomial spline estimator (PPS) defined as,  d  1 L PL µ j = argmin l(g) = argmin Y − gl g∈Gn g∈Gn  2 l=1 2 + n d L l=1   pλ (lgl ln ) .  (2.7) Here the penalty function pλ (·) is the smoothly clipped absolute deviation (SCAD) penalty proposed in Fan and Li (2001), whose derivative is p′λ (|β|) = λI(|β| � λ) + (aλ − |β|)+ I(|β| > λ) a−1 (2.8) with a constant a = 3.7 as in Fan and Li (2001) and a positive tuning parameter λ whose selection is discussed in subsection 2.3.1. The SCAD penalty rather than other penalty functions are used here because of its desirable properties such as unbiasedness, sparsity and continuity (Fan and Li 2001). The following two Theorems show that the proposed penalized polynomial spline es timator with SCAD penalty has nice asymptotic properties on both estimation consistency and sparsity for variable selections for geometrically α−mixing time series data. Without loss of generality, assume that only the first r components of Xt contribute in explaining Yt �r in model (2.1). Correspondingly, µ = l=1 µl where r = min {s : P [µs+1 (Xs+1 ) = 0] = · · · = P [µd (Xd ) = 0] = 1}. Theorem 2.2.2. (Estimation consistency) Under Assumptions (A1)-(A5) shown in the Appendix, for n sufficiently large, there exists a local minimizer µ jP L of the criterion 13 function l(·) in Gn such that µ jP L − µ OP (Nn /n + ρ2n ) for l = 1, · · · , d. 2 = OP (Nn /n + ρ2n ). Furthermore, µ jlP L − µl 2 = Theorem 2.2.3. (Sparsity) Under Assumptions (A1)-(A5) shown in the Appendix, except on a set whose probability tends to zero as n → ∞, µ jPl L = 0 a.s., for l = r + 1, · · · , d. Theorem 2.2.3 shows that unlike the initial polynomial spline estimator, the pe nalized polynomial spline estimator correctly shrinks the nonparametric components of redundant variables to zero and provides a parsimonious fitted model with probability approaching to one. Theorem 2.2.2 further indicates that µ jP L converges to µ at the same optimal rate as the initial polynomial spline estimator. Therefore, adding the SCAD penalty term does not change the accuracy of estimation. These two facts ensure the advantages of penalized polynomial spline estimator with the SCAD penalty. 2.3 Algorithm Note that the Newton-Raphson algorithm can not be directly used to minimize the penalized polynomial spline in (2.7) since the SCAD penalty function does not have a continuous second order derivative. Instead, we extend the local linear approximation (LLA) algorithm of Zou and Li (2008) to solve the penalized polynomial spline problem. In the LLA algorithm, we substitute the SCAD penalty in the criteria function l(g) by its local linear approximation. Then we calculate the corresponding minimizer of the adjusted criteria function using the coordinate-wise descent algorithm (CWD, Yuan and Lin 2006). Motivated from Zou and Li (2008), we first approximate the SCAD penalty function (0) by a local linear function. To be more specific, for a given initial estimator gl (0) can write pλ (lgl ln ) ≈ pλ ( gl (0) n ) + pλ′ ( gl (0) n )(lgl ln − gl n =µ jl , one ) for l = 1, · · · , d. As a 14 result, the objective function in (2.7) can be approximated by 2 d S(g g (0) L 1 ) = Y − gl 2 + l=1 n d L (0) p′λ ( gl n l=1 ) lgl ln , (2.9) up to a constant. Therefore the minimizer of S(g g (0) ) is an approximated solution to the original minimization problem (2.7). Equivalently, in terms of the spline basis representation, (2.9) can be written as d S(β β (0) L 1 )= Y − Bl βl 2n l=1 where lβl lKl = J βlT Kl βl with Kl = BT l Bl n , 2 + d L l=1 2 qλ,l lβl lKl (0) and qλ,l = p′λ ( βl Kl (2.10) ) is a constant only (0) depending on the initial value βl . This reduces to a group lasso with component-specific tuning parameter qλ,l . It can be solved by applying the coordinate-wise descent (CWD) algorithm as in Yuan and Lin (2006). Notice that by letting βl∗ = Dl βl and Bl∗ = Bl D−1 l ( T )1/2 , the objective function in (2.10) becomes with Dl = Bl Bl 2 d ∗ nS(β β (0) L 1 )= Y− B∗l βl∗ 2 + l=1 d L √ nqλ,l lβl∗ l2 . (2.11) l=1 2 Proposition 1 of Yuan and Lin (2006) shows that the solution to (2.11) can be found iteratively by ∗(k) βl  = 1 − √  nqλ,l  S∗l,k 2 ( S∗l,k , + )T βj1∗T , . . . , βjd∗T be the value of β ∗ at con ( )T j∗ vergence. Then the minimizer of (2.10) βj = βj1T , . . . , βjdT is found by β̂l = D−1 l βl for where S∗l,k = Y − � ∗ ∗(k) . ji=l Bj βj Let βj∗ = l = 1, · · · , d. Our experience shows that the CWD algorithm converges in just a few iterations. 2.3.1 Tuning Parameter and Knots Selection In this section we discuss how to choose the smoothing and tuning parameters. For the polynomial spline estimation, the polynomial spline spaces {Gl }dl=1 depend on the 15 knot sequence {xl,1 , · · · , xl,N } with N interior knots. We have used an equal spaced knot sequence with the knot number N selected by the Bayesian information criterion (BIC), which is defined as BIC = n ln( RSS n ) + kn ln(n). Here RSS is the residual sum of squares and kn is the number of free parameters to be estimated in (2.6) and n is the sample size. This strategy was also used in Xue (2009) and Huang et al. (2004). Generally the number of interior knots for each spline space Gl can be different. However we here use the same knot number N for all d spline spaces for computational simplicity. For lag selection, we have used the same optimal number of interior knots N selected by the BIC in the polynomial spline approach. In addition, the definition of the SCAD penalty function in equation (2.9) involves two tuning parameters a and λn . Following Fan and Li (2001), we take a = 3.7. The parameter λn plays an important role in the lag selection results. A larger value of λn leads to a simpler model with fewer variables. We } { (0) select the optimal λn by the BIC and we consider λn in the interval from min βl /a and up to some value which shrinks all components of µ jP L to zero. 2.4 Simulation studies In this section we use simulation to study the finite sample performance of the proposed penalized polynomial spline method. Denote S0 as the set of relevant variables in the true model (2.1). Following Huang and Yang (2004), we say that the variable set of the fitted model, denoted as S, is an overfit of S0 if S is a proper superset of S0 ; S is a correct fit if S is exactly S0 . For other cases, we say S is an underfit of S0 . Furthermore, we use the median integrated squared error (MISE) to evaluate the estimation accuracy of a fitted model, which is defined as, MISE = d L l=1 Median 1≤i≤nrep 1 ngrid ngrid ( L i=1 ) ( ) (r) ( G ) 2 x µ l xG µ − j i,l i,l l . 16 (r) Here µ jl is the l-th component of the given fitted function estimated from the r-th repli (r) G , i = 1, · · · , n cation, and xi,l jl grid are the grid points at which µ is evaluated . We simulate 500 random samples of size n=100, 200 or 500 from each of the additive autoregressive models given in Table 2.1. The same models were also used in Huang and Yang (2004). The models in Table 2.1 contain only one or two relevant lags with either linear or non-linear function forms. In addition, the error term εt in (2.1) are independent and identically distributed as N (0, 0.1) random variables. We consider the selection of relevant lags from a set of ten possible lags. That is, we consider Xt = (Xt1 , · · · , Xtd )T in model (2.1) to be the lag variables of Yt as Xtl = Yt−l for l = 1, · · · , 10. We also compare the estimation accuracy of our method with polynomial spline estimations of the oracle and the full models respectively. The full model is an additive autoregressive model (2.1) containing all 10 possible lags, while the oracle model contains only the relevant ones in model (2.1). We applied the penalized spline regression with both linear (p = 1) and cubic (p = 3) jn . For one splines. The BIC criteria was used to select the optimal tuning parameter λ run with sample size n = 500, Figure 2.1 plots the empirical norms of the estimated additive component µ jPl L (λn ) from the penalized cubic spline (p = 3) against λn for a linear AR3 model and a nonlinear NLAR1 model. One can see that most component norms turn to zero for λn large enough. Therefore, larger values of λn leads to a simpler model with fewer selected variables. The dotted line in Figure 2.1 marks the location of jn . It clearly indicates that the BIC works reasonably well for this run since the optimal λ jn that provide the correct fit. The lag selection results of each autoregressive it chooses λ model are presented in Table 2.2, in terms of underfitting, correct-fitting and over-fitting percentages. One can see that, for all additive autoregressive models, the percentage of correct fitting increases quickly to 100% or close to 100% as the sample size increases to 500. Therefore, it numerically verifies Theorem 2.2.3 that penalized polynomial spline 17 method is consistent for variable selection. Besides the lag selection results, we also compared the estimation accuracy of the penalized polynomial spline estimator with the polynomial spline estimations of the oracle and full models. Table 2.3 summarizes MISEs of the three estimators for all autoregressive models in Table 2.1. In almost all cases, the full estimator has largest MISEs while the oracle estimator has the smallest MISEs. It is not surprising since the oracle estimator uses the information on the data generating process, which is not available for real data analysis. The MISEs of the penalized spline estimator is much smaller than that from the full estimator and is very close to that of the oracle estimator. Furthermore, for one run with sample size n = 500, Figure 2.2 plots the estimated component curves from three approaches for model NLAR1 with p = 1 and model NLAR2 with p = 3 respectively. Figure 2.2 graphically confirms the results in Table 2.3 and clearly shows that the proposed methods estimate the unknown functions reasonably well. Both Figure 2.2 and Table 2.3 numerically support Theorem 2.2.2 that the penalized polynomial spline can estimate the model as accurately as the oracle when the sample size is large enough. The performances of the cubic regression and the linear regression are comparable except that the cubic regression gives smoother fitted component curves, which can be seen from Figure 2.2. 2.5 Real data analysis We applied our proposed method to analyze the quarterly US unemployment rate data from the first quarter of year 1948 to the last quarter of year 1978. Denote this time series by {Rt }120 t=1 . The data covers unemployed people in the labor force who are at least 16 years old of all ethnic origins, races and sexes, without distinction between industries and occupations. We then deseasonalized this series by taking the fourth difference of the data. Denoting the resulting new series {Yt }116 t=1 , as Yt = Rt+4 − Rt for t = 1, · · · , 116. In 18 our analysis, the last 16 observations of Yt were left out for prediction. The rest were used for model fitting. As in simulation study, we considered the last 10 lags of Yt as possible predictor variables. Then we applied the penalized polynomial spline method for lag selection, as well as the unpenalized spline estimation of the full model consisting all 10 lags. We considered cubic spline functions (p = 3) for both approaches. For each model fitting, the coefficient of determination R2 , the mean squared estimation error (MSEE) and the mean squared prediction error (MSPE) were calculated. Denote Ŷt the estimated or predicted 1 �100 value for Yt , and Y = 90 t=11 Yt , we write �100 ( )2 jt Y − Y t t=11 R2 = 1 − �100 ( )2 , t=11 Yt − Y MSEE = 100 116 )2 )2 1 L ( 1 L( Yt − Yjt . Yt − Yjt , MSPE = 90 16 t=11 t=101 We also compute the mean absolute estimation error (MAEE) and the mean absolute prediction error (MAPE) 100 116 1 L 1 L j MAEE = Yt − Yt , MAPE = Yt − Yjt . 90 t=11 16 t=101 The R2 , MSEE and MAEE measure how well the models fit the data, while MSPE and MAPE compare the prediction performance of various models. The results are reported in Table 2.4. It shows that the full model with all ten lags gives smaller estimation errors (MSEE and MAEE) compared with the penalized one, due to the fact that the full model has a larger model size. However, the penalized method not only gave a parsimonious model which is much easier to interpret, but also had better prediction performance with smaller prediction errors (MSPE and MAPE). Finally, we also consider a linear autore gressive model with two selected lags Yt−1 and Yt−2 by the penalized polynomial spline method. That is, we consider Yt = β0 + β1 Yt−1 + β2 Yt−2 + ǫt . It gives much larger estima 19 tion and prediction errors, suggesting that the nonlinear additive model better describes the underlying data generating structure for the quarterly US unemployment data. 2.6 Assumptions and Proofs 2.6.1 Notation and Definitions ∞ First, suppose we have two sequences {an }∞ n=1 and {bn }n=1 . Define an � bn if lim an n→∞ bn = 0, an ≍ bn if an � bn and bn � an . For Gnl , the spline space of polynomial functions on [0, 1] with order p (or less), we denote its dimension as Jn . Clearly Jn = Nn + p + 1. On the approximation space Gn , we define two important constants, An = { } 1g1∞ 1/2 sup and ρn = infn lg − µl∞ . Note that according to Huang(1998), An ≍ Jn ≍ 1g1 g∈Gn 1/2 Nn , g∈G 1 ≍ p+1 for polynomial spline space under Assumption (A3). Nn � Recall that µ̂ = dl=1 µ jl (Xl ) is the least squared estimator of µ in Gn based on the � sample. Furthermore, we denote µ̃ = dl=1 µ Jl (Xl ) to be the best approximation of µ in and ρn ≍ 1 Jnp+1 Gn with respect to the empirical norm. Let Q be the projection operator onto Gn with respect to the empirical inner product. We can write µ̃ = Qµ and µ̂ = QY . Consequently, the error can be decomposed as µ̂ − µ = (µ̂ − µ̃) + (µ̃ − µ). By Triangular Inequality, one has lµ̂ − µl � lµ̂ − µ̃l + lµ̃ − µl. As mentioned in Section 2.2.2, when there are redundant variables, without loss � of generality, we can write the true regression function as µ(X) = rl=1 µl (Xl ) for some r � d. Correspondingly denote G(0) = {g : g(X) = g1 (X1 ) + · · · + gr (Xr ),where gl ∈ Gnl for l = 1, . . . , r}. Then G(0) is an approximation space of µ other than Gn . Similar to Q, one can also define a projection operator Q(0) on to G(0) with respect to the empirical inner product. Let µ̂(0) = Q(0) Y , which is the least square estimation of µ in the space G(0) instead. At last, we introduce the α-mixing coefficient α(s) of a stationary stochastic process 20 ∞ (Yt , Xt )+ t=−∞ , which measures the strength of dependence for any two data points that are at least s time units apart. To be more specific, { } α(s) = sup |P (A)P (B) − P (A ∩ B)| :A∈σ({Xt′ , Yt′ , t′ � t}), B∈σ({Xt′ , Yt′ , t′ ; t + s}) . (2.12) Here for any index set Γ, σ({Xt , Yt , t ∈ Γ}) is the σ-field generated by {Xt , Yt , t ∈ Γ}. Note that in stationary stochastic process, α(s) does not vary with t in Equation (2.12). 2.6.2 Assumptions To establish the asymptotic theory, we need the following assumptions. (A1) The stochastic process (Xt , Yt ) is stationary and α−mixing with its α-mixing coeffi cient α(s) � C1 e−C2 s for some constants C1 and C2 . (A2) The joint density of Xt , denoted as fXt , is bounded away from 0 and ∞ on the compact support [0, 1]d . (A3) For each spline space Gnl , l = 1, · · · , d, the number of interior nodes Nn satisfies Nn ≍ nθ with 0 < θ < 13 . And the choice of these nodes tl,1 , · · · , tl,Nn satisfies max |tl,j+1 −tl,j | 1JjJNn < η for a positive constant η. min |tl,j+1 −tl,j | 1JjJN n (A4) The tuning parameter λn in the SCAD penalty function satisfies limn λn = 0. (A5) The tuning parameter λn in the SCAD penalty function satisfies lim n→∞ √ 2 Nn /n+ρn λn = 0. 2.6.3 Proof of Preliminary Lemmas Proof of Lemma 2.2.1. Let GnU B = {g ∈ Gn : lgl � 1} ⊆ Gn be a subset of all functions in Gn that are in the unit ball under the theoretical norm l·l defined before. Note that, to prove Lemma 2.2.1, it is sufficient to prove Equation (2.2) for all g ∈ GnU B . 21 For ∀ε1 , ε2 > 0, let f1 , f2 , g1 , g2 ∈ GnU B with lf1 − f2 l � ε1 , lg1 − g2 l � ε2 . One has lf1 g1 − f2 g2 l∞ � l(f1 − f2 ) g1 l∞ + lf2 (g1 − g2 )l∞ � lf1 − f2 l∞ lg1 l∞ + lf2 l∞ lg1 − g2 l∞ � A2n [lf1 − f2 l lg1 l + lf2 l lg1 − g2 l] � A2n [ε1 + ε2 ] (2.13) and V ar (f1 g1 − f2 g2 ) � 2V ar [(f1 − f2 ) g1 ] + 2V ar [f2 (g1 − g2 )] � 2 lf1 − f2 l2 lg1 l2∞ + 2 lf2 l2∞ lg1 − g2 l2 [ r � 2A2n ε21 + ε22 . Let h = f1 g1 − f2 g2 . Then for any integer r ≥ 3, one has { } { } E |h − E(h)|r = E |h − E(h)|2 |h − E(h)|r−2 � E |h − E(h)|2 2r−2 lhlr−2 ∞ rr−2 [ � 2r−2 A2n (ε1 + ε2 ) E |h − E(h)|2 [ rr−2 � r! A2n (ε1 + ε2 ) E |h − E(h)|2 = r!cr−2 E |h − E(h)|2 , where c = A2n (ε1 +ε2 ). By letting m22 = max one observes that m22 � 2A2n (ε12 + ε22 ) 1�t�n and mrr } { E |h − E(h)|2 , mrr = max {E |h − E(h)|r }, 1�t�n � cr−2 r!m22 . 1 Consequently 25m22 + 5cε ≍ A2n . Recall that An ≍ Nn2 . Assumption (A3) indicates A−2 n = OP (1). Therefore, for any inte ger q between 1 and n 2, Theorem 1.4 of Bosq (1998) gives 22 P ((En − E)(h) > ε) ) ( ) ( ] 2r [ r qε2 5mr 2r+1 n ε2 2r+1 −C2 n q � 2 +1+ exp − + 11n(1 + ) C1 e 2 2 ε q 25m2 + 5cε 25m2 + 5cε ) ( ) ( n qε2 � 2 + 2 exp − q 50An2 (ε21 + ε22 ) + 5A2n ε(ε1 + ε2 ) r ( ] 2r ] 1 ) 2r+1 [ 5[ −C n 2r+1 2(r−1) r−2 2 2 r +11n 1 + r!An C1 e 2 q (ε1 + ε2 ) (ε1 + ε2 ) ε ) ( 4n qε2 � exp − q 50A2n (ε21 + ε22 ) + 5A2n ε(ε1 + ε2 ) ( ) r [ ] 1 [ ] 2r 5 2r+1 2r+1 −C2 n r−2 2 2 2r+1 q +22n (ε + ε ) (ε + ε ) C e r!A2(r−1) 1 2 1 n 1 2 ε 1 for n large enough. Furthermore, through the convexity of function e− x , one has ) ( ) ( n qε2 qε2 P ((En − E)(h) > ε) � 2 exp − + exp − 10A2n ε(ε1 + ε2 ) q 100A2n (ε21 + ε22) ( ) r [ ] 1 [ ] 2r 5 2r+1 2r+1 −C2 n r−2 2 2 2r+1 q +22n (ε + ε ) (ε + ε ) . (2.14) r!A2(r−1) e C 1 2 1 n 1 2 ε For any g ∈ Gn U B , consider a sequence of subsets {g ≡ 0} = ℑ0 ⊂ ℑ1 ⊂ · · · ℑk ⊂ ℑk+1 ⊂ · · · satisfying min lg − g∗ l � δk , where δk = 31k . Note that the cardinality of ℑk satisfies g ∗ ∈ℑk )dJn ( k /2 � 3(k+1)dJn . Furthermore, for any arbitrary given t > 0, choose K #(ℑk ) � 1+δ δk /2 ( )K to be the maximum nonnegative integer such that 23 � 4At 2 . Then for each g ∈ GnU B , n one can find ∗ gK ∈ℑK such that lg − ∗ l gK � 1 . 3K So for each fixed positive integer k � K ∗ and the corresponding gk ∈ ℑk , we can choose gk−1 ∈ ℑk−1 to satisfy δk−1 = 1 . 3k−1 ∗ gk − gk−1 � K For any f ∈ GnU B , define {fk∗ }k=0 in a similar way. From the definitions of ∗ and g ∗ , and (2.13), one has fK K ∗ ∗ ∗ ∗ gK )| < 2 lf g − fK gK l∞ � |(En − E)(f g − fK 4A2n t � K. K 3 2 (2.15) Using (2.14) and (2.15), let’s now prove Lemma 2.2.1. Firstly, Triangular Inequality gives sup |(En − E)(f g)| � f,g∈Gn + ∗ sup |(En − E)(f g − fK ∗ gK )| f,g∈Gn K L sup k=1 fk ,gk ∈ℑk ∗ ) . (En − E)(fk∗ gk∗ − fk∗−1gk−1 23 Therefore, by (2.15), one has P + sup {|(En − E)(f g)|} > t P sup f,g∈Gn U B K L � # (ℑk ) k=1 sup f,g∈Gn UB ∗ ∗ |(En − E)(f g − fK gK )| > t ∗ ∗ gk−1 ) >t (En − E)(fk gk − fk−1 fk ,gk ∈ℑk k=1 ∞ L �P sup P fk ,gk ∈ℑk ( 1 2k , ∗ ∗ (En − E)(fk gk − fk−1 gk−1 ) >t for n sufficiently large. By plugging (2.14) with ε = t ,ε 2k 1 1 2K 1 2k ) = ε2 = , 1 3k−1 (2.16) into the last term above, one has P sup f,g∈Gn UB {|(En − E)(f g)|} > t } } { qt2 /22k qt/2k 3 � + exp − 2 /32(k−1) q 20An2 /3k−1 200A n k=1 ( ) r (k−1) 2(r−1) n 2r 2 2r+1 + C3 nAn2r+1 e−C2 q 2r+1 3(k+1)dJn , 3 [ 2 ] 1 [ ( 4 )r 2r r 2r+1 1/2 where C3 = 22 12 5t C1 r! . Now, let q = n 3 . Recall that An ≍ Nn , Jn ≍ Nn . ∞ L 2n (k+1)dJn Assumption (A3) gives P A2n Jn q { exp − 1 ≍ n2(θ− 3 ) → 0 as n → ∞. Thus sup f,g∈Gn UB {|(En − E)(f g)|} > t } { } { ∞ L n qt/2k n qt2 /22k � 4 exp − + 4 exp − q 20An2 /3k−1 q 200An2 /32(k−1) k=1 ( ) r (k−1) 2(r−1) 2r 2 2r+1 −C2 n 2r+1 q 2r+1 +C3 nAn 3(k+1)dJn . e 3 Then by e−x � x1 e−1 , one gets P + sup f,g∈Gn UB ∞ L k=1 {|(En − E)(f g)|} > t 2(r−1) 2r+1 C3 nAn n 2r e−C2 q 2r+1 � ∞ L k=1 7200 nA2n q 2 t2 ( ) r (k−1) 2 2r+1 3(k+1)dJn , 3 ( )2k ( ) nA2 2 k 2 + 240 2 n 3 q t 3 24 in which the first two terms converge to 0 as n → ∞. Let r = 3. Since Jn q n 1 ≍ n(θ− 3 ) → 0 as n → ∞, so for n large enough, ( ) 3 (k−1) 2 7 −C 6n C3 nAn e 2 7q 3(k+1)dJn 3 k=1 { ( )} ∞ L 4 6n 3(k − 1) 2 7 = C3 nAn exp −C2 + d(k + 1) (log 3) Jn + log 7q 7 3 k=1 { ( )} ∞ L 4 6n 3(k − 1) 2 � C3 nAn7 exp − C2 + log 7q 7 3 ∞ L 4 7 k=1 { }L ∞ ( ) 3(k−1) 7 6n 2 , = C3 nAn exp − C2 7q 3 4 7 k=1 which also goes to 0 as n → ∞. Therefore, lim supP n→∞ sup f,g∈Gn UB {|lf gln − lf gl|} > t Lemma 2.6.1. Under Assumption (A2), for any function h = � �d a constant C ≤ 1, such that C dl=1 lhl l � lhl ≤ l=1 lhl l . �d l=1 hl = 0. ∈ H, there exists Proof. Under Assumption (A2), one can assume that there exist constants 0 < b ≤ B such that the density function b ≤ fXt ≤ B on [0, 1]d . Let Wl = (X1 , · · · , Xl ) and Sl = h1 + · · · + hl for l = 1, · · · , d. We show by induction that, for each l = 1, · · · , d, where δ = J ( 1− b B. 1 − δ 2 ) l−1 2 (lh1 l + · · · + lhl l) � lSl l , (2.17) For l = 1, (2.17) is a trivial case. Suppose (2.17) is true for l < d, we show that (2.17) holds for l + 1. When lhl+1 l = 0 or lSl l = 0, since H is theoretically identifiable, it is a trivial case. Therefore one may assume that lhl+1 l > 0 and lSl l > 0. Denote ρ =corr(Sl , hl+1 (Xl+1 )). Then the part of variance of hl+1 (Xl+1 ), which cannot 25 be explained linearly by Sl (Wl ), can be written as { } (1 − ρ2 ) lhl+1 l2 = minE [hl+1 (Xl+1 ) − γSl (Wl )]2 γ j 1 j = min [hl+1 (xl+1 ) − γSl (wl )]2 fWl ,Xl+1 (wl , xl+1 )dwl dxl+1 γ ; 0 b min B γ Wl ∈[0,1]l j Wl ∈[0,1]l = b min B γ j Wl ∈[0,1]l = Therefore 1 − ρ2 ; b B. j 1 0 (hl+1 (xl+1 ) − γSl (wl ))2 fXl+1 (xl+1 )dxl+1 dwl [ ( 2 ) r E hl+1 (Xl+1 ) + γ 2 Sl2 (wl ) dwl b lhl+1 l2 . B Hence −δ � ρ � δ. Consequently, lSl+1 l2 = lSl l2 + 2ρ lSl l lhl+1 l + lhl+1 l2 1 + ρ (lSl l + lhl+1 l)2 2 ( ) l−1 1−δ 1−δ 2 ; (lh1 l + · · · + lhl l) + lhl+1 l 2 2 ( ) 1−δ l ; (lh1 l + · · · + lhl+1 l)2 . 2 ; Proof of Lemma 2.2.2. Given h = �d l=1 hl 2 ∈ H, lhl = 0, from the definition of theoretical inner product one knows that h = 0 a.s. Besides, Lemma 2.6.1 gives 0 � � C dl=1 lhl l � lhl = 0. So lhl l = 0 for l = 1, · · · , d. Given g ∈ Gn , lgln = 0. Lemma 2.2.1 entails that, except on a set whose probability tends to zero as n → ∞, 1 lgl2 � lgl2n � 2 lgl2 , g ∈ Gn . 2 (2.18) Thus lgl = 0. Since g ∈ Gn ⊆ H, one has g = 0 a.s. For the last identifiability, properties of centered spline function assures that there exist x∗l such that gl (x∗l ) = 0, l ∈ {1, · · · , d}. Given l0 , since the joint density of X is bounded away from 0, one has that P (Xl = xl∗ , l = l0 ) > 0. Therefore, 26   d L L ∗   (x ) + g (X ) = 0 ; P gl (Xl ) = 0 = 1. P (gl0 (Xl0 ) = 0) = P gl l l0 l0 li=l0 2.6.4 l=1 Proof of Theorems Proof of Theorem 2.2.1. We divide the norms of estimation error into two parts by Triangular Inequality. To be specific, lµ̂ − µl � lµ̂ − µ̃l + lµ̃ − µl and lµ̂ − µln � −(p+1) lµ̂ − µ̃ln + lµ̃ − µln . Recall that ρn ≍ Nn . Theorem 2.2.1 can be proved by showing (i) lµ̂ − µ̃l2n = OP ( Nnn ), lµ̂ − µ̃l2 = OP ( Nnn ); (ii) lµ̃ − µl2n = OP (ρn ), lµ̃ − µl2 = OP (ρn ). n n To prove (i), denote {φj }dJ j=1 as a set of orthonormal basis of the additive space G �dJn with respect to the empirical inner product. Note that µ̂ − µ̃ = QY − Qµ = j=1 (Q(Y − �dJn µ), φj )n φj = j=1 (Y − µ, φj )n φj . Therefore, with εt = Yt − µ(Xt ), one has ( E lµ̂ − µ̃l2n ) = dJn L j=1 = E(Y − dJn L E j=1 � dJn L µ, φj )n2 1 n2 n L = dJn L j=1 2 n 1L E εt φj (Xt ) n t=1 ε2t φ2j (Xt ) + E t=1 1 n2 L εs εt φj (Xs )φj (Xt ) 1�s<t�n (Ij1 + Ij2 ) . j=1 For the first part, Ij1 = n n r} r 1 L [ 2 2 1 L { [ 2 φ (X ) = E E φj (Xt )εt2 |Xt E ε t t j 2 2 n n t=1 = 1 E n2 t=1 n L φ2j (Xt )σ 2 = t=1 σ2 n . As in Ij2 , given each pair of (s, t), one has E [εs εt φj (Xs )φj (Xt )] = E {E [εs εt φj (Xs )φj (Xt )|σ {X1 , . . . , Xt }]} = E {εs φj (Xs )φj (Xt )E [εt |σ {X1 , . . . , Xt }]} = 0 27 since E [εt |σ {X1 , . . . , Xt }] = 0. Therefore lµ̂ − µ̃l2n = OP ( Jnn ) = OP ( Nnn ). The inequality (2.18) further gives that lµ̂ − µ̃l2 = OP ( Nnn ). To prove (ii), Theorem 6 on Page 149 of de Boor (2001) entails that, for ev ery l = 1, . . . , d, there exists a constant C ∗ and a spline function gl∗ ∈ Gnl satisfying �d ∗ lgl∗ − µl l � C ∗ ρn . Denote g∗ = l=1 gl . Then through Triangular Inequality, one gets lg∗ − µl = OP (ρn ). Inequality (2.18) entails that lg∗ − µln = OP (ρn ). As we mentioned before, µ̃ is the best approximation of µ in Gn with respect to the empiri cal norm. Thus one has lµ̃ − µln � lg∗ − µln = OP (ρn ). Furthermore, lµ̃ − g∗ ln � lµ̃ − µln +lg∗ − µln = OP (ρn ). Again inequality (2.18) gives lµ̃ − g∗ l = OP (ρn ). Therefore lµ̃ − µl � lµ̃ − g ∗ l + lg∗ − µl = OP (ρn ). Corollary 2.6.4.1. Denote µ̂ = Then �d jl , l=1 µ µ̃ = �d Jl , l=1 µ where µ jl , µ Jl ∈ Gnl for l = 1, . . . , d, J J (i) lµ jl − µ Jl l = OP ( Nn /n) and lµ jl − µ Jl ln = OP ( Nn /n); Jl − µl ln = OP (ρn ); (ii) lµ Jl − µl l = OP (ρn ) and lµ J J jl − µl ln = OP ( Nn /n + ρn ). Consequently, lµ jl − µl l = OP ( Nn /n + ρn ) and lµ � Proof. Lemma 2.6.1 and conclusion (i) in Theorem 2.2.1 gives dl=1 lµ jl − µ Jl l = J �d �d OP ( Nn /n) and l=1 lJ µl − µl l = OP (ρn ). Inequality (2.18) further proves l=1 lµ jl − µ Jl ln = J OP ( Nn /n). Since µ̃ is the best approximation of µ in G with respect to the empirical norm, Lemma 2.2.2 entails that µ Jl is the best approximation of µl in G with respect to the empirical norm for l = 1, · · · , d as well. Recall gl∗ in proof of Theorem 2.2.1 (ii) such � �d that lgl∗ − µl l = OP (ρn ), then by (2.18) one has dl=1 lµ Jl − µl ln � l=1 lgl∗ − µl ln � � 2 dl=1 lgl∗ − µl l = OP (ρn ). 28 Corollary 2.6.4.2. For µ = µ1 + · · · + µr , replace the approximation space Gn with G(0) , (0) (0) one has that the least squared estimator µ j(0) = Q(0) Y = g1 + · · · + gr µ̂(0) − µ 2 = OP (Nn /n + ρ2n ) and µ̂(0) − µ 2 n = OP (Nn /n + ρ2n ). Proof of Theorem 2.2.2. It is sufficient to prove µ jP L − µ µ jPl L − µl also satisfies 2 = OP (Nn /n + ρ2n ) and 2 = OP (Nn /n + ρn2 ) for l = 1, · · · , d. For µ = µ1 + · · · + µr , we have two � (0) least-square estimators µ j ∈ Gn and µ j(0) = rl=1 µ jl ∈ G(0) . Again from Theorem 6 on � Page 149 of de Boor (2001), for arbitrary positive C4 and any g ∗ = dl=1 gl∗ ∈ Gn such J that lg ∗ − µl = C4 Nn /n + ρ2n . Since pλn (·) ; 0, pλn (0) = 0, one has ∗ (0) j l(g ) − l(µ ) ; = 1 2 1 2 ( ( lY − g ∗ l2n − Y −µ j j−µ j(0) lµ j − g ∗ l2n − µ = I + II Denote ǫn = sup g∈G { n sufficiently large, 2I 1g12n 1g12 2 (0) n 2 n ) ) + + r [ L l=1 r [ L l=1 (0) pλn (lgl∗ ln ) − pλn ( µ jl n (0) jl pλn (lgl∗ ln ) − pλn ( µ n ] ) ] ) } − 1 . Lemma 2.2.1 tells that ǫn → 0 as n → ∞. Therefore, for 2 ; lj µ − g ∗ l2 (1 − ǫn ) − µ j−µ j (0) (1 + ǫn ) ( ) 2 2 (0) ∗ 2 (0) ∗ 2 = lj µ−g l − µ j−µ j − ǫn lj j−µ j µ−g l + µ ( ) 2 1 (0) ∗ 2 ; lj µ−g l − µ j−µ j 2 ( ) 2 1 2 ∗ (0) ∗ (0) ; lg − µl − 2 lµ j − µl lg − µl − µ j − µ + 2 lµ j − µl µ j −µ 2 J 2 ( ) 1 = C42 Nn /n + ρ2n − 2C4 Nn /n + ρ2n lµ j − µl − µ j(0) − µ 2 ] (0) −2 lµ j − µl µ j −µ . The last three terms are all of OP (Nn /n + ρ2n ) by Theorem 2.2.1 and Corollary 2.6.4.2. Therefore by choosing C4 large enough, one can assure that the first positive term dom inates the rest three, indicating that I ; 0 as for n large enough. For II, consider 29 any l ∈ {1, · · · , r}. One has that (0) (0) µ jl (0) ; lµl l − µ jl − µl . Lemma 2.6.1 im plies that µ jl − µl � C −1 µ j(0) − µ , and Corollary 2.6.4.2 gives that µ j(0) − µ = J J (0) −(p+1) jl ; lµl l − OP ( Nn /n + ρn2 ). Again with ρn ≍ Nn OP ( Nn /n + ρ2n ). Thus µ , J θ−1 (0) Nn 2 2 ) = OP (1). So µ jl ; 21 lµl l ; aλn Assumption (A3) indicates n + ρn = oP (n (0) since λn → 0. Consequently, pλn ( µ jl n )= a+1 2 2 λn for n large enough. Similarly, since lgl∗ l ; lµl l − lgl∗ − µl l ; lµl l − C ∗ ρn , one gets pλn (lgl∗ ln ) = a+1 2 2 λn for n large enough. Therefore, II → 0 as n → ∞. In all, l(g∗ ) − l(µ j(0) ) ; I + II > 0. Since for any g∗ J J that lg∗ − µl = C4 Nn /n + ρ2n , one has l(g ∗ ) > l(µ j(0) ) with µ j(0) < C4 Nn /n + ρn2 for some sufficiently large C4 , one can conclude that there exists a local minimizer µ jP L of the { } J criterion function l(g) in the subset of G, g : lg − µl � C4 Nn /n + ρ2n . This further assures that µ jP L − µ 2 = OP (Nn /n + ρ2n ). Proof of Theorem 2.2.3. As in the proof of Theorem 2.2.2, given l ∈ {r + 1, · · · , d}, J for arbitrary gl ∈ Gnl such that lgl l = OP ( Nn /n + ρ2n ) and arbitrary g(0) ∈ G(0) such J that g(0) − µ = OP ( Nn /n + ρ2n ), one can see that, l(g (0) ) − l(g (0) + gl ) = = ( 1 Y − g (0) 2 ( 1 µ j − g (0) 2 2 n 2 n − Y −g (0) 2 − gl − µ j − g (0) − gl n 2 n ) ) − pλn (lgl ln ) − pλn (lgl ln ). 30 By Lemma 2.2.1, the empirical norm can be switched to the theoretical norm, which is l(g (0) ) − l(g (0) + gl ) � ( (0) 2 2 (0) ) µ j−g − µ j − g − gl − pλn (lgl l) ) ( � lgl l µ j − g(0) + µ j − g (0) − gl − pλn (lgl l) ( ) p (lg l) 1 λ l lgl l µ j − g (0) + µ j− g(0) − gl − n = λn λn λn = λn lgl l p′ (ω) j − g(0) + µ j − g(0) − gl µ − λn λn λn 2 µ j− g (0) + lgl l p′λn (ω) � λn lgl l − λn λn ( ) 2 lµ j − µl + µ − g (0) + lgl l pλ′ n (ω) � λn lgl l − λn λn � λn lgl l Rn p′λn (ω) − , λn λn where 0 � ω � lgl l. The last term is derived from Taylor Expansion. (√ ) Nn /n+ρ2n n = O = OP (1). Theorem 2.2.1 and restrictions on g (0) and gl give R P λn λn ′ (ω) pλ n λn , since p′ (ω) Therefore λλnn For any g(0) ∈ G(0) ω � lgl l → 0 as n → ∞, one has that Rn λn , l(g (0) ) � l(g(0) + gl ) − J with g (0) − µ = OP ( Nn /n + ρ2n ), dominates ′ (ω) pλ n λn λn 1gl 1 2 = 1 for n large enough. < l(g(0) + gl ). That is, for min√ gl ∈Gl ,1gl 1=OP ( �d jlP L , whose l=1 µ l(g (0) + gl ) = Nn /n+ρ2n ) l(g(0) ). Therefore, for the local minimizer µ jP L = existence is assured by [ PL r Theorem 2.2.2, one has that limn→∞ P µ jl = 0 = 1 for l = r + 1, · · · , d. 31 TABLE 2.1: Autoregressive models in the simulation study Model Function AR1 Yt =0.5Yt−1 + 0.4Yt−2 + 0.1εt AR2 Yt =−0.5Yt−1 + 0.4Yt−2 + 0.1εt AR3 Yt =−0.5Yt−6 + 0.5Yt−10 + 0.1εt NLAR1 NLAR2 NLAR3 NLAR1U1 NLAR1U2 } { } { 3 / 1 + (Y 4 2 )/(1 + Y 2 ) + 0.6 3 − (Y Yt =−0.4(3 − Yt−1 t−2 − 0.5) t−2 − 0.5) t−1 +0.1εt { } { } 2 ) Y 2 Yt = 0.4 − 2 exp(−50Yt−6 t−6 + 0.5 − 0.5 exp(−50Yt−10 ) Yt−10 + 0.1εt { } 2 ) Y Yt = 0.4 − 2 cos (40Yt−6 ) exp(−30Yt−6 t−6 { } 2 + 0.55 − 0.55 sin (40Yt−10 ) exp(−10Yt−10 ) × Yt−10 + 0.1εt 2 )/(1 + Y 2 ) + 0.1ε Yt =−0.4(3 − Yt−1 t t−1 { } { } Yt =0.6 3 − (Yt−2 − 0.5)3 / 1 + (Yt−2 − 0.5)4 + 0.1εt 32 TABLE 2.2: Lag selection results for the simulation study. The columns of U, C, O give respectively the percentages of under-fitting, correct-fitting and over-fitting over 500 replications. Model AR1 AR2 AR3 NLAR1 NLAR2 NLAR3 NLAR1U1 NLAR1U2 n p=1 p=3 U C O U C O 100 0.212 0.628 0.116 0.356 0.544 0.100 200 0.030 0.898 0.072 0.146 0.834 0.020 500 0 0.992 0.008 0 1 0 100 0.272 0.570 0.158 0.424 0.468 0.108 200 0.044 0.892 0.064 0.198 0.770 0.032 500 0 0.994 0.006 0 0.998 0.002 100 0.008 0.836 0.156 0.018 0.856 0.126 200 0 0.972 0.028 0 0.982 0.018 500 0 0.994 0.006 0 1 0 100 0.002 0.940 0.058 0 0.966 0.034 200 0 1 0 0 0.994 0.006 500 0 1 0 0 0.998 0.002 100 0.766 0.172 0.062 0.554 0.402 0.044 200 0.622 0.336 0.042 0.060 0.888 0.052 500 0.004 0.974 0.022 0 0.990 0.010 100 0.048 0.628 0.324 0.204 0.572 0.224 200 0 0.906 0.094 0.010 0.948 0.042 500 0 0.988 0.012 0 1 0 100 0 0.984 0.016 0.002 0.996 0.002 200 0 0.994 0.006 0 0.998 0.002 500 0 1 0 0 1 0 100 0.070 0.552 0.378 0.064 0.66 0.276 200 0.024 0.680 0.296 0.066 0.784 0.0150 500 0 0.920 0.080 0 0.962 0.038 33 TABLE 2.3: The Penalty columns give MISEs from the penalized polynomial spline method. The Oracle and Full columns give MISEs of the polynomial spline estimation of the oracle and full models respectively. Model AR1 AR2 AR3 NLAR1 NLAR2 NLAR3 NLAR1U1 NLAR1U2 n p=1 p = 3 Penalty Oracle Full Penalty Oracle Full 100 0.0138 0.0124 0.0168 0.0504 0.0016 1.1498 200 0.0024 0.0020 0.0047 0.0223 0.0053 0.1530 500 0.0015 0.0014 0.0028 0.0245 0.0021 0.1433 100 0.0029 0.0005 0.0085 0.0872 0.0004 2.7515 200 0.0015 0.0005 0.0073 0.0747 0.0007 0.9650 500 0.0046 0.0046 0.0061 0.0534 0.0042 0.3213 100 0.0031 0.0025 0.0086 0.2572 0.0007 2.6233 200 0.0007 0.0006 0.0024 0.1117 0.0009 0.6983 500 0.0010 0.0010 0.0019 0.0367 0.0004 0.2134 100 3.5657 3.3146 3.6709 3.7874 3.2259 7.7016 200 3.4851 3.3389 3.5019 3.5770 3.4044 5.3536 500 3.7110 3.6349 3.7129 3.8708 3.2838 4.8348 100 0.0044 0.0034 0.0075 0.0180 0.0031 1.5706 200 0.0051 0.0051 0.0077 0.2566 0.0044 0.8909 500 0.0333 0.0056 0.0349 0.2926 0.0030 0.5308 100 0.0119 0.0109 0.0161 0.2935 0.0150 6.8577 200 0.0112 0.0108 0.0144 0.5081 0.0265 3.8070 500 0.0086 0.0085 0.0096 0.0842 0.0064 0.4677 100 0.4995 0.4993 0.5103 0.7834 0.5043 6.7995 200 0.4690 0.4687 0.4779 0.5454 0.4929 1.2789 500 0.4643 0.4639 0.4685 0.5400 0.4991 0.8170 100 2.0327 2.0317 2.1543 4.9415 2.0724 18.9815 200 2.0348 2.0329 2.0869 5.5130 2.3109 9.9782 500 2.0424 2.0239 2.0626 5.9310 2.7549 7.1970 34 TABLE 2.4: Analysis result of the US unemployment data. The Lags column gives the selected significant lags of the quarterly US unemployment data. R2 MSEE MAEE MSPE MAPE {1, 2} 0.9544 0.0382 0.1478 0.0531 0.2022 Full {1, · · · , 10} 0.9724 0.0231 0.1220 0.1759 0.3417 AR(2) {1, 2} 0.8466 0.1314 0.3232 0.2246 0.4322 Model Lags Penalty 35 NLAR1 0.05 0.04 0.03 0.02 0.00 0.01 The empirical norm of the components 0.006 0.004 0.002 0.000 The empirical norm of the components 0.008 AR3 0.01 0.02 0.03 lambda 0.04 0.05 0.02 0.06 0.10 0.14 lambda FIGURE 2.1: The empirical norms of all estimated additive components are plotted against the tuning parameter λn . We simulated data from both linear AR3 model (left) and a nonlinear NLAR1 model (right side) in Table 2.1 for one run with n = 500. The location of the optimal parameter λ̂n selected by BIC are marked by the dashed line. 0.0 −0.5 0.0 −1.5 −1.0 −1.0 −0.5 Yt 0.5 0.5 1.0 36 0.0 0.5 1.0 1.5 0.0 0.5 1.5 Yt−2 (b) −0.6 −0.6 −0.2 −0.2 Yt 0.2 0.2 0.6 0.6 Yt−1 (a) 1.0 −0.3 −0.2 −0.1 0.0 Yt−6 (c) 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 Yt−10 (d) FIGURE 2.2: The estimated relevant component functions for model NLAR1 ((a) and (b)) and NLAR2 ((c) and (d)) using three approaches. Model NLAR1 is fitted in linear spline space while model NLAR2 is fitted in cubic spline space. The dash-dotted lines and the dotted lines represent polynomial spline estimation of the oracle and full models respectively. The dashed lines represent the penalized polynomial spline estimators. The true component functions are also plotted in solid lines. 37 3 CONSISTENT MODEL SELECTION IN ADDITIVE COEFFICIENT MODELS WITH GLOBAL OPTIMALITY 3.1 The Model Let {(Yi , Xi , Ti )}ni=1 be a sequence of i.i.d. random vectors, where Yi is a response variable and Ti = (Ti1 , . . . , Tid1 )T and Xi = (Xi1 , . . . , Xid2 )T are explanatory variables. The additive coefficient model in Xue and Yang (2006 a & b) assumes that Yi = m(Xi , Ti ) + εi , i = 1, · · · , n. where m(Xi , Ti ) = d1 L l=1 α(Xi )Til , with α(Xi ) = αl0 + (3.1) d2 L αls (Xis ). s=1 Similar to the additive model, the coefficients functions αls are not uniquely identified up to a constant. Therefore, for model identification, we assume E (αls (Xs )) = 0 for l = 1, . . . , d1 , s = 1, . . . , d2 . Estimation of model (3.1) was studied in Xue and Yang (2006a), Xue and Yang (2006b), and Liu and Yang (2010). In this paper, we are particularly interested in model selection in the additive coefficient model. With the advance of technology, one is able to collect massive data with high dimensions. However, in many applications, only a small set of those variables are relevant. In our additive coefficient model, it is possible that only some of the additive coefficients components αls (·) are relevant. We say, αls (·) is irrelevant if αls (xs ) = 0 with probability one for s = 0, or αls = 0 for s = 0. Otherwise, we say αls (·) is relevant. To distinguish these two kinds of terms, we define a subset S (0) of the full index set S = {(l, s) : l = 1, · · · , d1 , s = 1, · · · , d2 } to identify all nonzero terms. That is, for any index pair (l, s) ∈ / S (0) , we have P (αls (Xs ) = 0) = 1 for s = 1 and αls = 0 for s = 0. Here Xs is a random variable having the same distribution as Xis , i = 1, · · · , n. 38 3.2 Penalized Polynomial Spline Estimation Our method utilizes the polynomial spline functions to approximate unknown coeffi cient functions. Similarly with Section 2.2, or each s = 1, . . . , d2 , let ks,n = {0 = νs,0 < νs,1 } < · · · < νs,Nn < νs,Nn+1 = 1 be a knot sequence on [0, 1]. For some integer p ≥ 0, a poly nomial spline with degree p on knot sequences ks,n are functions that are polynomials of degree p or less on each of the intervals [νs,i , νs,i+1 ), i = 0, . . . , Nn − 1 and [νs,Nn , νs,Nn +1 ], and are p − 1-times continuously differentiable on [0, 1]. Let ϕs = ϕ (ks,n , p) the space of such polynomial spline functions. The success of the polynomial spline functions is that it often provides good approximations to smooth functions with a small number of knots. To consistently estimate the centered coefficient functions αl,s , we define a subspace of ϕs which consists of empirically centered polynomial spline functions. Let ϕ0s = � {g ∈ ϕs , En (g) = 0}, where En (g) = ni=1 g (Xis ). Let Jn = N +p and Bs = (Bs,1 , · · · , Bs,Jn ) be a set of basis of ϕ0s . For example, one can set Bs,j = bs,j − En (bs,j ) for j = 1, · · · , Jn , n and {bs,j }Jj=1 be the truncated power basis { } p p , . . . , (xs − νs,Nn )+ , xs , . . . , xps , (xs − νs,1 )+ with (x)p+ = (x+ )p . Let B = {1, BT1 , · · · , BTd2 }T be the basis of the additive spline function. One can approximate the regression function in (3.1) by m (x, t) ≈ d1 L γl0 + d2 L gls (xs ) tl = s=1 l=1 T d1 L γl0 + d2 L T γls Bs (xs ) tl , s=1 l=1 where γls = (γls,1 , . . . , γls,Jn ) , and Bs (xs ) = (Bs1 (xs ) , . . . , BsJn (xs ))T . The standard polynomial spline method in Xue and Yang (2006b) minimizes the sum of squares to estimate the unknown coefficients γ = {γl0 , γls , 1 ≤ l ≤ d1 , 1 ≤ s ≤ d2 }, γ = argmin J γ = argmin γ n L i=1 n L i=1 Yi − Yi − d1 L l=1 d1 L l=1 γl0 + d2 L T γls Bs (Xis ) 2 Til s=1 γl0 Zi,l0 + d2 L s=1 2 T Zi,ls γls , (3.2) 39 where Zi,l0 = Til and for s > 0, Zi,ls = Bs (Xis ) Til = (Bs1 (Xis )Til , · · · , BsJn (Xis )Til )T . Then the resulting estimator of the coefficient functions are given by T Jls α Bs (xs ) = J ls (xs ) = γ Jn L j=1 Jls,j Bsj (xs ) . γ Xue and Yang (2006b) established that the standard polynomial spline estimator is con sistent and converges to the true function in the optimal L2 rate of convergence. However, when there exist redundant terms in (3.1), the standard polynomial spline method is unable to produce a parsimonious estimate and deteriorates the estimation accuracies of the nonzero coefficient functions. Therefore, in this paper, we propose a penalized polynomial method for model selection in the additive coefficient model. We consider   2 d2 d1 L d2 d1 n  1 L ) ( L L L T γ̂ = argmin Yi − γl0 Zi,l0 + γls Zi,ls + pλn lγls lWls ,   2n γ s=1 s=0 i=1 l=1 l=1 (3.3) J �n T T W γ , with W = γls ls ls ls = i=1 Zi,ls Zi,ls /n. Therefore, for s > 0, where lγls lWls � n � lγls lWls = (gs (Xis ) Til )2 /n is the empirical norm of gs (xs ) tl . In (3.3), pλn (·) is i=1 a penalty function depending on the tunning parameter λn . Although different penalty functions can potentially be used, we focus on the smoothly clipped absolute deviance (SCAD) penalty in Fan and Li (2001), which has the form    0 � |β| < λn ; λn |β|    2 2 pλn (|β|) = aλn (|β|−λn )−(|β| −λn )/2 λn � |β| < aλn ; a−1     (a−1)λ2n  + λ2n |β| ; aλn . 2 As pointed in Fan and Li (2001), the non-convex penalty function SCAD penalty achieves three desirable properties of model selection: unbiasedness, sparsity and continuity. Model selection result crucially depends on the choice of λn . Generally a larger λn shrinks more additive coefficient components to zero, and results in a more parsimonious model. The selection of λn will be discussed in Section 3.4 in details. Following Fan and Li (2001), we set a = 3.7. 40 3.3 Optimal Properties In this section we discuss the asymptotic properties of the global solution of the penalized polynomial estimator with a non-convex SCAD penalty. When d1 = 1 and T = 1, Model (3.1) reduces to an additive model. For additive models, Xue (2009) established that there exist a sequence of local minimizer that correctly shrink the zero function components to zero with probability approaching to one and estimate the nonzero components at the same L2 rate as the standard polynomial spline. Xue, Qu and Zhou (2010) and Jiang and Xue (2013) extended this local asymptotic result to (generalized) additive models for longitudinal data and weekly dependent time series data respectively. However, the penalized polynomial spline (PPS) estimator is the global optimal solution in (3.3) by the definition. Therefore, there is still a gap between the theory of the local optimality and the definition of PPS estimator. In Xue and Qu (2012), the global optimal property of a penalized estimator was established for the varying coefficient models, which also is a special case of the additive coefficient model with d2 = 1. However the truncated L1 -penalty (TLP) was used instead of the SCAD penalty. Different from SCAD, the TLP is a piecewise linear non-convex penalty. The required techniques to establish the global optimal of SCAD is very different from those for TLP in Xue and Qu (2012). For SCAD, the asymptotic property of the global optimality was proved in the linear regression in Kim, Choi, Oh (2008). In this paper, we extend the global optimality of SCAD estimator for a semi-parametric additive coefficient model. To prove our theoretical results, we need the following assumptions. (C.1) There exists a positive constant c such that minl,s∈S (0) lαls l2 ; c. (C.2) The tuning variables X = (X1 , · · · , Xd2 )T are compactly supported. Without the loss of generality, we assume that its support is [0, 1]d2 . Furthermore, we assume the 41 density function of X is bounded away from 0 and infinity on its support. r [ (C.3) The eigenvalues of E TTT |X = x are bounded away from 0 and infinity uniformly for all x ∈ [0, 1]d2 . (C.4) There exists a constant c > 0 such that |Tl | < c with probability 1 for l = 1, · · · , d1 . −(p+1) (C.5) The tuning parameter λn satisfies (i) limn→∞ Nn λn = 0, (ii) limn→∞ log(Nn ) nλ2n = 0, (iii) limn→∞ λn = 0. (C.6) For each s = 1, · · · , d2 , the set of interior knots ks,n = {0 = xs,0 < xs,1 < · · · < xs,Nn < xs,Nn +1 = 1} satisfies, for a constant c, max(xs,j+1 − xs,j , j = 0, · · · , Nn ) � c. min(xs,j+1 − xs,j , j = 0, · · · , Nn ) Assumptions (C.1)-(C.6) are commonly used assumptions in literatures on polyno mial splines and variable selection. Similar conditions as (C.1)-(C.6) can found in Huang, Wu and Zhou (2002), Xue and Yang (2006b), Xue, Qu and Zhou (2010), Xue and Qu (2012). Assumption (C.4) can be relaxed for Tl to have a support on the entire real line. The proofs of Theorems 1 and 2 can go through if the tail probability of each Tl vanishes sufficiently fast. We impose assumption (C.4) only for simplicity of proof. We define an oracle model as a sub-model of (3.1) which contains exactly those nonzero additive coefficient components. Let S (0) be the index set of the oracle model. Let the oracle estimator γ j(0) be the standard polynomial spline estimation of spline coefficients { } (0) (0) of the oracle model. That is, γ jls = 0 for (l, s) ∈S (0) , and γ jls , (l, s) ∈ S (0) minimizes (3.2) with an oracle model. Theorem 3.3.1. (Local Optimality) Under Assumptions (C.1) to (C.6), the oracle esti ( (0) ) mator γ j(0) is a local minimizer with probability tending to 1. That is, P γ j ∈ An (λn ) → 1 as n → ∞, where An (λn ) is the set of local minima of (3.3). 42 Theorem 3.3.2. (Global Optimality) Let j γ = (γ jls , l = 1, · · · , d1 , s = 1, · · · , d2 ) be the global minima of (3.3). Under Assumptions (C.1) to (C.6), the estimator by minimizing ) ( (3.3) enjoys the oracle property, that is P γ j=γ j(0) → 1 as n → ∞. Remark 1. Rather than stating only the existence of a sequence of consistent local minima of (3.3), Theorem 3.3.1 points out that with probability converging to one, the oracle estimator is one of such sequence of local minima that is consistent. It is a stronger conclusion than Theorem 2 in Xue (2009). Remark 2. Theorem 3.3.2 further concludes the oracle estimator is not only a local minima, but also the global minima of (3.3) with probability converging to one. Therefore, the global solution of (3.3) enjoys the orale properties. That is, it can correctly identify the non-zero terms of the true model and estimate the non-zero coefficients as well as if the true model was known for a large sample size. 3.4 Implementation In this section we extend the local linear approximation (LLA) algorithm proposed by Zou and Li (2008) for linear regression models to our semi-parametric additive coef ficient models. In the LLA algorithm, a non-convex penalty function is approximated locally by a linear function. Then one iteratively solves a sequence of upper convex ap proximation of the non-convex objective function for the final solution. Zou and Li (2008) pointed that LLA not only provides better approximation of the original non-convex ob jective function, it is also more numerically stable and computationally efficient, compared with other algorithms such as those based on the local quadratic approximation. In this section, we also discussed the choices for tuning parameters involved in the proposed method. 43 3.4.1 The local linear approximation algorithm Solving the optimization problem in (3.3) with a given tuning parameter λn is challenging. The SCAD penalty function is non-convex and is singular at the origin. The traditional Newton-Raphson algorithm can not be applied directly to solve (3.3). For linear models, Zou and Li (2008) developed a unified algorithm through local linear approximation(LLA) to solve non-convex penalized problems. Inspired by Zou and Li (2008), we extend the LLA algorithm to our semi-parametric additive coefficient models as follows. We first approximate the non-convex SCAD penalty locally by a linear function. Specifically, given λn and an initial point γ 0 , the SCAD penalty function in (3.3) can be approximated as ) ( ( 0 pλn lγls lWls ≈ pλn γls ) + p′λn ( 0 γls )( 0 lγls lWls − γls ) . (3.4) By removing all the constants irrelevant to γ, we have a one-step approximation   2 d1 L d2 d1 L d2  1  L L Y− Zls (xs , tl )γ ls + λn,l lγls lWls . γ (1) = argmin  γ∈Rd1 d2 Jn  2n (3.5) Wls l=1 s=1 Wls 2 Wls l=1 s=0 Therefore, the original optimization in (3.3) reduces to a group LASSO (Yuan and Lin, ( ) 0 2006) with component specific tuning λn,l = p′λn γls . We can then use the W ls coordinate-wise descent(CWD) algorithm (Yuan and Lin 2006) to get γ (1) by iteratively applying  γls = 1 − Here Sls = ZTls Y − √ ′ ( 0 npλn γls lSls l2 � Wls )  Sls for l = 1, · · · , d1 , s = 0, · · · , d2 . + Zl′ s′ γl′ s′ . (l′ ,s′ )i=(l,s) The LLA is an efficient optimization algorithm. One does not need to apply the local linear approximation (3.4) iteratively to get a closer and closer approximation, as required by local quadratic approximation (LQA) algorithm (Fan and Li 2001). It was 44 shown in Zou and Li (2008) that, with a good initial point γ 0 , the one step estimator γ (1) enjoys the same asymptotic property as the fully iterated solution. Therefore as in Zou and Li (2008), we use the one-step LLA algorithm to reduce the computational complexity. For any give λn , we only calculate the one-step approximation in (3.5) and use γ (1) as the final solution. Our simulation study indicates that the one-step LLA works reasonably well. 3.4.2 Selection of tuning parameters In this subsection we discuss the selection of tuning parameters involved in our estimation. There are two kinds of tuning parameters, those determine the polynomial spline space, and those involved in the SCAD penalty function. According to the theory of polynomial spline approximation, a polynomial spline space is determined by (i) the degree p restricting the maximal degree of polynomials within the space, and (ii) the set of interior nodes ks,n = {0 = νs,0 < νs,1 < · · · < νs,Nn } < νs,Nn+1 = 1 . The degree p determines the smoothness of the estimated curves. In most applications, cubic polynomial splines with p = 3 often provide sufficiently smooth fit. Therefore, for simplicity, we approximate the true model using cubic polynomial spline space with p = 3 in the examples given in sections 3.5 and 3.6. The selection of interior nodes is critical to the quality of the polynomial spline approximation. Following Huang, Wu and Zhou (2002), Xue, Qu and Zhou (2010) and Jiang and Xue (2013), we use the interior nodes that are equally spaced within the support of each Xs and select only the number of interior nodes Nn using the Bayesian Information Criterion (BIC). That is, for each given Nn , the unpenalized estimator from (3.2) can be calculated. Denote the RSS as the corresponding residual sum of squares, and kn = (Nn + p)d1 d2 + d1 as the total number of parameters in (3.2). Then the BIC is defined as BIC (Nn ) = n ln(RSS) + kn ln(n). 45 We select the optimal N̂n,opt which has the smallest BIC value. The SCAD penalty function involves two tuning parameters a and λn . For simplicity, we set a = 3.7 as in Fan and Li (2001). However, the tuning parameter λn crucially determines our model selection results. One can observed that a larger value of λn results in a simpler model with more zero additive components in the estimated coefficients. In particular, with λn = 0, the penalized estimator in (3.3) reduces to the unpenalized one in (3.2). On the other hand, an improper large λn can mistakenly shrink all coefficients to zero. Following Xue (2009) and Xue, Qu and Zhou (2010), we use the BIC to select the optimal λn from the interval [0, λn,max ], where λn,max shrinks all components to zero. That is, let j γλn be the solution of the penalized polynomial spline (3.3) for a given λn . jn,opt is defined by Then the optimal λ jn,opt = λ argmin λn ∈[0,λn,max ] {BIC(γ jλn )} = argmin λn ∈[0,λn,max ] where kn,;γλn is the number of nonzero terms in γ jλn . 3.5 { } n ln(RSS;γλn ) + kn,;γλn ln(n) , Simulation studies We studied the numerical performance of our proposed method through simulations in both low and high dimensional cases. We are mostly interested in validating the model selection consistency and the estimation accuracy through simulations. We evaluated the model selection results by comparing additive terms selected by our penalized method with those contained in the true model. Recall that S is the index pair jls Bs l set of all additive terms in the full model. We further define Sj = {(l, s)| lα jls l = lγ = 0, (l, s) ∈ S} as the index pair set of nonzero additive terms in our penalized estimator. We say that it is an overfitting if S0 ⊆ Ŝ, a correct fitting if S0 = Ŝ, and an underfitting otherwise. We introduce the averaged integrated squared error (AISE) to evaluate the esti 46 mation accuracy of the coefficient functions. Suppose for each αls (·), the estimator from the k-th generated sample is α̂k,ls (·) for k = 1, . . . , nrep. Here nrep is total number of n grid replications. Then given a set of grid points {xm }m=1 , the integrated squared error (ISE) is given by and AISE(α jls ) = ISE(α jk,ls ) = 1 nrep �nrep k=1 1 ngrid ngrid L m=1 ISE(α jk,ls ). {α jk,ls (xm ) − αls (xm )}2 , In both simulation studies, three estimators are considered: the proposed estimator using SCAD penalty (SCAD), the least squared estimators of the oracle model (ORACLE) and the full model (FULL) respectively. The oracle model contains only nonzero additive terms in the true model. The oracle estimator is not available in real data analysis where the true model is unavailable. In our simulation study, we use the oracle estimator as a benchmark to evaluate the estimation accuracies of other estimators. 3.5.1 Example 1: Low dimensional case We generated 100 samples of size n = 100, 250, 500 respectively from the model Y = d1 L l=1 αl0 + d2 L αls (Xs ) Tl + ε, (3.6) s=1 where d1 = 8, d2 = 2, α10 = 2, α20 = 1. In the model, there are only three rele vant coefficient functions α11 (x) = α21 (x) = sin(x), α12 (x) = x, and the rest coeffi cient functions are all zero. The true model index S0 = {(1, 0), (1, 1), (1, 2), (2, 0), (2, 1)}, and the full model index S = {(l, s), l = 1, . . . , 8, s = 0, 1, 2}. The explanatory variables {Xi = (Xi1 , . . . , Xid2 )}ni=1 are uniformly distributed on [−π, π]d2 . The linear covariates { }n Ti = (Ti1 , . . . , Tid1 )T i=1 have an i.i.d standard d1 dimensional multivariate normal dis tribution and the errors {εi }ni=1 follow i.i.d N (0, 1). Here {Xi }ni=1 , {Ti }ni=1 , and {εi }ni=1 are mutually independent. Tables 3.1 and 3.2 report the performance of our penalized method on model selec tion and model estimation respectively. Table 3.1 clearly shows that as the sample size 47 increases, the rate of correct fitting increases. It reaches 100% when the sample size in creases to 250, showing that our method is consistent in model selection. In Table 3.2, we reported the AISEs to compare the accuracy of SCAD, ORACLE and FULL in estimating coefficient functions. To assess the estimation accuracy for the two interceptors α10 = 2 and α20 = 1, we report their empirical means and empirical standard errors from 100 replications. One can see that, as the sample size gets larger, AISEs for each estimator get smaller. Furthermore, Table 3.2 also shows that the estimation accuracy of SCAD estimator is almost as good as ORACLE, and performs better than FULL. Therefore, the proposed SCAD method not only provides a parsimonious model which often is easier to interpret, but also gives more accurate estimate of the model than the FULL. Table 3.1 and 3.2 support the asymptotic results given in in Section 3.3. Figure 3.1 plots the estimated coefficient functions by SCAD from all 100 replications for sample size n = 100, 250 and 500 respectively. It also plots the typically estimated coefficient functions (whose ISE is the median of the 100 ISEs from the replications) for different sample sizes. It graphically verifies the results in Section 3.3. 3.5.2 Example 2: High dimensional case with intercept. In this example we consider a similar model as (3.6) in Example 1, but with d1 = 50 and d2 = 2. It is a model with much higher dimension of linear covariates. But as in Example 1, we only three nonzero coefficient functions α11 (x) = α21 (x) = sin(x), α12 (x) = x. That is, only the first two linear covariates are relevant and the rest linear covarites are redundant. For simplification, we consider αl0 = 0 for l = 1, . . . , 50. We generated 100 samples of size 250 from the model. Note that in such a high dimensional case, the least square estimator for the full model is not feasible. Instead, we used zero as the initial value when applying the LLA algorithm in Section 3.4 when find SCAD estimator. The SCAD method selects the correct model 77 times out of the 100 replications, selects an overfitted model 23 times 48 and no underfit. Therefore, no important covariate is missed by the SCAD procedure. It indicates that our method performances reasonably well even in high dimensional case. Table 3.3 reports AISEs of SCAD estimators and the Oracle estimators of the three nonzero coefficient functions. Figure 3.2 plots the typical fitted curves. It shows that the SCAD estimates the nonzero coefficient functions reasonably well. But the SCAD shrinks the nonzero function a little bit towards zero due to larger tuning parameter needed in the high dimensional case and the fact that we started from a much worse starting point for LLA algorithm when the FULL estimator is not available. 3.6 Real data analysis In this section we analyze the Tucson housing price data in (Fik et al. 2003) using the proposed penalized estimation method. Tucson housing data contains information of the 2971 geocoded housing units sold during year 1998 in selected districts of Tucson, Arizona. Fik et al. (2003) analyzed this data by performing their interactive variable approach to explain the variation of housing price of an urban residential housing market. Six variables describing the houses are considered, • x: the latitude coordinate of the house referenced to the southern most house in record. • y: the longitude coordinate of the house referenced to the western most house in record. • AGE: the age of dwelling in years. • LOT: the lot size of the house. • SQFT: the square footage of the house. • PRICE: the price at which the house was sold. 49 Fik et al. (2003) considered a linear regression model with the logarithm of PRICE as the response and the rest variables with possible polynomial interactions as explana tory variables. In Fik et al. (2003), eleven polynomial interactive effects between ex planatory variables were tested to be significant. They are: AGE, AGE2 , LOT2 , SQFT2 , AGE∗SQFT, LOT∗x, LOT∗y 2 , LOT∗y 3 , SQFT∗x2 , SQFT∗y 3 , SQFT∗x2 ∗ y. Except for the three-way interaction SQFT∗x2 ∗ y, all other interactions can be absorbed by the following additive coefficient model: log(PRICE) = [α10 + α11 (x) + α12 (y) + α13 (AGE)] + [α20 + α21 (x) + α22 (y) + α23 (AGE)] SQFT + [α30 + α31 (x) + α32 (y) + α33 (AGE)] SQFT2 + [α40 + α41 (x) + α42 (y) + α43 (AGE)] LOT + [α50 + α51 (x) + α52 (y) + α53 (AGE)] LOT2 . (3.7) We consider (3.7) as the full model to start variable selection in the penalized estimation. Therefore, in the notation of model (3.1), the linear covariates T consists of five compo nents: the constant 1, SQFT, SQFT2 , LOT, and LOT2 , and the covariates X consists of x, y, and AGE. The model (3.7) effectively describes the spatial and temporal patterns of the linear regression coefficients. The proposed penalized polynomial spline with SCAD penalty (SCAD) is applied and the tunning parameters are selected by the BIC as described in Section 3.4. For comparison, we also consider a standard polynomial estimation of the full model (3.7) and the parametric model in Fik et al (2003) of form, log(PRICE) = β1 AGE + β2 AGE2 + β3 LOT2 + β4 SQFT2 + β5 AGE ∗ SQFT + β6 x2 ∗ SQFT + β7 x3 ∗ SQFT + β8 y 3 ∗ SQFT + β9 x ∗ LOT + β10 y 2 ∗ LOT + β11 y 3 ∗ LOT + β12 x2 ∗ y ∗ SQFT. 50 In all of our analyses, the variables are centered by the sample means and scaled by the sample standard deviations. We exclude 29 data points with extreme values of housing prices that had standard deviations four times larger than the mean response. We only consider the remaining 2942 data points. We then randomly select 2059 points for model estimation, and use the rest 883 data points for prediction. For each method, we report the R2 statistics and the mean absolute estimation error (MAEE) to quantify its estimation accuracy. For prediction, in addition to the mean absolute prediction error (MAPE), we also calculate the percentage of “good” predicted prices. Here we define a “good” predicted price as the predicted price within 10% of the actual price. For each data point i, denote the actual price as Pi , the estimated or predicted value as Pji , and the mean of the first 2059 Pi as P . Then R2 , MAEE and MAPE are calculated by 2 R =1− 2059 � (Pi − Pji )2 i=1 2059 � i=1 (Pi − P )2 , MAEE = 2059 2942 1 L 1 L Pi − Pji . Pi − Pji , and MAPE = 2059 883 i=1 i=2060 We carried out the whole process for 10 times with a random permutation on all 2942 data points at each time. The averaged model size selected by SCAD is 7.8 for the 10 replicates, which is much smaller than the size of 20 for the full model. The additive terms α10 , α11 (x), α12 (y), α13 (AGE), and α21 (x)∗SQFT in (3.7) were always selected; and α41 (x) ∗ LOT was selected 8 out of 10 times and α31 (x) ∗ SQFT2 was selected 5 times. The fitted curves of the SCAD estimator for all additive coefficient components from one run are plotted in Figure 3.3. Corresponding fitted curves of the unpenalized estimator under the full model are also plotted for comparison. This model selection result is consistent with the findings in Fik et al. (2003). It indicates that the absolute location has an unique effect on the price of a housing unit. In particular, the plots of α11 (x), α12 (y), and α21 (x) indicate that the housing price generally decreases from south to north, and increases from west to east. Furthermore, houses from south are more sensitive to effect of the square footage. For each unit of decreasing in SQFT, the housing price is likely to 51 decrease more in south town. And the plot α13 (AGE) also indicates that a newer house is more expensive than a older one with similar condition. Table 3.4 reports the averaged R2 , MAEE, MAPE and the percentage of ‘good” predicted price over 10 replicates for SCAD, FULL and parametric model respectively. From Table 3.4, FULL gives a slightly better estimation than SCAD at the expense of a more complex model structure. However for prediction, SCAD outperforms FULL. Therefore, SCAD not only gives a simpler and more interpretable model, but also improves prediction accuracy. Furthermore, the parametric model gives the worst results in both estimation and prediction, indicating that the data contains a nonlinear structure that can not be fully explained by the parametric model. Figure 3.4 plots the randomly selected actual prices against corresponding predicted prices under three models separately, with criteria band enclosing “good” estimates. Again, it suggests both penalize model and full model do much better than the parametric model. 3.7 Proof of Lemmas and Theorems (0) Based on the oracle estimator γ j(0) = (γ jls , l = 1, · · · , d1 , s = 0, · · · , d2 ) in Section (0) (0) (0) 3.3, we further define α jls for notation convenience. For (l, 0) ∈ S (0) , α jl0 = γ jl0 . For (0) (0) s > 0 and (l, s) ∈ S (0) , α jls = γ jls Bs . Here Bs is the vector of the empirically centered (0) spline basis on xs defined in Section 3.2. For (l, s) ∈ / S (0) , α jls = 0. For any square matrix U, we denote ρmin (U) and ρmax (U) as the minimal and maximal eigenvalues of U respectively. For notation simplicity, we use the same c, c1 , c2 as general notations for positive constants with not necessarily the same value. 3.7.1 Preliminary Lemmas Lemma 3.7.1. Under Assumptions (C.3) - (C.4), for each pair of (l, s) ∈ S, the eigen values of Wls are bounded by two positive constants with probability approaching to 1. 52 That is, let ρmin (Wls ) and ρmax (Wls ) be the minimal and maximal eigenvalues of Wls respectively. Then there exist 0 < c1 < c2 , such that P (c1 � ρmin (Wls ) � ρmax (Wls ) � c2 ) → 1 as n → ∞. (3.8) Proof. From Lemma 5 of Xue and Qu (2012), we know that (3.8) is true for ( T ) t t (l, s) ∈ S and s = 0. Now, for s = 0, ρ (Wl0 ) = ρ ln l = lTl l2n . Therefore, according to Assumption (C.4), one has that lTl l2n � c2 . Furthermore, the Central Limit Theorem gives ( ) that, lTl l2n ; E Tl2 − oP (1) ; Var(Tl ). Therefore, by letting c1 = minl=1,··· ,d1 Var(Tl ) and c2 = c2 , Lemma 3.7.1 is proved. Lemma 3.7.2. Let Z = (Zls , (l, s) ∈ S), W = ZT Z n , then under Assumptions (C.3) (C.4), the eigenvalues of W are bounded within two positive constants. That is, there exist c2 > c1 > 0 such that, except on an event whose probability tends to zero, as n → ∞, c1 � ρmin (W) � ρmax (W) � c2 . (3.9) T , (l, s) ∈ S)T ∈ Rd1 (1+d2 Jn ) , one has Proof. For any coefficients γ = (γls  2  d1 d2 L Jn L L γl0 + γ T Wγ = γls,j Bs,j Tl  . l=1 s=1 j=1 n Lemma A.5 in Xue and Yang (2006b) ensures that there exists a positive constants c1 such d1 d2 � Jn � � 2 + 2 γl0 γls,j = c1 lγl2 . On the other hand, Cauchy-Schwarz that γ T Wγ ; c1 s=1 j=1 l=1 inequality gives that there exists a constant c > 0 such that  2  d1 d2 L d1 d2 Jn T T L L L L T Zl0 Zl0 T Zls Zls γl0 Tl + γ T Wγ = γls,j Bs,j Tl  � c γl0 γls γl0 + γls n n l=1 s=1 j=1 n l=1 . s=1 Therefore, from Lemma 3.7.1, there exists a positive c2 such that γ T Wγ � c2 lγl2 . As a consequence of Lemmas 3.7.1 and 3.7.2, one has the following corollary. Corollary 3.7.1.1. Let A be a subset of the full index pair set S. Denote ZA = (Zls , (l, s) ∈ A) and WA = ZT A ZA n , then under Assumption (C.3) - (C.4), there exist two positive 53 constants c1 , c2 , such that P (c1 � ρmin (WA ) � ρmax (WA ) � c2 ) → 1 as n → +∞. Lemma 3.7.3. Under Assumptions (C.3) - (C.4) and (C.6), each additive term of oracle (0) estimators, α jls , converges to the corresponding true function αls in probability. Specifi cally, we have d1 L l=1 (0) jl0 Tl α − αl0 Tl + d1 L d2 L l=1 s=1 (0) α jls (Xs )Tl − αls (Xs )Tl 2 = OP + � Nn n . + � Nn n . Nn−(p+1) Proof. Theorem 1 in Xue and Yang (2006) entails that max 1�l�d1 (0) α jl0 − αl0 + (0) α jls (Xs ) − max 1�l�d1 ,1�s�d2 αls (Xs ) 2 = OP Nn−(p+1) Therefore, Lemma follows from Assumption (C.4). 3.7.2 Proof of Theorem 3.3.1 For notation simplicity, denote d d 1 L 2 L 1 Y− Zls (xs , tl )γls Ln (γ) = 2n l=1 s=0 2 + 2 d1 L d2 L l=1 s=0 ) ( pλn lγls lWls . − 1 ∗ = W γ . Consequently one has W∗ = Now, let Z∗ls = Zls Wls 2 and γls ls ls ls ∗ Z∗T ls Zls n = IJn for ∗ = 1. Therefore, (3.3) can be rewritten as s = 0, and Wl0 γ j∗ = argminLn (γ ∗ ) γ∗  d1 L d2  1 L = argmin Y− Z∗ls (xs , tl )γ ∗ls 2n γ∗ l=1 s=0  d1 L d2  1 L ∗ = argmin Y− Z∗ls (xs , tl )γls  2n γ∗ l=1 s=0 2 + l=1 s=0 2 2 + 2 d1 L d2 L d1 L d2 L l=1 s=1  ( ) ∗ pλn lγls lW∗ ls  pλn (lγls l2 ) + d1 L l=1 pλn (|γl∗0 |) Therefore, for any given index pair (l, s) ∈ S, the partial derivative d d 1 L 2 L ∂Ln (γ ∗ ) 1 ∗T ∗ = − Y − l2 ) , Z Z∗l′ s′ (xs′ , tl′ )γl∗′ s′ + ∂pλn (lγls ls ∗ ∂γls n ′ ′ l =1 s =0    . 54 ( ∗ ) ( ∗ ) where ∂pλn lγls l2 is the subgradient of pλn lγls l2 . [ ] �d1 �d2 ∗ (x ′ , t ′ )γ ∗ Z Y − We denote C∗ls (γ ∗ ) = − n1 Z∗T l ls l′ =1 s′ =0 l′ s′ s l′ s′ . Then the KKT local optimality condition suggests that, any γ ∗ satisfying the following two conditions must be a local minimum of our penalized objective function, (i) ∗ l2 > aλn , for (l, s) ∈ S (0) ; C∗ls (γ ∗ ) = 0, lγls (ii) ∗ lC∗ls (γ ∗ )l2 � λn , lγls l2 < λn , for (l, s) ∈ / S (0) . Equivalently in terms of γ, the vector of untransformed coefficients, the sufficient condi tions for a solution to be a local minimum are (i) Cls (γ) = 0, lγls lWls > aλn , for (l, s) ∈ S (0) ; (3.10) (ii) lCls (γ)lW−1 � λn , lγls lWls < λn , for (l, s) ∈ / S (0) . (3.11) ls Therefore, to prove γ j(0) ∈ An (λn ), one only needs to prove (3.10) and (3.11) for γ = γ j(0) . j(0) , the first equation in (3.10) and the second inequality in (3.11) When γ = γ naturally hold by the definition of γ j(0) . So one only needs to prove that, except on an event whose probability tends to 0, as n → ∞, (i) (ii) (0) j γls Wls > aλn , for (l, s) ∈ S (0) ; (0) Cls (γ jls ) −1 Wls � λn , for (l, s) ∈ / S (0) . (0) 2 We first prove (3.12). Note that γ jls (3.12) (0) Wls = α jls (Xs )Tl 2 n (3.13) is the empirical norm of an non-zero additive term of the oracle estimator. Lemma 3.7.3 implies that, (0) (0) γls j Wls = (0) jls (Xs )Tl α α jls (Xs )Tl n (0) α jls (Xs )Tl ; [lαls (Xs )Tl l2 − oP (1)] (0) α jls (Xs )Tl (0) α jls (Xs )Tl n . 55 Furthermore, Assumptions (C.2), (C.6) and Lemma A.4 in Xue and Yang (2006b) imply that the second multiplicative term converges to 1 in probability. Together with Assump tions (C.1), (C.3), and (iii) of Assumption (C.5), one has that, with probability goes to 1, (0) γls j ; lαls (Xs )Tl l2 − oP (1) J [ r = E α2ls (Xs )Tl2 − oP (1) J [ }r { = E α2ls (Xs )E Tl2 |Xs − oP (1) J [ r ; c E α2ls (Xs ) − oP (1) Wls = c − oP (1) ; aλn . (3.14) Therefore (3.12) is proved. Now for (3.13), we define Z(1) = (Zls , (l, s) ∈ S (0) ) as the column-wise combination of all Zls matrices corresponding to all “nonzero” components, and Z(2) = (Zls , (l, s) ∈ / S (0) ) as the column-wise combination of all Zls matrices corre �1 � sponding to all “redundant” components. Recall that Y = dl=1 s∈Sl αls (Xs )Tl +ε. One has j(0) ) = Cls (γ = )−1 ( 1 T Zls In − Z(1) ZT(1) Z(1) ZT(1) Y n 1 T Z Hn [δ + ε] , n ls ( )−1 � T and δ = T where Hn = In − Z(1) ZT(1) Z(1) Z(1) (l,s)∈S (0) δls with δls = (δ1,ls , · · · , δn,ls ) (0) and δi,ls = αls (xi,s )ti,l − α jls (xi,s )ti,l . Therefore, P ( Cls (γ j (0) ) −1 Wls > λn , ∃(l, s) ∈ /S (0) ) � P + P ( ( max 1 ZTls Hn δ n max 1 ZTls Hn ε n (l,s)∈S / (0) (l,s)∈S / (0) λn −1 > Wls 2 ) λn (3.15) −1 > Wls 2 . According to Lemma 3.7.1, one has max (l,s)∈ /S (0) 1 ZTls Hn δ n −1 Wls � max (l,s)∈ /S (0) c T Z Hn δ n ls 2 ) c � √ lHn δl2 . n 56 Note that Hn � In . That is, In − Hn is semi-positive definite. The approximation theory √ −(p+1) in de Boor (2001) (p.149) gives that lδl2 � c nNn . Therefore, one has max (l,s)∈S / (0) 1 ZTls Hn δ n −1 Wls �√ 1 lδl2 � cNn−(p+1) . nc1 Consequently, (i) of Assumption (C.5) entails that ( ) 1 λn T = 0. lim P max Zls Hn δ W−1 > n→∞ ls 2 (l,s)∈ /S (0) n Similarly, one has max(l,s)∈/S (0) 1 n other hand, denote H(2)T = ZT(2) TH ε Zls n (3.16) 1 c1 n T H ε . On the max(l,s)∈/S (0) Zls n 2 )−1 ( T Z T T and let H(2) be the J − Z(2) Z(1) n (1) Z(1) Z(1) ls −1 Wls � (2) columns of H(2) corresponding to Zls with Hls = ZTls Hn for (l, s) ∈ S (0) . Note that H(2) H(2)T = Z(2) Hn ZT(2) � Z(2) ZT(2) . Therefore, one has ZTls Hn ε 2 � ZTls ε 2 . Conse quently, max (l,s)∈ /S (0) 1 ZTls Hn ε n −1 Wls � 1 T ε 2. max Zls c1 n (l,s)∈S / (0) According to Lemma 7 in Xue and Qu (2012), ) ( (J ) 1 T =O log (Nn d1 d2 ) . E max √ Zls ε n (l,s)∈ /S ( 0) 2 Then Markov inequality implies ( 1 P max ZTls Hn ε (l,s)∈ /S (0) n −1 Wls λn > 2 ) �C � log (Nn ) . nλ2n (3.17) Then (3.13) follows from (3.16), (3.17) and (ii) of Assumption (C.5). 3.7.3 Proof of Theorem 3.3.2 To prove γ j(0) converges to the global minimum in (3.3), it suffices to show that ( ) P Ln (γ) ; Ln (γ j(0)) for all γ ∈ Rd1 d2 Jn → 1 as n → ∞. (3.18) We first define several subset of the index pair set S = {(l, s)|l = 1, · · · , d1 , s = 1, · · · , } { / S (0) . For any given γ ∈ Rd1 d2 Jn , let d2 }. Let P = S (0) and N = S/S (0) = (l, s)|(l, s) ∈ { } (l, s) ∈ P = S (0) , lγls lWls > aλn , P − = P/P + ; P+ = { } N+ = (l, s) ∈ N = S/S (0) , lγls lWls > λn , N − = N /N + . 57 Recall that Z(1) is the columnwise combination of Zls satisfying (l, s) ∈ S (0) = P, and Z(2) is the columnwise combination of Zls satisfying (l, s) ∈ N = S (0) . We further define j (2) as the column-wise projection of Z(2) onto Z(1) . That is, the j th column of Z j (2) is the Z projection of the j th column of Z(2) onto the linear space spanned by the column vectors of ]−1 [ T . Furthermore, we denote j (2) = K(1) Z(2) , where K(1) = Z(1) ZT Z(1) Z(1) with Z Z(1) (1) J (2) = Z(2) − Z j (2) and Y j (0) = Zj Z γ (0) . Then through orthogonal decomposition, one has Ln (γ) = 1 j (0) j (2) γ(2) Y − Z(1) γ(1) − Z 2n 2 2 + 1 j (0) − Z J (2) γ(2) Y−Y 2n 2 2 + L (l,s)∈S pλn (lγls lWls ). Here γ(1) and γ(2) are coefficients corresponding to Z(1) and Z(2) . Note that j (0) − Z J (2) γ(2) Y−Y Then one has 2 2 j (0) = Y−Y 2 2 T JT J j (0) , Z J (2) γ(2) ). Z(2) Z(2) γ(2) − 2(Y − Y + γ(2) Ln (γ) − Ln (j γ (0) ) 2 1 j (0) JT Z J γ j (2) γ(2) + 1 γ T Z Y − Z(1) γ(1) − Z 2n 2n (2) (2) (2) (2) 2 L 1 (0) j (0) , Z J (2) γ(2) ) + jls pλn (lγls lWls ) − pλn ( γ − (Y − Y n ; (l,s)∈S Consider the linear space A spanned by Wls ) . { } j ls : (l, s) ∈ P + ∪ N , where we let Z j ls = Z j (0) be the projection of Y j (0) onto A. Let Z j A = (Z j ls , (l, s) ∈ H(1) Zls for (l, s) ∈ S (0) . Let Y A j AC = (Z j ls , (l, s) ∈ P − ) and P + ∪ N ) and γ A = (γls , (l, s) ∈ P + ∪ N ). Similarly, we let Z 58 C γ A = (γls , (l, s) ∈ P − ). Then we have Ln (γ) − Ln (j γ (0) ) 1 j (0) j (0) j (0) j A j 1 T JT J C 2 Y − YA + YA − ZA γ − ZAC γ A + γ Z Z γ 2n 2n (2) (2) (2) (2) 2 L 1 (0) j (0) , Z J (2) γ(2) ) + pλn (lγls lWls ) − pλn ( γ jls ) − (Y − Y n Wls (l,s)∈S ( ) 3 1 j (0) j (0) 2 1 j (0) j A 2 3 j C 2 − ZAC γ A ; Y − YA + YA − Z A γ 4 2n 2n 2n 2 2 2 1 1 j (0) , Z J (2) γ(2) ) + J γ JT Z − (Y − Y γT Z n 2n (2) (2) (2) (2) L (0) + pλn (lγls lWls ) − pλn ( γ jls ) ; Wls (l,s)∈S ; c j (0) j (0) 2 c j 1 T JT J C 2 Y − YA − ZAC γ A + γ Z Z γ n n 2n (2) (2) (2) (2) 2 2 L 1 (0) j,Z J (2) γ(2) ) + pλn (lγls lWls ) − pλn ( γ jls − (Y − Y n (l,s)∈S Wls ) = I − II + III − IV + V. (3.19) In the following, we will discuss all the five additive terms in (3.19) separately. We first J (2,2) = discuss the term III. Let C - (2)T Z - (2) Z . n One has ( )−1 J (2,2) = Z(2)T In − Z(1) Z(1)T Z(1) C Z(1)T Z(2) = ( −Z (2)T Z (1) ( Z (1)T Z (1) )−1 , I(2) )( ZT Z n )( −Z (2)T Z (1) )T )−1 ( (1)T (1) Z , I(2) Z . Here I(2) is the identity matrix and has the same number of columns as Z(2) . Lemma 3.7.2 entails that, for any γ(2) , } {( }T ( ZT Z ) {( )T )T D, I(2) γ(2) D, I(2) γ(2) n ( ) 2 ( )T ( ) DT T γ ; c D, I(2) γ(2) = c γ(2) D, I(2) I(2) (2) 2 T J (2,2) γ(2) C γ(2) = T T = c γ(2) DDT γ(2) + γ(2) γ(2) ; c γ(2) 2 , 2 59 ( )−1 where D = −Z(2)T Z(1) Z(1)T Z(1) . Therefore, Corollary 3.7.1.1 gives that III ; c γ(2) ; c 2 2 (l,s)∈N + L (l,s)∈N + lγls l22 lγls l2Wls ; cλn L ; cλn L ;c L (l,s)∈N + lγls lWls lγls l2 . (l,s)∈N + (3.20) Now without loss of generality, let ZA = (Zls : (l, s) ∈ P + ∪ N ). Similarly we have r (0) c j (0) j (0) 2 c (0) [ T j Z In − ZA (ZA = γ γ Y − YA ZA )−1 ZA Zj n n 2 L 2 (0) 2 j(0) ; c . jls ; c γ γ I = 2 (3.21) Wls (l,s)∈P Furthermore, using exactly the same argument, one can prove that ; (2) ; (2)T Z Z n has the largest eigenvalue bounded by c > 0. Thus for II, II = C c j ZAC γ A n j,Z J (2) γ(2) ) = � For IV = n1 (Y −Y (l,s)∈N 2 2 1 n C � c γA 2 (3.22) 2 I ) T j , Zls γls = � Y−Y γ (0) ), (l,s)∈N γls Cls (j T C (j (0) ) � C (j (0) ) the Cauchy-Schwarz inequality gives that, for (l, s) ∈ N , γls ls γ ls γ 2 lγls l2 . Furthermore, according to conclusion (3.13) proved in Theorem 3.3.1, we have Cls (j γ (0) ) 2 lγls l2 = oP (λn ) lγls l2 . Thus, IV = L 1 j,Z J (2) γ(2) ) = oP (λn ) (Y − Y lγls l2 . n (3.23) (l,s)∈N For the last additive term of (3.19), we have L (l,s)∈S L = (l,s)∈P + + (0) pλn (lγls lWls ) − pλn ( γ jls L Wls (0) pλn (lγls lWls ) − pλn ( γ jls (l,s)∈N + ) ( pλn lγls lWls + ) L (l,s)∈N − Wls ) + L (l,s)∈P − ) ( pλn lγls lWls . (0) pλn (lγls lWls ) − pλn ( γ jls Wls ) 60 For (l, s) ∈ P + , one notes that lγls lWls > aλn by the definition of P + . Therefore � � (0) (3.12) entails that (l,s)∈P + pλn (lγls lWls ) − pλn ( γ jls ) = 0 and (l,s)∈P − pλn (lγls lWls ) − pλn Wls ( ) � − − 2 ; 0−|P | OP (λn ). For (l, s) ∈ N , we have lγls lWls < λn . Thus (l,s)∈N − pλn lγls lWls = � λn (l,s)∈N − lγls lWls . Therefore one has V = L (0) (l,s)∈S pλn (lγls lWls ) − pλn ( γ jls [ r ; 0 + 0 − P − OP (λ2n ) + λn ; cλn L (l,s)∈N − Wls L (l,s)∈N − ) lγls lWls lγls l2 − P − OP (λ2n ). (3.24) Finally, (3.20), (3.21), (3.22), (3.23) and (3.24) together entail that Ln (γ) − Ln (j γ (0) ) ; c L (l,s)∈P +cλn (0) 2 γ jls L (l,s)∈N − Wls C − c γA 2 L + cλn 2 (l,s)∈N + lγls l2 − P − OP (λ2n ). lγls l2 − L (l,s)∈N oP (λn ) lγls l2 (3.25) Note that according to the definition of A which is the combination of P + and N , we have C γA 2 2 � |P − | a2 λ2n . For the same reason, λn lγls , (l, s) ∈ N + l2 . Therefore, one has  L Ln (γ) − Ln (j γ (0) ) ; c (l,s)∈P (0) 2 γ jls Wls + [cλn − oP (λn )] − P− L (l,s)∈N Note that, according to (3.14) in proof of Theorem 3.3.1, 2 2 γ(2) ( 2 ; l(γls , (l, s) ∈ N + )l2 ;  ) cλ2n + OP (λ2n )  lγls l2 . � (0) 2 (l,s)∈P γ jls Wls is great than a positive constant with probability to 1. Therefore, (3.18) follows from (iii) of Assumption (C.5). It completes the proof of Theorem 2. 61 TABLE 3.1: Variable selection results for Example 1. The columns of U, C, O give respec tively the numbers of under-fitting, correct-fitting and over-fitting from 100 replications. n U C O 100 4 28 68 250 0 100 0 500 0 100 0 TABLE 3.2: Estimation accuracy for Example 1. The Mean(SE) columns give the mean and standard errors of α̂10 and α̂20 . The AISE columns give AISEs of α̂11 , α̂12 , α̂21 and α̂22 respectively. n 100 250 500 Method Mean(SE) AISE α10 = 2 α20 = 1 α11 α12 α21 SCAD 1.866(0.0458) 0.849(0.0379) 0.9510 0.6230 1.7236 ORACLE 1.965(0.0241) 1.016(0.0132) 0.0935 0.0994 0.1193 FULL 1.868(0.0560) 1.023(0.0415) 1.9228 1.4381 2.7271 SCAD 2.004(0.0082) 1.016(0.0055) 0.0278 0.0241 0.0183 ORACLE 2.004(0.0083) 1.017(0.0053) 0.0242 0.0209 0.0177 FULL 2.006(0.0084) 1.021(0.0057) 0.0331 0.0305 0.0247 SCAD 2.000(0.0045) 1.002(0.0025) 0.0122 0.0133 0.0079 ORACLE 2.000(0.0045) 1.002(0.0025) 0.0121 0.0132 0.0079 FULL 2.001(0.0046) 1.002(0.0025) 0.0137 0.0141 0.0095 62 TABLE 3.3: Estimation accuracy for Example 2. The AISE columns give AISEs of α̂11 , α̂12 and α̂21 respectively. Method AISE α11 α12 α21 SCAD 0.0871 0.0819 0.0735 Oracle 0.0218 0.0173 0.0202 TABLE 3.4: Analysis result of Tucson housing price data. Method R2 MAEE MAPE Percentage of “good” prediction SCAD 0.846 13419.1 13801.7 61.3% Full 0.861 12975.7 15385.0 61.9% Linear 0.555 19904.2 19274.6 45.5% 2 1.5 1 1.0 0.5 −2 −1 0 1 2 3 −3 −2 0 1 2 3 −1 −2 −3 −2 −1 0 1 2 3 −3 −2 (c) (a3) 0 1 2 3 2 3 2 3 2 3 −2 −1 0 1 2 3 −2 0 1 2 3 −1 −3 −2 −1 0 1 2 3 −3 −2 (g) (b3) 0 1 −2 −1 0 1 2 3 −2 0 1 2 3 −1 −3 −2 −1 0 1 2 3 −3 −2 (k) (c3) 0 1 −3 −2 −1 0 1 (m) (d1) 2 3 −3 −2 −1 0 1 (n) (d2) 2 3 −2 −1.5 −1 0 −0.5 0.0 0.5 1 1.0 4 −4 −1.5 −2 0 −0.5 0.0 0.5 2 1.0 −1 (l) (c4) 1.5 (j) (c2) 1.5 (i) (c1) −1 −2 −1.5 −3 2 −3 0 −0.5 0.0 0.5 1 1.0 4 −4 −1.5 −2 0 −0.5 0.0 0.5 2 1.0 −1 (h) (b4) 1.5 (f ) (b2) 1.5 (e) (b1) −1 −2 −1.5 −3 2 −3 0 −0.5 0.0 0.5 1 1.0 4 −4 −1.5 −2 0 −0.5 0.0 0.5 2 1.0 −1 (d) (a4) 1.5 (b) (a2) 1.5 (a) (a1) −1 2 −3 0 −0.5 0.0 −1.5 −4 −1.5 −2 0 −0.5 0.0 0.5 2 1.0 4 1.5 63 −3 −2 −1 0 1 (o) (d3) 2 3 −3 −2 −1 0 1 (p) (d4) FIGURE 3.1: Fitted curves for each αls (·), l = 1, 2, s = 1, 2 in Example 1. Plots (a1)-(a4) show the 100 fitted curves for α11 (x) = sin(x), α12 (x) = x, α21 (x) = sin(x), α22 (x) = 0 respectively when sample size n = 100. Plots (b1)-(b4) and (c1)-(c4) are for n = 250 and n = 500 respectively. (d1)-(d4) plot the true model functions (solid) as well as the typically estimated curves of different sample sizes: dashed lines (n = 100), dotted lines (n = 250), and the dot-dashed lines (n = 500). 4 2 0 −2 −4 −2 −1 0 1 2 64 −3 −2 −1 0 1 2 3 −3 −2 −1 1 2 3 (b) (b) −2 −1 0 1 2 (a) (a) 0 −3 −2 −1 0 1 2 3 (c) (c) FIGURE 3.2: Fitted curves for α11 (·), α12 (·), and α21 (·) in Example 2. Plotted are the true functions (solid), typical oracle estimates (dashed) and typical SCAD estimates (dotted). The typical estimated curve is the one whose ISE is the median among the 100 ISEs from replications. −3 −2 −1 0 1 2 0.0 0.4 −1.0 −0.4 α13 α12 −0.4 0.5 −0.5 α11 1.5 0.2 65 −2 −1 0 1 2 3 −1.5 −0.5 y 0.5 1.0 1.5 2.0 1.0 1.5 2.0 1.0 1.5 2.0 1.0 1.5 2.0 1.0 1.5 2.0 AGE −3 −2 −1 0 1 2 −0.2 0.0 α22 0 0.0 α23 1.0 3 2 1 α21 0.2 4 x −2 −1 0 1 2 3 −1.5 −0.5 y 0.5 AGE −1.5 −1 0 1 2 −2 −1 0 2 3 −1.5 −1 0 1 2 −0.2 α43 −2 −1 0 1 2 3 α53 0.0 α52 −1 x −0.5 0 1 2 0.5 AGE −0.4 −1.0 0.0 −2 −1.5 y 1.0 x −3 0.5 0.2 0.0 α42 −2 −0.5 AGE −0.8 −0.4 −1 α41 −3 −3 α51 1 y 0 x 0.0 −2 −0.1 0.1 −3 0.0 −0.2 α33 −0.5 α32 −1.0 −2.0 α31 0.0 x −2 −1 0 1 y 2 3 −1.5 −0.5 0.5 AGE FIGURE 3.3: Graphs of estimated curves of all fifteen additive coefficient components αls , l = 1, · · · , 5, s = 1, 2, 3. SCAD estimators are plotted in solid curves, and full estimators are plotted in dashed curves. The non-zero components selected by SCAD are α11 , α12 , α13 , α21 , α41 . 66 150000 250000 Actual Price 350000 50000 350000 300000 250000 Predicted Price 50000 100000 150000 200000 300000 250000 Predicted Price 50000 100000 150000 200000 300000 250000 200000 Predicted Price 150000 100000 50000 50000 LINEAR Prediction 350000 FULL Prediction 350000 SCAD Prediction 150000 250000 Actual Price 350000 50000 150000 250000 Actual Price FIGURE 3.4: Plots of actual housing prices against predicted values from SCAD, FULL and a linear regression model. In order to reduce crowding of all the 891 points, we randomly selected only 80 points for plotting. For each plot, dotted line is the symmetric axis y = x. Two dashed lines respectively represent y = x + 0.1 |x| and y = x − 0.1 |x|. Any point enclosed within the two dashed lines represents a “good” predicted price. 350000 67 4 CONCLUSIONS In this dissertation, we proposed the SCAD-penalized polynomial spline estimation method for estimation and variable selection on both nonlinear time series data and ran domly sampled data with nonparametric structure using a stochastic additive model and the more general additive coefficient model respectively. Both semi-parametric models al low the detection of a very general data structure, but avoid the “curse of dimensionality” through their additive model formation. The proposed method requires minimizing the objective function which is the sum of a non-convex penalty function and sum of squared errors. We have used the local linear approximation (LLA) method introduced in linear regression models and extended it to both SAM and ACM semiparametric regression models for solving the non-convex optimization problem. This LLA method greatly reduces the computational complexity, and still provides consistent estimators whose consistency has been theoretically proven. We have shown the effectiveness of the proposed method through Monte Carlo studies and real data analysis. We have further proved the oracle properties of the penalized estimators in both models. For stochastic additive models, the proofs assume weak dependency of time series data. This is a more general assumption than assuming the data to be independent. Therefore the theoretical conclusion is broadened. For the additive coefficient model, we directly proved that our proposed estimators, the global minimum of the objective function, has oracle properties. This is a stronger conclusion than previous work, which only stated the existence of a sequence of consistent local minima. Our research can be advanced in the future in at least two directions. First of all, the analysis of time series data is an important issue. We proposed SAM to analyze such data in our research. Furthermore, the additive coefficient model can also be used to analyze 68 time series data. Take autoregressive models as an example. Different from SAM, ACM allows interaction between lags, therefore can model more general data structure. The global optimality of the SCAD estimators could be proven for ACM given some appropriate assumptions on the time series data. Secondly, statistical genetics is a growing area in recent years. Analysis of genetic data usually generate new research questions involving high dimensional data. According to Fan and Li (2001), the SCAD penalty is able to deal with high dimensional data. So it will be interesting to extend the proposed method to the high dimensional case where the number of predictors are no longer fixed, but increases with the sample size. The developed high dimensional SAM and ACM could contribute to solving statistical genetics problems. Though new theory needs to be built in the future, a high-dimensional example for numerical studies was given in subsection 3.5, and our proposed method gives satisfying performances. 69 BIBLIOGRAPHY Ang, A. and Chen, J. (2002), Asymmetric Correlations of Equity Portfolios. Journal of Financial Economics, 63, 443–494. Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes: Estimation and Pre diction, 2nd Ed. New York: Springer-Verlag. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383. Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–619. Brockwell, P. J. and Davis, R. A. (1991). Time series: theory and methods. New York: Springer-Verlag. Candés, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Annals of Statistics, 35, 2313–2404. Chen, R. and Tsay, R. S. (1993). Nonlinear additive ARX models. Journal of the American Statistical Association, 88, 955–967. de Boor, C. (2001). A Practical Guide to Splines. New York: Springer. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407–499. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360. Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32, 928–961. 70 Fik, T. J., Ling, D. C. and Mulligan G. F. (2003). Modeling spatial variation in housing prices: A variable interaction approach. Real Estate Economics, V31, 623–646. Frank. I. E. and Friedman. J. H. (1993). A statistical view of some chemometrics tools. Technometrics, 35, 109–135. Härdle, W., Liang, H. and Gao, J. (2000). Partially Linear Models, Physica-Verlag. Härdle, W. and Stoker, T. (1989). Investigating Smooth Multiple Regression by the Method of Average Derivatives. Journal of the American Statistical Association, 84, 986–995. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models. London: Chapman and Hall. Hastie, T. J. and Tibshirani, R. J. (1993). Varying-coefficient models. J. R. Stat. Soc. Ser. B, 55,757–796. Huang, J. Z. (1998a). Projection estimation in multiple regression with application to functional ANOVA models. Annals of Statistics, 26, 242–272. Huang, J. Z. (1998b). Functional ANOVA models for generalized regression. Journal of Multivariate Analysis, 67, 49–71. Huang, J. Z., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric additive models. Annals of Statistics, 38, 2282–2313. Huang, J. Z., Kooperberg, C., Stone, C. J., and Truong, Y. K. (2000) Functional ANOVA modeling for proportional hazards regression. Annals of Statistics, 28, 961–999. Huang, J. Z., Wu, C. O. and Zhou, L. (2002). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 14, 763–788. 71 Huang, J. Z. and Yang, L. (2004). Identification of nonlinear additive autoregressive mod els. J. R. Stat. Soc. Ser. B, 66, 463–477. Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics, 58, 71–120. Jiang, S. and Xue, L. (2013). Lag selection in stochastic additive models. Journal of Nonparametric Statistics. 25, 129–146. Kim, Y., Choi, H., and Oh, H. S. (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association, 103, 1665–1673. Lian, H. (2012). Variable selection for high-dimensional generalized varying-coefficient models. Statistica Sinica, 22, 1563–1588. Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in smoothing spline analysis of variance models. Annals of Statistics, 34, 2272–2297. Linton, O. B. and Härdle, W. (1996). Estimation of additive regression models with known links. Biometrika, 83, 529–540. Liu, R. and Yang, L. (2010). Spline-backfitted kernel smoothing of additive coefficient model. Econometric Theory, 12, 29–59. Lütkepohl, H. (1993). Introduction to multiple time series analysis. Berlin: SpringerVerlag. Ma, S., Song, Q. and Wang, L. (2013). Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data. Bernoulli, 19, 252–274. 72 Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Annals of Statistics, 37, 3779–3821. Shen, X., Pan, W. and Zhu, Y. (2011). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107, 223–232. Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics, 6, 461–464. Stone, C. J. (1985). Additive regression and other nonparametric models. Annals of Statis tics, 13, 689–705. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B, 58, 267–288. Tjøtheim, D. (1994). Non-linear time series: a selective review. Scand. J. Statist., 21, 97–130. Tong, H. (1990). Nonlinear time series. A dynamical system approach. New York: Oxford University Press. Tong, H. (1995). A personal overview of non-linear time series analysis from a chaos perspective. Scand. J. Statist., 22, 399–445. schernig, R. and Yang, L. (2000). Nonparametric estimation of generalized impulse re sponse function. Discussion Papers, Interdisciplinary Research Project 373: Quantifica tion and Simulation of Economic Processes, 2000. Wang, L., Li, H. and Huang, J. Z. (2008). Variable selection in nonparametric varyingcoefficient models for analysis of repeated measurements. Journal of the American Sta tistical Association, 103, 1556–1569. Wang, H., Li, R. and Tsai, C. (2008). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–568. 73 Wang, L., Liu. X., Liang, H. and Carroll, R. (2011). Estimation and variable selection for generalized additive partial linear models. Annals of Statistics, 39, 1827–1851. Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive autoregression model. Annals of Statistics, 35, 2474–2503. Wei, F., Huang, J., and Li, H. (2011). Variable selection and estimation in highdimensional varying-coefficient models. Statistica Sinica, 21, 1515–1540. Xue, L. (2009). Variable selection in additive models. Statistica Sinica, 19, 1281–1296. Xue, L., and Qu, A. (2012). Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality. The Journal of Machine Learning Research, 13, 1973 1998. Xue, L., Qu, A. and Zhou, J. (2010). Consistent model selection for marginal generalized additive model for correlated data. Journal of the American Statistical Association, 105, 1518–1530. Xue, L., and Yang, L. (2006a). Estimation of semiparametric additive coefficient model. J. Statist. Plann. Inference, 136, 2506–2534. Xue, L., and Yang, L. (2006b). Additive coefficient modeling via polynomial spline. Sta tistica Sinica, 16, 1423–1446. Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B, 68, 49–67. Zhang, Cun-Hui (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38, 894–942. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. 74 Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36, 1509–1533.

Document 10940544

Related documents

Products

Support

Document 10940544

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib