AN ABSTRACT OF THE DISSERTATION OF
Shuping Jiang for the degree of Doctor of Philosophy in Statistics presented on
June 5, 2013.
Title: Variable Selection in Semi-parametric Models.
Abstract approved:
Lan Xue
We consider two semiparametric regression models for data analysis, the stochastic
additive model (SAM) for nonlinear time series data and the additive coefficient model
(ACM) for randomly sampled data with nonparametric structure. We employ the SCADpenalized polynomial spline estimation method for estimation and simultaneous variable
selection in both models. It approximates the nonparametric functions by polynomial
splines, and minimizes the sum of squared errors subject to an additive penalty on norms
of spline functions. A coordinate-wise algorithm is developed for finding the solution for
the penalized polynomial spline problem. For SAM, we establish that, under geometrically
α−mixing, the resulting estimator enjoys the optimal rate of convergence for estimating
the nonparametric functions. It also selects the correct model with probability approach­
ing to one as the sample size increases. For ACM, we investigate the asymptotic properties
of the global solution of the non-convex objective function. We establish explicitly that the
oracle estimator is the global solution with probability approaching to one. Therefore, the
global solution enjoys both model estimation and selection consistency. In the literature,
the asymptotic properties of local solutions rather than global solutions are well estab­
lished for non-convex penalty functions. Our theoretical results broaden the traditional
understandings of the penalized polynomial spline method. For both models, extensive
Monte Carlo studies have been conducted and show the proposed procedure works effec­
tively even with moderate sample size. We also illustrate the use of the proposed methods
by analyzing the US unemployment time series under SAM, and the Tucson housing price
data under ACM.
c
©
Copyright by Shuping Jiang
June 5, 2013
All Rights Reserved
Variable Selection in Semi-parametric Models
by
Shuping Jiang
A DISSERTATION
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Presented June 5, 2013
Commencement June 2014
Doctor of Philosophy dissertation of Shuping Jiang presented on June 5, 2013
APPROVED:
Major Professor, representing Statistics
Chair of the Department of Statistics
Dean of the Graduate School
I understand that my dissertation will become part of the permanent collection of Oregon
State University libraries. My signature below authorizes release of my dissertation to
any reader upon request.
Shuping Jiang, Author
ACKNOWLEDGEMENTS
I would like to take this opportunity to express my deepest appreciation to many people
that helped me in the past five years.
I firstly want to thank my advisor, Professor Lan Xue, for her continuous and
unreserved guidance, support and encouragement during the past five years. I was lucky
to have her as my advisor. With endless patience and full responsibility, she taught me
every single step of how to do research. She was always helpful and insightful when I
needed guidance, totally understanding and considerate when I came across difficulties,
and especially proactive and supportive for any possible opportunities for me. As my
advisor, she has not only helped me in my Ph.D. research, but also set a high standard
for me to pursue in my future life by her personal example.
I would also like to thank Professors Yanming Di, Virginia Lesser, Sinisa Todorovic,
and Bo Zhang for serving on my committee. Professor Lesser helped me a lot for my
attendance of conferences last year, as well as during my job searching process. I am
really grateful about it. Professor Zhang has provided me one year of important research
experience as his research assistant. This has granted me a precious chance of studying
an area that is totally different from the area of my Ph.D. research. He also gave me
many advises of looking for an internship. I deeply appreciate it. Professor Di is always
supportive for serving on both of my Ph.D. and master’s committee.
I would like to thank all other professors in the department for teaching me statistics:
Professors David Birkes, Alix Gitelman, Lisa Madsen, Paul Murtaugh, Cliff Pereira, Dan
Schafer, and Bob Smythe. Your teaching has opened the door of statistics to me, and
showed me a beautiful world. I would also like to thank Professor Yuan Jiang’s help when
I had theoretical questions. Your passion in statistics is impressive and influential.
Finally, I would like thank all of my friends, especially Meian, Zoey, Xuan, Gu and
Yan, for helping me, being with me, and trusting me. It’s my honor to know you all.
TABLE OF CONTENTS
Page
1
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
LAG SELECTION IN STOCHASTIC ADDITIVE MODELS . . . . . . . . . . . . . . . . .
8
2.1
Model and Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
Polynomial spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1 An initial estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Penalized polynomial spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . .
10
12
2.3
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Tuning Parameter and Knots Selection . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5
Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6
Assumptions and Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1
2.6.2
2.6.3
2.6.4
3
14
Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
20
20
26
CONSISTENT MODEL SELECTION IN ADDITIVE COEFFICIENT MOD­
ELS WITH GLOBAL OPTIMALITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2
Penalized Polynomial Spline Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3
Optimal Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 The local linear approximation algorithm . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Selection of tuning parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
43
44
Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Example 1: Low dimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Example 2: High dimensional case with intercept. . . . . . . . . . . . . . . .
46
47
TABLE OF CONTENTS (Continued)
Page
3.6
Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7
Proof of Lemmas and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.2 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.3 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
51
53
56
CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
LIST OF FIGURES
Figure
2.1
2.2
3.1
3.2
3.3
3.4
Page
The empirical norms of all estimated additive components are plotted
against the tuning parameter λn . We simulated data from both linear
AR3 model (left) and a nonlinear NLAR1 model (right side) in Table
2.1 for one run with n = 500. The location of the optimal parameter λ̂n
selected by BIC are marked by the dashed line. . . . . . . . . . . . . . . . . . . . . . . . . .
35
The estimated relevant component functions for model NLAR1 ((a) and
(b)) and NLAR2 ((c) and (d)) using three approaches. Model NLAR1 is
fitted in linear spline space while model NLAR2 is fitted in cubic spline
space. The dash-dotted lines and the dotted lines represent polynomial
spline estimation of the oracle and full models respectively. The dashed
lines represent the penalized polynomial spline estimators. The true com­
ponent functions are also plotted in solid lines. . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Fitted curves for each αls (·), l = 1, 2, s = 1, 2 in Example 1. Plots
(a1)-(a4) show the 100 fitted curves for α11 (x) = sin(x), α12 (x) = x,
α21 (x) = sin(x), α22 (x) = 0 respectively when sample size n = 100.
Plots (b1)-(b4) and (c1)-(c4) are for n = 250 and n = 500 respectively.
(d1)-(d4) plot the true model functions (solid) as well as the typically
estimated curves of different sample sizes: dashed lines (n = 100), dotted
lines (n = 250), and the dot-dashed lines (n = 500). . . . . . . . . . . . . . . . . . . . .
63
Fitted curves for α11 (·), α12 (·), and α21 (·) in Example 2. Plotted are
the true functions (solid), typical oracle estimates (dashed) and typical
SCAD estimates (dotted). The typical estimated curve is the one whose
ISE is the median among the 100 ISEs from replications. . . . . . . . . . . . . . . . .
64
Graphs of estimated curves of all fifteen additive coefficient components
αls , l = 1, · · · , 5, s = 1, 2, 3. SCAD estimators are plotted in solid
curves, and full estimators are plotted in dashed curves. The non-zero
components selected by SCAD are α11 , α12 , α13 , α21 , α41 . . . . . . . . . . . . . . . .
65
Plots of actual housing prices against predicted values from SCAD, FULL
and a linear regression model. In order to reduce crowding of all the 891
points, we randomly selected only 80 points for plotting. For each plot,
dotted line is the symmetric axis y = x. Two dashed lines respectively
represent y = x + 0.1 |x| and y = x − 0.1 |x|. Any point enclosed within
the two dashed lines represents a “good” predicted price. . . . . . . . . . . . . . . .
66
LIST OF TABLES
Table
Page
2.1
Autoregressive models in the simulation study . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.2
Lag selection results for the simulation study. The columns of U, C,
O give respectively the percentages of under-fitting, correct-fitting and
over-fitting over 500 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
The Penalty columns give MISEs from the penalized polynomial spline
method. The Oracle and Full columns give MISEs of the polynomial
spline estimation of the oracle and full models respectively. . . . . . . . . . . . . .
33
Analysis result of the US unemployment data. The Lags column gives
the selected significant lags of the quarterly US unemployment data. . . .
34
Variable selection results for Example 1. The columns of U, C, O give
respectively the numbers of under-fitting, correct-fitting and over-fitting
from 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Estimation accuracy for Example 1. The Mean(SE) columns give the
mean and standard errors of α̂10 and α̂20 . The AISE columns give AISEs
of α̂11 , α̂12 , α̂21 and α̂22 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Estimation accuracy for Example 2. The AISE columns give AISEs of
α̂11 , α̂12 and α̂21 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Analysis result of Tucson housing price data. . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
2.3
2.4
3.1
3.2
3.3
3.4
FOR GRANDMOM.
VARIABLE SELECTION IN SEMI-PARAMETRIC MODELS
1
INTRODUCTION
Linear regression models are widely used in statistical analysis. It is a classical tool
for analyzing many types of data, including independent data such as randomly sampled
data, and dependent data such as time series data. However, like other parametric re­
gression models, it relies on an explicit parametric form of the regression function. This
usually restricts the availability of the model, and sometimes requires the pre-knowledge
of the data structure for correct modeling. An inappropriate application of parametric
models may result in estimation bias and erroneous inferences. On the contrary, nonpara­
metric models have provided great flexibility to explore hidden data structures, without
any prior knowledge of the structure of the data. Therefore, in recent years, nonpara­
metric models have gained increasing popularity as an alternative tool which avoids the
problems occurred using inappropriate parametric models.
While general nonparametric models are useful for exploring data structures, it
suffers the “curse of dimensionality” phenomenon. When there are a large number of
variables, the volume of space of predictive variables explodes so fast that the data becomes
sparse in the space, and therefore estimation becomes unreliable. In this case general
nonparametric models can rarely be used to make effective inferences. Therefore, a class
of semiparametric models was developed. Semiparametric models usually impose certain
structures to the general nonparametric model, while still reserve some nonparametric
components in the model. Therefore, semiparametric models can avoid the “curse of
dimensionality”. In the meantime, they can still be more flexible than the parametric
2
ones. This class of models include the partially linear model (Härdle, Liang and Gao 2000)
which allows part of the additive components in the linear regression to be nonparametric;
the additive model (Stone 1985, Hastie and Tibshirani 1990) which relaxes the strict linear
assumption but retains the interpretable additive form; the single index model (Härdle and
Stoker 1989, Ichimura 1993) which allows the mean of responses to be a nonparametric
function of the linear combinations of predictive variables; the varying coefficient model
(Hastie and Tibshirani 1993) which replaces the constant coefficient of each linear predictor
by a nonparametric function of another predictor; and the additive coefficient model (Xue
and Yang 2006 a & b) whose coefficients of linear predictors are additive nonparametric
functions of multiple predictors.
To estimate the semiparametric/nonparametric models, we used one classical tool,
polynomial splines. Compared to the local polynomial method, the polynomial spline
method enables global smoothing and estimation of all nonparametric components through
one single least square estimation, therefore reduces computational complexity. General
theories of polynomial spline smoothing were established in Stone (1985), de Boor (2001),
and Huang (1998 a & b, and 2000).
Besides estimation, model/variable selection is also crucial in semiparametric mod­
eling. A large number of predictors are often introduced at the beginning of modeling
process to avoid missing any potentially important explanatory elements. Consequently,
selection of significant predictors becomes necessary. The variable selection process not
only simplifies the model complexity and its interpretation, but also greatly enhances
model predictability. However, according to Breiman (1996), traditional approaches of
variable selection including stepwise and subset selection methods are computationally
intensive, unstable, and difficult to summarize the sampling properties. This manual
selection process also brings in some uninterpretable stochastic errors in the stages of
variable selection. Therefore, since the 1990s, a number of penalization methods emerged
3
as an alternative approach to variable selection in linear regression models. These pe­
nalization methods conduct the estimation and variable selection in the same modeling
process, in which the coefficients are appropriately estimated, and the significant variables
are simultaneously and automatically selected. These penalization estimation methods in­
clude the bridge regression methods proposed by Frank and Friedman (1993), the least
absolute shrinkage and selection operator (LASSO) proposed by Tibshirani (1996, 1997)
and Efron, Hastie, Johnstone, and Tibshirani (2004), the smoothly clipped absolute de­
viation (SCAD) penalty method proposed by Fan and Li (2001), the adaptive LASSO
proposed by Zou (2006), the Dantzig selector by Candés and Tao (2007), the minimum
concave penalty by Zhang (2010) and the threshold L1 penalty by Shen, Pan, and Zhu
(2012). All those work assumed a linear or parametric form for the regression model. Some
more recent work extends the aforementioned penalization methods to the nonparametric
additive models. In particular, Lin and Zhang (2006) proposed the component selection
and smoothing operator for model selection in a more general nonparametric ANOVA
model. Meier, van de Geer, and Bühlmann (2009), Xue (2009), Huang, Horowitz, and
Wei (2010) Xue, Qu and Zhou (2010), and Wang, Liu, Liang, and Carroll (2011) proposed
to approximate the nonparametric components using B-spline basis and conduct variable
selection using group-wise model selection approach (Yuan and Lin 2006).
Among all the penalization methods, the one with SCAD penalty (Fan and Li 2001)
is our focus in this thesis. Fan and Li (2001) have shown that the SCAD penalty has all
the three properties of a desirable penalty function: sparsity, unbiasedness and continuity.
That is, for a given tuning parameter, on one hand, the SCAD penalty shrinks insignifi­
cant variables to zero; and on the other hand, leaves relatively large significant variables
unbiased after the shrinkage. In addition, the shrinkage for estimators is continuous under
SCAD penalties. Combining the SCAD penalty method with polynomial spline estima­
tion, results in a penalized polynomial spline (PPS) method. In this method, for a given
4
tuning parameter, we define the PPS estimator as the global minimizer of the objective
function, which equals to the sum of squared errors plus the sum of dimensional-wise
SCAD penalty functions. In the literature, the PPS method has recently been used for
variable/model selection in various semiparametric models. For example, Wang, Li and
Huang (2008), Wei, Huang and Li (2011), Lian (2012) and Xue and Qu (2012) considered
PPS for variable selection in the varying-coefficient models. Xue (2009), Huang, Horowitz
and Wei (2010), and Xue, Qu and Zhou (2010) applied PPS for variable selection in (gen­
eralized) additive models. Ma, Song and Wang (2013) used PPS for variable selection in
the single index models.
In the PPS method, the minimization of the objective function is challenging since
the SCAD penalty has a non-convex form. An approximation method is required to
solve the minimization problem. The local quadratic approximation (LQA) method in
the nonparametric framework has been proposed by Xue (2009). Another local linear
approximation (LLA) method under linear regression has been provided by Zou and Li
(2008). Zou and Li (2008) has shown in theory that this method is computationally
efficient, since the one-step approximator is already consistent to the oracle estimator. In
our research, we extended this LLA method to semiparametric models.
Finally, the tuning parameter involved in the PPS methods needs to be deter­
mined via a model selection criterion such as the final prediction error (FPE, Akaike
1969), Akaike information criterion (AIC, Akaike 1974), Bayes information criterion (BIC,
Schwarz 1978), or a model selection method such as cross-validation (CV) and generalized
cross-validation (GCV). These criteria/methods were originally proposed for the selection
of parametric models for independent and identically distributed data. Recently it has
been proved that some of these are also consistent for the selection of nonparametric mod­
els for weakly dependent time series data. For example, Huang and Yang (2004) provided
the theoretical justification of the consistency of BIC in variable selection of nonlinear
5
stochastic regression model, providing the data are from a strongly mixing (i.e. α-mixing)
and strictly stationary stochastic process. The consistency of the cross-validation and
FPE methods have been shown using stronger mixing conditions (i.e. β-mixing) in other
literature (Tschernig and Yang 2000).
The specific contribution of our research is the study of the theoretical properties,
as well as numerical performances of the PPS method on variable selection under two
semi-parametric models: the stochastic additive models (SAM, Chen and Tsay 1993) and
the additive coefficient models (ACM, Xue and Yang 2006 a & b). Here SAM models are
similar to the additive models (Stone 1985, Hastie and Tibshirani 1990), the difference
is that SAM assumes both responses and predictors are from a stationary stochastic
process. We use SAM to analyze weakly dependent time series data, and ACM to analyze
independent, randomly sampled data. For both models, in addition to numerical studies,
we establish the asymptotic oracle properties in theory.
For stochastic additive models (SAM), its estimation method has been studied by
a large amount of literature. For example, Chen and Tsay (1993) considered two backfit­
ting techniques: alternating conditional expectation algorithm of Breiman and Friedman
(1985) and the BRUTO algorithm of Hastie and Tibshirani (1990). Huang and Yang
(2004) estimated the additive components in SAM using an efficient regression splines
procedure. Wang and Yang (2007, 2009) proposed a spline-backfitted kernel estimator,
which was shown to be both computationally expedient and theoretically reliable. How­
ever, the important issue of variable/lag selection in the SAM is not well studied in general.
Our PPS method estimates and performs variable selection in the same process. Conse­
quently, the consistent properties in both estimation and model selection are theoretically
developed and proved under a more general assumption of weakly dependent data, rather
than the commonly used assumption of independent data.
For the second model we considered, the additive coefficient model (ACM) proposed
6
by Xue and Yang (2006a), we extend the varying coefficient model to be more flexible.
This models the dynamic changes of regression coefficients by allowing the coefficient
functions to vary with multiple variables in an additive functional form. It is particularly
useful in modeling temporal and spatial data. As discussed in Xue and Yang (2006a), the
additive coefficient model includes the aforementioned additive model, varying coefficient
model, partially linear model, and linear regression model as special cases. The estimation
of the additive coefficient model was studied in Xue and Yang (2006 a & b), using local
polynomial and polynomial spline methods respectively. The model selection problem on
ACM has not be studied prior to this research.
In our theoretical results, we prove a set of strong global oracle properties for ACM.
Recent literature including Xue (2009) and Xue, Qu and Zhou (2010) consider SCAD
penalty, that only guarantees there exists a sequence of local minimum that is consistent
in recovering a sparse model and consistent in estimating the non-zero function compo­
nents. However, when the objective function is non-convex (which is our case), the local
minimum is not necessarily the global minimum. Therefore, there is a discrepancy between
the theory and the fact that the PPS estimator is the global solution by the definition.
Previous literature have done some work to address this. For example, Xue and Qu (2012)
proved the oracle properties of the global minimum under the varying coefficient models, a
special case of additive coefficient models for the truncated L1 penalty. Kim, Choi and Oh
(2008) discussed the oracle properties of the global minimum for the SCAD penalty, but
only in linear regression models. However, our research in the thesis has initially studied
the global minimum under the general ACM models with the SCAD penalty.
The remainder of the thesis contains two parts. Section 2 discusses lag selection in
stochastic additive models. Subsection 2.1 defines stochastic additive models using neces­
sary notations. In Subsection 2.2, we describe an estimation procedure for the stochastic
additive model and established asymptotic properties of the proposed estimator. We then
7
introduce a penalized polynomial spline method for simultaneous estimation and variable
selection in the stochastic additive model. We develop the model selection consistency
and the oracle properties of the proposed method. In Subsection 2.3, we describe an al­
gorithm based on the local linear approximation of the nonconcave penalty function. We
also discuss to select tuning parameters via the Bayesian information criterion. Subsection
2.4 illustrates the numerical performance of the proposed methods by simulation studies
and Subsection 2.5 analyzes the US employment time series. Technical proofs, relative
definitions and assumptions are contained in Subsection 2.6.
Section 3 focuses on model selection in additive coefficient models. In Subsection 3.1,
we introduce the additive coefficient model. In Subsection 3.2, we propose the penalized
polynomial spline method for simultaneous model selection and estimation in the additive
coefficient model. The asymptotic property of the proposed method is established in
subsection 3.3. We describe an algorithm based on the local linear approximation to solve
the non-convex minimization problem and discuss a tuning parameter selection by the
Bayesain Information Criteria (BIC) in Subsection 3.4. Subsection 3.5 and 3.6 contain
simulation studies and an analysis of the Tucson housing price data. Technical proofs are
relegated to Subsection 3.8.
8
2
2.1
LAG SELECTION IN STOCHASTIC ADDITIVE MODELS
Model and Introduction
Linear models such as ARMA, ARIMA and SARIMA models (Brockwell and Davis
1991), VAR, VARMA and PAR models (Lütkepohl 1993) have been a popular method
for modeling time series data. These simple linear structures allow for easy estimation
and inference in the data generating process. However, for many important time series,
there is little a priori justification for assuming such linearity. Tong (1990, 1995) and
Tjøstheim (1994) gave interesting examples of data with asymmetric cycles and nonlinear
relationship between lagged variables. To model such nonlinear time series, non- and
semi-parametric models are useful alternatives. In this paper, we consider a stochastic
∞
T
additive model. Let (Xt , Yt )+
t=−∞ with Xt = (Xt1 , . . . , Xtd ) be a stationary stochastic
process satisfying
Yt = µ0 + µ1 (Xt1 ) + µ2 (Xt2 ) + · · · + µd (Xtd ) + εt ,
(2.1)
where µ0 is an unknown constant and {µl (·)}dl=1 are unknown nonparametric functions
and {εt } is white noise conditional on Xt . The covariates Xt can contain both variables
from an exogenous time series and lagged values of Yt . When it contains only lagged
values of Yt , it reduces to the nonlinear additive autoregressive model considered in Chen
and Tsay (1993).
2.2
Polynomial spline estimation
Let (Xt , Yt )nt=1 be a sample of size n generated from the stochastic additive model
(2.1). The additive functions in model (2.1) are not identified up to a constant. One
practical solution is to impose an identification condition that E [µl (Xtl )] = 0 for each
9
l = 1, . . . , d. Furthermore, since nonparametric functions are often estimated only on a
compact set, we assume without loss of generality that each Xtl has a compact support
on [0, 1]. Note that the identification condition entails that µ0 = E (Y ) . Therefore µ0 can
be consistently estimated by the sample average Y at the root-n rate, which is faster than
any rate of convergence for nonparametric function estimation. Therefore, for notation
simplicity, we assume µ0 = 0 and the response variable Y is centered with Y = 0. Now
�
let µ(Xt ) = dl=1 µl (Xtl ) be the additive function in (2.1) with E [µl (Xtl )] = 0 for each
l = 1, . . . , d. The main objectives of this paper are (a) to establish the rate of convergence
of polynomial spline estimation of the nonparametric functions in the stochastic addi­
tive model (2.1) for weakly dependent data, and (b) to propose a penalized polynomial
spline method for simultaneous lag selection and nonparametric function estimation for
the model (2.1).
l
=
Let H[0,1]
{
f : E [f (Xl )] = 0, E [f (Xl )]2 < ∞
}
be the space of all square inte­
grable functions that are centered with respect to the probability measure of Xl . Here
l
for l = 1, · · · , d.
Xl is the l−th component in X = (X1 , . . . , Xd )T . Assume µl ∈ H[0,1]
{
}
�d
l
Then H = h : h(X) = l=1 hl (Xl ), hl ∈ H[0,1]
for l = 1, · · · , d is a space of all square
integrable additive functions on [0, 1]d with µ ∈ H. For any functions f, g ∈ H, de­
fine the theoretical and empirical inner products as (f, g) = E [f (X)g(X)] and (f, g)n =
2
1 �n
t=1 f (Xt )g(Xt ) respectively. The induced theoretical and empirical norms are lgl =
n
(g, g) and lgl2n = (g, g)n respectively. Then, Lemma 2.6.1 in the Appendix entails that,
under some reasonable assumptions on the joint density of X, H is theoretically identifi­
able in the sense that for any h ∈ H, lhl = 0 implies hl = 0 a.s. for l = 1, · · · , d.
For each l = 1, . . . , d, we estimate the unknown function µl in (2.1) by polynomial
splines. The polynomial spline is a piece-wise polynomial that is connected smoothly
over a set of interior knots. Let {0 = xl,0 < xl,1 < · · · < xl,Nn < xl,Nn +1 = 1} be a set
of Nn interior knots. Then for any integer p > 0, let Gl be the space of polynomial
10
spline functions that are polynomials of order p (or less) on the intervals [xl,j , xl,j+1 ]
for j = 0, . . . , Nn and overall it is p − 1 times continuously differentiable on [0, 1]. The
polynomial spline has been widely used in nonparametric function estimation due to the
fact that it often provides a good approximation to smooth functions with only a small
number of knots. Since each µl in model (2.1) is theoretically centered, we consider
�n
Gnl = {g : g ∈ Gl , t=1
[g (Xtl )] /n = 0} , the space of empirically centered polynomial
splines. Let Gn = {g : g(X) = g1 (X1 ) + · · · + gd (Xd ), gl ∈ Gnl for l = 1, . . . , d}, which
provides approximation for functions in the model space H. The following lemma shows
the theoretical and empirical norms are asymptotically equivalent on Gn .
Lemma 2.2.1. Under Assumptions (A1)-(A3) in the Appendix, one has
lgl2n
sup
lgl2
g∈Gn
−1
= oP (1),
that is, for any ε > 0,
lim supP
n→∞
sup
g∈Gn
lgl2n
lgl2
−1
>ε
= 0.
(2.2)
Lemma 2.2.1 indicates that except on a set with probability goes to zero, the theoret­
ical norm can be approximated by the empirical norm. Therefore together with Lemma
2.6.1, the approximation space G is identifiable under both theoretical and empirical
norms.
Lemma 2.2.2. For any g ∈ Gn , lgln = 0 or lgl = 0 implies gl = 0 a.s. for l = 1, . . . , d
under Assumption (A2).
2.2.1
An initial estimator
Suppose that each µl in model (2.1) is smooth and can be approximated by a
polynomial spline function gl ∈ Gnl for l = 1, · · · , d. Then one has
µ (Xt ) ≈
d
L
l=1
gl (Xtl ), t = 1, · · · , n.
(2.3)
Let bnl (x) = bl (x)− n1
in which
(x)p+
(
p
b
(X
)
with
b
(x)
=
x,x2 ,· · · ,xp ,(x − xl,1 )+
,· · · ,(x − xl,N )p+
tl
l
t=1 l
�n
p
= (x+ ) . Here
bln (x)
11
)T
,
is the empirically centered truncated power basis for
Gnl . Denote µ = (µ (X1 ) , · · · , µ (Xn ))T , and Bl = (bnl (X1l ), · · · , bnl (Xnl ))T . Then an
equivalent expression of (2.3) in terms of matrix is
µ≈
d
L
Bl βl ,
(2.4)
l=1
)T
(
where the coefficients β = β1T , · · · , βdT
can be estimated by minimizing the sum of
squares
βj = arg min Y −
β
T
d
L
2
Bl βl
l=1
.
2
Here Y = (Y1 , . . . , Yn ) and l·l2 is the vector L2 -norm. Equivalently, µ
j=
j
µ
jl = bnT
l βl satisfying
µ
j = arg min Y −
gl ∈Gn
l
d
L
2
gl
l=1
,
(2.5)
�d
jl
l=1 µ
with
(2.6)
n
where with slightly abuse of notation, Y denotes a random function that interpolates the
values Y1 , . . . , Yn at data points. For identically and independently distributed data, it
has been shown in the literature (Stone 1985 and Huang 1998) that the polynomial spline
j for time series
estimator µ
j in (2.6) is L2 -consistent. However, the asymptotic property of µ
data is less well understood in general. Theorem 2.2.1 below establishes that µ
j in (2.6) is
also L2 -consistency with an optimal rate of convergence for weakly dependent data that
are geometrically α−mixing.
Theorem 2.2.1. Under Assumptions (A1)-(A3) in the Appendix, lµ̂ − µl2 = OP (Nn /n+
ρ2n ) and lµ̂ − µl2n = OP (Nn /n + ρ2n ) with ρn = 1/Nnp+1 .
The proof of Theorem 2.2.1 is given in the Appendix. Theorem 2.2.1 demonstrates
the consistency of the polynomial spline estimator. However it fails to produce a parsi­
monious model when there exists redundant variables. This result is useful for providing
12
an initial consistent estimator for late development in simultaneous lag/variable selection
and estimation of the stochastic additive model.
2.2.2
Penalized polynomial spline estimation
For variable selection in the stochastic additive model (2.1), we propose a penalized
polynomial spline method. It minimizes the sum of squared errors subject to an additive
jP L be the
penalty on the L2 norms of the spline functions. To be more specific, let µ
penalized polynomial spline estimator (PPS) defined as,

d
 1
L
PL
µ
j =
argmin l(g) = argmin
Y −
gl
g∈Gn
g∈Gn 
2
l=1
2
+
n
d
L
l=1


pλ (lgl ln )
.

(2.7)
Here the penalty function pλ (·) is the smoothly clipped absolute deviation (SCAD) penalty
proposed in Fan and Li (2001), whose derivative is
p′λ (|β|) = λI(|β| � λ) +
(aλ − |β|)+
I(|β| > λ)
a−1
(2.8)
with a constant a = 3.7 as in Fan and Li (2001) and a positive tuning parameter λ whose
selection is discussed in subsection 2.3.1. The SCAD penalty rather than other penalty
functions are used here because of its desirable properties such as unbiasedness, sparsity
and continuity (Fan and Li 2001).
The following two Theorems show that the proposed penalized polynomial spline es­
timator with SCAD penalty has nice asymptotic properties on both estimation consistency
and sparsity for variable selections for geometrically α−mixing time series data. Without
loss of generality, assume that only the first r components of Xt contribute in explaining Yt
�r
in model (2.1). Correspondingly, µ = l=1
µl where r = min {s : P [µs+1 (Xs+1 ) = 0] = · · ·
= P [µd (Xd ) = 0] = 1}.
Theorem 2.2.2. (Estimation consistency) Under Assumptions (A1)-(A5) shown in the
Appendix, for n sufficiently large, there exists a local minimizer µ
jP L of the criterion
13
function l(·) in Gn such that µ
jP L − µ
OP (Nn /n + ρ2n ) for l = 1, · · · , d.
2
= OP (Nn /n + ρ2n ). Furthermore, µ
jlP L − µl
2
=
Theorem 2.2.3. (Sparsity) Under Assumptions (A1)-(A5) shown in the Appendix, except
on a set whose probability tends to zero as n → ∞, µ
jPl L = 0 a.s., for l = r + 1, · · · , d.
Theorem 2.2.3 shows that unlike the initial polynomial spline estimator, the pe­
nalized polynomial spline estimator correctly shrinks the nonparametric components of
redundant variables to zero and provides a parsimonious fitted model with probability
approaching to one. Theorem 2.2.2 further indicates that µ
jP L converges to µ at the same
optimal rate as the initial polynomial spline estimator. Therefore, adding the SCAD
penalty term does not change the accuracy of estimation. These two facts ensure the
advantages of penalized polynomial spline estimator with the SCAD penalty.
2.3
Algorithm
Note that the Newton-Raphson algorithm can not be directly used to minimize the
penalized polynomial spline in (2.7) since the SCAD penalty function does not have a
continuous second order derivative. Instead, we extend the local linear approximation
(LLA) algorithm of Zou and Li (2008) to solve the penalized polynomial spline problem.
In the LLA algorithm, we substitute the SCAD penalty in the criteria function l(g) by
its local linear approximation. Then we calculate the corresponding minimizer of the
adjusted criteria function using the coordinate-wise descent algorithm (CWD, Yuan and
Lin 2006).
Motivated from Zou and Li (2008), we first approximate the SCAD penalty function
(0)
by a local linear function. To be more specific, for a given initial estimator gl
(0)
can write pλ (lgl ln ) ≈ pλ ( gl
(0)
n
) + pλ′ ( gl
(0)
n
)(lgl ln − gl
n
=µ
jl , one
) for l = 1, · · · , d. As a
14
result, the objective function in (2.7) can be approximated by
2
d
S(g g
(0)
L
1
) =
Y −
gl
2
+
l=1
n
d
L
(0)
p′λ ( gl
n
l=1
) lgl ln ,
(2.9)
up to a constant. Therefore the minimizer of S(g g (0) ) is an approximated solution
to the original minimization problem (2.7). Equivalently, in terms of the spline basis
representation, (2.9) can be written as
d
S(β β
(0)
L
1
)=
Y −
Bl βl
2n
l=1
where lβl lKl =
J
βlT Kl βl with Kl =
BT
l Bl
n ,
2
+
d
L
l=1
2
qλ,l lβl lKl
(0)
and qλ,l = p′λ ( βl
Kl
(2.10)
) is a constant only
(0)
depending on the initial value βl . This reduces to a group lasso with component-specific
tuning parameter qλ,l . It can be solved by applying the coordinate-wise descent (CWD)
algorithm as in Yuan and Lin (2006). Notice that by letting βl∗ = Dl βl and Bl∗ = Bl D−1
l
( T )1/2
, the objective function in (2.10) becomes
with Dl = Bl Bl
2
d
∗
nS(β β
(0)
L
1
)=
Y−
B∗l βl∗
2
+
l=1
d
L
√
nqλ,l lβl∗ l2 .
(2.11)
l=1
2
Proposition 1 of Yuan and Lin (2006) shows that the solution to (2.11) can be found
iteratively by
∗(k)
βl

= 1 −
√

nqλ,l 
S∗l,k
2
(
S∗l,k ,
+
)T
βj1∗T , . . . , βjd∗T
be the value of β ∗ at con­
(
)T
j∗
vergence. Then the minimizer of (2.10) βj = βj1T , . . . , βjdT
is found by β̂l = D−1
l βl for
where S∗l,k = Y −
�
∗ ∗(k) .
ji=l Bj βj
Let βj∗ =
l = 1, · · · , d. Our experience shows that the CWD algorithm converges in just a few
iterations.
2.3.1
Tuning Parameter and Knots Selection
In this section we discuss how to choose the smoothing and tuning parameters. For
the polynomial spline estimation, the polynomial spline spaces {Gl }dl=1 depend on the
15
knot sequence {xl,1 , · · · , xl,N } with N interior knots. We have used an equal spaced knot
sequence with the knot number N selected by the Bayesian information criterion (BIC),
which is defined as BIC = n ln( RSS
n ) + kn ln(n). Here RSS is the residual sum of squares
and kn is the number of free parameters to be estimated in (2.6) and n is the sample size.
This strategy was also used in Xue (2009) and Huang et al. (2004). Generally the number
of interior knots for each spline space Gl can be different. However we here use the same
knot number N for all d spline spaces for computational simplicity.
For lag selection, we have used the same optimal number of interior knots N selected
by the BIC in the polynomial spline approach. In addition, the definition of the SCAD
penalty function in equation (2.9) involves two tuning parameters a and λn . Following
Fan and Li (2001), we take a = 3.7. The parameter λn plays an important role in the lag
selection results. A larger value of λn leads to a simpler model with fewer variables. We
}
{
(0)
select the optimal λn by the BIC and we consider λn in the interval from min βl
/a
and up to some value which shrinks all components of µ
jP L to zero.
2.4
Simulation studies
In this section we use simulation to study the finite sample performance of the
proposed penalized polynomial spline method. Denote S0 as the set of relevant variables
in the true model (2.1). Following Huang and Yang (2004), we say that the variable set
of the fitted model, denoted as S, is an overfit of S0 if S is a proper superset of S0 ; S is a
correct fit if S is exactly S0 . For other cases, we say S is an underfit of S0 . Furthermore,
we use the median integrated squared error (MISE) to evaluate the estimation accuracy
of a fitted model, which is defined as,
MISE =
d
L
l=1
Median
1≤i≤nrep
1
ngrid
ngrid (
L
i=1
)
( )
(r) ( G ) 2
x
µ l xG
µ
−
j
i,l
i,l
l
.
16
(r)
Here µ
jl
is the l-th component of the given fitted function estimated from the r-th repli­
(r)
G
, i = 1, · · · , n
cation, and xi,l
jl
grid are the grid points at which µ
is evaluated .
We simulate 500 random samples of size n=100, 200 or 500 from each of the additive
autoregressive models given in Table 2.1. The same models were also used in Huang and
Yang (2004). The models in Table 2.1 contain only one or two relevant lags with either
linear or non-linear function forms. In addition, the error term εt in (2.1) are independent
and identically distributed as N (0, 0.1) random variables. We consider the selection of
relevant lags from a set of ten possible lags. That is, we consider Xt = (Xt1 , · · · , Xtd )T in
model (2.1) to be the lag variables of Yt as Xtl = Yt−l for l = 1, · · · , 10. We also compare
the estimation accuracy of our method with polynomial spline estimations of the oracle
and the full models respectively. The full model is an additive autoregressive model (2.1)
containing all 10 possible lags, while the oracle model contains only the relevant ones in
model (2.1).
We applied the penalized spline regression with both linear (p = 1) and cubic (p = 3)
jn . For one
splines. The BIC criteria was used to select the optimal tuning parameter λ
run with sample size n = 500, Figure 2.1 plots the empirical norms of the estimated
additive component µ
jPl L (λn ) from the penalized cubic spline (p = 3) against λn for a
linear AR3 model and a nonlinear NLAR1 model. One can see that most component
norms turn to zero for λn large enough. Therefore, larger values of λn leads to a simpler
model with fewer selected variables. The dotted line in Figure 2.1 marks the location of
jn . It clearly indicates that the BIC works reasonably well for this run since
the optimal λ
jn that provide the correct fit. The lag selection results of each autoregressive
it chooses λ
model are presented in Table 2.2, in terms of underfitting, correct-fitting and over-fitting
percentages. One can see that, for all additive autoregressive models, the percentage of
correct fitting increases quickly to 100% or close to 100% as the sample size increases to
500. Therefore, it numerically verifies Theorem 2.2.3 that penalized polynomial spline
17
method is consistent for variable selection.
Besides the lag selection results, we also compared the estimation accuracy of the
penalized polynomial spline estimator with the polynomial spline estimations of the oracle
and full models. Table 2.3 summarizes MISEs of the three estimators for all autoregressive
models in Table 2.1. In almost all cases, the full estimator has largest MISEs while the
oracle estimator has the smallest MISEs. It is not surprising since the oracle estimator
uses the information on the data generating process, which is not available for real data
analysis. The MISEs of the penalized spline estimator is much smaller than that from
the full estimator and is very close to that of the oracle estimator. Furthermore, for
one run with sample size n = 500, Figure 2.2 plots the estimated component curves from
three approaches for model NLAR1 with p = 1 and model NLAR2 with p = 3 respectively.
Figure 2.2 graphically confirms the results in Table 2.3 and clearly shows that the proposed
methods estimate the unknown functions reasonably well. Both Figure 2.2 and Table 2.3
numerically support Theorem 2.2.2 that the penalized polynomial spline can estimate the
model as accurately as the oracle when the sample size is large enough. The performances
of the cubic regression and the linear regression are comparable except that the cubic
regression gives smoother fitted component curves, which can be seen from Figure 2.2.
2.5
Real data analysis
We applied our proposed method to analyze the quarterly US unemployment rate
data from the first quarter of year 1948 to the last quarter of year 1978. Denote this time
series by {Rt }120
t=1 . The data covers unemployed people in the labor force who are at least
16 years old of all ethnic origins, races and sexes, without distinction between industries
and occupations. We then deseasonalized this series by taking the fourth difference of the
data. Denoting the resulting new series {Yt }116
t=1 , as Yt = Rt+4 − Rt for t = 1, · · · , 116. In
18
our analysis, the last 16 observations of Yt were left out for prediction. The rest were used
for model fitting.
As in simulation study, we considered the last 10 lags of Yt as possible predictor
variables. Then we applied the penalized polynomial spline method for lag selection, as
well as the unpenalized spline estimation of the full model consisting all 10 lags. We
considered cubic spline functions (p = 3) for both approaches. For each model fitting, the
coefficient of determination R2 , the mean squared estimation error (MSEE) and the mean
squared prediction error (MSPE) were calculated. Denote Ŷt the estimated or predicted
1 �100
value for Yt , and Y = 90
t=11 Yt , we write
�100 (
)2
jt
Y
−
Y
t
t=11
R2 = 1 − �100 (
)2 ,
t=11 Yt − Y
MSEE =
100
116
)2
)2
1 L (
1 L(
Yt − Yjt .
Yt − Yjt , MSPE =
90
16
t=11
t=101
We also compute the mean absolute estimation error (MAEE) and the mean absolute
prediction error (MAPE)
100
116
1 L
1 L
j
MAEE =
Yt − Yt , MAPE =
Yt − Yjt .
90 t=11
16 t=101
The R2 , MSEE and MAEE measure how well the models fit the data, while MSPE and
MAPE compare the prediction performance of various models. The results are reported
in Table 2.4. It shows that the full model with all ten lags gives smaller estimation errors
(MSEE and MAEE) compared with the penalized one, due to the fact that the full model
has a larger model size. However, the penalized method not only gave a parsimonious
model which is much easier to interpret, but also had better prediction performance with
smaller prediction errors (MSPE and MAPE). Finally, we also consider a linear autore­
gressive model with two selected lags Yt−1 and Yt−2 by the penalized polynomial spline
method. That is, we consider Yt = β0 + β1 Yt−1 + β2 Yt−2 + ǫt . It gives much larger estima­
19
tion and prediction errors, suggesting that the nonlinear additive model better describes
the underlying data generating structure for the quarterly US unemployment data.
2.6
Assumptions and Proofs
2.6.1
Notation and Definitions
∞
First, suppose we have two sequences {an }∞
n=1 and {bn }n=1 .
Define an � bn if
lim an
n→∞
bn
= 0, an ≍ bn if an � bn and bn � an . For Gnl , the spline space of polynomial
functions on [0, 1] with order p (or less), we denote its dimension as Jn . Clearly Jn =
Nn + p + 1. On the approximation space Gn , we define two important constants, An =
{
}
1g1∞
1/2
sup
and ρn = infn
lg − µl∞ . Note that according to Huang(1998), An ≍ Jn ≍
1g1
g∈Gn
1/2
Nn ,
g∈G
1
≍ p+1
for polynomial spline space under Assumption (A3).
Nn
�
Recall that µ̂ = dl=1 µ
jl (Xl ) is the least squared estimator of µ in Gn based on the
�
sample. Furthermore, we denote µ̃ = dl=1 µ
Jl (Xl ) to be the best approximation of µ in
and ρn ≍
1
Jnp+1
Gn with respect to the empirical norm. Let Q be the projection operator onto Gn with
respect to the empirical inner product. We can write µ̃ = Qµ and µ̂ = QY . Consequently,
the error can be decomposed as µ̂ − µ = (µ̂ − µ̃) + (µ̃ − µ). By Triangular Inequality, one
has lµ̂ − µl � lµ̂ − µ̃l + lµ̃ − µl.
As mentioned in Section 2.2.2, when there are redundant variables, without loss
�
of generality, we can write the true regression function as µ(X) = rl=1 µl (Xl ) for some
r � d. Correspondingly denote G(0) = {g : g(X) = g1 (X1 ) + · · · + gr (Xr ),where gl ∈ Gnl
for l = 1, . . . , r}. Then G(0) is an approximation space of µ other than Gn . Similar to
Q, one can also define a projection operator Q(0) on to G(0) with respect to the empirical
inner product. Let µ̂(0) = Q(0) Y , which is the least square estimation of µ in the space
G(0) instead.
At last, we introduce the α-mixing coefficient α(s) of a stationary stochastic process
20
∞
(Yt , Xt )+
t=−∞ , which measures the strength of dependence for any two data points that
are at least s time units apart. To be more specific,
{
}
α(s) = sup |P (A)P (B) − P (A ∩ B)| :A∈σ({Xt′ , Yt′ , t′ � t}), B∈σ({Xt′ , Yt′ , t′ ; t + s}) .
(2.12)
Here for any index set Γ, σ({Xt , Yt , t ∈ Γ}) is the σ-field generated by {Xt , Yt , t ∈ Γ}.
Note that in stationary stochastic process, α(s) does not vary with t in Equation (2.12).
2.6.2
Assumptions
To establish the asymptotic theory, we need the following assumptions.
(A1) The stochastic process (Xt , Yt ) is stationary and α−mixing with its α-mixing coeffi­
cient α(s) � C1 e−C2 s for some constants C1 and C2 .
(A2) The joint density of Xt , denoted as fXt , is bounded away from 0 and ∞ on the
compact support [0, 1]d .
(A3) For each spline space Gnl , l = 1, · · · , d, the number of interior nodes Nn satisfies
Nn ≍ nθ with 0 < θ < 13 . And the choice of these nodes tl,1 , · · · , tl,Nn satisfies
max |tl,j+1 −tl,j |
1JjJNn
< η for a positive constant η.
min |tl,j+1 −tl,j |
1JjJN
n
(A4) The tuning parameter λn in the SCAD penalty function satisfies limn λn = 0.
(A5) The tuning parameter λn in the SCAD penalty function satisfies
lim
n→∞
√
2
Nn /n+ρn
λn
=
0.
2.6.3
Proof of Preliminary Lemmas
Proof of Lemma 2.2.1. Let GnU B = {g ∈ Gn : lgl � 1} ⊆ Gn be a subset of all
functions in Gn that are in the unit ball under the theoretical norm l·l defined before.
Note that, to prove Lemma 2.2.1, it is sufficient to prove Equation (2.2) for all g ∈ GnU B .
21
For ∀ε1 , ε2 > 0, let f1 , f2 , g1 , g2 ∈ GnU B with lf1 − f2 l � ε1 , lg1 − g2 l � ε2 . One has
lf1 g1 − f2 g2 l∞ � l(f1 − f2 ) g1 l∞ + lf2 (g1 − g2 )l∞
� lf1 − f2 l∞ lg1 l∞ + lf2 l∞ lg1 − g2 l∞
� A2n [lf1 − f2 l lg1 l + lf2 l lg1 − g2 l]
� A2n [ε1 + ε2 ]
(2.13)
and
V ar (f1 g1 − f2 g2 ) � 2V ar [(f1 − f2 ) g1 ] + 2V ar [f2 (g1 − g2 )]
� 2 lf1 − f2 l2 lg1 l2∞ + 2 lf2 l2∞ lg1 − g2 l2
[
r
� 2A2n ε21 + ε22 .
Let h = f1 g1 − f2 g2 . Then for any integer r ≥ 3, one has
{
}
{
}
E |h − E(h)|r = E |h − E(h)|2 |h − E(h)|r−2 � E |h − E(h)|2 2r−2 lhlr−2
∞
rr−2
[
� 2r−2 A2n (ε1 + ε2 )
E |h − E(h)|2
[
rr−2
� r! A2n (ε1 + ε2 )
E |h − E(h)|2
= r!cr−2 E |h − E(h)|2 ,
where c = A2n (ε1 +ε2 ). By letting m22 = max
one observes that
m22
�
2A2n (ε12 + ε22 )
1�t�n
and mrr
}
{
E |h − E(h)|2 , mrr = max {E |h − E(h)|r },
1�t�n
�
cr−2 r!m22 .
1
Consequently 25m22 + 5cε ≍ A2n .
Recall that An ≍ Nn2 . Assumption (A3) indicates A−2
n = OP (1). Therefore, for any inte­
ger q between 1 and
n
2,
Theorem 1.4 of Bosq (1998) gives
22
P ((En − E)(h) > ε)
)
(
)
(
] 2r
[
r
qε2
5mr 2r+1
n
ε2
2r+1
−C2 n
q
� 2
+1+
exp −
+ 11n(1 +
)
C1 e
2
2
ε
q
25m2 + 5cε
25m2 + 5cε
)
(
)
(
n
qε2
� 2
+ 2 exp −
q
50An2 (ε21 + ε22 ) + 5A2n ε(ε1 + ε2 )
r
(
] 2r
] 1 ) 2r+1
[
5[
−C n 2r+1
2(r−1)
r−2 2
2 r
+11n 1 +
r!An
C1 e 2 q
(ε1 + ε2 ) (ε1 + ε2 )
ε
)
(
4n
qε2
�
exp −
q
50A2n (ε21 + ε22 ) + 5A2n ε(ε1 + ε2 )
( ) r [
] 1 [
] 2r
5 2r+1
2r+1
−C2 n
r−2 2
2 2r+1
q
+22n
(ε
+
ε
)
(ε
+
ε
)
C
e
r!A2(r−1)
1
2
1
n
1
2
ε
1
for n large enough. Furthermore, through the convexity of function e− x , one has
)
(
)
(
n
qε2
qε2
P ((En − E)(h) > ε) � 2
exp −
+ exp −
10A2n ε(ε1 + ε2 )
q
100A2n (ε21 + ε22)
( ) r [
] 1 [
] 2r
5 2r+1
2r+1
−C2 n
r−2 2
2 2r+1
q
+22n
(ε
+
ε
)
(ε
+
ε
)
.
(2.14)
r!A2(r−1)
e
C
1
2
1
n
1
2
ε
For any g ∈ Gn
U B , consider a sequence of subsets {g ≡ 0} = ℑ0 ⊂ ℑ1 ⊂ · · · ℑk ⊂ ℑk+1 ⊂
· · · satisfying min
lg − g∗ l � δk , where δk = 31k . Note that the cardinality of ℑk satisfies
g ∗ ∈ℑk
)dJn
(
k /2
� 3(k+1)dJn . Furthermore, for any arbitrary given t > 0, choose K
#(ℑk ) � 1+δ
δk /2
( )K
to be the maximum nonnegative integer such that 23
� 4At 2 . Then for each g ∈ GnU B ,
n
one can find
∗
gK
∈ℑK such that lg −
∗ l
gK
�
1
.
3K
So for each fixed positive integer k � K
∗
and the corresponding gk ∈ ℑk , we can choose gk−1
∈ ℑk−1 to satisfy
δk−1 =
1
.
3k−1
∗
gk − gk−1
�
K
For any f ∈ GnU B , define {fk∗ }k=0
in a similar way. From the definitions of
∗ and g ∗ , and (2.13), one has
fK
K
∗ ∗
∗ ∗
gK )| < 2 lf g − fK
gK l∞ �
|(En − E)(f g − fK
4A2n
t
� K.
K
3
2
(2.15)
Using (2.14) and (2.15), let’s now prove Lemma 2.2.1. Firstly, Triangular Inequality gives
sup |(En − E)(f g)| �
f,g∈Gn
+
∗
sup |(En − E)(f g − fK ∗ gK
)|
f,g∈Gn
K
L
sup
k=1 fk ,gk ∈ℑk
∗
) .
(En − E)(fk∗ gk∗ − fk∗−1gk−1
23
Therefore, by (2.15), one has
P
+
sup
{|(En − E)(f g)|} > t
P
sup
f,g∈Gn
U B
K
L
�
# (ℑk )
k=1
sup
f,g∈Gn
UB
∗ ∗
|(En − E)(f g − fK
gK )| > t
∗
∗
gk−1
) >t
(En − E)(fk gk − fk−1
fk ,gk ∈ℑk
k=1
∞
L
�P
sup P
fk ,gk ∈ℑk
(
1
2k
,
∗
∗
(En − E)(fk gk − fk−1
gk−1
) >t
for n sufficiently large. By plugging (2.14) with ε =
t
,ε
2k 1
1
2K
1
2k
)
= ε2 =
,
1
3k−1
(2.16)
into the last term
above, one has
P
sup
f,g∈Gn
UB
{|(En − E)(f g)|} > t
}
}
{
qt2 /22k
qt/2k
3
�
+ exp −
2 /32(k−1)
q
20An2 /3k−1
200A
n
k=1
( ) r (k−1)
2(r−1)
n 2r
2 2r+1
+ C3 nAn2r+1 e−C2 q 2r+1
3(k+1)dJn ,
3
[ 2 ]
1
[ ( 4 )r 2r r 2r+1
1/2
where C3 = 22 12 5t
C1 r!
. Now, let q = n 3 . Recall that An ≍ Nn , Jn ≍ Nn .
∞
L
2n
(k+1)dJn
Assumption (A3) gives
P
A2n Jn
q
{
exp −
1
≍ n2(θ− 3 ) → 0 as n → ∞. Thus
sup
f,g∈Gn
UB
{|(En − E)(f g)|} > t
}
{
}
{
∞
L
n
qt/2k
n
qt2 /22k
�
4 exp −
+ 4 exp −
q
20An2 /3k−1
q
200An2 /32(k−1)
k=1
( ) r (k−1)
2(r−1)
2r
2 2r+1
−C2 n
2r+1
q 2r+1
+C3 nAn
3(k+1)dJn .
e
3
Then by e−x � x1 e−1 , one gets
P
+
sup
f,g∈Gn
UB
∞
L
k=1
{|(En − E)(f g)|} > t
2(r−1)
2r+1
C3 nAn
n
2r
e−C2 q 2r+1
�
∞
L
k=1
7200
nA2n
q 2 t2
( ) r
(k−1)
2 2r+1
3(k+1)dJn ,
3
( )2k
( )
nA2 2 k
2
+ 240 2 n
3
q t 3
24
in which the first two terms converge to 0
as n → ∞.
Let r =
3. Since
Jn q
n
1
≍ n(θ− 3 ) → 0
as n → ∞, so for n large enough,
( )
3 (k−1)
2 7
−C 6n
C3 nAn
e
2 7q 3(k+1)dJn
3
k=1
{
( )}
∞
L
4
6n
3(k − 1)
2
7
=
C3 nAn exp −C2
+ d(k + 1) (log 3) Jn +
log
7q
7
3
k=1
{
( )}
∞
L
4
6n
3(k − 1)
2
�
C3 nAn7 exp − C2 +
log
7q
7
3
∞
L
4
7
k=1
{
}L
∞ ( ) 3(k−1)
7
6n
2
,
=
C3 nAn exp − C2
7q
3
4
7
k=1
which also goes to 0 as n → ∞. Therefore, lim supP
n→∞
sup
f,g∈Gn
UB
{|lf gln − lf gl|} > t
Lemma 2.6.1. Under Assumption (A2), for any function h =
�
�d
a constant C ≤ 1, such that C dl=1 lhl l � lhl ≤ l=1
lhl l .
�d
l=1 hl
= 0.
∈ H, there exists
Proof. Under Assumption (A2), one can assume that there exist constants 0 < b ≤
B such that the density function b ≤ fXt ≤ B on [0, 1]d . Let Wl = (X1 , · · · , Xl ) and
Sl = h1 + · · · + hl for l = 1, · · · , d. We show by induction that, for each l = 1, · · · , d,
where δ =
J
(
1−
b
B.
1 − δ
2
) l−1
2
(lh1 l + · · · + lhl l) � lSl l ,
(2.17)
For l = 1, (2.17) is a trivial case. Suppose (2.17) is true for l < d, we
show that (2.17) holds for l + 1. When lhl+1 l = 0 or lSl l = 0, since H is theoretically
identifiable, it is a trivial case. Therefore one may assume that lhl+1 l > 0 and lSl l > 0.
Denote ρ =corr(Sl , hl+1 (Xl+1 )). Then the part of variance of hl+1 (Xl+1 ), which cannot
25
be explained linearly by Sl (Wl ), can be written as
{
}
(1 − ρ2 ) lhl+1 l2 = minE [hl+1 (Xl+1 ) − γSl (Wl )]2
γ
j 1 j
= min
[hl+1 (xl+1 ) − γSl (wl )]2 fWl ,Xl+1 (wl , xl+1 )dwl dxl+1
γ
;
0
b
min
B γ
Wl ∈[0,1]l
j
Wl ∈[0,1]l
=
b
min
B γ
j
Wl ∈[0,1]l
=
Therefore 1 − ρ2 ;
b
B.
j
1
0
(hl+1 (xl+1 ) − γSl (wl ))2 fXl+1 (xl+1 )dxl+1 dwl
[ ( 2
)
r
E hl+1 (Xl+1 ) + γ 2 Sl2 (wl ) dwl
b
lhl+1 l2 .
B
Hence −δ � ρ � δ. Consequently,
lSl+1 l2 = lSl l2 + 2ρ lSl l lhl+1 l + lhl+1 l2
1 + ρ
(lSl l + lhl+1 l)2
2
(
) l−1
1−δ
1−δ 2
;
(lh1 l + · · · + lhl l) + lhl+1 l
2
2
(
)
1−δ l
;
(lh1 l + · · · + lhl+1 l)2 .
2
;
Proof of Lemma 2.2.2. Given h =
�d
l=1 hl
2
∈ H, lhl = 0, from the definition of
theoretical inner product one knows that h = 0 a.s. Besides, Lemma 2.6.1 gives 0 �
�
C dl=1 lhl l � lhl = 0. So lhl l = 0 for l = 1, · · · , d.
Given g ∈ Gn , lgln = 0. Lemma 2.2.1 entails that, except on a set whose probability
tends to zero as n → ∞,
1
lgl2 � lgl2n � 2 lgl2 , g ∈ Gn .
2
(2.18)
Thus lgl = 0. Since g ∈ Gn ⊆ H, one has g = 0 a.s.
For the last identifiability, properties of centered spline function assures that there
exist x∗l such that gl (x∗l ) = 0, l ∈ {1, · · · , d}. Given l0 , since the joint density of X is
bounded away from 0, one has that P (Xl = xl∗ , l = l0 ) > 0. Therefore,
26


d
L
L
∗


(x
)
+
g
(X
)
=
0
;
P
gl (Xl ) = 0 = 1.
P (gl0 (Xl0 ) = 0) = P
gl l
l0
l0
li=l0
2.6.4
l=1
Proof of Theorems
Proof of Theorem 2.2.1. We divide the norms of estimation error into two parts
by Triangular Inequality. To be specific, lµ̂ − µl � lµ̂ − µ̃l + lµ̃ − µl and lµ̂ − µln �
−(p+1)
lµ̂ − µ̃ln + lµ̃ − µln . Recall that ρn ≍ Nn
. Theorem 2.2.1 can be proved by showing
(i) lµ̂ − µ̃l2n = OP ( Nnn ), lµ̂ − µ̃l2 = OP ( Nnn );
(ii) lµ̃ − µl2n = OP (ρn ), lµ̃ − µl2 = OP (ρn ).
n
n
To prove (i), denote {φj }dJ
j=1 as a set of orthonormal basis of the additive space G
�dJn
with respect to the empirical inner product. Note that µ̂ − µ̃ = QY − Qµ = j=1
(Q(Y −
�dJn
µ), φj )n φj = j=1 (Y − µ, φj )n φj . Therefore, with εt = Yt − µ(Xt ), one has
(
E lµ̂ −
µ̃l2n
)
=
dJn
L
j=1
=
E(Y −
dJn
L
E
j=1
�
dJn
L
µ, φj )n2
1
n2
n
L
=
dJn
L
j=1
2
n
1L
E
εt φj (Xt )
n t=1
ε2t φ2j (Xt ) + E
t=1
1
n2
L
εs εt φj (Xs )φj (Xt )
1�s<t�n
(Ij1 + Ij2 ) .
j=1
For the first part,
Ij1 =
n
n
r}
r
1 L [ 2 2
1 L { [ 2
φ
(X
)
=
E E φj (Xt )εt2 |Xt
E
ε
t
t j
2
2
n
n
t=1
=
1
E
n2
t=1
n
L
φ2j (Xt )σ 2 =
t=1
σ2
n
.
As in Ij2 , given each pair of (s, t), one has
E [εs εt φj (Xs )φj (Xt )]
= E {E [εs εt φj (Xs )φj (Xt )|σ {X1 , . . . , Xt }]}
= E {εs φj (Xs )φj (Xt )E [εt |σ {X1 , . . . , Xt }]} = 0
27
since E [εt |σ {X1 , . . . , Xt }] = 0. Therefore lµ̂ − µ̃l2n = OP ( Jnn )
=
OP ( Nnn ). The inequality
(2.18) further gives that lµ̂ − µ̃l2 = OP ( Nnn ).
To prove (ii), Theorem 6 on Page 149 of de Boor (2001) entails that, for ev­
ery l = 1, . . . , d, there exists a constant C ∗ and a spline function gl∗ ∈ Gnl
satisfying
�d
∗
lgl∗ − µl l � C ∗ ρn . Denote g∗ =
l=1 gl . Then through Triangular Inequality, one
gets lg∗ − µl = OP (ρn ). Inequality (2.18) entails that lg∗ − µln = OP (ρn ). As we
mentioned before, µ̃ is the best approximation of µ in Gn with respect to the empiri­
cal norm. Thus one has lµ̃ − µln � lg∗ − µln = OP (ρn ). Furthermore, lµ̃ − g∗ ln �
lµ̃ − µln +lg∗ − µln = OP (ρn ). Again inequality (2.18) gives lµ̃ − g∗ l = OP (ρn ). Therefore lµ̃ − µl � lµ̃ − g ∗ l + lg∗ − µl = OP (ρn ).
Corollary 2.6.4.1. Denote µ̂ =
Then
�d
jl ,
l=1 µ
µ̃ =
�d
Jl ,
l=1 µ
where µ
jl , µ
Jl ∈ Gnl for l = 1, . . . , d,
J
J
(i) lµ
jl − µ
Jl l = OP ( Nn /n) and lµ
jl − µ
Jl ln = OP ( Nn /n);
Jl − µl ln = OP (ρn );
(ii) lµ
Jl − µl l = OP (ρn ) and lµ
J
J
jl − µl ln = OP ( Nn /n + ρn ).
Consequently, lµ
jl − µl l = OP ( Nn /n + ρn ) and lµ
�
Proof. Lemma 2.6.1 and conclusion (i) in Theorem 2.2.1 gives dl=1 lµ
jl − µ
Jl l =
J
�d
�d
OP ( Nn /n) and l=1 lJ
µl − µl l = OP (ρn ). Inequality (2.18) further proves l=1 lµ
jl − µ
Jl ln =
J
OP ( Nn /n). Since µ̃ is the best approximation of µ in G with respect to the empirical
norm, Lemma 2.2.2 entails that µ
Jl is the best approximation of µl in G with respect to
the empirical norm for l = 1, · · · , d as well. Recall gl∗ in proof of Theorem 2.2.1 (ii) such
�
�d
that lgl∗ − µl l = OP (ρn ), then by (2.18) one has dl=1 lµ
Jl − µl ln � l=1
lgl∗ − µl ln �
�
2 dl=1 lgl∗ − µl l = OP (ρn ).
28
Corollary 2.6.4.2. For µ = µ1 + · · · + µr , replace the approximation space Gn with G(0) ,
(0)
(0)
one has that the least squared estimator µ
j(0) = Q(0) Y = g1 + · · · + gr
µ̂(0) − µ
2
= OP (Nn /n + ρ2n ) and µ̂(0) − µ
2
n
= OP (Nn /n + ρ2n ).
Proof of Theorem 2.2.2. It is sufficient to prove µ
jP L − µ
µ
jPl L − µl
also satisfies
2
= OP (Nn /n + ρ2n ) and
2
= OP (Nn /n + ρn2 ) for l = 1, · · · , d. For µ = µ1 + · · · + µr , we have two
�
(0)
least-square estimators µ
j ∈ Gn and µ
j(0) = rl=1 µ
jl ∈ G(0) . Again from Theorem 6 on
�
Page 149 of de Boor (2001), for arbitrary positive C4 and any g ∗ = dl=1 gl∗ ∈ Gn such
J
that lg ∗ − µl = C4 Nn /n + ρ2n . Since pλn (·) ; 0, pλn (0) = 0, one has
∗
(0)
j
l(g ) − l(µ
) ;
=
1
2
1
2
(
(
lY −
g ∗ l2n
− Y −µ
j
j−µ
j(0)
lµ
j − g ∗ l2n − µ
= I + II
Denote ǫn = sup
g∈G
{
n sufficiently large,
2I
1g12n
1g12
2
(0)
n
2
n
)
)
+
+
r [
L
l=1
r [
L
l=1
(0)
pλn (lgl∗ ln ) − pλn ( µ
jl
n
(0)
jl
pλn (lgl∗ ln ) − pλn ( µ
n
]
)
]
)
}
− 1 . Lemma 2.2.1 tells that ǫn → 0 as n → ∞. Therefore, for
2
; lj
µ − g ∗ l2 (1 − ǫn ) − µ
j−µ
j
(0) (1 + ǫn )
(
)
2
2
(0)
∗ 2
(0)
∗ 2
= lj
µ−g l − µ
j−µ
j
− ǫn lj
j−µ
j
µ−g l + µ
(
)
2
1
(0)
∗ 2
;
lj
µ−g l − µ
j−µ
j
2
(
)
2
1
2
∗
(0)
∗
(0)
;
lg − µl − 2 lµ
j − µl lg − µl −
µ
j − µ + 2 lµ
j − µl µ
j −µ
2
J
2
(
)
1
=
C42 Nn /n + ρ2n − 2C4 Nn /n + ρ2n lµ
j − µl − µ
j(0) − µ
2
]
(0)
−2 lµ
j − µl µ
j −µ .
The last three terms are all of OP (Nn /n + ρ2n ) by Theorem 2.2.1 and Corollary 2.6.4.2.
Therefore by choosing C4 large enough, one can assure that the first positive term dom­
inates the rest three, indicating that I ; 0 as for n large enough. For II, consider
29
any l ∈ {1, · · · , r}.
One has that
(0)
(0)
µ
jl
(0)
; lµl l − µ
jl
− µl .
Lemma 2.6.1 im­
plies that µ
jl − µl � C −1 µ
j(0) − µ , and Corollary 2.6.4.2 gives that µ
j(0) − µ =
J
J
(0)
−(p+1)
jl
; lµl l − OP ( Nn /n + ρn2 ). Again with ρn ≍ Nn
OP ( Nn /n + ρ2n ). Thus µ
,
J
θ−1
(0)
Nn
2
2 ) = OP (1). So
µ
jl
; 21 lµl l ; aλn
Assumption (A3) indicates
n + ρn = oP (n
(0)
since λn → 0. Consequently, pλn ( µ
jl
n
)=
a+1 2
2 λn
for n large enough. Similarly, since
lgl∗ l ; lµl l − lgl∗ − µl l ; lµl l − C ∗ ρn , one gets pλn (lgl∗ ln ) =
a+1
2
2 λn
for n large enough.
Therefore, II → 0 as n → ∞. In all, l(g∗ ) − l(µ
j(0) ) ; I + II > 0. Since for any g∗
J
J
that lg∗ − µl = C4 Nn /n + ρ2n , one has l(g ∗ ) > l(µ
j(0) ) with µ
j(0) < C4 Nn /n + ρn2 for
some sufficiently large C4 , one can conclude that there exists a local minimizer µ
jP L of the
{
}
J
criterion function l(g) in the subset of G, g : lg − µl � C4 Nn /n + ρ2n . This further
assures that µ
jP L − µ
2
= OP (Nn /n + ρ2n ).
Proof of Theorem 2.2.3. As in the proof of Theorem 2.2.2, given l ∈ {r + 1, · · · , d},
J
for arbitrary gl ∈ Gnl such that lgl l = OP ( Nn /n + ρ2n ) and arbitrary g(0) ∈ G(0) such
J
that g(0) − µ = OP ( Nn /n + ρ2n ), one can see that,
l(g
(0)
) − l(g
(0)
+ gl ) =
=
(
1
Y − g (0)
2
(
1
µ
j − g (0)
2
2
n
2
n
− Y −g
(0)
2
− gl
− µ
j − g (0) − gl
n
2
n
)
)
− pλn (lgl ln )
− pλn (lgl ln ).
30
By Lemma 2.2.1, the empirical norm can be switched to the theoretical norm, which is
l(g
(0)
) − l(g
(0)
+ gl ) �
(
(0)
2
2
(0)
)
µ
j−g
− µ
j − g − gl
− pλn (lgl l)
)
(
� lgl l µ
j − g(0) + µ
j − g (0) − gl − pλn (lgl l)
(
) p (lg l)
1
λ
l
lgl l µ
j − g (0) + µ
j− g(0) − gl − n
= λn
λn
λn
= λn lgl l
p′ (ω)
j − g(0) + µ
j − g(0) − gl
µ
− λn
λn
λn
2 µ
j− g (0) + lgl l p′λn (ω)
� λn lgl l
−
λn
λn
(
)
2 lµ
j − µl + µ − g (0) + lgl l pλ′ n (ω)
� λn lgl l
−
λn
λn
� λn lgl l
Rn p′λn (ω)
−
,
λn
λn
where 0 � ω � lgl l. The last term is derived from Taylor Expansion.
(√
)
Nn /n+ρ2n
n
=
O
= OP (1).
Theorem 2.2.1 and restrictions on g (0) and gl give R
P
λn
λn
′ (ω)
pλ
n
λn , since
p′ (ω)
Therefore λλnn
For
any g(0) ∈ G(0)
ω � lgl l → 0 as n → ∞, one has that
Rn
λn ,
l(g (0) ) � l(g(0) + gl ) −
J
with g (0) − µ = OP ( Nn /n + ρ2n ),
dominates
′ (ω)
pλ
n
λn
λn 1gl 1
2
= 1 for n large enough.
< l(g(0) + gl ). That is, for
min√
gl ∈Gl ,1gl 1=OP (
�d
jlP L , whose
l=1 µ
l(g (0) + gl ) =
Nn /n+ρ2n )
l(g(0) ). Therefore, for the local minimizer µ
jP L =
existence is assured by
[ PL
r
Theorem 2.2.2, one has that limn→∞ P µ
jl = 0 = 1 for l = r + 1, · · · , d.
31
TABLE 2.1: Autoregressive models in the simulation study
Model
Function
AR1
Yt =0.5Yt−1 + 0.4Yt−2 + 0.1εt
AR2
Yt =−0.5Yt−1 + 0.4Yt−2 + 0.1εt
AR3
Yt =−0.5Yt−6 + 0.5Yt−10 + 0.1εt
NLAR1
NLAR2
NLAR3
NLAR1U1
NLAR1U2
} {
}
{
3 / 1 + (Y
4
2 )/(1 + Y 2 ) + 0.6 3 − (Y
Yt =−0.4(3 − Yt−1
t−2 − 0.5)
t−2 − 0.5)
t−1
+0.1εt
{
}
{
}
2 ) Y
2
Yt = 0.4 − 2 exp(−50Yt−6
t−6 + 0.5 − 0.5 exp(−50Yt−10 ) Yt−10 + 0.1εt
{
}
2 ) Y
Yt = 0.4 − 2 cos (40Yt−6 ) exp(−30Yt−6
t−6
{
}
2
+ 0.55 − 0.55 sin (40Yt−10 ) exp(−10Yt−10
) × Yt−10 + 0.1εt
2 )/(1 + Y 2 ) + 0.1ε
Yt =−0.4(3 − Yt−1
t
t−1
{
}
{
}
Yt =0.6 3 − (Yt−2 − 0.5)3 / 1 + (Yt−2 − 0.5)4 + 0.1εt
32
TABLE 2.2: Lag selection results for the simulation study. The columns of U, C, O
give respectively the percentages of under-fitting, correct-fitting and over-fitting over 500
replications.
Model
AR1
AR2
AR3
NLAR1
NLAR2
NLAR3
NLAR1U1
NLAR1U2
n
p=1
p=3
U
C
O
U
C
O
100
0.212
0.628
0.116
0.356
0.544
0.100
200
0.030
0.898
0.072
0.146
0.834
0.020
500
0
0.992
0.008
0
1
0
100
0.272
0.570
0.158
0.424
0.468
0.108
200
0.044
0.892
0.064
0.198
0.770
0.032
500
0
0.994
0.006
0
0.998
0.002
100
0.008
0.836
0.156
0.018
0.856
0.126
200
0
0.972
0.028
0
0.982
0.018
500
0
0.994
0.006
0
1
0
100
0.002
0.940
0.058
0
0.966
0.034
200
0
1
0
0
0.994
0.006
500
0
1
0
0
0.998
0.002
100
0.766
0.172
0.062
0.554
0.402
0.044
200
0.622
0.336
0.042
0.060
0.888
0.052
500
0.004
0.974
0.022
0
0.990
0.010
100
0.048
0.628
0.324
0.204
0.572
0.224
200
0
0.906
0.094
0.010
0.948
0.042
500
0
0.988
0.012
0
1
0
100
0
0.984
0.016
0.002
0.996
0.002
200
0
0.994
0.006
0
0.998
0.002
500
0
1
0
0
1
0
100
0.070
0.552
0.378
0.064
0.66
0.276
200
0.024
0.680
0.296
0.066
0.784
0.0150
500
0
0.920
0.080
0
0.962
0.038
33
TABLE 2.3: The Penalty columns give MISEs from the penalized polynomial spline
method. The Oracle and Full columns give MISEs of the polynomial spline estimation of
the oracle and full models respectively.
Model
AR1
AR2
AR3
NLAR1
NLAR2
NLAR3
NLAR1U1
NLAR1U2
n
p=1
p = 3
Penalty
Oracle
Full
Penalty
Oracle
Full
100
0.0138
0.0124
0.0168
0.0504
0.0016
1.1498
200
0.0024
0.0020
0.0047
0.0223
0.0053
0.1530
500
0.0015
0.0014
0.0028
0.0245
0.0021
0.1433
100
0.0029
0.0005
0.0085
0.0872
0.0004
2.7515
200
0.0015
0.0005
0.0073
0.0747
0.0007
0.9650
500
0.0046
0.0046
0.0061
0.0534
0.0042
0.3213
100
0.0031
0.0025
0.0086
0.2572
0.0007
2.6233
200
0.0007
0.0006
0.0024
0.1117
0.0009
0.6983
500
0.0010
0.0010
0.0019
0.0367
0.0004
0.2134
100
3.5657
3.3146
3.6709
3.7874
3.2259
7.7016
200
3.4851
3.3389
3.5019
3.5770
3.4044
5.3536
500
3.7110
3.6349
3.7129
3.8708
3.2838
4.8348
100
0.0044
0.0034
0.0075
0.0180
0.0031
1.5706
200
0.0051
0.0051
0.0077
0.2566
0.0044
0.8909
500
0.0333
0.0056
0.0349
0.2926
0.0030
0.5308
100
0.0119
0.0109
0.0161
0.2935
0.0150
6.8577
200
0.0112
0.0108
0.0144
0.5081
0.0265
3.8070
500
0.0086
0.0085
0.0096
0.0842
0.0064
0.4677
100
0.4995
0.4993
0.5103
0.7834
0.5043
6.7995
200
0.4690
0.4687
0.4779
0.5454
0.4929
1.2789
500
0.4643
0.4639
0.4685
0.5400
0.4991
0.8170
100
2.0327
2.0317
2.1543
4.9415
2.0724
18.9815
200
2.0348
2.0329
2.0869
5.5130
2.3109
9.9782
500
2.0424
2.0239
2.0626
5.9310
2.7549
7.1970
34
TABLE 2.4: Analysis result of the US unemployment data. The Lags column gives the
selected significant lags of the quarterly US unemployment data.
R2
MSEE
MAEE
MSPE
MAPE
{1, 2}
0.9544
0.0382
0.1478
0.0531
0.2022
Full
{1, · · · , 10}
0.9724
0.0231
0.1220
0.1759
0.3417
AR(2)
{1, 2}
0.8466
0.1314
0.3232
0.2246
0.4322
Model
Lags
Penalty
35
NLAR1
0.05
0.04
0.03
0.02
0.00
0.01
The empirical norm of the components
0.006
0.004
0.002
0.000
The empirical norm of the components
0.008
AR3
0.01
0.02
0.03
lambda
0.04
0.05
0.02
0.06
0.10
0.14
lambda
FIGURE 2.1: The empirical norms of all estimated additive components are plotted
against the tuning parameter λn . We simulated data from both linear AR3 model (left)
and a nonlinear NLAR1 model (right side) in Table 2.1 for one run with n = 500. The
location of the optimal parameter λ̂n selected by BIC are marked by the dashed line.
0.0
−0.5
0.0
−1.5
−1.0
−1.0
−0.5
Yt
0.5
0.5
1.0
36
0.0
0.5
1.0
1.5
0.0
0.5
1.5
Yt−2
(b)
−0.6
−0.6
−0.2
−0.2
Yt
0.2
0.2
0.6
0.6
Yt−1
(a)
1.0
−0.3 −0.2 −0.1
0.0
Yt−6
(c)
0.1
0.2
−0.3 −0.2 −0.1
0.0
0.1
0.2
Yt−10
(d)
FIGURE 2.2: The estimated relevant component functions for model NLAR1 ((a) and
(b)) and NLAR2 ((c) and (d)) using three approaches. Model NLAR1 is fitted in linear
spline space while model NLAR2 is fitted in cubic spline space. The dash-dotted lines
and the dotted lines represent polynomial spline estimation of the oracle and full models
respectively. The dashed lines represent the penalized polynomial spline estimators. The
true component functions are also plotted in solid lines.
37
3 CONSISTENT MODEL SELECTION IN ADDITIVE
COEFFICIENT MODELS WITH GLOBAL OPTIMALITY
3.1
The Model
Let {(Yi , Xi , Ti )}ni=1 be a sequence of i.i.d. random vectors, where Yi is a response
variable and Ti = (Ti1 , . . . , Tid1 )T and Xi = (Xi1 , . . . , Xid2 )T are explanatory variables.
The additive coefficient model in Xue and Yang (2006 a & b) assumes that
Yi = m(Xi , Ti ) + εi , i = 1, · · · , n.
where
m(Xi , Ti ) =
d1
L
l=1
α(Xi )Til , with α(Xi ) = αl0 +
(3.1)
d2
L
αls (Xis ).
s=1
Similar to the additive model, the coefficients functions αls are not uniquely identified
up to a constant. Therefore, for model identification, we assume E (αls (Xs )) = 0 for
l = 1, . . . , d1 , s = 1, . . . , d2 .
Estimation of model (3.1) was studied in Xue and Yang (2006a), Xue and Yang
(2006b), and Liu and Yang (2010). In this paper, we are particularly interested in model
selection in the additive coefficient model. With the advance of technology, one is able to
collect massive data with high dimensions. However, in many applications, only a small
set of those variables are relevant. In our additive coefficient model, it is possible that
only some of the additive coefficients components αls (·) are relevant. We say, αls (·) is
irrelevant if αls (xs ) = 0 with probability one for s = 0, or αls = 0 for s = 0. Otherwise,
we say αls (·) is relevant. To distinguish these two kinds of terms, we define a subset S (0)
of the full index set S = {(l, s) : l = 1, · · · , d1 , s = 1, · · · , d2 } to identify all nonzero terms.
That is, for any index pair (l, s) ∈
/ S (0) , we have P (αls (Xs ) = 0) = 1 for s = 1 and αls = 0
for s = 0. Here Xs is a random variable having the same distribution as Xis , i = 1, · · · , n.
38
3.2
Penalized Polynomial Spline Estimation
Our method utilizes the polynomial spline functions to approximate unknown coeffi­
cient functions. Similarly with Section 2.2, or each s = 1, . . . , d2 , let ks,n = {0 = νs,0 < νs,1
}
< · · · < νs,Nn < νs,Nn+1 = 1 be a knot sequence on [0, 1]. For some integer p ≥ 0, a poly­
nomial spline with degree p on knot sequences ks,n are functions that are polynomials of
degree p or less on each of the intervals [νs,i , νs,i+1 ), i = 0, . . . , Nn − 1 and [νs,Nn , νs,Nn +1 ],
and are p − 1-times continuously differentiable on [0, 1]. Let ϕs = ϕ (ks,n , p) the space of
such polynomial spline functions. The success of the polynomial spline functions is that
it often provides good approximations to smooth functions with a small number of knots.
To consistently estimate the centered coefficient functions αl,s , we define a subspace of ϕs which consists of empirically centered polynomial spline functions. Let ϕ0s =
�
{g ∈ ϕs , En (g) = 0}, where En (g) = ni=1 g (Xis ). Let Jn = N +p and Bs = (Bs,1 , · · · , Bs,Jn )
be a set of basis of ϕ0s . For example, one can set Bs,j = bs,j − En (bs,j ) for j = 1, · · · , Jn ,
n
and {bs,j }Jj=1
be the truncated power basis
{
}
p
p
, . . . , (xs − νs,Nn )+
,
xs , . . . , xps , (xs − νs,1 )+
with (x)p+ = (x+ )p . Let B = {1, BT1 , · · · , BTd2 }T be the basis of the additive spline function.
One can approximate the regression function in (3.1) by
m (x, t) ≈
d1
L
γl0 +
d2
L
gls (xs ) tl =
s=1
l=1
T
d1
L
γl0 +
d2
L
T
γls
Bs (xs ) tl ,
s=1
l=1
where γls = (γls,1 , . . . , γls,Jn ) , and Bs (xs ) = (Bs1 (xs ) , . . . , BsJn (xs ))T . The standard
polynomial spline method in Xue and Yang (2006b) minimizes the sum of squares to
estimate the unknown coefficients γ = {γl0 , γls , 1 ≤ l ≤ d1 , 1 ≤ s ≤ d2 },
γ = argmin
J
γ
= argmin
γ
n
L
i=1
n
L
i=1
Yi −
Yi −
d1
L
l=1
d1
L
l=1
γl0 +
d2
L
T
γls
Bs (Xis )
2
Til
s=1
γl0 Zi,l0 +
d2
L
s=1
2
T
Zi,ls
γls
,
(3.2)
39
where Zi,l0 = Til and for s > 0, Zi,ls = Bs (Xis ) Til = (Bs1 (Xis )Til , · · · , BsJn (Xis )Til )T .
Then the resulting estimator of the coefficient functions are given by
T
Jls
α
Bs (xs ) =
J
ls (xs )
=
γ
Jn
L
j=1
Jls,j Bsj (xs )
.
γ
Xue and Yang (2006b) established that the standard polynomial spline estimator is con­
sistent and converges to the true function in the optimal L2 rate of convergence. However,
when there exist redundant terms in (3.1), the standard polynomial spline method is unable to produce a parsimonious estimate and deteriorates the estimation accuracies of the
nonzero coefficient functions. Therefore, in this paper, we propose a penalized polynomial
method for model selection in the additive coefficient model. We consider


2
d2
d1 L
d2
d1
n
 1
L
)
(
L
L
L
T
γ̂ = argmin
Yi −
γl0 Zi,l0 +
γls
Zi,ls
+
pλn lγls lWls
,


2n
γ
s=1
s=0
i=1
l=1
l=1
(3.3)
J
�n
T
T W γ , with W
=
γls
ls ls
ls =
i=1 Zi,ls Zi,ls /n. Therefore, for s > 0,
where lγls lWls
�
n
�
lγls lWls =
(gs (Xis ) Til )2 /n is the empirical norm of gs (xs ) tl . In (3.3), pλn (·) is
i=1
a penalty function depending on the tunning parameter λn . Although different penalty
functions can potentially be used, we focus on the smoothly clipped absolute deviance
(SCAD) penalty in Fan and Li (2001), which has the form



0 � |β| < λn ;
λn |β|



2
2
pλn (|β|)
=
aλn (|β|−λn )−(|β| −λn )/2 λn � |β| < aλn ;
a−1




(a−1)λ2n

+ λ2n
|β| ; aλn .
2
As pointed in Fan and Li (2001), the non-convex penalty function SCAD penalty achieves
three desirable properties of model selection: unbiasedness, sparsity and continuity. Model
selection result crucially depends on the choice of λn . Generally a larger λn shrinks more
additive coefficient components to zero, and results in a more parsimonious model. The
selection of λn will be discussed in Section 3.4 in details. Following Fan and Li (2001), we
set a = 3.7.
40
3.3
Optimal Properties
In this section we discuss the asymptotic properties of the global solution of the
penalized polynomial estimator with a non-convex SCAD penalty. When d1 = 1 and
T = 1, Model (3.1) reduces to an additive model. For additive models, Xue (2009)
established that there exist a sequence of local minimizer that correctly shrink the zero
function components to zero with probability approaching to one and estimate the nonzero
components at the same L2 rate as the standard polynomial spline. Xue, Qu and Zhou
(2010) and Jiang and Xue (2013) extended this local asymptotic result to (generalized)
additive models for longitudinal data and weekly dependent time series data respectively.
However, the penalized polynomial spline (PPS) estimator is the global optimal solution
in (3.3) by the definition. Therefore, there is still a gap between the theory of the local
optimality and the definition of PPS estimator.
In Xue and Qu (2012), the global optimal property of a penalized estimator was
established for the varying coefficient models, which also is a special case of the additive
coefficient model with d2 = 1. However the truncated L1 -penalty (TLP) was used instead
of the SCAD penalty. Different from SCAD, the TLP is a piecewise linear non-convex
penalty. The required techniques to establish the global optimal of SCAD is very different
from those for TLP in Xue and Qu (2012). For SCAD, the asymptotic property of the
global optimality was proved in the linear regression in Kim, Choi, Oh (2008). In this
paper, we extend the global optimality of SCAD estimator for a semi-parametric additive
coefficient model.
To prove our theoretical results, we need the following assumptions.
(C.1) There exists a positive constant c such that minl,s∈S (0) lαls l2 ; c.
(C.2) The tuning variables X = (X1 , · · · , Xd2 )T are compactly supported. Without the
loss of generality, we assume that its support is [0, 1]d2 . Furthermore, we assume the
41
density function of X is bounded away from 0 and infinity on its support.
r
[
(C.3) The eigenvalues of E TTT |X = x are bounded away from 0 and infinity uniformly
for all x ∈ [0, 1]d2 .
(C.4) There exists a constant c > 0 such that |Tl | < c with probability 1 for l = 1, · · · , d1 .
−(p+1)
(C.5) The tuning parameter λn satisfies (i) limn→∞
Nn
λn
= 0, (ii) limn→∞
log(Nn )
nλ2n
= 0,
(iii) limn→∞ λn = 0.
(C.6) For each s = 1, · · · , d2 , the set of interior knots ks,n = {0 = xs,0 < xs,1 < · · · < xs,Nn
< xs,Nn +1 = 1} satisfies, for a constant c,
max(xs,j+1 − xs,j , j = 0, · · · , Nn )
� c.
min(xs,j+1 − xs,j , j = 0, · · · , Nn )
Assumptions (C.1)-(C.6) are commonly used assumptions in literatures on polyno­
mial splines and variable selection. Similar conditions as (C.1)-(C.6) can found in Huang,
Wu and Zhou (2002), Xue and Yang (2006b), Xue, Qu and Zhou (2010), Xue and Qu
(2012). Assumption (C.4) can be relaxed for Tl to have a support on the entire real line.
The proofs of Theorems 1 and 2 can go through if the tail probability of each Tl vanishes
sufficiently fast. We impose assumption (C.4) only for simplicity of proof.
We define an oracle model as a sub-model of (3.1) which contains exactly those
nonzero additive coefficient components. Let S (0) be the index set of the oracle model. Let
the oracle estimator γ
j(0) be the standard polynomial spline estimation of spline coefficients
{
}
(0)
(0)
of the oracle model. That is, γ
jls = 0 for (l, s) ∈S (0) , and γ
jls , (l, s) ∈ S (0) minimizes
(3.2) with an oracle model.
Theorem 3.3.1. (Local Optimality) Under Assumptions (C.1) to (C.6), the oracle esti­
( (0)
)
mator γ
j(0) is a local minimizer with probability tending to 1. That is, P γ
j ∈ An (λn ) →
1 as n → ∞, where An (λn ) is the set of local minima of (3.3).
42
Theorem 3.3.2. (Global Optimality) Let j
γ = (γ
jls , l = 1, · · · , d1 , s = 1, · · · , d2 ) be the
global minima of (3.3). Under Assumptions (C.1) to (C.6), the estimator by minimizing
)
(
(3.3) enjoys the oracle property, that is P γ
j=γ
j(0) → 1 as n → ∞.
Remark 1. Rather than stating only the existence of a sequence of consistent local
minima of (3.3), Theorem 3.3.1 points out that with probability converging to one, the
oracle estimator is one of such sequence of local minima that is consistent. It is a stronger
conclusion than Theorem 2 in Xue (2009).
Remark 2. Theorem 3.3.2 further concludes the oracle estimator is not only a local
minima, but also the global minima of (3.3) with probability converging to one. Therefore,
the global solution of (3.3) enjoys the orale properties. That is, it can correctly identify
the non-zero terms of the true model and estimate the non-zero coefficients as well as if
the true model was known for a large sample size.
3.4
Implementation
In this section we extend the local linear approximation (LLA) algorithm proposed
by Zou and Li (2008) for linear regression models to our semi-parametric additive coef­
ficient models. In the LLA algorithm, a non-convex penalty function is approximated
locally by a linear function. Then one iteratively solves a sequence of upper convex ap­
proximation of the non-convex objective function for the final solution. Zou and Li (2008)
pointed that LLA not only provides better approximation of the original non-convex ob­
jective function, it is also more numerically stable and computationally efficient, compared
with other algorithms such as those based on the local quadratic approximation. In this
section, we also discussed the choices for tuning parameters involved in the proposed
method.
43
3.4.1
The local linear approximation algorithm
Solving the optimization problem in (3.3) with a given tuning parameter λn is
challenging. The SCAD penalty function is non-convex and is singular at the origin.
The traditional Newton-Raphson algorithm can not be applied directly to solve (3.3).
For linear models, Zou and Li (2008) developed a unified algorithm through local linear
approximation(LLA) to solve non-convex penalized problems. Inspired by Zou and Li
(2008), we extend the LLA algorithm to our semi-parametric additive coefficient models
as follows.
We first approximate the non-convex SCAD penalty locally by a linear function.
Specifically, given λn and an initial point γ 0 , the SCAD penalty function in (3.3) can be
approximated as
)
(
(
0
pλn lγls lWls ≈ pλn γls
)
+ p′λn
(
0
γls
)(
0
lγls lWls − γls
)
.
(3.4)
By removing all the constants irrelevant to γ, we have a one-step approximation


2
d1 L
d2
d1 L
d2
 1

L
L
Y−
Zls (xs , tl )γ ls +
λn,l lγls lWls .
γ (1) = argmin

γ∈Rd1 d2 Jn  2n
(3.5)
Wls
l=1 s=1
Wls
2
Wls
l=1 s=0
Therefore, the original optimization in (3.3) reduces to a group LASSO (Yuan and Lin,
(
)
0
2006) with component specific tuning λn,l = p′λn γls
. We can then use the
W
ls
coordinate-wise descent(CWD) algorithm (Yuan and Lin 2006) to get γ (1) by iteratively
applying

γls = 1 −
Here Sls = ZTls Y −
√ ′ ( 0
npλn γls
lSls l2
�
Wls
)
 Sls for l = 1, · · · , d1 , s = 0, · · · , d2 .
+
Zl′ s′ γl′ s′ .
(l′ ,s′ )i=(l,s)
The LLA is an efficient optimization algorithm. One does not need to apply the
local linear approximation (3.4) iteratively to get a closer and closer approximation, as
required by local quadratic approximation (LQA) algorithm (Fan and Li 2001). It was
44
shown in Zou and Li (2008) that, with a good initial point γ 0 , the one step estimator γ (1)
enjoys the same asymptotic property as the fully iterated solution. Therefore as in Zou
and Li (2008), we use the one-step LLA algorithm to reduce the computational complexity.
For any give λn , we only calculate the one-step approximation in (3.5) and use γ (1) as the
final solution. Our simulation study indicates that the one-step LLA works reasonably
well.
3.4.2
Selection of tuning parameters
In this subsection we discuss the selection of tuning parameters involved in our
estimation. There are two kinds of tuning parameters, those determine the polynomial
spline space, and those involved in the SCAD penalty function.
According to the theory of polynomial spline approximation, a polynomial spline
space is determined by (i) the degree p restricting the maximal degree of polynomials
within the space, and (ii) the set of interior nodes ks,n = {0 = νs,0 < νs,1 < · · · < νs,Nn
}
< νs,Nn+1 = 1 . The degree p determines the smoothness of the estimated curves. In
most applications, cubic polynomial splines with p = 3 often provide sufficiently smooth
fit. Therefore, for simplicity, we approximate the true model using cubic polynomial spline
space with p = 3 in the examples given in sections 3.5 and 3.6. The selection of interior
nodes is critical to the quality of the polynomial spline approximation. Following Huang,
Wu and Zhou (2002), Xue, Qu and Zhou (2010) and Jiang and Xue (2013), we use the
interior nodes that are equally spaced within the support of each Xs and select only the
number of interior nodes Nn using the Bayesian Information Criterion (BIC). That is, for
each given Nn , the unpenalized estimator from (3.2) can be calculated. Denote the RSS
as the corresponding residual sum of squares, and kn = (Nn + p)d1 d2 + d1 as the total
number of parameters in (3.2). Then the BIC is defined as
BIC (Nn ) = n ln(RSS) + kn ln(n).
45
We select the optimal N̂n,opt which has the smallest BIC value.
The SCAD penalty function involves two tuning parameters a and λn . For simplicity,
we set a = 3.7 as in Fan and Li (2001). However, the tuning parameter λn crucially
determines our model selection results. One can observed that a larger value of λn results
in a simpler model with more zero additive components in the estimated coefficients. In
particular, with λn = 0, the penalized estimator in (3.3) reduces to the unpenalized one
in (3.2). On the other hand, an improper large λn can mistakenly shrink all coefficients
to zero. Following Xue (2009) and Xue, Qu and Zhou (2010), we use the BIC to select
the optimal λn from the interval [0, λn,max ], where λn,max shrinks all components to zero.
That is, let j
γλn be the solution of the penalized polynomial spline (3.3) for a given λn .
jn,opt is defined by
Then the optimal λ
jn,opt =
λ
argmin
λn ∈[0,λn,max ]
{BIC(γ
jλn )} =
argmin
λn ∈[0,λn,max ]
where kn,;γλn is the number of nonzero terms in γ
jλn .
3.5
{
}
n ln(RSS;γλn ) + kn,;γλn ln(n) ,
Simulation studies
We studied the numerical performance of our proposed method through simulations
in both low and high dimensional cases. We are mostly interested in validating the model
selection consistency and the estimation accuracy through simulations.
We evaluated the model selection results by comparing additive terms selected by our
penalized method with those contained in the true model. Recall that S is the index pair
jls Bs l
set of all additive terms in the full model. We further define Sj = {(l, s)| lα
jls l = lγ
= 0, (l, s) ∈ S} as the index pair set of nonzero additive terms in our penalized estimator.
We say that it is an overfitting if S0 ⊆ Ŝ, a correct fitting if S0 = Ŝ, and an underfitting
otherwise.
We introduce the averaged integrated squared error (AISE) to evaluate the esti­
46
mation accuracy of the coefficient functions. Suppose for each αls (·), the estimator from
the k-th generated sample is α̂k,ls (·) for k = 1, . . . , nrep. Here nrep is total number of
n
grid
replications. Then given a set of grid points {xm }m=1
, the integrated squared error (ISE)
is given by
and AISE(α
jls ) =
ISE(α
jk,ls ) =
1
nrep
�nrep
k=1
1
ngrid
ngrid
L
m=1
ISE(α
jk,ls ).
{α
jk,ls (xm ) − αls (xm )}2 ,
In both simulation studies, three estimators are considered: the proposed estimator
using SCAD penalty (SCAD), the least squared estimators of the oracle model (ORACLE)
and the full model (FULL) respectively. The oracle model contains only nonzero additive
terms in the true model. The oracle estimator is not available in real data analysis where
the true model is unavailable. In our simulation study, we use the oracle estimator as a
benchmark to evaluate the estimation accuracies of other estimators.
3.5.1
Example 1: Low dimensional case
We generated 100 samples of size n = 100, 250, 500 respectively from the model
Y =
d1
L
l=1
αl0 +
d2
L
αls (Xs ) Tl + ε,
(3.6)
s=1
where d1 = 8, d2 = 2, α10 = 2, α20 = 1. In the model, there are only three rele­
vant coefficient functions α11 (x) = α21 (x) = sin(x), α12 (x) = x, and the rest coeffi­
cient functions are all zero. The true model index S0 = {(1, 0), (1, 1), (1, 2), (2, 0), (2, 1)},
and the full model index S = {(l, s), l = 1, . . . , 8, s = 0, 1, 2}. The explanatory variables
{Xi = (Xi1 , . . . , Xid2 )}ni=1 are uniformly distributed on [−π, π]d2 . The linear covariates
{
}n
Ti = (Ti1 , . . . , Tid1 )T i=1 have an i.i.d standard d1 dimensional multivariate normal dis­
tribution and the errors {εi }ni=1 follow i.i.d N (0, 1). Here {Xi }ni=1 , {Ti }ni=1 , and {εi }ni=1
are mutually independent.
Tables 3.1 and 3.2 report the performance of our penalized method on model selec­
tion and model estimation respectively. Table 3.1 clearly shows that as the sample size
47
increases, the rate of correct fitting increases. It reaches 100% when the sample size in­
creases to 250, showing that our method is consistent in model selection. In Table 3.2, we
reported the AISEs to compare the accuracy of SCAD, ORACLE and FULL in estimating
coefficient functions. To assess the estimation accuracy for the two interceptors α10 = 2
and α20 = 1, we report their empirical means and empirical standard errors from 100
replications. One can see that, as the sample size gets larger, AISEs for each estimator
get smaller. Furthermore, Table 3.2 also shows that the estimation accuracy of SCAD
estimator is almost as good as ORACLE, and performs better than FULL. Therefore, the
proposed SCAD method not only provides a parsimonious model which often is easier to
interpret, but also gives more accurate estimate of the model than the FULL. Table 3.1 and
3.2 support the asymptotic results given in in Section 3.3. Figure 3.1 plots the estimated
coefficient functions by SCAD from all 100 replications for sample size n = 100, 250 and
500 respectively. It also plots the typically estimated coefficient functions (whose ISE is
the median of the 100 ISEs from the replications) for different sample sizes. It graphically
verifies the results in Section 3.3.
3.5.2
Example 2: High dimensional case with intercept.
In this example we consider a similar model as (3.6) in Example 1, but with d1 = 50
and d2 = 2. It is a model with much higher dimension of linear covariates. But as
in Example 1, we only three nonzero coefficient functions α11 (x) = α21 (x) = sin(x),
α12 (x) = x. That is, only the first two linear covariates are relevant and the rest linear
covarites are redundant. For simplification, we consider αl0 = 0 for l = 1, . . . , 50. We
generated 100 samples of size 250 from the model.
Note that in such a high dimensional case, the least square estimator for the full
model is not feasible. Instead, we used zero as the initial value when applying the LLA
algorithm in Section 3.4 when find SCAD estimator. The SCAD method selects the
correct model 77 times out of the 100 replications, selects an overfitted model 23 times
48
and no underfit. Therefore, no important covariate is missed by the SCAD procedure. It
indicates that our method performances reasonably well even in high dimensional case.
Table 3.3 reports AISEs of SCAD estimators and the Oracle estimators of the three
nonzero coefficient functions. Figure 3.2 plots the typical fitted curves. It shows that the
SCAD estimates the nonzero coefficient functions reasonably well. But the SCAD shrinks
the nonzero function a little bit towards zero due to larger tuning parameter needed in
the high dimensional case and the fact that we started from a much worse starting point
for LLA algorithm when the FULL estimator is not available.
3.6
Real data analysis
In this section we analyze the Tucson housing price data in (Fik et al. 2003) using
the proposed penalized estimation method. Tucson housing data contains information of
the 2971 geocoded housing units sold during year 1998 in selected districts of Tucson,
Arizona. Fik et al. (2003) analyzed this data by performing their interactive variable
approach to explain the variation of housing price of an urban residential housing market.
Six variables describing the houses are considered,
• x: the latitude coordinate of the house referenced to the southern most house in
record.
• y: the longitude coordinate of the house referenced to the western most house in
record.
• AGE: the age of dwelling in years.
• LOT: the lot size of the house.
• SQFT: the square footage of the house.
• PRICE: the price at which the house was sold.
49
Fik et al. (2003) considered a linear regression model with the logarithm of PRICE
as the response and the rest variables with possible polynomial interactions as explana­
tory variables. In Fik et al. (2003), eleven polynomial interactive effects between ex­
planatory variables were tested to be significant. They are: AGE, AGE2 , LOT2 , SQFT2 ,
AGE∗SQFT, LOT∗x, LOT∗y 2 , LOT∗y 3 , SQFT∗x2 , SQFT∗y 3 , SQFT∗x2 ∗ y. Except for
the three-way interaction SQFT∗x2 ∗ y, all other interactions can be absorbed by the
following additive coefficient model:
log(PRICE) = [α10 + α11 (x) + α12 (y) + α13 (AGE)]
+ [α20 + α21 (x) + α22 (y) + α23 (AGE)] SQFT
+ [α30 + α31 (x) + α32 (y) + α33 (AGE)] SQFT2
+ [α40 + α41 (x) + α42 (y) + α43 (AGE)] LOT
+ [α50 + α51 (x) + α52 (y) + α53 (AGE)] LOT2 .
(3.7)
We consider (3.7) as the full model to start variable selection in the penalized estimation.
Therefore, in the notation of model (3.1), the linear covariates T consists of five compo­
nents: the constant 1, SQFT, SQFT2 , LOT, and LOT2 , and the covariates X consists of
x, y, and AGE. The model (3.7) effectively describes the spatial and temporal patterns of
the linear regression coefficients.
The proposed penalized polynomial spline with SCAD penalty (SCAD) is applied
and the tunning parameters are selected by the BIC as described in Section 3.4. For
comparison, we also consider a standard polynomial estimation of the full model (3.7) and
the parametric model in Fik et al (2003) of form,
log(PRICE) = β1 AGE + β2 AGE2 + β3 LOT2 + β4 SQFT2 + β5 AGE ∗ SQFT
+ β6 x2 ∗ SQFT + β7 x3 ∗ SQFT + β8 y 3 ∗ SQFT + β9 x ∗ LOT
+ β10 y 2 ∗ LOT + β11 y 3 ∗ LOT + β12 x2 ∗ y ∗ SQFT.
50
In all of our analyses, the variables are centered by the sample means and scaled
by the sample standard deviations. We exclude 29 data points with extreme values of
housing prices that had standard deviations four times larger than the mean response.
We only consider the remaining 2942 data points. We then randomly select 2059 points
for model estimation, and use the rest 883 data points for prediction. For each method,
we report the R2 statistics and the mean absolute estimation error (MAEE) to quantify
its estimation accuracy. For prediction, in addition to the mean absolute prediction error
(MAPE), we also calculate the percentage of “good” predicted prices. Here we define a
“good” predicted price as the predicted price within 10% of the actual price. For each
data point i, denote the actual price as Pi , the estimated or predicted value as Pji , and
the mean of the first 2059 Pi as P . Then R2 , MAEE and MAPE are calculated by
2
R =1−
2059
�
(Pi − Pji )2
i=1
2059
�
i=1
(Pi − P )2
, MAEE =
2059
2942
1 L
1 L
Pi − Pji .
Pi − Pji , and MAPE =
2059
883
i=1
i=2060
We carried out the whole process for 10 times with a random permutation on all
2942 data points at each time. The averaged model size selected by SCAD is 7.8 for the
10 replicates, which is much smaller than the size of 20 for the full model. The additive
terms α10 , α11 (x), α12 (y), α13 (AGE), and α21 (x)∗SQFT in (3.7) were always selected; and
α41 (x) ∗ LOT was selected 8 out of 10 times and α31 (x) ∗ SQFT2 was selected 5 times. The
fitted curves of the SCAD estimator for all additive coefficient components from one run
are plotted in Figure 3.3. Corresponding fitted curves of the unpenalized estimator under
the full model are also plotted for comparison. This model selection result is consistent
with the findings in Fik et al. (2003). It indicates that the absolute location has an
unique effect on the price of a housing unit. In particular, the plots of α11 (x), α12 (y),
and α21 (x) indicate that the housing price generally decreases from south to north, and
increases from west to east. Furthermore, houses from south are more sensitive to effect
of the square footage. For each unit of decreasing in SQFT, the housing price is likely to
51
decrease more in south town. And the plot α13 (AGE) also indicates that a newer house
is more expensive than a older one with similar condition.
Table 3.4 reports the averaged R2 , MAEE, MAPE and the percentage of ‘good”
predicted price over 10 replicates for SCAD, FULL and parametric model respectively.
From Table 3.4, FULL gives a slightly better estimation than SCAD at the expense
of a more complex model structure. However for prediction, SCAD outperforms FULL.
Therefore, SCAD not only gives a simpler and more interpretable model, but also improves
prediction accuracy. Furthermore, the parametric model gives the worst results in both
estimation and prediction, indicating that the data contains a nonlinear structure that can
not be fully explained by the parametric model. Figure 3.4 plots the randomly selected
actual prices against corresponding predicted prices under three models separately, with
criteria band enclosing “good” estimates. Again, it suggests both penalize model and full
model do much better than the parametric model.
3.7
Proof of Lemmas and Theorems
(0)
Based on the oracle estimator γ
j(0) = (γ
jls , l = 1, · · · , d1 , s = 0, · · · , d2 ) in Section
(0)
(0)
(0)
3.3, we further define α
jls for notation convenience. For (l, 0) ∈ S (0) , α
jl0 = γ
jl0 . For
(0)
(0)
s > 0 and (l, s) ∈ S (0) , α
jls = γ
jls Bs . Here Bs is the vector of the empirically centered
(0)
spline basis on xs defined in Section 3.2. For (l, s) ∈
/ S (0) , α
jls = 0.
For any square matrix U, we denote ρmin (U) and ρmax (U) as the minimal and
maximal eigenvalues of U respectively. For notation simplicity, we use the same c, c1 , c2
as general notations for positive constants with not necessarily the same value.
3.7.1
Preliminary Lemmas
Lemma 3.7.1. Under Assumptions (C.3) - (C.4), for each pair of (l, s) ∈ S, the eigen­
values of Wls are bounded by two positive constants with probability approaching to 1.
52
That is, let ρmin (Wls ) and ρmax (Wls ) be the minimal and maximal eigenvalues of Wls
respectively. Then there exist 0 < c1 < c2 , such that
P (c1 � ρmin (Wls ) � ρmax (Wls ) � c2 ) → 1 as n → ∞.
(3.8)
Proof. From Lemma 5 of Xue and Qu (2012), we know that (3.8) is true for
( T )
t t
(l, s) ∈ S and s = 0. Now, for s = 0, ρ (Wl0 ) = ρ ln l =
lTl l2n .
Therefore, according to
Assumption (C.4), one has that lTl l2n � c2 .
Furthermore, the Central Limit Theorem gives
( )
that, lTl l2n ; E Tl2 − oP (1) ; Var(Tl ). Therefore, by letting c1 = minl=1,··· ,d1 Var(Tl )
and c2 = c2 , Lemma 3.7.1 is proved.
Lemma 3.7.2. Let Z = (Zls , (l, s) ∈ S), W =
ZT Z
n ,
then under Assumptions (C.3) ­
(C.4), the eigenvalues of W are bounded within two positive constants. That is, there
exist c2 > c1 > 0 such that, except on an event whose probability tends to zero, as n → ∞,
c1 � ρmin (W) � ρmax (W) � c2 .
(3.9)
T , (l, s) ∈ S)T ∈ Rd1 (1+d2 Jn ) , one has
Proof. For any coefficients γ = (γls
 2

d1
d2 L
Jn
L
L
γl0 +
γ T Wγ =
γls,j Bs,j Tl 
.
l=1
s=1 j=1
n
Lemma A.5 in Xue and Yang (2006b) ensures that there exists a positive constants c1 such
d1
d2 �
Jn
�
�
2 +
2
γl0
γls,j
= c1 lγl2 . On the other hand, Cauchy-Schwarz
that γ T Wγ ; c1
s=1 j=1
l=1
inequality gives that there exists a constant c > 0 such that
 2

d1
d2 L
d1
d2
Jn
T
T
L
L
L
L
T Zl0 Zl0
T Zls Zls
γl0 Tl +
γ T Wγ
=
γls,j Bs,j Tl 
� c
γl0
γls
γl0 +
γls
n
n
l=1
s=1 j=1
n
l=1
.
s=1
Therefore, from Lemma 3.7.1, there exists a positive c2 such that γ T Wγ � c2 lγl2 .
As a consequence of Lemmas 3.7.1 and 3.7.2, one has the following corollary.
Corollary 3.7.1.1. Let A be a subset of the full index pair set S. Denote ZA = (Zls , (l, s)
∈ A) and WA =
ZT
A ZA
n ,
then under Assumption (C.3) - (C.4), there exist two positive
53
constants c1 , c2 , such that
P (c1 � ρmin (WA ) � ρmax (WA ) � c2 ) → 1 as n → +∞.
Lemma 3.7.3. Under Assumptions (C.3) - (C.4) and (C.6), each additive term of oracle
(0)
estimators, α
jls , converges to the corresponding true function αls in probability. Specifi­
cally, we have
d1
L
l=1
(0)
jl0 Tl
α
− αl0 Tl +
d1 L
d2
L
l=1 s=1
(0)
α
jls (Xs )Tl
− αls (Xs )Tl
2
= OP
+
�
Nn
n
.
+
�
Nn
n
.
Nn−(p+1)
Proof. Theorem 1 in Xue and Yang (2006) entails that
max
1�l�d1
(0)
α
jl0
− αl0 +
(0)
α
jls (Xs ) −
max
1�l�d1 ,1�s�d2
αls (Xs )
2
= OP
Nn−(p+1)
Therefore, Lemma follows from Assumption (C.4).
3.7.2
Proof of Theorem 3.3.1
For notation simplicity, denote
d
d
1 L
2
L
1
Y−
Zls (xs , tl )γls
Ln (γ) =
2n
l=1 s=0
2
+
2
d1 L
d2
L
l=1 s=0
)
(
pλn lγls lWls .
− 1
∗ = W γ . Consequently one has W∗ =
Now, let Z∗ls = Zls Wls 2 and γls
ls ls
ls
∗
Z∗T
ls Zls
n
= IJn for
∗ = 1. Therefore, (3.3) can be rewritten as
s = 0, and Wl0
γ
j∗ = argminLn (γ ∗ )
γ∗

d1 L
d2
 1
L
= argmin
Y−
Z∗ls (xs , tl )γ ∗ls
2n
γ∗
l=1 s=0

d1 L
d2
 1
L
∗
= argmin
Y−
Z∗ls (xs , tl )γls
 2n
γ∗
l=1 s=0
2
+
l=1 s=0
2
2
+
2
d1 L
d2
L
d1 L
d2
L
l=1 s=1

(
)
∗
pλn lγls
lW∗
ls 
pλn (lγls l2 ) +
d1
L
l=1
pλn (|γl∗0 |)
Therefore, for any given index pair (l, s) ∈ S, the partial derivative
d
d
1 L
2
L
∂Ln (γ ∗ )
1 ∗T
∗
=
−
Y
−
l2 ) ,
Z
Z∗l′ s′ (xs′ , tl′ )γl∗′ s′ + ∂pλn (lγls
ls
∗
∂γls
n
′
′
l =1 s =0



.
54
( ∗ )
( ∗ )
where ∂pλn lγls l2 is the subgradient of pλn lγls l2 .
[
]
�d1 �d2
∗ (x ′ , t ′ )γ ∗
Z
Y
−
We denote C∗ls (γ ∗ ) = − n1 Z∗T
l
ls
l′ =1
s′ =0 l′ s′ s
l′ s′ . Then the KKT
local optimality condition suggests that, any γ ∗ satisfying the following two conditions
must be a local minimum of our penalized objective function,
(i)
∗
l2 > aλn , for (l, s) ∈ S (0) ;
C∗ls (γ ∗ ) = 0, lγls
(ii)
∗
lC∗ls (γ ∗ )l2 � λn , lγls
l2 < λn , for (l, s) ∈
/ S (0) .
Equivalently in terms of γ, the vector of untransformed coefficients, the sufficient condi­
tions for a solution to be a local minimum are
(i)
Cls (γ) = 0, lγls lWls > aλn , for (l, s) ∈ S (0) ;
(3.10)
(ii)
lCls (γ)lW−1 � λn , lγls lWls < λn , for (l, s) ∈
/ S (0) .
(3.11)
ls
Therefore, to prove γ
j(0) ∈ An (λn ), one only needs to prove (3.10) and (3.11) for γ = γ
j(0) .
j(0) , the first equation in (3.10) and the second inequality in (3.11)
When γ = γ
naturally hold by the definition of γ
j(0) . So one only needs to prove that, except on an
event whose probability tends to 0, as n → ∞,
(i)
(ii)
(0)
j
γls
Wls
> aλn , for (l, s) ∈ S (0) ;
(0)
Cls (γ
jls )
−1
Wls
� λn , for (l, s) ∈
/ S (0) .
(0) 2
We first prove (3.12). Note that γ
jls
(3.12)
(0)
Wls
= α
jls (Xs )Tl
2
n
(3.13)
is the empirical norm
of an non-zero additive term of the oracle estimator. Lemma 3.7.3 implies that,
(0)
(0)
γls
j
Wls
=
(0)
jls (Xs )Tl
α
α
jls (Xs )Tl
n
(0)
α
jls (Xs )Tl
; [lαls (Xs )Tl l2 − oP (1)]
(0)
α
jls (Xs )Tl
(0)
α
jls (Xs )Tl
n
.
55
Furthermore, Assumptions (C.2), (C.6) and Lemma A.4 in Xue and Yang (2006b) imply
that the second multiplicative term converges to 1 in probability. Together with Assump­
tions (C.1), (C.3), and (iii) of Assumption (C.5), one has that, with probability goes to
1,
(0)
γls
j
; lαls (Xs )Tl l2 − oP (1)
J [
r
=
E α2ls (Xs )Tl2 − oP (1)
J [
}r
{
=
E α2ls (Xs )E Tl2 |Xs − oP (1)
J [
r
; c E α2ls (Xs ) − oP (1)
Wls
= c − oP (1) ; aλn .
(3.14)
Therefore (3.12) is proved. Now for (3.13), we define Z(1) = (Zls , (l, s) ∈ S (0) ) as the
column-wise combination of all Zls matrices corresponding to all “nonzero” components,
and Z(2) = (Zls , (l, s) ∈
/ S (0) ) as the column-wise combination of all Zls matrices corre­
�1 �
sponding to all “redundant” components. Recall that Y = dl=1
s∈Sl αls (Xs )Tl +ε. One
has
j(0) ) =
Cls (γ
=
)−1
(
1 T
Zls In − Z(1) ZT(1) Z(1)
ZT(1) Y
n
1 T
Z Hn [δ + ε] ,
n ls
(
)−1
�
T and δ =
T
where Hn = In − Z(1) ZT(1) Z(1)
Z(1)
(l,s)∈S (0) δls with δls = (δ1,ls , · · · , δn,ls )
(0)
and δi,ls = αls (xi,s )ti,l − α
jls (xi,s )ti,l . Therefore,
P
(
Cls (γ
j
(0)
)
−1
Wls
> λn , ∃(l, s) ∈
/S
(0)
)
� P
+ P
(
(
max
1
ZTls Hn δ
n
max
1
ZTls Hn ε
n
(l,s)∈S
/ (0)
(l,s)∈S
/ (0)
λn
−1 >
Wls
2
)
λn
(3.15)
−1 >
Wls
2 .
According to Lemma 3.7.1, one has
max
(l,s)∈
/S (0)
1
ZTls Hn δ
n
−1
Wls
�
max
(l,s)∈
/S (0)
c T
Z Hn δ
n ls
2
)
c
� √ lHn δl2 .
n
56
Note that Hn � In . That is, In − Hn is semi-positive definite. The approximation theory
√
−(p+1)
in de Boor (2001) (p.149) gives that lδl2 � c nNn
. Therefore, one has
max
(l,s)∈S
/ (0)
1
ZTls Hn δ
n
−1
Wls
�√
1
lδl2 � cNn−(p+1) .
nc1
Consequently, (i) of Assumption (C.5) entails that
(
)
1
λn
T
= 0.
lim P
max
Zls Hn δ W−1 >
n→∞
ls
2
(l,s)∈
/S (0) n
Similarly, one has max(l,s)∈/S (0)
1
n
other hand, denote H(2)T = ZT(2)
TH ε
Zls
n
(3.16)
1
c1 n
T H ε . On the
max(l,s)∈/S (0) Zls
n 2
)−1
(
T Z
T
T and let H(2) be the J
− Z(2)
Z(1)
n
(1) Z(1) Z(1)
ls
−1
Wls
�
(2)
columns of H(2) corresponding to Zls with Hls = ZTls Hn for (l, s) ∈ S (0) . Note that
H(2) H(2)T = Z(2) Hn ZT(2) � Z(2) ZT(2) . Therefore, one has ZTls Hn ε
2
� ZTls ε 2 . Conse­
quently,
max
(l,s)∈
/S (0)
1
ZTls Hn ε
n
−1
Wls
�
1
T
ε 2.
max Zls
c1 n (l,s)∈S
/ (0)
According to Lemma 7 in Xue and Qu (2012),
)
(
(J
)
1
T
=O
log (Nn d1 d2 ) .
E
max √ Zls ε
n
(l,s)∈
/S ( 0)
2
Then Markov inequality implies
(
1
P
max
ZTls Hn ε
(l,s)∈
/S (0) n
−1
Wls
λn
>
2
)
�C
�
log (Nn )
.
nλ2n
(3.17)
Then (3.13) follows from (3.16), (3.17) and (ii) of Assumption (C.5).
3.7.3
Proof of Theorem 3.3.2
To prove γ
j(0) converges to the global minimum in (3.3), it suffices to show that
(
)
P Ln (γ) ; Ln (γ
j(0)) for all γ ∈ Rd1 d2 Jn → 1 as n → ∞.
(3.18)
We first define several subset of the index pair set S = {(l, s)|l = 1, · · · , d1 , s = 1, · · · ,
}
{
/ S (0) . For any given γ ∈ Rd1 d2 Jn , let
d2 }. Let P = S (0) and N = S/S (0) = (l, s)|(l, s) ∈
{
}
(l, s) ∈ P = S (0) , lγls lWls > aλn , P − = P/P + ;
P+ =
{
}
N+ =
(l, s) ∈ N = S/S (0) , lγls lWls > λn , N − = N /N + .
57
Recall that Z(1) is the columnwise combination of Zls satisfying (l, s) ∈ S (0) = P, and
Z(2) is the columnwise combination of Zls satisfying (l, s) ∈ N = S (0) . We further define
j (2) as the column-wise projection of Z(2) onto Z(1) . That is, the j th column of Z
j (2) is the
Z
projection of the j th column of Z(2) onto the linear space spanned by the column vectors of
]−1
[
T .
Furthermore, we denote
j (2) = K(1) Z(2) , where K(1) = Z(1) ZT Z(1)
Z(1) with Z
Z(1)
(1)
J (2) = Z(2) − Z
j (2) and Y
j (0) = Zj
Z
γ (0)
. Then through orthogonal decomposition, one has
Ln (γ) =
1 j (0)
j (2) γ(2)
Y − Z(1) γ(1) − Z
2n
2
2
+
1
j (0) − Z
J (2) γ(2)
Y−Y
2n
2
2
+
L
(l,s)∈S
pλn (lγls lWls ).
Here γ(1) and γ(2) are coefficients corresponding to Z(1) and Z(2) . Note that
j (0) − Z
J (2) γ(2)
Y−Y
Then one has
2
2
j (0)
= Y−Y
2
2
T JT J
j (0) , Z
J (2) γ(2) ).
Z(2) Z(2) γ(2) − 2(Y − Y
+ γ(2)
Ln (γ) − Ln (j
γ (0) )
2
1 j (0)
JT Z
J γ
j (2) γ(2) + 1
γ T Z
Y − Z(1) γ(1) − Z
2n
2n (2) (2) (2) (2)
2
L
1
(0)
j (0) , Z
J (2) γ(2) ) +
jls
pλn (lγls lWls ) − pλn ( γ
− (Y − Y
n
;
(l,s)∈S
Consider the linear space A spanned by
Wls
) .
{
}
j ls : (l, s) ∈ P + ∪ N , where we let Z
j ls =
Z
j (0) be the projection of Y
j (0) onto A. Let Z
j A = (Z
j ls , (l, s) ∈
H(1) Zls for (l, s) ∈ S (0) . Let Y
A
j AC = (Z
j ls , (l, s) ∈ P − ) and
P + ∪ N ) and γ A = (γls , (l, s) ∈ P + ∪ N ). Similarly, we let Z
58
C
γ A = (γls , (l, s) ∈ P − ). Then we have
Ln (γ) − Ln (j
γ (0) )
1 j (0) j (0) j (0) j A j
1 T JT J
C 2
Y − YA + YA − ZA γ − ZAC γ A
+
γ Z Z γ
2n
2n (2) (2) (2) (2)
2
L
1
(0)
j (0) , Z
J (2) γ(2) ) +
pλn (lγls lWls ) − pλn ( γ
jls
)
− (Y − Y
n
Wls
(l,s)∈S
(
)
3 1 j (0) j (0) 2
1 j (0) j A 2
3 j
C 2
−
ZAC γ A
;
Y − YA
+
YA − Z A γ
4 2n
2n
2n
2
2
2
1
1
j (0) , Z
J (2) γ(2) ) +
J γ
JT Z
− (Y − Y
γT Z
n
2n (2) (2) (2) (2)
L
(0)
+
pλn (lγls lWls ) − pλn ( γ
jls
)
;
Wls
(l,s)∈S
;
c j (0) j (0) 2 c j
1 T JT J
C 2
Y − YA
−
ZAC γ A
+
γ Z Z γ
n
n
2n (2) (2) (2) (2)
2
2
L
1
(0)
j,Z
J (2) γ(2) ) +
pλn (lγls lWls ) − pλn ( γ
jls
− (Y − Y
n
(l,s)∈S
Wls
)
= I − II + III − IV + V.
(3.19)
In the following, we will discuss all the five additive terms in (3.19) separately. We first
J (2,2) =
discuss the term III. Let C
- (2)T Z
- (2)
Z
.
n
One has
(
)−1
J (2,2) = Z(2)T In − Z(1) Z(1)T Z(1)
C
Z(1)T Z(2)
=
(
−Z
(2)T
Z
(1)
(
Z
(1)T
Z
(1)
)−1
, I(2)
)(
ZT Z
n
)(
−Z
(2)T
Z
(1)
)T
)−1
(
(1)T (1)
Z
, I(2)
Z
.
Here I(2) is the identity matrix and has the same number of columns as Z(2) . Lemma 3.7.2
entails that, for any γ(2) ,
}
{(
}T ( ZT Z ) {(
)T
)T
D, I(2) γ(2)
D, I(2) γ(2)
n
( )
2
(
)T
(
) DT
T
γ
; c D, I(2) γ(2) = c γ(2) D, I(2)
I(2) (2)
2
T J (2,2)
γ(2)
C
γ(2) =
T
T
= c γ(2)
DDT γ(2) + γ(2)
γ(2) ; c γ(2)
2
,
2
59
(
)−1
where D = −Z(2)T Z(1) Z(1)T Z(1)
. Therefore, Corollary 3.7.1.1 gives that
III ; c γ(2)
; c
2
2
(l,s)∈N +
L
(l,s)∈N +
lγls l22
lγls l2Wls ; cλn
L
; cλn
L
;c
L
(l,s)∈N +
lγls lWls
lγls l2 .
(l,s)∈N +
(3.20)
Now without loss of generality, let ZA = (Zls : (l, s) ∈ P + ∪ N ). Similarly we have
r (0)
c j (0) j (0) 2
c (0) [
T
j Z In − ZA (ZA
= γ
γ
Y − YA
ZA )−1 ZA Zj
n
n
2
L
2
(0) 2
j(0) ; c
.
jls
; c γ
γ
I =
2
(3.21)
Wls
(l,s)∈P
Furthermore, using exactly the same argument, one can prove that
; (2)
; (2)T Z
Z
n
has the
largest eigenvalue bounded by c > 0. Thus for II,
II =
C
c j
ZAC γ A
n
j,Z
J (2) γ(2) ) = �
For IV = n1 (Y −Y
(l,s)∈N
2
2
1
n
C
� c γA
2
(3.22)
2
I
)
T
j , Zls γls = �
Y−Y
γ (0) ),
(l,s)∈N γls Cls (j
T C (j
(0) ) � C (j
(0) )
the Cauchy-Schwarz inequality gives that, for (l, s) ∈ N , γls
ls γ
ls γ
2
lγls l2 .
Furthermore, according to conclusion (3.13) proved in Theorem 3.3.1, we have Cls (j
γ (0) )
2
lγls l2 =
oP (λn ) lγls l2 . Thus,
IV =
L
1
j,Z
J (2) γ(2) ) = oP (λn )
(Y − Y
lγls l2 .
n
(3.23)
(l,s)∈N
For the last additive term of (3.19), we have
L
(l,s)∈S
L
=
(l,s)∈P +
+
(0)
pλn (lγls lWls ) − pλn ( γ
jls
L
Wls
(0)
pλn (lγls lWls ) − pλn ( γ
jls
(l,s)∈N +
)
(
pλn lγls lWls +
)
L
(l,s)∈N −
Wls
) +
L
(l,s)∈P −
)
(
pλn lγls lWls .
(0)
pλn (lγls lWls ) − pλn ( γ
jls
Wls
)
60
For (l, s) ∈ P + , one notes that lγls lWls > aλn by the definition of P + . Therefore
�
�
(0)
(3.12) entails that (l,s)∈P + pλn (lγls lWls ) − pλn ( γ
jls
) = 0 and (l,s)∈P − pλn (lγls lWls ) − pλn
Wls
(
)
�
−
−
2
; 0−|P | OP (λn ). For (l, s) ∈ N , we have lγls lWls < λn . Thus (l,s)∈N − pλn lγls lWls =
�
λn (l,s)∈N − lγls lWls . Therefore one has
V
=
L
(0)
(l,s)∈S
pλn (lγls lWls ) − pλn ( γ
jls
[
r
; 0 + 0 − P − OP (λ2n ) + λn
; cλn
L
(l,s)∈N −
Wls
L
(l,s)∈N −
)
lγls lWls
lγls l2 − P − OP (λ2n ).
(3.24)
Finally, (3.20), (3.21), (3.22), (3.23) and (3.24) together entail that
Ln (γ) − Ln (j
γ (0) ) ; c
L
(l,s)∈P
+cλn
(0) 2
γ
jls
L
(l,s)∈N −
Wls
C
− c γA
2
L
+ cλn
2
(l,s)∈N +
lγls l2 − P − OP (λ2n ).
lγls l2 −
L
(l,s)∈N
oP (λn ) lγls l2
(3.25)
Note that according to the definition of A which is the combination of P + and N ,
we have
C
γA
2
2
� |P − | a2 λ2n . For the same reason,
λn lγls , (l, s) ∈ N + l2 . Therefore, one has

L
Ln (γ) − Ln (j
γ (0) ) ; c
(l,s)∈P
(0) 2
γ
jls
Wls
+ [cλn − oP (λn )]
− P−
L
(l,s)∈N
Note that, according to (3.14) in proof of Theorem 3.3.1,
2
2
γ(2)
(
2
; l(γls , (l, s) ∈ N + )l2 ;

)
cλ2n + OP (λ2n ) 
lγls l2 .
�
(0) 2
(l,s)∈P
γ
jls
Wls
is great than a
positive constant with probability to 1. Therefore, (3.18) follows from (iii) of Assumption
(C.5). It completes the proof of Theorem 2.
61
TABLE 3.1: Variable selection results for Example 1. The columns of U, C, O give respec­
tively the numbers of under-fitting, correct-fitting and over-fitting from 100 replications.
n
U
C
O
100
4
28
68
250
0
100
0
500
0
100
0
TABLE 3.2: Estimation accuracy for Example 1. The Mean(SE) columns give the mean
and standard errors of α̂10 and α̂20 . The AISE columns give AISEs of α̂11 , α̂12 , α̂21 and
α̂22 respectively.
n
100
250
500
Method
Mean(SE)
AISE
α10 = 2
α20 = 1
α11
α12
α21
SCAD
1.866(0.0458)
0.849(0.0379)
0.9510
0.6230
1.7236
ORACLE
1.965(0.0241)
1.016(0.0132)
0.0935
0.0994
0.1193
FULL
1.868(0.0560)
1.023(0.0415)
1.9228
1.4381
2.7271
SCAD
2.004(0.0082)
1.016(0.0055)
0.0278
0.0241
0.0183
ORACLE
2.004(0.0083)
1.017(0.0053)
0.0242
0.0209
0.0177
FULL
2.006(0.0084)
1.021(0.0057)
0.0331
0.0305
0.0247
SCAD
2.000(0.0045)
1.002(0.0025)
0.0122
0.0133
0.0079
ORACLE
2.000(0.0045)
1.002(0.0025)
0.0121
0.0132
0.0079
FULL
2.001(0.0046)
1.002(0.0025)
0.0137
0.0141
0.0095
62
TABLE 3.3: Estimation accuracy for Example 2. The AISE columns give AISEs of α̂11 ,
α̂12 and α̂21 respectively.
Method
AISE
α11
α12
α21
SCAD
0.0871
0.0819
0.0735
Oracle
0.0218
0.0173
0.0202
TABLE 3.4: Analysis result of Tucson housing price data.
Method
R2
MAEE
MAPE
Percentage of “good” prediction
SCAD
0.846
13419.1
13801.7
61.3%
Full
0.861
12975.7
15385.0
61.9%
Linear
0.555
19904.2
19274.6
45.5%
2
1.5
1
1.0
0.5
−2
−1
0
1
2
3
−3
−2
0
1
2
3
−1
−2
−3
−2
−1
0
1
2
3
−3
−2
(c) (a3)
0
1
2
3
2
3
2
3
2
3
−2
−1
0
1
2
3
−2
0
1
2
3
−1
−3
−2
−1
0
1
2
3
−3
−2
(g) (b3)
0
1
−2
−1
0
1
2
3
−2
0
1
2
3
−1
−3
−2
−1
0
1
2
3
−3
−2
(k) (c3)
0
1
−3
−2
−1
0
1
(m) (d1)
2
3
−3
−2
−1
0
1
(n) (d2)
2
3
−2
−1.5
−1
0
−0.5 0.0
0.5
1
1.0
4
−4
−1.5
−2
0
−0.5 0.0
0.5
2
1.0
−1
(l) (c4)
1.5
(j) (c2)
1.5
(i) (c1)
−1
−2
−1.5
−3
2
−3
0
−0.5 0.0
0.5
1
1.0
4
−4
−1.5
−2
0
−0.5 0.0
0.5
2
1.0
−1
(h) (b4)
1.5
(f ) (b2)
1.5
(e) (b1)
−1
−2
−1.5
−3
2
−3
0
−0.5 0.0
0.5
1
1.0
4
−4
−1.5
−2
0
−0.5 0.0
0.5
2
1.0
−1
(d) (a4)
1.5
(b) (a2)
1.5
(a) (a1)
−1
2
−3
0
−0.5 0.0
−1.5
−4
−1.5
−2
0
−0.5 0.0
0.5
2
1.0
4
1.5
63
−3
−2
−1
0
1
(o) (d3)
2
3
−3
−2
−1
0
1
(p) (d4)
FIGURE 3.1: Fitted curves for each αls (·), l = 1, 2, s = 1, 2 in Example 1. Plots (a1)-(a4)
show the 100 fitted curves for α11 (x) = sin(x), α12 (x) = x, α21 (x) = sin(x), α22 (x) = 0
respectively when sample size n = 100. Plots (b1)-(b4) and (c1)-(c4) are for n = 250
and n = 500 respectively. (d1)-(d4) plot the true model functions (solid) as well as the
typically estimated curves of different sample sizes: dashed lines (n = 100), dotted lines
(n = 250), and the dot-dashed lines (n = 500).
4
2
0
−2
−4
−2
−1
0
1
2
64
−3
−2
−1
0
1
2
3
−3
−2
−1
1
2
3
(b) (b)
−2
−1
0
1
2
(a) (a)
0
−3
−2
−1
0
1
2
3
(c) (c)
FIGURE 3.2: Fitted curves for α11 (·), α12 (·), and α21 (·) in Example 2. Plotted are
the true functions (solid), typical oracle estimates (dashed) and typical SCAD estimates
(dotted). The typical estimated curve is the one whose ISE is the median among the 100
ISEs from replications.
−3
−2
−1
0
1
2
0.0
0.4
−1.0
−0.4
α13
α12
−0.4
0.5
−0.5
α11
1.5
0.2
65
−2
−1
0
1
2
3
−1.5
−0.5
y
0.5
1.0
1.5
2.0
1.0
1.5
2.0
1.0
1.5
2.0
1.0
1.5
2.0
1.0
1.5
2.0
AGE
−3
−2
−1
0
1
2
−0.2
0.0
α22
0
0.0
α23
1.0
3
2
1
α21
0.2
4
x
−2
−1
0
1
2
3
−1.5
−0.5
y
0.5
AGE
−1.5
−1
0
1
2
−2
−1
0
2
3
−1.5
−1
0
1
2
−0.2
α43
−2
−1
0
1
2
3
α53
0.0
α52
−1
x
−0.5
0
1
2
0.5
AGE
−0.4
−1.0
0.0
−2
−1.5
y
1.0
x
−3
0.5
0.2
0.0
α42
−2
−0.5
AGE
−0.8 −0.4
−1
α41
−3
−3
α51
1
y
0
x
0.0
−2
−0.1 0.1
−3
0.0
−0.2
α33
−0.5
α32
−1.0
−2.0
α31
0.0
x
−2
−1
0
1
y
2
3
−1.5
−0.5
0.5
AGE
FIGURE 3.3: Graphs of estimated curves of all fifteen additive coefficient components αls ,
l = 1, · · · , 5, s = 1, 2, 3. SCAD estimators are plotted in solid curves, and full estimators
are plotted in dashed curves. The non-zero components selected by SCAD are α11 , α12 ,
α13 , α21 , α41 .
66
150000
250000
Actual Price
350000
50000
350000
300000
250000
Predicted Price
50000
100000
150000
200000
300000
250000
Predicted Price
50000
100000
150000
200000
300000
250000
200000
Predicted Price
150000
100000
50000
50000
LINEAR Prediction
350000
FULL Prediction
350000
SCAD Prediction
150000
250000
Actual Price
350000
50000
150000
250000
Actual Price
FIGURE 3.4: Plots of actual housing prices against predicted values from SCAD, FULL
and a linear regression model. In order to reduce crowding of all the 891 points, we
randomly selected only 80 points for plotting. For each plot, dotted line is the symmetric
axis y = x. Two dashed lines respectively represent y = x + 0.1 |x| and y = x − 0.1 |x|.
Any point enclosed within the two dashed lines represents a “good” predicted price.
350000
67
4
CONCLUSIONS
In this dissertation, we proposed the SCAD-penalized polynomial spline estimation
method for estimation and variable selection on both nonlinear time series data and ran­
domly sampled data with nonparametric structure using a stochastic additive model and
the more general additive coefficient model respectively. Both semi-parametric models al­
low the detection of a very general data structure, but avoid the “curse of dimensionality”
through their additive model formation.
The proposed method requires minimizing the objective function which is the sum
of a non-convex penalty function and sum of squared errors. We have used the local
linear approximation (LLA) method introduced in linear regression models and extended
it to both SAM and ACM semiparametric regression models for solving the non-convex
optimization problem. This LLA method greatly reduces the computational complexity,
and still provides consistent estimators whose consistency has been theoretically proven.
We have shown the effectiveness of the proposed method through Monte Carlo studies
and real data analysis.
We have further proved the oracle properties of the penalized estimators in both
models. For stochastic additive models, the proofs assume weak dependency of time
series data. This is a more general assumption than assuming the data to be independent.
Therefore the theoretical conclusion is broadened. For the additive coefficient model,
we directly proved that our proposed estimators, the global minimum of the objective
function, has oracle properties. This is a stronger conclusion than previous work, which
only stated the existence of a sequence of consistent local minima.
Our research can be advanced in the future in at least two directions. First of all, the
analysis of time series data is an important issue. We proposed SAM to analyze such data
in our research. Furthermore, the additive coefficient model can also be used to analyze
68
time series data. Take autoregressive models as an example. Different from SAM, ACM
allows interaction between lags, therefore can model more general data structure. The
global optimality of the SCAD estimators could be proven for ACM given some appropriate
assumptions on the time series data. Secondly, statistical genetics is a growing area in
recent years. Analysis of genetic data usually generate new research questions involving
high dimensional data. According to Fan and Li (2001), the SCAD penalty is able to deal
with high dimensional data. So it will be interesting to extend the proposed method to the
high dimensional case where the number of predictors are no longer fixed, but increases
with the sample size. The developed high dimensional SAM and ACM could contribute to
solving statistical genetics problems. Though new theory needs to be built in the future,
a high-dimensional example for numerical studies was given in subsection 3.5, and our
proposed method gives satisfying performances.
69
BIBLIOGRAPHY
Ang, A. and Chen, J. (2002), Asymmetric Correlations of Equity Portfolios. Journal of
Financial Economics, 63, 443–494.
Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes: Estimation and Pre­
diction, 2nd Ed. New York: Springer-Verlag.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals
of Statistics, 24, 2350–2383.
Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple
regression and correlation. Journal of the American Statistical Association, 80, 580–619.
Brockwell, P. J. and Davis, R. A. (1991). Time series: theory and methods. New York:
Springer-Verlag.
Candés, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is
much larger than n (with discussion). Annals of Statistics, 35, 2313–2404.
Chen, R. and Tsay, R. S. (1993). Nonlinear additive ARX models. Journal of the American
Statistical Association, 88, 955–967.
de Boor, C. (2001). A Practical Guide to Splines. New York: Springer.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression.
Annals of Statistics, 32, 407–499.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of
parameters. Annals of Statistics, 32, 928–961.
70
Fik, T. J., Ling, D. C. and Mulligan G. F. (2003). Modeling spatial variation in housing
prices: A variable interaction approach. Real Estate Economics, V31, 623–646.
Frank. I. E. and Friedman. J. H. (1993). A statistical view of some chemometrics tools.
Technometrics, 35, 109–135.
Härdle, W., Liang, H. and Gao, J. (2000). Partially Linear Models, Physica-Verlag.
Härdle, W. and Stoker, T. (1989). Investigating Smooth Multiple Regression by the
Method of Average Derivatives. Journal of the American Statistical Association, 84,
986–995.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models. London: Chapman
and Hall.
Hastie, T. J. and Tibshirani, R. J. (1993). Varying-coefficient models. J. R. Stat. Soc. Ser.
B, 55,757–796.
Huang, J. Z. (1998a). Projection estimation in multiple regression with application to
functional ANOVA models. Annals of Statistics, 26, 242–272.
Huang, J. Z. (1998b). Functional ANOVA models for generalized regression. Journal of
Multivariate Analysis, 67, 49–71.
Huang, J. Z., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric
additive models. Annals of Statistics, 38, 2282–2313.
Huang, J. Z., Kooperberg, C., Stone, C. J., and Truong, Y. K. (2000) Functional ANOVA
modeling for proportional hazards regression. Annals of Statistics, 28, 961–999.
Huang, J. Z., Wu, C. O. and Zhou, L. (2002). Polynomial spline estimation and inference
for varying coefficient models with longitudinal data. Statistica Sinica, 14, 763–788.
71
Huang, J. Z. and Yang, L. (2004). Identification of nonlinear additive autoregressive mod­
els. J. R. Stat. Soc. Ser. B, 66, 463–477.
Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of
single-index models. Journal of Econometrics, 58, 71–120.
Jiang, S. and Xue, L. (2013). Lag selection in stochastic additive models. Journal of
Nonparametric Statistics. 25, 129–146.
Kim, Y., Choi, H., and Oh, H. S. (2008). Smoothly clipped absolute deviation on high
dimensions. Journal of the American Statistical Association, 103, 1665–1673.
Lian, H. (2012). Variable selection for high-dimensional generalized varying-coefficient
models. Statistica Sinica, 22, 1563–1588.
Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear
models. Biometrika, 73, 13–22.
Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in smoothing spline
analysis of variance models. Annals of Statistics, 34, 2272–2297.
Linton, O. B. and Härdle, W. (1996). Estimation of additive regression models with known
links. Biometrika, 83, 529–540.
Liu, R. and Yang, L. (2010). Spline-backfitted kernel smoothing of additive coefficient
model. Econometric Theory, 12, 29–59.
Lütkepohl, H. (1993). Introduction to multiple time series analysis. Berlin: SpringerVerlag.
Ma, S., Song, Q. and Wang, L. (2013). Simultaneous variable selection and estimation in
semiparametric modeling of longitudinal/clustered data. Bernoulli, 19, 252–274.
72
Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling.
Annals of Statistics, 37, 3779–3821.
Shen, X., Pan, W. and Zhu, Y. (2011). Likelihood-based selection and sharp parameter
estimation. Journal of the American Statistical Association, 107, 223–232.
Schwarz, G. (1978). Estimating the Dimension of a Model. Annals of Statistics, 6, 461–464.
Stone, C. J. (1985). Additive regression and other nonparametric models. Annals of Statis­
tics, 13, 689–705.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.
Ser. B, 58, 267–288.
Tjøtheim, D. (1994). Non-linear time series: a selective review. Scand. J. Statist., 21,
97–130.
Tong, H. (1990). Nonlinear time series. A dynamical system approach. New York: Oxford
University Press.
Tong, H. (1995). A personal overview of non-linear time series analysis from a chaos
perspective. Scand. J. Statist., 22, 399–445.
schernig, R. and Yang, L. (2000). Nonparametric estimation of generalized impulse re­
sponse function. Discussion Papers, Interdisciplinary Research Project 373: Quantifica­
tion and Simulation of Economic Processes, 2000.
Wang, L., Li, H. and Huang, J. Z. (2008). Variable selection in nonparametric varyingcoefficient models for analysis of repeated measurements. Journal of the American Sta­
tistical Association, 103, 1556–1569.
Wang, H., Li, R. and Tsai, C. (2008). Tuning parameter selectors for the smoothly clipped
absolute deviation method. Biometrika, 94, 553–568.
73
Wang, L., Liu. X., Liang, H. and Carroll, R. (2011). Estimation and variable selection for
generalized additive partial linear models. Annals of Statistics, 39, 1827–1851.
Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive
autoregression model. Annals of Statistics, 35, 2474–2503.
Wei, F., Huang, J., and Li, H. (2011). Variable selection and estimation in highdimensional varying-coefficient models. Statistica Sinica, 21, 1515–1540.
Xue, L. (2009). Variable selection in additive models. Statistica Sinica, 19, 1281–1296.
Xue, L., and Qu, A. (2012). Variable Selection in High-dimensional Varying-coefficient
Models with Global Optimality. The Journal of Machine Learning Research, 13, 1973­
1998.
Xue, L., Qu, A. and Zhou, J. (2010). Consistent model selection for marginal generalized
additive model for correlated data. Journal of the American Statistical Association,
105, 1518–1530.
Xue, L., and Yang, L. (2006a). Estimation of semiparametric additive coefficient model.
J. Statist. Plann. Inference, 136, 2506–2534.
Xue, L., and Yang, L. (2006b). Additive coefficient modeling via polynomial spline. Sta­
tistica Sinica, 16, 1423–1446.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped
variables. J. R. Stat. Soc. Ser. B, 68, 49–67.
Zhang, Cun-Hui (2010). Nearly unbiased variable selection under minimax concave
penalty. Annals of Statistics, 38, 894–942.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American
Statistical Association, 101, 1418–1429.
74
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood
models. Annals of Statistics, 36, 1509–1533.