Uploaded by Andrea Neyra Nazarrett

Lec23

advertisement
ECON 2140: Econometric Methods
Lecture 23: Nonparametric Estimation:
Sieve Estimator
Wayne Gao
ECON 2140, Spring 2022
Lecture 23
1/28
Outline for Lecture 23
L23-1: Introduction to Sieve Approximation
L23-2: Polynomial Series Regression
L23-3: Sieve Estimation More Generally
L23-4: Inference at a Point
ECON 2140, Spring 2022
Lecture 23
2/28
L23-1: Introduction to Sieve Approximation
ECON 2140, Spring 2022
Lecture 23
3/28
Local vs Global
▶ Nonparametric regression:
m (x) = E [ yi | Xi = x]
▶ Recall that kernel methods are “local” in nature:
− seek to approximate f (x) using observations locally around x
− for each x in the relevant domain X
▶ Today we are going to talk about sieve methods:
− which are “global”
− seek to approximate the whole function f on a domain X
− using a sequence of whole functions f1 , f2 , f3 , ...
ECON 2140, Spring 2022
Lecture 23
4/28
“Sieve”
loosely speaking: “a grid on functions”
ECON 2140, Spring 2022
Lecture 23
5/28
Weirstrauss Approximation Theorem
▶ The idea of “sieves” dates back to:
− ancient Egypt for grain sieving...
− or the 19th century for the famous “approximation theory” result:
▶ Weirstrauss Approximation Theorem:
− let m be any continuous function on [a, b] ⊆ R,
− let ϵ > 0,
− then there exists a k-th order polynomial of the form
mk,β (x) = β0 + β1 x + β2 x 2 + ... + βk x k
for some β0 , β1 , ..., βk ∈ R and some k ∈ N,
− such that
∥mk,β − m∥∞ := sup |mk (x) − m (x)| < ϵ.
x∈X
▶ Loosely speaking, the above says, for any ϵ > 0,
− you use polynomials to build a “grid of functions”
− with “holes” no larger than ϵ
ECON 2140, Spring 2022
Lecture 23
6/28
L23-2: Polynomial Series Regression
ECON 2140, Spring 2022
Lecture 23
7/28
Polynomial Series Regression
▶ Motivated by the above, we may consider estimating
m (x) = E [ yi | Xi = x]
with polynomial series, i.e.
▶ For scalar x ∈ R, using
mk,β̂ (x) = β̂0 + β̂1 x + β̂2 x 2 + ... +ˆk−1 x k−1
− for some chosen order k
− with β̂ given by
β̂n := arg minβ
Pn
− mk,β (Xi ))2
→ i.e., just run OLS regression of Yi on 1, Xi , ..., Xik
1
n
i=1 (Yi
▶ For vector-valued Xi ∈ Rd , need to include “interaction terms”
Q
J−1
1, Xi,1 , ..., Xi,d , ......, dj=1 Xi,j
in which case we have k = J d terms
→ could already sense the “curse of dimensionality” here
ECON 2140, Spring 2022
Lecture 23
8/28
Consistency
▶ Under appropriate conditions,
the following can be stated in either of:
r h
i
− L2 norm ∥f ∥2 := E f (Xi )2
− sup norm ∥f ∥∞ = supx |f (x)|
▶ From our extremum consistency results, for fixed k, as n → ∞
h
i
p
β̂n −→ βk∗ := arg min E (m (Xi ) − mk,β (Xi ))2
β
or in other words
p
mk,β̂n − mk,βk∗ −→ 0.
▶ The Weirstrauss Approximation Theorem guarantees that
lim
k→∞
mk,βk∗ − m = 0.
▶ Hence, as n → ∞, if we set kn → ∞ “slowly enough relative to n”, then
mkn ,β̂n is consistent for m,
mkn ,β̂n − m ≤ mkn ,β̂n − mkn ,βk∗
n
ECON 2140, Spring 2022
p
+ mk,βk∗ − m −→ 0.
Lecture 23
9/28
Rate of Convergence
▶ Very loosely, can think of “kn ” as “1/hn bandwidth in kernel regression”
▶ Results on rates of convergence for series estimators:
− derived in, e.g., Newey (1997) for polynomial (and spline) estimators
− surveyed in, e.g., Chen (2007) for many other series/sieve estimators
− “improved” in, e.g., Belloni et al (2015) for polynomial/spline...
▶ Under some regularity conditions and
− “smoothness”: m (x) is s-times continuously differentiable
− “slow enough”: kn2 log kn /n → 0 (Belloni et al, 2015)
then
s q −
mkn ,β̂n − m = Op kn d + knn
2
s q −
mkn ,β̂n − m
= Op
kn d + knn kn
∞
which are increasing in s and decreasing in d.
ECON 2140, Spring 2022
Lecture 23
10/28
Optimal Rate of Convergence
mkn ,β̂n − m
mkn ,β̂n − m
∞
2
= Op
= Op
−s
kn d
− ds
kn
!
kn
+
n
r ! !
kn
kn
+
n
r
▶ Again, we can optimize over rate of kn as n → ∞.
→ optimal rate for kn = Jnd :
d
kn∗ ∝ n 2s+d ,
1
Jn∗ ∝ n 2s+d
→ with optimal kn∗
mk ∗ ,β̂ ∗ − m
n
n
ECON 2140, Spring 2022
2
= Op n
s
− 2s+d
,
mk ∗ ,β̂ ∗ − m
n
n
Lecture 23
∞
= Op n
−
s−d
(2s+d)2
11/28
Optimal Rate of Convergence
mk ∗ ,β̂ ∗ − m
n
n
2
= Op n
s
− 2s+d
,
mk ∗ ,β̂ ∗ − m
n
n
∞
− s−d 2
(2s+d)
= Op n
▶ Recall from Stone (1982) bound on optimal rate of convergence.
→ the polynomial series estimator mk ∗ ,β̂ ∗
n
− attains the optimal rate n
n
s
− 2s+d
under L2 norm
− s
2s+d
− does not attain the optimal rate logn n
under sup norm
→ which, however, can be obtained using other types of sieves
→ polynomial series suboptimal in this sense.
ECON 2140, Spring 2022
Lecture 23
12/28
L23-3: Sieve Estimation More Generally
ECON 2140, Spring 2022
Lecture 23
13/28
Approximation of Smooth Functions
▶ The math field of “approximation theory” is about how well a class of
functions can approximate another class of functions.
− has nothing to do with “statistical uncertainty” or “sampling
randomness”
▶ In sieve nonparametric regression context:
− usually assume that the function of interest
m (x) = E [ yi | Xi = x]
lies within some “smoothness class of functions”:
− to be approximated by functions in (a sequence of) sieve spaces.
▶ How small the approximation error is depends on:
− the structure/smoothness of the target class
− the structure/configuration of the sieve space
− the dimension of x
− the “norm” (L2 , sup, etc)
ECON 2140, Spring 2022
Lecture 23
14/28
Smoothness Classes “of Order s”
▶ In Weirstrauss Approximation Theorem, we assumed that
“m is continuous function on [a, b] ⊆ R”
which is a “smoothness/regularity” condition.
▶ Usually, such smoothness/regularity conditions takes the form of some
imposed bounds on the magnitude of the function and its derivatives up
to a certain order.
▶ Example: Hölder space of order s: loosely speaking,
− uniformly bounded (partial) derivatives exist up to order s
− order-s (partial) derivatives are Lipschitz continuous
▶ Another popular smoothness class is “Sobolev space”
▶ See, e.g., Chen (2007) for exact definition
→ the exact definition for order of smoothness “s” is trickier
ECON 2140, Spring 2022
Lecture 23
15/28
Some Classes of Linear Sieves
▶ We have talked about polynomials earlier.
▶ There are transformed type of polynomials (orthogonal, Hermite, ...)
▶ There are other types of linear sieve (series).
▶ Fourier/Cosine/Sine:
(k −1)/2
n
1, (cos (2πjx) , sin (2πjx))j=1
▶ Splines: e.g. cubic spline with knots z1 , z2 , ...
1, x, x 2 , x 3 , [x − z1 ]3+ , ..., [x − zkn −4 ]3+
▶ Wavelet...
▶ Local polynomial partition...
▶ And many more...
ECON 2140, Spring 2022
Lecture 23
16/28
Some Classes of Linear Sieves
▶ Existing results show that the Stone optimal rates
s
− 2s+d
∗
, ∥m̂∗ − m∥∞ = Op
∥m̂ − m∥2 = Op n
n
log n
−
s
2s+d
!
are attainable by:
− Fourier, splines, wavelet, local polynomial partition...
− in i.i.d. and time-series settings...
− see, e.g., Newey (1997), ..., Chen (2007), Belloni et al (2015), Chen
& Christensen (2015)
▶ General takeaway:
− do not use polynomial if you care about uniform/sup-norm
convergence...
ECON 2140, Spring 2022
Lecture 23
17/28
Nonlinear Sieve: Neural Network
▶ So far we have been talking about linear sieve (series) estimators
▶ There are also nonlinear sieves, the mostly notable of which is...
▶ Neural network (NN):
− single-layer NN: linear combinations of nonlinear transformations
kn
X
′ βj ϕ x γj
j=1
− multi-layer (deep) NN: recursion of the above
− “perform surprisingly well” in many senses...
s
▶ Well, NN cannot beat n− 2s+d , so what are people talking about?
− (lot of) research on how NN approximate certain target function
class, with usually more structure or nice properties than the
“smoothness class” defined earlier
− exploring conditions for “no curse of dimensionality” for neural nets
− will talk about this bit in the last lecture if time allows...
ECON 2140, Spring 2022
Lecture 23
18/28
Choosing Sieve Class
▶ Short answer: it depends!
− on what you know about your “target function” (class)
▶ Popular choice: cubic splines with knots defined by quantiles
→ produce a “nice looking” curve
→ for each coordinate, has J = 7 terms if setting nodes at quartiles
1, x, x 2 , x 3 , [x − z0.25 ]3+ , [x − z0.5 ]3+ , [x − z0.75 ]3+
ECON 2140, Spring 2022
Lecture 23
19/28
Choosing Sieve Dimension kn
▶ Usually done by cross validation (CV)
→ similar to choice of bandwidth hn through CV in PS10
▶ Divide data into K “folds”:
− e.g. K = 5 or K = 10, or even K = n (“leave-one-out”)
− denote observations in jth fold by Ij
− for each fold and each k take
X
1
β̂ j (k) = arg min
(Yi − mk,β (Xi ))2
β n − |Ij |
i̸∈Ij
2
X
1
Yi − mk,β̂ j (k) (Xi )
e j (k) =
|Ij |
i∈Ij
P
j
− choose kn to minimize K1 K
j=1 e (k)
→ estimate of “out-of-sample” given sample size n
▶ Under mild conditions, can show that k̂n selected based on CV converges
at optimal rate
ECON 2140, Spring 2022
Lecture 23
20/28
L23-4: Inference at a Point
ECON 2140, Spring 2022
Lecture 23
21/28
Inference at a Point
▶ We might also be interested in constructing standard errors and
confidence sets for m (·)
− e.g. standard errors for the estimate of the RD parameter
E [TEi |Ri = 0] = lim E [Yi |Ri = r ] − lim E [Yi |Ri = r ]
r ↓0
r ↑0
− or standard errors for the CATE
E [TEi |Xi = x ∗ ] = E [Yi |Ti = 1, Xi = x ∗ ] − E [Yi |Ti = 0, Xi = x ∗ ]
for some value x ∗ of interest
ECON 2140, Spring 2022
Lecture 23
22/28
Inference at a Point
▶ For optimal choice of bandwidth hn or sieve dimension kn , will have that
d
rn (m̂ (x) − m (x)) −→ N (b (x) , σ (x))
where:
− rn is the rate of convergence
− b (x) is the aysmptotic bias
− σ̂ (x) is some estimator of the aysmptotic variance
▶ e.g. for NW estimator with scalar x, we have
′
2
5
d
n (m̂ (x) − m (x)) −→ N
ECON 2140, Spring 2022
1 ′′
f (x) ′
R2 σϵ2 (x)
m (x) +
m (x) ,
2
f (x)
f (x)
Lecture 23
!
23/28
Dealing with the Bias
▶ σ̂ (x) is usually straightforward to obtain: use fact that
− kernel regression are just Weighted LS
− series regressions are just OLS
− see, e.g., the Hansen textbook for explicit formula for kernel/series
▶ Problem is usually the bias:
− depends on unknown feature of the DGP
− generally very hard, if possible at all, to “estimate” the bias and
“de-bias”
▶ One trick is “under-smoothing”:
− in kernel (mentioned before) i: set rate of hn to be smaller than
optimal rate hn∗
− in sieve/series: set rate of kn to be larger than optimal rate hn∗
− so that bias vanishes asymptotically relative to variance
ECON 2140, Spring 2022
Lecture 23
24/28
Under-smoothing: “Bad Idea”
▶ Common suggestion: easy to implement and use m̂ (x) ± 1.96se
ˆ (x)
▶ Unattractive for both theoretical and practical reasons
▶ Theoretical:
− under-smoothing means sub-optimal rate of convergence
− in large samples, under-smoothed approach is infinitely worse
− in the end, asymptotics intended as an approximation of finite-sample
dist
− undersmoothing ↔ “my asymptotic approximation neglects bias”
▶ Practical:
− no guidance on how to choose degree of under-smoothing
− “I’m promising I’d take model to be even more flexible if I had more
data, but why should this make my estimates for a given sample more
convincing?”
ECON 2140, Spring 2022
Lecture 23
25/28
Alternative: Bounding the Bias
▶ An alternative approach is to derive bounds on the bias:
− based on a-priori assumptions on e.g. smoothness of conditional
expectation m (·)
− recently discussed by Armstrong and Kolesar (2018, 2020)
▶ Smoothness assumption imposed anyway for nonparametric estimation
→ by doing nonparameteric estimation at all, I’m already expressing a
view that m (·) is reasonably “nice”
→ with the smoothness assumption made explicit, can derive bounds on
the bias b (·)
ECON 2140, Spring 2022
Lecture 23
26/28
“Honest” CIs Based on Bias Bounds
▶ Once impose common smoothness assumptions (e.g. bounds on
derivatives),
→ can derive worst-case bias b̄ for given m̂
→ which can be used to adjust critical values:
▶ For two-sided CI,
− use 1 − α quantile of N σ̂b , 1 as critical value
− rather than the quantiles of |N (0, 1)|
− ensures correct coverage for any bias ≤ b̄
− e.g. for kernel with n−2/5 rate, critical value from 1.96 to 2.18
ECON 2140, Spring 2022
Lecture 23
27/28
References
▶ Stone, C. (1982), “Optimal Global Rates of Convergence for Nonparameteric
Regression,” Annals of Statistics, 10(4), 1040-1053
▶ Newey, W. (1997), “Convergence Rates and Asymptotic Normality for Series
Estimators,” Journal of Econometrics, 79(1), 147-168
▶ Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparameteric
Models,” Handbook of Econometrics Volume 6B
▶ Belloni, A., Chernozhukov, V., Chetverikov, D., & Kato, K. (2015). Some new
asymptotic theory for least squares series: Pointwise and uniform results.
Journal of Econometrics, 186(2), 345-366.
▶ Chen, X., & Christensen, T. M. (2018). Optimal sup-norm rates and uniform
inference on nonlinear functionals of nonparametric IV regression. Quantitative
Economics, 9(1), 39-84.
▶ Armstrong, T. and Kolesar, M. (2018). Optimal inference in a class of
regression models, Econometrica, 86, 655--683.
▶ Armstrong, T. B., & Kolesár, M. (2020). Simple and honest confidence
intervals in nonparametric regression. Quantitative Economics, 11(1), 1-39.
ECON 2140, Spring 2022
Lecture 23
28/28
Download