ECON 2140: Econometric Methods Lecture 23: Nonparametric Estimation: Sieve Estimator Wayne Gao ECON 2140, Spring 2022 Lecture 23 1/28 Outline for Lecture 23 L23-1: Introduction to Sieve Approximation L23-2: Polynomial Series Regression L23-3: Sieve Estimation More Generally L23-4: Inference at a Point ECON 2140, Spring 2022 Lecture 23 2/28 L23-1: Introduction to Sieve Approximation ECON 2140, Spring 2022 Lecture 23 3/28 Local vs Global ▶ Nonparametric regression: m (x) = E [ yi | Xi = x] ▶ Recall that kernel methods are “local” in nature: − seek to approximate f (x) using observations locally around x − for each x in the relevant domain X ▶ Today we are going to talk about sieve methods: − which are “global” − seek to approximate the whole function f on a domain X − using a sequence of whole functions f1 , f2 , f3 , ... ECON 2140, Spring 2022 Lecture 23 4/28 “Sieve” loosely speaking: “a grid on functions” ECON 2140, Spring 2022 Lecture 23 5/28 Weirstrauss Approximation Theorem ▶ The idea of “sieves” dates back to: − ancient Egypt for grain sieving... − or the 19th century for the famous “approximation theory” result: ▶ Weirstrauss Approximation Theorem: − let m be any continuous function on [a, b] ⊆ R, − let ϵ > 0, − then there exists a k-th order polynomial of the form mk,β (x) = β0 + β1 x + β2 x 2 + ... + βk x k for some β0 , β1 , ..., βk ∈ R and some k ∈ N, − such that ∥mk,β − m∥∞ := sup |mk (x) − m (x)| < ϵ. x∈X ▶ Loosely speaking, the above says, for any ϵ > 0, − you use polynomials to build a “grid of functions” − with “holes” no larger than ϵ ECON 2140, Spring 2022 Lecture 23 6/28 L23-2: Polynomial Series Regression ECON 2140, Spring 2022 Lecture 23 7/28 Polynomial Series Regression ▶ Motivated by the above, we may consider estimating m (x) = E [ yi | Xi = x] with polynomial series, i.e. ▶ For scalar x ∈ R, using mk,β̂ (x) = β̂0 + β̂1 x + β̂2 x 2 + ... +ˆk−1 x k−1 − for some chosen order k − with β̂ given by β̂n := arg minβ Pn − mk,β (Xi ))2 → i.e., just run OLS regression of Yi on 1, Xi , ..., Xik 1 n i=1 (Yi ▶ For vector-valued Xi ∈ Rd , need to include “interaction terms” Q J−1 1, Xi,1 , ..., Xi,d , ......, dj=1 Xi,j in which case we have k = J d terms → could already sense the “curse of dimensionality” here ECON 2140, Spring 2022 Lecture 23 8/28 Consistency ▶ Under appropriate conditions, the following can be stated in either of: r h i − L2 norm ∥f ∥2 := E f (Xi )2 − sup norm ∥f ∥∞ = supx |f (x)| ▶ From our extremum consistency results, for fixed k, as n → ∞ h i p β̂n −→ βk∗ := arg min E (m (Xi ) − mk,β (Xi ))2 β or in other words p mk,β̂n − mk,βk∗ −→ 0. ▶ The Weirstrauss Approximation Theorem guarantees that lim k→∞ mk,βk∗ − m = 0. ▶ Hence, as n → ∞, if we set kn → ∞ “slowly enough relative to n”, then mkn ,β̂n is consistent for m, mkn ,β̂n − m ≤ mkn ,β̂n − mkn ,βk∗ n ECON 2140, Spring 2022 p + mk,βk∗ − m −→ 0. Lecture 23 9/28 Rate of Convergence ▶ Very loosely, can think of “kn ” as “1/hn bandwidth in kernel regression” ▶ Results on rates of convergence for series estimators: − derived in, e.g., Newey (1997) for polynomial (and spline) estimators − surveyed in, e.g., Chen (2007) for many other series/sieve estimators − “improved” in, e.g., Belloni et al (2015) for polynomial/spline... ▶ Under some regularity conditions and − “smoothness”: m (x) is s-times continuously differentiable − “slow enough”: kn2 log kn /n → 0 (Belloni et al, 2015) then s q − mkn ,β̂n − m = Op kn d + knn 2 s q − mkn ,β̂n − m = Op kn d + knn kn ∞ which are increasing in s and decreasing in d. ECON 2140, Spring 2022 Lecture 23 10/28 Optimal Rate of Convergence mkn ,β̂n − m mkn ,β̂n − m ∞ 2 = Op = Op −s kn d − ds kn ! kn + n r ! ! kn kn + n r ▶ Again, we can optimize over rate of kn as n → ∞. → optimal rate for kn = Jnd : d kn∗ ∝ n 2s+d , 1 Jn∗ ∝ n 2s+d → with optimal kn∗ mk ∗ ,β̂ ∗ − m n n ECON 2140, Spring 2022 2 = Op n s − 2s+d , mk ∗ ,β̂ ∗ − m n n Lecture 23 ∞ = Op n − s−d (2s+d)2 11/28 Optimal Rate of Convergence mk ∗ ,β̂ ∗ − m n n 2 = Op n s − 2s+d , mk ∗ ,β̂ ∗ − m n n ∞ − s−d 2 (2s+d) = Op n ▶ Recall from Stone (1982) bound on optimal rate of convergence. → the polynomial series estimator mk ∗ ,β̂ ∗ n − attains the optimal rate n n s − 2s+d under L2 norm − s 2s+d − does not attain the optimal rate logn n under sup norm → which, however, can be obtained using other types of sieves → polynomial series suboptimal in this sense. ECON 2140, Spring 2022 Lecture 23 12/28 L23-3: Sieve Estimation More Generally ECON 2140, Spring 2022 Lecture 23 13/28 Approximation of Smooth Functions ▶ The math field of “approximation theory” is about how well a class of functions can approximate another class of functions. − has nothing to do with “statistical uncertainty” or “sampling randomness” ▶ In sieve nonparametric regression context: − usually assume that the function of interest m (x) = E [ yi | Xi = x] lies within some “smoothness class of functions”: − to be approximated by functions in (a sequence of) sieve spaces. ▶ How small the approximation error is depends on: − the structure/smoothness of the target class − the structure/configuration of the sieve space − the dimension of x − the “norm” (L2 , sup, etc) ECON 2140, Spring 2022 Lecture 23 14/28 Smoothness Classes “of Order s” ▶ In Weirstrauss Approximation Theorem, we assumed that “m is continuous function on [a, b] ⊆ R” which is a “smoothness/regularity” condition. ▶ Usually, such smoothness/regularity conditions takes the form of some imposed bounds on the magnitude of the function and its derivatives up to a certain order. ▶ Example: Hölder space of order s: loosely speaking, − uniformly bounded (partial) derivatives exist up to order s − order-s (partial) derivatives are Lipschitz continuous ▶ Another popular smoothness class is “Sobolev space” ▶ See, e.g., Chen (2007) for exact definition → the exact definition for order of smoothness “s” is trickier ECON 2140, Spring 2022 Lecture 23 15/28 Some Classes of Linear Sieves ▶ We have talked about polynomials earlier. ▶ There are transformed type of polynomials (orthogonal, Hermite, ...) ▶ There are other types of linear sieve (series). ▶ Fourier/Cosine/Sine: (k −1)/2 n 1, (cos (2πjx) , sin (2πjx))j=1 ▶ Splines: e.g. cubic spline with knots z1 , z2 , ... 1, x, x 2 , x 3 , [x − z1 ]3+ , ..., [x − zkn −4 ]3+ ▶ Wavelet... ▶ Local polynomial partition... ▶ And many more... ECON 2140, Spring 2022 Lecture 23 16/28 Some Classes of Linear Sieves ▶ Existing results show that the Stone optimal rates s − 2s+d ∗ , ∥m̂∗ − m∥∞ = Op ∥m̂ − m∥2 = Op n n log n − s 2s+d ! are attainable by: − Fourier, splines, wavelet, local polynomial partition... − in i.i.d. and time-series settings... − see, e.g., Newey (1997), ..., Chen (2007), Belloni et al (2015), Chen & Christensen (2015) ▶ General takeaway: − do not use polynomial if you care about uniform/sup-norm convergence... ECON 2140, Spring 2022 Lecture 23 17/28 Nonlinear Sieve: Neural Network ▶ So far we have been talking about linear sieve (series) estimators ▶ There are also nonlinear sieves, the mostly notable of which is... ▶ Neural network (NN): − single-layer NN: linear combinations of nonlinear transformations kn X ′ βj ϕ x γj j=1 − multi-layer (deep) NN: recursion of the above − “perform surprisingly well” in many senses... s ▶ Well, NN cannot beat n− 2s+d , so what are people talking about? − (lot of) research on how NN approximate certain target function class, with usually more structure or nice properties than the “smoothness class” defined earlier − exploring conditions for “no curse of dimensionality” for neural nets − will talk about this bit in the last lecture if time allows... ECON 2140, Spring 2022 Lecture 23 18/28 Choosing Sieve Class ▶ Short answer: it depends! − on what you know about your “target function” (class) ▶ Popular choice: cubic splines with knots defined by quantiles → produce a “nice looking” curve → for each coordinate, has J = 7 terms if setting nodes at quartiles 1, x, x 2 , x 3 , [x − z0.25 ]3+ , [x − z0.5 ]3+ , [x − z0.75 ]3+ ECON 2140, Spring 2022 Lecture 23 19/28 Choosing Sieve Dimension kn ▶ Usually done by cross validation (CV) → similar to choice of bandwidth hn through CV in PS10 ▶ Divide data into K “folds”: − e.g. K = 5 or K = 10, or even K = n (“leave-one-out”) − denote observations in jth fold by Ij − for each fold and each k take X 1 β̂ j (k) = arg min (Yi − mk,β (Xi ))2 β n − |Ij | i̸∈Ij 2 X 1 Yi − mk,β̂ j (k) (Xi ) e j (k) = |Ij | i∈Ij P j − choose kn to minimize K1 K j=1 e (k) → estimate of “out-of-sample” given sample size n ▶ Under mild conditions, can show that k̂n selected based on CV converges at optimal rate ECON 2140, Spring 2022 Lecture 23 20/28 L23-4: Inference at a Point ECON 2140, Spring 2022 Lecture 23 21/28 Inference at a Point ▶ We might also be interested in constructing standard errors and confidence sets for m (·) − e.g. standard errors for the estimate of the RD parameter E [TEi |Ri = 0] = lim E [Yi |Ri = r ] − lim E [Yi |Ri = r ] r ↓0 r ↑0 − or standard errors for the CATE E [TEi |Xi = x ∗ ] = E [Yi |Ti = 1, Xi = x ∗ ] − E [Yi |Ti = 0, Xi = x ∗ ] for some value x ∗ of interest ECON 2140, Spring 2022 Lecture 23 22/28 Inference at a Point ▶ For optimal choice of bandwidth hn or sieve dimension kn , will have that d rn (m̂ (x) − m (x)) −→ N (b (x) , σ (x)) where: − rn is the rate of convergence − b (x) is the aysmptotic bias − σ̂ (x) is some estimator of the aysmptotic variance ▶ e.g. for NW estimator with scalar x, we have ′ 2 5 d n (m̂ (x) − m (x)) −→ N ECON 2140, Spring 2022 1 ′′ f (x) ′ R2 σϵ2 (x) m (x) + m (x) , 2 f (x) f (x) Lecture 23 ! 23/28 Dealing with the Bias ▶ σ̂ (x) is usually straightforward to obtain: use fact that − kernel regression are just Weighted LS − series regressions are just OLS − see, e.g., the Hansen textbook for explicit formula for kernel/series ▶ Problem is usually the bias: − depends on unknown feature of the DGP − generally very hard, if possible at all, to “estimate” the bias and “de-bias” ▶ One trick is “under-smoothing”: − in kernel (mentioned before) i: set rate of hn to be smaller than optimal rate hn∗ − in sieve/series: set rate of kn to be larger than optimal rate hn∗ − so that bias vanishes asymptotically relative to variance ECON 2140, Spring 2022 Lecture 23 24/28 Under-smoothing: “Bad Idea” ▶ Common suggestion: easy to implement and use m̂ (x) ± 1.96se ˆ (x) ▶ Unattractive for both theoretical and practical reasons ▶ Theoretical: − under-smoothing means sub-optimal rate of convergence − in large samples, under-smoothed approach is infinitely worse − in the end, asymptotics intended as an approximation of finite-sample dist − undersmoothing ↔ “my asymptotic approximation neglects bias” ▶ Practical: − no guidance on how to choose degree of under-smoothing − “I’m promising I’d take model to be even more flexible if I had more data, but why should this make my estimates for a given sample more convincing?” ECON 2140, Spring 2022 Lecture 23 25/28 Alternative: Bounding the Bias ▶ An alternative approach is to derive bounds on the bias: − based on a-priori assumptions on e.g. smoothness of conditional expectation m (·) − recently discussed by Armstrong and Kolesar (2018, 2020) ▶ Smoothness assumption imposed anyway for nonparametric estimation → by doing nonparameteric estimation at all, I’m already expressing a view that m (·) is reasonably “nice” → with the smoothness assumption made explicit, can derive bounds on the bias b (·) ECON 2140, Spring 2022 Lecture 23 26/28 “Honest” CIs Based on Bias Bounds ▶ Once impose common smoothness assumptions (e.g. bounds on derivatives), → can derive worst-case bias b̄ for given m̂ → which can be used to adjust critical values: ▶ For two-sided CI, − use 1 − α quantile of N σ̂b , 1 as critical value − rather than the quantiles of |N (0, 1)| − ensures correct coverage for any bias ≤ b̄ − e.g. for kernel with n−2/5 rate, critical value from 1.96 to 2.18 ECON 2140, Spring 2022 Lecture 23 27/28 References ▶ Stone, C. (1982), “Optimal Global Rates of Convergence for Nonparameteric Regression,” Annals of Statistics, 10(4), 1040-1053 ▶ Newey, W. (1997), “Convergence Rates and Asymptotic Normality for Series Estimators,” Journal of Econometrics, 79(1), 147-168 ▶ Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparameteric Models,” Handbook of Econometrics Volume 6B ▶ Belloni, A., Chernozhukov, V., Chetverikov, D., & Kato, K. (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Econometrics, 186(2), 345-366. ▶ Chen, X., & Christensen, T. M. (2018). Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric IV regression. Quantitative Economics, 9(1), 39-84. ▶ Armstrong, T. and Kolesar, M. (2018). Optimal inference in a class of regression models, Econometrica, 86, 655--683. ▶ Armstrong, T. B., & Kolesár, M. (2020). Simple and honest confidence intervals in nonparametric regression. Quantitative Economics, 11(1), 1-39. ECON 2140, Spring 2022 Lecture 23 28/28