Prediction variance in Linear Regression • Assumptions on noise in linear regression allow us to estimate the prediction variance due to the noise at any point. • Prediction variance is usually large when you are far from a data point. • We distinguish between interpolation, when we are in the convex hull of the data points, and extrapolation where we are outside. • Extrapolation is associated with larger errors, and in high dimensions it usually cannot be avoided. Linear Regression • Surrogate is linear n combination of 𝑛𝑏 yˆ bii (x) given shape functions i 1 • For linear 1 1 2 x approximation • Difference (error) e y b (x ) e y Xb between 𝑛𝑦 data and surrogate T T e e ( y X b ) (y Xb) • Minimize square error • Differentiate to obtain X T Xb X T y b nb j j i 1 i i j Model based error for linear regression • The common assumptions for linear regression – The true function is described by the functional form of the surrogate. – The data is contaminated with normally distributed error with the same standard deviation at every point. – The errors at different points are not correlated. • Under these assumptions, the noise standard deviation (called standard error) is estimated as eT e ˆ n y nb 2 • 𝜎 is used as estimate of the prediction error. Prediction variance n yˆ bii (x), • Linear regression model i 1 • Define x (m) i i (x), then yˆ x( m)T b, • With some algebra Var[ yˆ (x)] x ( m )T • Standard error b x (m) ( m )T (m) X X x , ( m )T X X x 2 s y ˆ x T T 1 1 x( m ) Interpolation, extrapolation and regression • Interpolation is often contrasted to regression or least-squares fit • As important is the contrast between interpolation and extrapolation • Extrapolation occurs when we are outside the convex hull of the data points n 1 x i xi , i 1 n 1 i 1 i 1, i 0, • For high dimensional spaces we must have extrapolation! 2D example of convex hull • By generating 20 points at random in the unit square we end up with substantial region near the origin where we will need to use extrapolation • Using the data in the notes, give a couple of alternative sets of alphas Approximately for the point (0.4,0.4) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Example of prediction variance • For a linear polynomial RS y=b1+b2x1+b3x2 find the prediction variance in the region 1 x1 1, 1 x2 1 1 0.8 • (a) For data at three vertices (omitting (1,1)) 0.6 0.4 0.2 0 -0.2 -0.4 x 1, 1 , x 1,1 , x 1, 1 T 1 x (m) T 2 T 3 1 2 T 1 x1 X X 0.25 1 x 1 2 s y ˆ x ( m )T 1 1 2 1 , 1 2 -0.6 -0.8 -1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1 1 1 3 1 -1 X 1 1 1 , X T X 1 3 -1 1 1 1 1 -1 3 (m) 2 2 ˆ X X x 0.5 1 x x x x 1 2 1 2 x1x2 T 1 Interpolation vs. Extrapolation • At origin s y 2. ˆ At 3 vertices s y ˆ . At (1,1) s y 3ˆ 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 s y ˆ x -0.8 ( m )T -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (m) 2 2 ˆ X X x 0.5 1 x x x x 1 2 1 2 x1x2 T 1 Standard error contours • Minimum error sy obtained by setting to zero derivative of prediction variance with respect to 𝑥1 and 𝑥2 . • What is special about this point • Contours of prediction variance provide more detail. 1 1 ˆ 3 1 1.4 1 3 1. 6 1.2 1 0.8 0.6 x1 x2 at 0.8 0.4 1 0.8 1. 4 1. 2 0.2 1 -0.2 0.6 8 0. 0. 6 0 -0.4 0.6 -0.6 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 1 -1 -1 0.8 0.8 -0.8 0.8 1 Data at four vertices 1 1 X 1 1 • Now x1T 1, 1 , x2T 1,1 , x3T 1, 1 , x4T 1,1 • And x ( m )T X X T 1 1 1 1 0 0 1 1 , X T X 4 0 1 0 1 1 0 0 1 1 1 x ( m) 0.25(1 x12 x22 ), • Error at vertices 3 sy ˆ 2 • At the origin minimum is 1 s y ˆ 2 • How can we reduce error without adding points? Graphical Comparison of Standard Errors Three points 1. 4 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0. 8 1 -0.8 -0.8 0.8 0.8 0.55 7 0. -0.6 0.6 0.8 1 -1 -1 0.7 0. 6 -0.4 -0.6 0.6 0.55 1 -0.2 -0.4 -1 -1 0.55 8 0. -0.2 -0.8 0.55 0 0 0.6 0.6 0.65 0.2 0.2 0.65 0.6 0.4 0.65 1. 2 0. 8 0.7 5 0.7 1 0.7 0.65 7 0. 0.6 0.8 5 0.7 0.8 0.8 0.4 0.7 0.8 1. 6 1.2 1 0.8 0.6 1 1.4 0.6 1 0.7 1 Four points 0. 65 0.6 0.7 5 -0.8 0.6 65 0. 0.65 -0.6 -0.4 -0.2 5 0.7 0.7 0.7 0 0.2 0.4 0.6 0.8 0.8 1 Homework • Redo the four point example, when the data points are not at the corners but inside the domain, at +-0.8. What does the difference in the results tells you? • For a grid of 3x3 data points, compare the standard errors for a linear and quadratic fits.