4.2.1 Interpolation, extrapolation and prediction variance

advertisement
Prediction variance in Linear
Regression
• Assumptions on noise in linear regression allow
us to estimate the prediction variance due to the
noise at any point.
• Prediction variance is usually large when you are
far from a data point.
• We distinguish between interpolation, when we
are in the convex hull of the data points, and
extrapolation where we are outside.
• Extrapolation is associated with larger errors, and
in high dimensions it usually cannot be avoided.
Linear Regression
• Surrogate is linear
n
combination of 𝑛𝑏
yˆ   bii (x)
given shape functions
i 1
• For linear
1  1  2  x
approximation
• Difference (error)
e  y   b  (x ) e  y  Xb
between 𝑛𝑦 data and
surrogate
T
T
e
e

(
y

X
b
)
(y  Xb)
• Minimize square error
• Differentiate to obtain X T Xb  X T y
b
nb
j
j
i 1
i i
j
Model based error for linear
regression
• The common assumptions for linear regression
– The true function is described by the functional form
of the surrogate.
– The data is contaminated with normally distributed
error with the same standard deviation at every point.
– The errors at different points are not correlated.
• Under these assumptions, the noise standard
deviation (called standard error) is estimated as
eT e
ˆ 
n y  nb
2
• 𝜎 is used as estimate of the prediction error.
Prediction variance
n
yˆ   bii (x),
• Linear regression model
i 1
• Define x
(m)
i
 i (x), then
yˆ  x( m)T b,
• With some algebra
Var[ yˆ (x)]  x
( m )T
• Standard error
b x
(m)
( m )T
(m)
X
X
x
  ,
( m )T
X X 
 x
2
s y  ˆ x
T
T
1
1
x( m )
Interpolation, extrapolation and regression
• Interpolation is often contrasted to regression or
least-squares fit
• As important is the contrast between interpolation
and extrapolation
• Extrapolation occurs when we are outside the convex
hull of the data points
n 1
x    i xi ,
i 1
n 1

i 1
i
 1,
 i  0,
• For high dimensional spaces we must have
extrapolation!
2D example of convex hull
• By generating 20 points at
random in the unit square
we end up with
substantial region near
the origin where we will
need to use extrapolation
• Using the data in the
notes, give a couple of
alternative sets of alphas
Approximately for the point
(0.4,0.4)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Example of prediction variance
• For a linear polynomial RS y=b1+b2x1+b3x2 find the prediction
variance in the region
 1  x1  1,  1  x2  1
1
0.8
• (a) For data at three vertices (omitting (1,1))
0.6
0.4
0.2
0
-0.2
-0.4
x   1, 1 , x   1,1 , x  1, 1
T
1
x
(m)
T
2
T
3
1 
2
  T 1
  x1   X X   0.25 1
x 
1
 2
s y  ˆ x
( m )T
1 1
2 1  ,
1 2 
-0.6
-0.8
-1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1  1  1
3  1 -1
X  1  1 1  , X T X   1 3 -1 
1 1  1 
 1 -1 3 
(m)
2
2
ˆ
X
X
x


0.5
1

x

x

x

x
 
 1 2 1 2  x1x2 
T
1
Interpolation vs. Extrapolation
• At origin s
y
2.
 ˆ
At 3 vertices
s y  ˆ
. At (1,1)
s y  3ˆ
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-1
s y  ˆ x
-0.8
( m )T
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
(m)
2
2
ˆ
X
X
x


0.5
1

x

x

x

x
 
 1 2 1 2  x1x2 
T
1
Standard error contours
• Minimum error
sy 
obtained by setting to
zero derivative of
prediction variance
with respect to
𝑥1 and 𝑥2 .
• What is special about
this point
• Contours of prediction
variance provide more
detail.
1
1
ˆ
3
1
1.4
1
3
1.
6
1.2
1
0.8
0.6
x1  x2  
at
0.8
0.4
1
0.8
1.
4
1.
2
0.2
1
-0.2
0.6
8
0.
0.
6
0
-0.4
0.6
-0.6
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
1
-1
-1
0.8
0.8
-0.8
0.8
1
Data at four vertices
1
1
X 
1

1
• Now
x1T   1, 1 , x2T  1,1 , x3T  1, 1 , x4T  1,1
• And
x
( m )T
X X 
T
1
 1  1
1 0 0 
 1 1 
, X T X  4 0 1 0 
1 1 
0 0 1 

1 1 
x ( m)  0.25(1  x12  x22 ),
• Error at vertices
3
sy 
ˆ
2
• At the origin minimum is
1
s y  ˆ
2
• How can we reduce error without adding points?
Graphical Comparison of Standard
Errors
Three points
1.
4
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.
8
1
-0.8
-0.8
0.8
0.8
0.55
7
0.
-0.6
0.6
0.8
1
-1
-1
0.7
0.
6
-0.4
-0.6
0.6
0.55
1
-0.2
-0.4
-1
-1
0.55
8
0.
-0.2
-0.8
0.55
0
0
0.6
0.6
0.65
0.2
0.2
0.65
0.6
0.4
0.65
1.
2
0.
8
0.7
5
0.7
1
0.7
0.65
7
0.
0.6
0.8
5
0.7
0.8
0.8
0.4
0.7
0.8
1.
6
1.2
1
0.8
0.6
1
1.4
0.6
1
0.7
1
Four points
0.
65
0.6
0.7
5
-0.8
0.6
65
0.
0.65
-0.6
-0.4
-0.2
5
0.7
0.7
0.7
0
0.2
0.4
0.6
0.8
0.8
1
Homework
• Redo the four point example, when the
data points are not at the corners but
inside the domain, at +-0.8. What does the
difference in the results tells you?
• For a grid of 3x3 data points, compare the
standard errors for a linear and quadratic
fits.
Download