Nonlinear Estimation Professor David H. Staelin Massachusetts Institute of Technology Lec22.5-1

advertisement
Nonlinear Estimation
Professor David H. Staelin
Massachusetts Institute of Technology
Lec22.5-1
3/27/01
Case I: Nonlinear Physics
p̂ = [D1D2 ] ⎡1 ⎤
⎢ d⎥
d
⎣ ⎦
Data
Best Fit,
"Linear Regression"
D1 D2
Slope
P{p}
Optimum
ˆ
Estimator P(d)
Parameter
p
Probability Distribution
p
0
Minimum Square Error (MSE) is usually the "optimum" sought.
It is important that the "training" data used for regression
is similar to, but independent of, the "test" data used for
performance evaluation.
Lec22.5-2
3/27/01
J1
Case II: Non-Gaussian Statistics
d
Best linear estimator
dA
Maximum a posteriori
probability "MAP"
estimator
p̂ dA
0
Best Nonlinear Estimator
(avoids pˆ < 0).
p
P { p dA }
p
p̂ dA Minimizes MSE given dA
ˆ
Note: Linear estimator can set p<0
(which is non-physical):
nonlinear estimator approaches correct asymptotes as d → ±∞
Note: Physics here is linear. Can use polynomials etc. for
p̂(d), or recursive linear estimators updating D.
Lec22.5-3
3/27/01
J2
Linear Estimates for Nonlinear Problems
p
d2
Optimum Linear
Estimate
d1
p
d1
Truth (without noise)
Optimum Linear
d2 Estimate
Linear estimates are suboptimum for d1 or d2 alone.
d1 = a0 + a1p + a2p2 and d2 = b0 + b1p + b2p2
Say
Then
p2 = ( d2 − b0 − b1p ) / b2
d1 = a0 + a1p + a2 ( d2 − b0 − b1p ) / b2 = c 0 + c1p + c 2 d2
(defines plane in ( p, d1, d2 ) - space
Therefore pˆ = p = ( −c 0 + d1 − c 2 d2 ) / c1 where c1 = a1 −
a2b1
≠0
b2
p̂ ( d1, d2 ) can be linear in d1, d2 and (noiseless case)
perfect, even though p ( d1, d2 ) is nonlinear
In general, the more varied the data d, the smaller the performance gap
between linear estimators and nonlinear ones, which are a superset.
Lec22.5-4
3/27/01
J3
Linear Estimates for Nonlinear Problems
p
p ( d1 d2 )
p ( d1,d2 )
p ( d2 d1 )
Plane pˆ ( d1, d2 ) is the
linear estimator
d2
∆p
d1
Plane pˆ ( d1,d2 ) has two
angular degrees of freedom
to use for minimizing (∆p).
Lec22.5-5
3/27/01
J4
Generalization to Nth-Order Nonlinearities
Let
d1 = c1 + a11p + a12p2 + … + a1npn
d2 = c 2 + a21p + a22p2 + … + a2npn
•
•
•
dn = c n + an1p + an2p2 + … + annpn
where d1, d2,…, dn are observed noise-free
data related to p by an nth-order polynomial.
Claim: In non-singular cases there exists an
exact linear estimator pˆ = Dd + constant.
Note: For perfect linear estimation the number of different
observations must equal or exceed n, the maximum
order of the polynomials characterizing the physics.
Lec22.5-6
3/27/01
J5
Nonlinear Estimators for Nonlinear Problems
Approaches: 1) pˆ = Ddaug where d is
augmented via polynomials
2) Rank reduction first via KLT
3) Neural nets
4) Iterations with D ( pˆ )
Case (1): daug = 1, d1, d2 , d12, d22, d1d2 , d13 , for example
Case (2): Use of KLT to Reduce Rank
Option A:
d′
daug
Nonlinear
KLT
d
Augmentation
Ddaug
p̂
Drop
Useful if d is of High Order
Lec22.5-7
3/27/01
L1
“Karhunen-Loeve Transform” (KLT)
∆
d′ = Kd transforms a noisy redundant data vector d into a
[reduced-rank ] vector d′ with most variance in d1′, least
in d′n , where E ⎣⎡d′id′j ⎦⎤ = δijλi , and λi ≥ λ j>i
⎡λ1 . . . .0 ⎤
t
t
∆
⎡
⎥K
Cdd = E dd ⎤ = K ⎢
λ2
⎢⎣ ⎥⎦
⎢
⎥
λ
0
.
.
.
.
n⎦
⎣
e.g.
h
2
10KM
3
T(h)
1
Basis Functions
Are Rows of K
KLT
The KLT is essentially the same as Principal Component
Analysis (PCA) and Empirical Orthogonal Functions (EOF).
Lec22.5-8
3/27/01
L2
Nonlinear Estimators Using Rank Reduction
Option B:
d
Nonlinear
augmentation
Dd′
KLT
p̂
Drop
Useful if d is very nonlinear
Option C:
d
KLT
d′
drop
Lec22.5-9
3/27/01
Nonlinear
augmentation
daug
KLT
d′ aug
Dd′aug
p̂
drop
L3
Neural Networks (NN)
NN
d
1
d1
dN
W01
W11
WN1
WNM
d
p̂
Neural nets compute complex polynomials
with great efficiency, simplicity.
Σ
1
d1′
Σ
M
dM′
NN1
d′
NN2
NN coefficients computed
via relaxation optimization
back propagation.
d′′
NN3
WNM
p̂
"Hidden Layers"
NN may be recursive or purely Feed-Forward (FFNN)
Lec22.5-10
3/27/01
L4
Neural Networks (NN)
Weights Wi j are found by guessing, then perturbing to reduce cost
function on ⎡⎣pˆ − p⎤⎦ over "training set" of known ⎡⎣ d,p⎤⎦ . Training
can be tedious. Commercial software tools are commonly used.
Degrees of freedom in training set should exceed the number of
weights by a modest factor of ~ 3-10. Risks of "overtraining" are
reduced by monitoring NN performance on independent test data.
The more nonlinear problems need more layers and more
internal nodes, as determined empirically. Neural nets can
also perform recognition tasks.
Simple neural nets are easier to train correctly,
so merge with linear techniques, e.g.
d
Lec22.5-11
3/27/01
NN
Linear
Estimator
p̂
L5
Accommodating Nonlinearity via Iteration
A. Library Retrievals
d
Do
∧
Di
∧
Dk
p
pi
p
Dk
Do
Successive Di Chosen
⎛∧ ⎞
from Library as f ⎜ p i ⎟
⎝ ⎠
d
Lec22.5-12
3/27/01
M1
Accommodating Nonlinearity via Iteration
B. Physics Recomputation
∧
po "First Guess"
∧
d
+
+
∆pi
D
−
+
∧
n
∧
po + ∑ ∆pi
∧
pn
i =1
d̂i +1
W
Nonlinear Physics
∧
Essential only that D always yields ∆ pi in
right direction (so errors shrink).
Lec22.5-13
3/27/01
M2
Accommodating Nonlinearity via Iteration
C. Spatial Iteration
5
4
6
7
8
Space
i
Works well if data is smooth.
∧
∧
+
di
+
−
Delay
∆pi +
∆di
D
∧
+
∧
∧
di −1
pi − 1
pi
Delay
Alternate Method
(Use if ∆di > Threshold
Lec22.5-14
3/27/01
M3
Multi-Dimensional Estimation
e.g., Atmospheric Temperature Profiles
1)
2)
(
)
Tˆ ( h, x, y ) = f ( d ( x, y ) , d ( x ± ∆, y ± ∆ ) ,...)
Tˆ ( h, x, y ) = f d ( x, y )
Estimate improved by additional correlations
and degrees of freedom.
Lec22.5-15
3/27/01
M4
Genetic Algorithms
1) Characterize algorithm by segmented character string,
e.g., a binary number.
2) Significance of string is defined, e.g.,
A. As weights in a linear estimator or neural net.
B. It characterizes the architecture of a neural net.
3) Many algorithms (strings) are tested, and the metrics
for each are evaluated for some ensemble of test cases.
4) Randomly combine elements of best algorithms to form
new ones; some random changes may be made too.
5) Iterate steps (3) and(4) until asymmtotic optimum is
reached. Can be used for pattern recognition, signal
detection, parameter estimation, and other purposes.
6) Proper choice of "gene" size accelerates convergence
(genes swapped as units).
Lec22.5-16
3/27/01
M5
Download