defense slides - UCLA Computer Science

advertisement
2012
Moving Object Segmentation
by Pursuing Local SpatioTemporal Manifolds
Yuanlu Xu
Problem
Segmenting moving foreground in a video
Related work & intuitions
Dynamic background ~ dynamic textures
Image sequences of
certain textures moving
and changing under
certain properties.
S. Soatto, G. Doretto, and Y.
Wu. “Dynamic textures”.
IJCV 2003
Related work & intuitions
Dynamic background ~ dynamic textures
How to model?
The output of a linear dynamic system driven by
IID Gaussian noises.
Intuition for moving object segmentation:
A complex scene containing dynamic background is
composed of several independent dynamic textures.
Related work & intuitions
Illumination changes ~ modeling illumination
Observing eigenvalue
curves of different state
bricks, (a) background,
(b) foreground occlusion
Y. Zhao et al. “Spatiotemporal patches for night
background modeling by
subspace learning”. ICPR 2008
Related work & intuitions
Illumination changes ~ modeling illumination
Intuition for handling illumination changes:
The set of bricks of a given background location under various
lighting conditions lies in a low-dimensional manifold.
Related work & intuitions
Indistinctive changes
Similar appearance οƒ  incorporating extra information
Intuition for distinguishing indistinctive moving objects:
Modeling background appearance variations, estimating next state,
distinguishing moving objects not following the similar changes
Intuitions & assumptions
1.
A complex scene containing dynamic
background is composed of several
independent dynamic textures.
2.
The set of bricks of a given
background location under various
lighting conditions lies in a lowdimensional manifold.
3.
Modeling background appearance
variations.
1.
Given a background location,
the sequence of bricks (under
dynamic changes, illumination
changes) lies in a lowdimensional manifold, and the
variations satisfy local linear.
2.
The bricks with indistinctive
and distinctive foreground
occlusions can be well separated
from the background by
distinguishing differences in
both appearance and variations.
Representation
Segmenting Brick in Video:
For each frame, we divide it into
patches with size β„Ž ⋅ 𝑀. At each
location, t patches are combined
together to form a brick
Representation
Center Symmetric – Spatio Temporal LTP (CS-STLTP) Descriptor
156 178 182
0
ε°ΊεΊ¦ι˜ˆε€Ό
70 101 89
193 251 126
t = 0.2
1
0
-1
4δΈͺ既空平青
85 178 124
81 101 63
Y
T
.
.
.
146 251 145
特征向量
0
1
0
-1
56 178 76
X
123 101 251
53 251 142
.
.
.
.
.
.
-1
1
-1
53 178 78
3x3x3
立方体
246 101 198
43 251 20
-1
1
-1
1
1
Mathematical formulation
Given a brick sequence 𝑽 = 𝑣1 , 𝑣2 , … , 𝑣𝑛 ∈ 𝐑m∗nof a background
location, we assume the dimension of the manifold 𝑽 in is 𝑑.
The structure of this manifold:
𝑑
𝑣𝑖 =
𝑧𝑖,𝑗 𝐢𝑗 + πœ”
𝑗=1
π‘ͺ = 𝐢1 , 𝐢2 , … , 𝐢𝑑 : bases of the manifold.
𝑧𝑖,𝑗 : coefficient of basis 𝐢𝑗 given 𝑣𝑖 .
πœ”: structural residual .
Mathematical formulation
Given the corresponding coding 𝒁 = 𝑧1 , 𝑧2 , … , 𝑧𝑛 ∈ 𝑹𝑑∗𝑛 for 𝑽
= 𝑣1 , 𝑣2 , … , 𝑣𝑛 , the coding variation is local linear, according to the
assumption.
The coding variation within this manifold:
𝑧𝑖+1 = 𝐴𝑧𝑖 + πœ–π‘–
𝑧𝑖+1 , 𝑧𝑖 : two successive state.
𝐴 ∈ 𝑹𝑑∗𝑑 : description of the coding variation.
πœ–π‘– : state residual.
Mathematical formulation
The problem of pursuing the structure of and the variation within a
manifold is formulated as minimizing the empirical energy function:
1
π‘šπ‘–π‘›. 𝑓𝑛 π‘ͺ, 𝑨 =
𝑛
𝑛
𝑖=1
1
(
𝑣𝑖 − π‘ͺ𝑧𝑖
2
1
+
𝑧𝑖 − 𝐴𝑧𝑖−1
2
2
2
(𝑽 = 𝑣1 , 𝑣2 , … , 𝑣𝑛 ∈ π‘Ήπ‘š∗𝑛 , 𝒁 ∈ 𝑹𝑑∗𝑛 , π‘ͺ ∈ π‘Ήπ‘š∗𝑑 , 𝐴 ∈ 𝑹𝑑∗𝑑 )
min. structural
residual
min. state
residual
2
2
)
Mathematical formulation
Because 𝒁 is unknown, we rewrite the problem as a joint optimization
problem with π‘ͺ, 𝒁, 𝐴:
1
π‘šπ‘–π‘›. 𝑓 π‘ͺ, 𝒁, 𝐴 =
𝑛
𝑛
𝑖=1
1
(
𝑣𝑖 − π‘ͺ𝑧𝑖
2
1
+
𝑧𝑖 − 𝐴𝑧𝑖−1
2
2
2
2
2
)
Not jointly convex, but convex with respect to π‘ͺ, 𝐴 and 𝒁 when the other
is fixed.
A numerical solution: alternate between the two variables, minimizing
over one while keeping the other one fixed.
Representation
1
π‘šπ‘–π‘›. 𝑓 π‘ͺ, 𝒁, 𝐴 =
𝑛
𝑛
𝑖=1
1
(
𝑣 − π‘ͺ𝑧𝑖
2 𝑖
1
+
𝑧 − 𝐴𝑧𝑖−1
2
2 𝑖
2
2
2
)
Rewritten as a linear dynamic system
(LDS)
πœ”π‘– ∼𝐼𝐼𝐷
structural residual 
structural noise
𝑣𝑖 = 𝐢 𝑧𝑖 + πœ”π‘– ,
𝑧𝑖+1 = 𝐴 𝑧𝑖 + πœ–π‘–
𝑁 0, 𝑅 , πœ–π‘– ∼𝐼𝐼𝐷 𝑁(0, 𝑄)
state residual 
state noise
Learning
𝑣𝑖 = 𝐢 𝑧𝑖 + πœ”π‘– ,
𝑧𝑖+1 = 𝐴 𝑧𝑖 + πœ–π‘–
πœ”π‘– ∼𝐼𝐼𝐷 𝑁 0, 𝑅 , πœ–π‘– ∼𝐼𝐼𝐷 𝑁(0, 𝑄)
Initial
Learning
Given a training sequence 𝑉
= {𝑣1 , 𝑣2 , … , 𝑣𝑛 }, identify
𝐢𝑛 , 𝐴𝑛 , 𝑅𝑛 , 𝑄𝑛
𝑣𝑖 = 𝐢𝑛 𝑧𝑖 + πœ”π‘– ,
𝑧𝑖+1 = 𝐴𝑛 𝑧𝑖 + πœ–π‘–
Online
Learning
Given a new brick 𝑣𝑛+1 ,
incrementally learn 𝐢𝑛+1 , 𝐴𝑛+1 ,
𝑅𝑛+1 , 𝑄𝑛+1
𝑣𝑖+1 = 𝐢𝑛+1 𝑧𝑖 + πœ”π‘–+1 ,
𝑧𝑖+2 = 𝐴𝑛+1 𝑧𝑖+1 + πœ–π‘–+1
Learning
Initial Learning
Sub-optimal analytical solution
S. Soatto, G. Doretto, and Y. Wu. “Dynamic textures”. IJCV 2003.
Online Learning
Learning 𝐢𝑛+1 : incremental subspace learning Candid Covariance-free IPCA (CCIPCA) and IPCA
J. Weng et al. “Candid covariance-free incremental principal component analysis”. TPAMI 2003.
Y. Li. “On incremental and robust subspace learning”. Pattern Recognition 2004.
Learning 𝐴𝑛+1 : Linear problem of the latest 𝑙 states
Inference
For a new brick 𝑣𝑛+1 , the segmentation of moving object is decided by the
structural noise and state noise.
Structural noise:
𝑧
′
πœ”π‘›+1
𝑇
= 𝐢𝑛 𝑣𝑛+1
= 𝑣𝑛+1 − 𝐢𝑛 𝑧′𝑛+1
𝑛+1
State noise:
πœ–π‘› = 𝑧 ′
𝑛+1
− 𝐴𝑛 𝑧𝑛
Experimental Results
Datasets
Busy scenes
Dynamic scenes
Water Surface
Illumination changes
Swaying Trees
Sudden Light
Airport
Heavy Rain
Active Fountain
Train Station
Gradual Light
Waving Curtain
Floating Bottle
Experimental Results
Scene
GMM
1# Airport
2# Floating Bottle
3# Waving Curtain
4# Active Fountain
5# Heavy Rain
6# Sudden Light
7# Gradual Light
8# Train Station
9# Swaying Trees
10# Water Surface
Average
46.99
57.91
62.75
52.77
71.11
47.11
51.10
65.12
19.51
79.54
55.39
ImGMM
47.36
57.77
74.58
60.11
81.54
51.37
50.12
68.80
23.25
86.01
59.56
OnlineAR
62.72
43.79
77.86
70.41
78.68
37.30
13.16
36.01
63.54
77.31
57.02
JDR
60.23
45.64
72.72
68.53
75.88
52.26
47.48
57.68
45.61
84.27
60.23
Struct1
-SVM
65.35
47.87
77.34
74.94
82.62
47.61
62.44
61.79
24.38
83.13
59.79
SILTP
68.14
59.57
78.01
76.33
76.71
52.63
54.86
67.05
42.54
74.30
63.08
STDB
(RGB)
75.52
69.04
87.74
76.85
86.86
51.56
54.84
73.43
43.70
88.54
70.81
STDB
(Ftr.)
66.40
75.85
79.57
79.68
81.35
70.23
72.52
66.46
48.49
87.88
72.84
Experimental Results
Experimental Results
Experimental Results
Experimental Results
Experimental Results
Experimental Results
Selection of structural update approach
Scene
1# Airport
2# Floating Bottle
3# Waving Curtain
4# Active Fountain
5# Heavy Rain
6# Sudden Light
7# Gradual Light
8# Train Station
9# Swaying Trees
10# Water Surface
Average
CCIPCA
Accuracy
Efficiency
(%)
(fps)
75.52
69.04
87.74
76.85
86.86
51.56
4.1
54.84
73.43
43.70
88.54
70.81
IPCA
Efficiency
Accuracy (%)
(fps)
65.13
70.02
78.47
81.38
79.84
53.63
2.3
59.79
68.69
70.17
89.43
71.66
Dynamic scenes: IPCA
is much better than
CCIPCA
Busy scenes: CCIPCA is
much better than IPCA
Illumination changes:
IPCA slightly better
than CCIPCA
Efficiency: CCIPCA is
much faster than IPCA
Contribution
1. Formulating the problem of modeling background by pursuing local
spatio-temporal manifolds of video brick sequences.
2. Representing spatio-temporal statistics in video bricks with CSSTLTP descriptor.
3. Pursuing local spatio-temporal manifolds with two LDSs: a timeinvariant LDS for initial learning and a time-variant LDS for online
learning.
4. Online learning the structure of local spatio-temporal manifolds with
incremental subspace learning and the state variations with re-solving
linear problems.
Problems
1. CS-STLTP behaves well in handling illumination changes, but not
sufficient to capture variation statistics.
2. In highly dynamics scenes, the assumption of local linear variation
can hardly hold.
3. CCIPCA suffers updating the great changes of the structure of the
manifold. IPCA behaves better than CCIPCA but suffers the
computational complexity.
Published Papers
1. Yuanlu Xu, Hongfei Zhou, Qing Wang, Liang Lin. “Realtime Objectof-Interest Tracking by Learning Composite Patch-based Templates”.
ICIP 2012 (accepted)
2. Liang Lin, Yuanlu Xu, Xiaodan Liang. “Complex Background
Subtraction by Pursuing Dynamic Spatio-temporal Manifolds”.
ECCV 2012 (submitted)
QUESTIONS?
Difficulties
Dynamic backgrounds
Illumination changes (especially sudden changes)
Difficulties
Indistinctive moving objects
Moving camera (e.g., shaking, hand-held)
Contribution
1. Formulating the problem of modeling background by pursuing local
spatio-temporal manifolds of video brick sequences.
2. Representing spatio-temporal statistics in video bricks.
3. Pursuing local spatio-temporal manifolds.
4. Maintaining local spatio-temporal manifolds online.
Mathematical formulation
Similar to sparse coding, to prevent π‘ͺ being arbitrarily large, which
results 𝒁 arbitrarily small, we add the constraint πΆπ‘˜ 2 ≤ 1, and the
constraint set 𝛀 is formulated as:
𝛀 β‰œ π‘ͺ ∈ π‘Ήπ‘š∗𝑑 , ∀π‘˜ = 1,2, … , 𝑑, πΆπ‘˜
2
≤1
∀ 𝐢1 2 ≤ 1, 𝐢2 2 ≤ 1, ∀ 0 ≤ πœƒ ≤ 1,
πœƒπΆ1 + 1 − πœƒ 𝐢2 2 ≤ πœƒπΆ1 2 + 1 − πœƒ 𝐢2
≤ πœƒ 𝐢1 2 + 1 − πœƒ 𝐢2
≤πœƒ+ 1−πœƒ ≤1
Thus 𝛀 is a convex set.
2
2
Mathematical formulation
Because 𝒁 is unknown, we rewrite the problem as a joint optimization
problem with π‘ͺ, 𝒁, 𝐴:
1
π‘šπ‘–π‘›. 𝑓 π‘ͺ, 𝒁, 𝐴 =
𝑛
𝑛
𝑖=1
1
(
𝑣𝑖 − π‘ͺ𝑧𝑖
2
1
+
𝑧𝑖 − 𝐴𝑧𝑖−1
2
2
2
2
2
)
𝑠𝑒𝑏𝑗𝑒𝑐𝑑 π‘‘π‘œ π‘ͺ ∈ Γ
Not jointly convex, but convex with respect to π‘ͺ, 𝐴 and 𝒁 when the other
is fixed.
A numerical solution: alternate between the two variables, minimizing
over one while keeping the other one fixed.
Mathematical formulation
In practice, above joint optimization problem is simplified as a two step
optimization:
1. Rewrite the problem as a time-variant linear dynamic system, solve the
structure of the system, ignore the state (coding) variation.
2. Given the structure of the system, solve the state variation, based on the
corresponding state for each brick.
Representation
Local Binary Pattern (LBP)
/ Local Ternary Pattern
(LTP)
Representation
Scale Invariant LTP
(SILTP)
S. Liao et al. “Modeling pixel process with
scale invariant local patterns for background
subtraction in complex scenes”. CVPR 2010
Representation
Scale Invariant LTP
(SILTP)
SILTP is more robust in handling scale changes (illumination changes).
Representation
156 178 182
0
ε°ΊεΊ¦ι˜ˆε€Ό
70 101 89
193 251 126
t = 0.2
1
0
-1
4δΈͺ既空平青
85 178 124
81 101 63
Y
T
.
.
.
146 251 145
特征向量
0
1
0
-1
56 178 76
X
123 101 251
53 251 142
.
.
.
.
.
.
-1
1
-1
53 178 78
3x3x3
立方体
246 101 198
43 251 20
-1
1
-1
1
1
Representation
Center Symmetric Coding
P0 P1 P2
P7
Pc
P3
P6 P5 P4
Comparison
S0
S1
S2
8 neighboring pixels
S3
around the center are
formed into 4 pairs
(𝑃0 , 𝑃4 ), (𝑃1 , 𝑃5 ),
(𝑃2 , 𝑃6 ), (𝑃3 , 𝑃7 ).
Representation
1
π‘šπ‘–π‘›. 𝑓 π‘ͺ, 𝒁, 𝐴 =
𝑛
structure of the
manifold 
appearance matrix
𝑛
𝑖=1
1
(
𝑣 − π‘ͺ𝑧𝑖
2 𝑖
1
+
𝑧 − 𝐴𝑧𝑖−1
2
2 𝑖
2
2
2
)
Rewritten as a linear dynamic system
(LDS)
structural noise
πœ”π‘– ∼𝐼𝐼𝐷 𝑁 0, 𝑅 
𝑣𝑖 = 𝐢 𝑧𝑖 + πœ”π‘– ,
structural residual
𝑧𝑖+1 = 𝐴 𝑧𝑖 + πœ–π‘–
state variations of the
manifold  dynamics
matrix
state noise
πœ–π‘– ∼𝐼𝐼𝐷 𝑁(0, 𝑄)
 state residual
Initial learning
Sub-optimal analytical solution
Assumption:
1. The dimension of the manifold is 𝑑, the dimension of the state noise is
π‘‘πœ– , 𝑑 > π‘‘πœ– . The appearance matrix satisfies 𝐢𝑛𝑇 𝐢𝑛 = 𝐼𝑑 .
2. The analytical solution for the structure of the manifold is
The decomposition is simulated by SVD.
π‘Š = π‘ˆ 𝑆 𝑉 𝑇 , 𝐢𝑛 = π‘ˆ 1: 𝑑, : , 𝑍𝑛 = 𝑆(1: 𝑑, 1: 𝑑) 𝑉(1: 𝑑, : )𝑇
S. Soatto, G. Doretto, and Y. Wu. “Dynamic textures”. IJCV 2003.
Initial learning
Given the states 𝑧1 𝑧2 … 𝑧𝑛 , solving the dynamics matrix 𝐴𝑛 by linear
programming:
To estimate noise covariance 𝑄𝑛 , we treat πœ–π‘– as the reconstruction error 𝑒𝑖
= 𝑧𝑖+1 − 𝐴𝑛 𝑧𝑖 , and 𝑄𝑛 is represented as
𝑄𝑛 = 𝐸 𝑒𝑖 𝑒𝑖
𝑇
1
= lim
𝑗→+∞ 𝑗
1
≈
𝑛−1
𝑗
π‘’π‘˜ π‘’π‘˜
π‘˜=1
𝑛−1
π‘’π‘˜ π‘’π‘˜
𝑇
𝑇
π‘˜=1
To reduce the dimension of 𝑒𝑖 , let 𝑄𝑛 = 𝐡𝑛 𝐡𝑛
= 𝐡 −1 𝑒𝑖 .
𝑇
and apply PCA to 𝑄𝑛 , πœ–π‘–
Initial learning
Since different manifold has different dynamic properties, the dimension of
the manifold is determined by the training samples.
Static
Dimension Low
Dynamic
Dimension High
Online learning
Against foreground occlusions
We define a noise-free video brick
under the current model to
compensate the missing
background samples.
The noise-free video brick 𝑣𝑛+1 is
defined as
Online learning
To update the structure of the manifold, we regard π‘Šπ‘›+1 as the extension
by adding a new column (update sample) to π‘Šπ‘› .
The problem of updating 𝐢𝑛+1 is formulated as incremental subspace
learning.
To find a more effective approach, we employ two incremental subspace
learning methods:
1. Candid Covariance-free Incremental PCA (CCIPCA), without
estimating the covariance matrix.
2. Incremental PCA (IPCA), estimating the covariance matrix.
Online learning
CCIPCA
J. Weng et al. “Candid covariance-free incremental principal
component analysis”. IEEE TPAMI 2003.
Online learning
IPCA
For a 𝑑-dimension manifold, with eigenvectors 𝐢𝑛 , and eigenvalues Λ𝑛 ,
the covariance matrix is estimated as
With the new sample, the new covariance matrix is estimated as
Using the new covariance matrix to estimate the new eigenvectors 𝐢𝑛+1 ,
Λ𝑛+1 .
Y. Li. “On incremental and robust subspace
learning”. Pattern Recognition 2004.
Online learning
Update the state variation 𝐴𝑛+1 , 𝐡𝑛+1 by re-estimating the new state 𝑧𝑛+1 ,
𝐴𝑛+1 is updated by re-computing the linear problem,
𝐡𝑛+1 by re-estimating the covariance matrix,
[ 𝑒𝑛−𝑙+1 𝑒𝑛−𝑙+2 β‹― 𝑒𝑛 ] = [ 𝑧𝑛−𝑙+2 𝑧𝑛−𝑙+3 β‹― 𝑧𝑛+1 ]
− 𝐴𝑛 [ 𝑧𝑛−𝑙+1 𝑧𝑛−𝑙+2 β‹― 𝑧𝑛 ]
𝑄𝑛+1 = 𝐸 𝑒𝑖 𝑒𝑖
𝑇
1
=
𝑙
𝑛
π‘’π‘˜ π‘’π‘˜
π‘˜=𝑛−𝑙+1
𝑇
Online learning
Anti-degeneration
Algorithm
Experimental Results
Behave poorly on
highly dynamic
backgrounds!
Download