06.01.2012
1
1.
Introduction ...................................................................................................................... 3 i.
Overview of the Proposed System .................................................................................. 3 ii.
Literature Update ........................................................................................................... 5 iii.
Previous Thesis Progress Report Comments ................................................................. 7 iv.
Summary of the Efforts Since the Last Thesis Progress Report .................................... 7
2.
Proposed 3D Tracking Algorithm .................................................................................. 8 i.
Initial Semi-Automatic Segmentation ............................................................................. 9 ii. Transferring Measurements between Consecutive Frames ........................................ 11 iii. Reliability Check and Correction of Correspondences .................................................. 15
3.
Test Results ................................................................................................................... 24 i.
Artificial Data Tests ....................................................................................................... 24 ii.
Real Data Tests ............................................................................................................. 28
4.
Conclusions and Future Work ..................................................................................... 32
References ............................................................................................................................... 34
2
In this PhD. thesis study, our aim is to develop an accurate 3D tracking system,
which utilizes both RGB vision sensor data and 3D depth sensor data. As shown in Figure 1,
the system estimates 3D 6-DOF pose parameters (
[t x to
, t y to
, t z to
]
T
) between sensors and the object of interest:
R to
= R(ρ to
, θ to
, φ to
) , t to
=
Figure 1 Object and camera reference frames.
The main motivations to exploit both types of sensors are as follows:
ο·
Commercial systems that can capture high resolution (640x480) and high frame rate (25 fps) color and depth data are emerged,
ο· Pure vision sensor based trackers require manual initialization or offline training, which may not be feasible for most robotic or augmented reality applications,
ο· Pure depth sensor based methods rely on 3D-3D registration, which may easily trap into local minima.
With the aforementioned motivations, the system summarized in
proposed. This system is very similar to the one detailed in previous Thesis Progress
Report, however is more robust and reliable.
3
Color – 3D Data for t
1
Segmented Color and 3D
Data at t
0
Color – 3D Data for t
0
Initial Semi-
Automatic
Segmentation
Initial Model-
Data
Association using KLT
Reliable correspondences between 2D-3D Observations bwn t
0 and t
1 and π π‘π
, π‘ π‘π
at t
1
Reliability check and correction
(RCC) for the correspondences and pose estimation
Figure 2 Overview of the proposed 3D tracking system.
Color – 3D Data for t
2
KLT 2D-3D Feature
Matching (with
RCC) between t
1
– t
2
π π‘π
, π‘ π‘π
, π£ π‘π
at t
3
Associate data and model, update
π π‘π
, π‘ π‘π and reset EKF, if there is drift
Correspondences between
3D Object Model and t
3
2D-
3D Observations
EKF
π π‘π
, π‘ π‘π
, π£ π‘π
at t
1
Loop with t
2
-> t n
, t
3
-> t n+1 if there is no drift
KLT 2D-3D Feature
Matching (with
RCC) between t
2
– t
3
Correspondences between
Object Model and t
2 instant
2D-3D Observations
Pose Estimation between 3D Object
Model and t
2
instant 3D
Observation
Correspondences
Color – 3D Data for t
3
π π‘π
, π‘ π‘π
, π£ π‘π
at t
2
Update π π‘π
, π‘ π‘π
,reset EKF and continue
Color – 3D Data for t n
Model-Data
Association
Segmented Color – 3D
Data for t
0
Recalculated Correspondences between 3D Object Model and t n
2D-
3D Observations and π π‘π
, π‘ π‘π
at t n
4
The system is composed of the following building blocks:
Initial Semi-Automatic Segmentation:
This step takes a user input to specify rough foreground and background regions in order to segment the object to track. Since, two sensors are calibrated, after this step we have a colored point cloud of the object, which will be tracked at following instants.
Transferring Measurements between Consecutive Frames
The proposed method mainly relies on transferring 2D pixel and 3D XYZ measurements to consecutive video frames. Thus, the accuracy of the system is directly related to the accuracy of feature tracking. To this aim, a robust feature tracker, which is based on Kanade -Lukas Tracker (KLT), utilizing both intensity and
Shape Index Map (SIM) data is proposed.
Reliability Check and Correction of Correspondences
The errors in feature tracking are automatically detected and corrected using the Template Inverse Matching (TIM) algorithm, which operates on intensity and
SIM trackers.
State Estimation Using Extended Kalman Filter (EKF)
2D pixel and 3D XYZ measurements, obtained for the current time instant using the proposed KLT tracker, are fed to an Extended Kalman Filter (EKF). The filter uses constant velocity motion model to estimate 6 DOF object motion. A novelweighting scheme weights 2D and 3D measurements based on their qualities, which increases the 3D tracker performance significantly.
Automatic Realignment to Prevent Drift (Optional)
Since cumulative systems are prone to drift due to error accumulation, in the proposed system, drift is detected and corrected automatically.
The system building blocks will be explained in details in Section 2.
Within the last 6 months period, the literature of 3D tracking is reviewed once more in order to “justify” the proposed algorithm and keep us up-to-date with the new trends in tracking.
5
Constant Velocity Motion Model
In the literature of 3D tracking, constant velocity motion model is used in
many algorithms such as [1]-[5]. In the well-known MonoSLAM approach [1], the 3D
pose of a freely moving camera is estimated (with respect to scene); meanwhile the
3D scene is sparsely reconstructed. The motion model is similar to the formulation here. The state vector is composed of 7 dimensional pose parameters (4 parameters representing rotation in angle-axis form and 3 parameters representing translation) and associated velocities. Accelerations are modeled as noise updates.
In [2] and [3], using the previous measurements, a state estimate for the
current instant is obtained. This estimate is used to obtain a synthetic view of the object. For each feature, a sum of squared distances (SSD) surface is generated by using patches from synthetic view and the current frame. The peak of the surface gives the corresponding feature location in current frame. Moreover, a 2D Gaussian is fitted to the SSD surface. Hence, a variance estimate for each feature is obtained
and passed to the EKF, which utilizes constant velocity motion model. [4] also
proposes a similar approach utilizing Iterated EKF (IEKF).
A recent model-based 3D tracking approach, which utilizes vision and depth
sensors is proposed in [5]. The initial state estimate is utilized to transform the
object initially. Then, the proposed articulated iterative closest point (ICP) is used to match features of transformed colored object point cloud and the colored point cloud at current instant. Finally, measurement update is performed to correct the pose estimate of the articulated ICP. Although the method is quite similar to our approach, the EKF formulation is different.
Combination of Two Sets of Measurements
In the proposed system, 2D pixel and 3D XYZ measurements of each feature are simply concatenated to obtain a 5Nx1 measurement vector for N features. At this point, one may question whether these two sets of measurements can be combined by any appropriate method other than concatenation. However, the
authors of [6] prove that concatenation of measurements is more flexible and
efficient than a weighted combination, if the measurement matrices are not identical. Although, they prove the theorem for linear Kalman case, a 5Nx1 measurement vector still sounds.
Error Accumulation
In the experiments of the proposed system, it is observed that the error increases gradually at each frame, i.e. there is error accumulation. Thus, an automatic realignment module is added. In the previous thesis progress report,
there were some doubts related with such a drift. The authors of [7] state that,
6
inaccuracy can accumulate in filters due to limitations in the representation of probability distributions, such as the linearization assumptions of Gaussian-based filters like the EKF. Moreover, relying on optical flow for 3D tracking also results in
Non-sequential Methods
In the literature of 3D tracking, there are recent algorithms, which propose batch methods instead of sequential state updates, i.e. filters. For instance, utilizing
the power of multi-core processors, the authors of [9] propose a parallel tracking
and mapping algorithm, which also performs bundle adjustment to refine camera
poses and 3D coordinates of map points. Moreover, in [7] it is stated that if overall
processing budget is small filtering should be applied. Otherwise, performing bundle adjustment on a small set of key-frames results in higher performance.
Furthermore, utilizing parallel processing on GPUs, in [10] an ICP based 3D
tracking algorithm is proposed. The system tracks all pixels in a 640x480 image and the performance is quite satisfactory due to dense tracking. Similarly, the authors of
[11] maximize photo consistency by linearizing the cost function to register all pixels
of consecutive frames.
In the previous thesis progress report, following comments were made:
ο·
Monte Carlo simulations should be performed in order to analyze system convergence characteristics thoroughly.
ο· Real data tests with object/camera motion constrained to one direction should be performed in order to see whether the algorithm makes logical pose estimates or not.
As summarized in the following subsection, a plenty of real and artificial tests are performed in order to meet above comments.
Efforts within the last 6-month period can be summarized as follows:
7
ο·
ο·
ο·
Monte Carlo simulations are performed in order to analyze system convergence characteristics.
ο·
Monte Carlo simulations are performed in order to observe the strength of utilization of EKF compared to an instantaneous state estimation approach.
ο· Monte Carlo simulations are performed in order to observe the effect of
ο· weighting 2D measurements on 3D tracking quality, when generation and prediction noise parameters are different.
Monte Carlo simulations are performed in order to observe the relation
ο·
ο·
ο· between the variance of 3D points tracked and 3D tracking quality.
Monte Carlo simulations are performed in order to observe the effect of emphasizing the 3D measurements that are far from the center of mass of 3D observations.
Using real data, many methods are compared in terms of their effectiveness to detect the quality of 2D KLT tracking.
A robust KLT tracker, which utilizes intensity and SIM data, is proposed.
The effect of weighting 2D measurements, based on TIM errors, on 3D tracking quality is analyzed using real data.
The effect weighting 3D measurements, based on TIM errors and distances from the center of mass of 3D observations, on 3D tracking quality is analyzed using real data.
As already mentioned, in the proposed system, the object selected by the
user is tracked using data provided by vision and depth sensors. Considering Figure 1 ,
following variables define the overall system:
[
π π π
π π π
] : 3D coordinates of the i th object point with respect to object reference
π π π frame (For instance, obtained by user segmentation at the very first frame or from CAD model)
π π ππ
[
π π ππ
π π ππ
] : 3D coordinates of the i th object point at time instant t, measured by the depth sensor with respect to depth camera reference frame
[ π₯ π¦ π π π π
] : 2D pixel coordinates measurement of the i th object point at time instant t
8
π ππ
= π (π ππ
, π ππ
, π ππ
) : Rotation parameters between object and depth camera in x, y and z directions respectively π‘ ππ
= [π‘ π₯ ππ
, π‘ π¦ ππ
, π‘ π§ ππ
]
π
: Translation parameters between object and depth camera in x, y and z directions respectively π£ ππ
= [πΜ ππ
, πΜ ππ
, πΜ ππ
, π‘Μ π₯ ππ
, π‘Μ π¦ ππ
, π‘Μ π§ ππ
]
π
: Associated velocity parameters between object and depth camera
π ππ£
= π (π ππ£
, π ππ£
, π ππ£
) : Rotation parameters between
RGB and depth camera in x, y and z directions respectively π‘ ππ£
= [π‘ π₯ ππ£
, π‘ π¦ ππ£
, π‘ π§ ππ£
]
π
: Translation parameters between RGB and depth camera in x, y and z directions respectively
πΎ π£
: Internal calibration vector ( [π π₯
, π π¦
, π π₯
, π π¦
]
π
) of RGB camera composed of focal lengths and principal point offsets
With the above parameters defined, the overall system is explained in details in the following sub-sections.
Since the utilized Kinect sensor provides RGB and 3D data, it is possible to extract 3D object models in the form colored point clouds. For this purpose, a semi-
automatic object model extraction algorithm, based on grow-cut segmentation [12],
is utilized.
First of all, user is asked to select rough foreground and background regions from the first RGB image of the sequence. Then, the grow-cut implementation of
[13] extracts foreground regions belonging to the object, as shown in Figure 3 . As
RGB and depth cameras are calibrated, colored 3D point cloud model of the object is obtained. For complex objects, 3D object models can be generated using algorithms
Figure 4 shows snapshots from “Book” and “Face” 3D models.
Once the object is segmented and associated PCM is obtained, the next step is the determination of which features to utilize for 3D tracking, since due to the computational requirements whole object points cannot be tracked. Intuitively, there should be a relation between variance of the 3D points selected and pose estimation accuracy. Actually, simulations verify this intuition and results are
summarized in Table 1 . As the spread of points in 3D increases, the pose estimation
errors decrease. Consequently, the features to track are selected using a regular
sampling grid, as shown in Figure 5 .
9
(a) Input image (b) Rough segmentation
(c) Object boundaries (d) Object mask
Figure 3 Segmentation results for the “Book” model.
(a) “Book” model (b) “Face” model
Figure 4 3D object models in the form of colored point clouds.
Table 1 Relation between norm of std vector and 3D tracking errors.
Norm/Error Rot-x
(mrad)
Rot-y
(mrad)
Rot-z
(mrad)
Tr-x
(mm)
Tr-y
(mm)
Tr-z
(mm)
44.92
59.83
68.76
3.8
3.6
3.1
3.6
2.4
2.3
4.1
3.5
3.4
2.1
1.7
1.6
3.0
3.1
2.5
1.9
1.2
1.2
10
Figure 5 Regular sampling of tracked points.
In the proposed system, 3D object pose is determined using EKF. However, this requires 2D and 3D measurements associated with each object point i
( [π₯ π π
π¦ π π
]
π
, [π π ππ
π π ππ
π π ππ
]
π respectively) to be transferred between consecutive time instants. In the algorithm proposed in the previous thesis progress report, 2D measurements are matched between consecutive frames using KLT tracking via intensity data. Thus, since the vision and depth sensors are externally calibrated, the
3D measurement of the object point i is also obtained for the next time instant.
Figure 6 -a illustrates typical KLT tracked features using intensity data.
Due to the fact that relying on optical flow for 3D tracking results in error
accumulation [8], in our way to obtain a high accurate 3D tracker, association of
measurements between consecutive frames should be handled with special care.
⁄ surfaces of typical features matched via KLT. A patch of dimensions 10x10 around a 2D measurement at time instant t is moved around the
KLT matched 2D location at time instant t+1 in order to calculate SSD using:
5 5
πππ· π
(π₯, π¦) = ∑ ∑ (πΌ π‘+1
((π₯ π π
) π‘+1 π=−5 π=−5
+ π₯ + π, (π¦ π π
) π‘+1
+ π¦ + π)
− πΌ π‘
((π₯ π π
) π‘+1
+ π, (π¦ π π
) π‘+1
+ π)) 2 where I stands for intensity and (x, y) represents offset from KLT matched position at
t+1.
11
(a) Intensity KLT tracking.
(b) Depth KLT tracking.
(c) SIM KLT tracking.
Figure 6 KLT tracking using different data.
1 SSD
graph of the green feature in Figure
5 is at the origin, and hence, we conclude that the KLT tracking is successful.
However, in Figure 7 -b, the KLT tracker failed to locate the exact match of the blue
feature in Figure 5 , since the peak is at
(3, −10) T .
In order to increase the accuracy of the 3D tracker, the errors in KLT tracker should definitely be corrected. At this point, following question arises: “Can one
12
(a) Successfully tracked feature using intensity data.
(b) Unsuccessfully tracked feature with intensity data.
(c) Tracking corrected using SIM data.
Figure 7 π πππ plots of different features.
13
increase the quality of KLT tracker by utilizing available 3D information?” An instant
answer to this question may be the exploitation of depth maps as in Figure 6 . For
each feature, two parallel KLT trackers are utilized using intensity and depth data, assuming pixel coordinates of matches should ideally be the same. However, since
details are lost in depth data, the KLT tracker fails as in Figure 6 -b.
On the other hand, Shape Index Map (SIM) proposed in [15], is well-suited for
our approach since the method exaggerates details by using principal curvatures π
1 and π
2
:
ππΌ =
1
2
− (
1 π
) tan
−1 ( π
1 π
1
+ π
2
− π
2
)
Principal curvatures ( π
1
and π
2
) stand for the minimum and maximum curvatures of
the point of interest and are estimated as in [24].
Figure 8 shows a typical SIM. Since the formation of SIM is computationally
involved, it is computed over a dynamically selected bounding box as in Figure 6 -c.
Figure 8 A typical shape index map.
Although the feature of Figure 7 -b is erroneously tracked using intensity data,
SIM KLT tracking is correct, as shown Figure 7 -c. Thus, we have a strong idea that KLT
tracking using intensity data can be corrected using KLT tracking using SIM. However, how can we decide which tracking is more accurate than the other without explicit calculation of SSD values? The next sub-section answers this question.
14
Figure 9 KLT accuracies of features.
Figure 9 shows the KLT accuracies of typically tracked features using intensity
data. In order to obtain
Figure 9 , SSD plots, as described in sub-section 2-ii, are
obtained and the inverse of the distances of SSD-peaks from the origin are plotted (a small ε is added to avoid divide by zero). Although SSD plots reveal the accuracy of
KLT tracking, due to computational requirements, a method should be developed in order to detect the quality of KLT tracking, without explicit calculation of SSD values.
To this aim, following methods are tried:
ο·
Shi-Tomasi Cornerness Measure
ο·
Harris Cornerness Measure
ο· KLT Error
ο· Template Inverse Matching Error
Once an effective way to detect the reliability of a KLT tracked feature is developed, it can be utilized to correct intensity KLT tracking using SIM KLT tracking and vice versa. In the following paragraphs, each method is briefly explained and associated performances are given:
Shi-Tomasi Cornerness Measure
If a feature has high “cornerness” measure with high spatial derivatives, it
will probably be tracked with high accuracy. Hence, in [16], Shi and Tomasi propose a
method to locate features with high cornerness measure. First, using a patch P of size w x h around the interest point i, structure tensor is calculated: β/2 π€/2
π΄ = ∑ ∑ π€(π’, π£) [
πΌ 2 π₯ π’=−β/2 π£=−π€/2
πΌ
πΌ π₯
πΌ π¦ π₯
πΌ
πΌ π¦
2 π¦
]
15
Figure 10 Shi-Tomasi cornerness measure. where π€(π’, π£) represents the weight of pixel (π’, π£) generally obtained by sampling a
2D Gaussian, πΌ π₯ and I y
stand for spatial derivatives in x and y directions respectively.
If both eigenvalues of the structure tensor are large, then the point i has high
cornerness measure. Thus, in [16], the minimum of the eigenvalues of A is utilized as
the cornerness measure. Figure 10 shows the Shi-Tomasi cornerness measures of the
features whose reliabilities are given in Figure 9 .
Harris Cornerness Measure
Harris corner detector [17] works similarly to the Shi-Tomasi approach,
however does not require explicit calculation of eigenvalues of the structure tensor.
Instead, it exploits the fact that if both eigenvalues are large, their product will much deviate from their sum. Thus, following cornerness measure is developed:
πΆ π
= π
1 π
2
− π (π
1
+ π
2
) 2 = det(π΄) − π π‘ππππ 2 (π΄) where π is a constant, generally selected 0.04.
Figure 11 shows Harris cornerness
measures of the features whose reliabilities are given in Figure 9 .
KLT Error
KLT error simply calculates the SSD between the patch around the feature at time instant t and the patch around its match at t+1: β/2 π€/2
πΎπΏπ πΈππππ π
= ∑ ∑ (πΌ π‘+1
((π₯ π π
) π‘+1 π=−β/2 π=−π€/2
+ π, (π¦ π π
) π‘+1
+ π)
− πΌ π‘
((π₯ π π
) π‘+1
+ π, (π¦ π π
) π‘+1
+ π)) 2
16
Figure 11 Harris cornerness measure.
KLT error is expected to increase, if a feature is tracked erroneously. Figure 12
illustrates 1/ KLT Error
of the features whose reliabilities are given in Figure 9 .
Figure 12 1/KLT Errors.
Template Inverse Matching
Proposed by the authors of [18], Template Inverse Matching (TIM) can be
utilized to detect the qualities of 2D measurements matched across consecutive frames. TIM simply calculates the Euclidean distance between a 2D measurement
[x o π
y o i
] t
T
associated with a feature i at time t and the 2D measurement [x o i
y o i
] t
′ T obtained by tracking i’s correspondence at time t+1 backward:
17
Figure 13 1/ π
π.π
πππ
values. d
TIM i
= β[x o i
y o i
] t
T
− [x o i
y o i
] t
′ T
β
Ideally its value being zero, d
TIM i
increases as the quality of feature i decreases.
Instead of utilizing d
TIM
directly, real-data tests detailed in Section 3 reveal that utilization of d 0.5
TIM
increases tracking accuracy.
d 0.5
TIM values for
A subjective comparison of the aforementioned methods can be done by
observing the correlation between Figure 9
reliable way is to calculate sample correlation coefficients between reliabilities and the tested metrics using: π π₯π¦
=
2
√∑
∑ π π=1 π π=1
(π₯ − π₯Μ )(π¦ − π¦Μ )
(π₯ − π₯Μ ) 2 ∑ π π=1
(π¦ − π¦Μ ) 2 where π₯Μ and π¦Μ represent means of vectors π₯ and π¦ .
Table 2 shows the sample correlation coefficients between reliabilities and
the tested metrics:
18
Table 2 Sample correlation coefficients.
Method π
π.π
πππ
KLT Error π
πππ
Harris Cornerness Measure
Shi-Tomasi Cornerness Measure
Sample Correlation Coefficient
0.4983
0.3161
0.3146
0.1056
0.0606
Examining Table 2 , we can conclude that, among the tested algorithms, TIM
method is more accurate in terms of detecting the accuracy of KLT tracking.
Consequently, the following feature tracker is proposed:
1.
Track features via pyramidal KLT algorithm [19] using intensity
data,
2.
Track features via pyramidal KLT algorithm [19] using SIM data,
3.
Calculate TIM errors for intensity and SIM trackers,
4.
For each feature, comparing the TIM errors of intensity and SIM trackers, assign final correspondence at t+1 based on the tracker with minimum error.
5.
Discard features with TIM error larger than a predefined threshold.
Figure 14 shows final feature correspondences on intensity and SIM data.
(a) Correspondences on intensity data
19
(b) Correspondences on SIM data
Figure 14 Final feature correspondences.
Relying on both intensity and SIM data, the proposed tracker has following specifications:
ο·
Can operate under varying lighting conditions,
ο· Can track relatively smooth objects,
ο·
Can track partially specular objects.
Robust 2D and 3D measurements obtained using the proposed feature tracker, are fed to EKF, which estimates 6 DOF 3D motion between camera and the object of interest. Furthermore, a novel measurement-weighting scheme favors
‘good’ measurements and provides high accurate tracking. In the following paragraphs, state update and measurement equations as well as the proposed weighting scheme are highlighted.
State Update Equations
State update equations define transition from previous state π₯ π‘−1 to current state π₯ π‘ when the input π’ π‘
is applied to the system: π₯ π‘
= π(π₯ π‘−1
, π’ π‘
) + π π‘
In the proposed system, constant velocity motion model is applied:
20
π ππ
[ π π‘ π ππ ππ π₯ ππ π‘ π¦ ππ π‘ π§ ππ
] π‘
=
[ π ππ π‘ π₯ ππ π‘ π ππ π ππ π¦ ππ π‘ π§ ππ
] π‘−1
+ πΜ ππ πΜ ππ πΜ ππ
[ π‘Μ π₯ ππ π‘Μ π¦ ππ π‘Μ π§ ππ
] π‘−1
+ π π π‘ where π π‘
is the state update noise for motion parameters between depth camera and object reference frames. On the other hand, the velocity parameters only have noise updates: πΜ πΜ πΜ ππ ππ ππ
[ π‘Μ π‘Μ π₯ ππ π¦ ππ π‘Μ π§ ππ
] π‘
= πΜ ππ πΜ ππ πΜ ππ π‘Μ π‘Μ π₯ ππ π¦ ππ
[ π‘Μ π§ ππ
] π‘−1
+ π π π‘
Measurement Equations
Measurement equations relate current states and current measurements as follows: π§ π‘
= β(π₯ π‘
) + π π‘
In the proposed model, measurements are 3D coordinate measurements from depth sensor and 2D pixel coordinates from RGB camera. 3D object coordinates with respect to depth camera reference frame can be related to the states as follows:
X o di
[
Y o di
Z o di
] t
= R do
X o i
[ Y o i
Z o i
] + t do
+ ε i t
Finally, 2D pixel coordinates of object and scene points can be related to the states as follows:
α i x o i
[ y o i
]
1 t
= K [R dv
[R do
X o i
[ Y o i
Z o i
] + t do
] + t dv
] + ε ii t
3D object coordinates with respect to depth camera
3D object coordinates with respect to RGB camera where πΌ π
represents scale factor, π π π‘
and π ππ π‘
are observation noises with covariance matrices π π and π ππ .
21
Weighting Observations
Generally, the observation noise covariance matrices are assumed constant and they are specified by the sensor characteristics. Assuming independent measurements, π π and π ππ covariance matrices for 2D and 3D measurement noises have following form:
π π = ππππ(π 2 πππ₯
)
π ππ = ππππ(π 2
πππ
) π 2 πππ₯
represents the variance of noise on 2D measurements possibly caused by finite image resolution, quantization, motion blur, errors in KLT tracking etc. On the other hand, finite resolution, multiple reflections, quantization etc. can be possible reasons for noise on 3D measurements, represented by π 2
πππ
. Thus, in the previous thesis progress report, π π and π ππ matrices having above form are utilized.
Thorough analysis in sub-section 2-iii reveal that the 2D tracking reliabilities of features may vary significantly, hence utilization of constant π π matrix may decrease system performance. Based on the reliabilities of features determined by
TIM, following weighting scheme is proposed for the 2D measurements: π 2 πππ₯ π
= π d
TIM i
0.5
∑ π π=1 d
TIM i
0.5
π 2 πππ₯
Consequently, if a feature has large TIM, it is probably tracked erroneously; it has higher measurement noise and contributes less to state updates.
Moreover, as explained in Section 2-ii, 3D measurement of feature i is obtained from corresponding 2D measurement (using external calibration of two sensors), so the errors in KLT tracking also decrease qualities of 3D measurements.
Hence, using perspective camera model, 3D error corresponding to TIM can be found as: w
1 i
=
Z o di f d
TIM i
0.5
, w
1 iπ
= w
1 i
∑ π π=1 w
1 i where Z o di is depth measurement of i and f is the depth camera focal length.
In Section 2-i it is shown that there is a strong relation between the 3D variance of selected object points and the pose estimation accuracy, thus regular sampling is utilized in order to select points to track. Hence, one may expect that the points far away from the 3D center of mass should be favored during pose estimation. In order to validate this intuition, first following object points are selected for pose estimation:
22
Figure 15 Selected object points.
Then pose estimations are performed using a 100-frame-length artificial sequence with following properties:
ο·
For the first case, the 3D measurements have generation and prediction noises of variance π 2
πππ
= 10,
ο·
For the second case, the prediction noises are weighted based on the distances of 3D measurements from 3D center of mass of observations
[πΆ
π
πΆ
π
πΆ
π
]
π
:
1/w
2 i
= β[π π ππ
π π ππ
π π ππ
]
π
− [πΆ
π
πΆ
π
πΆ
π
] π β , w
2 iπ
= w
2 i
∑ π π=1 w
2 i π 2
πππ π
= π w
2 iπ π 2
πππ
Pose estimation accuracies for the two cases are provided via Table 3 .
Table 3 3D tracking accuracies.
Rot-x
(mrad)
Rot-y
(mrad)
Rot-z
(mrad)
Tr-x
(mm)
Tr-y
(mm)
Tr-z
(mm)
Case II
Case I
5.9
6.3
3.7
4.0
5.9
5.5
2.5
2.69
4.47
4.67
2.00
2.20
It is clear that for such an exaggerate point selection as in
weighting of 3D measurements based on distances from 3D center of mass increases tracking performance. Consequently, following weighting scheme is proposed for the
3D measurements: π 2
πππ π
= π
( w
1 iπ
+ w
∑ π π=1
( w
1 iπ
2 iπ
+ w
)
2 iπ
) π 2
πππ
23
Since cumulative systems are prone to drift due to error accumulation, in the proposed system, drift is detected and corrected automatically. In the previous thesis progress report, this step is accomplished by automatic segmentation of data and registration of model and segmented data. However, this step should also be modified by probably utilizing wide feature matching via SIM and intensity trackers.
The proposed 3D tracking algorithm is thoroughly analyzed using experiments with real and artificial data. Using artificial data:
ο·
The system convergence characteristics with different initial conditions are examined,
ο· The effect of weighting observations on 3D tracking quality is studied,
ο· The strength of utilization of EKF compared to an instantaneous state estimation approach using quaternions is examined.
Using real data:
ο· The effect of weighting observations on 3D tracking quality is examined,
ο·
The effects of regular sampling and assisting KLT tracking using SIM data on 3D tracking quality are studied.
As shown in Figure 2 , the initial state is estimated using feature tracking and
registering matched features using a PnP and LM based approach [20]-[21]. Thus,
the system performance with known initial state is shown in Figure 16 . It is assumed
that generation and prediction noise variances for different 2D and 3D measurements are the same and 3 pixels and 10 mms respectively. Moreover, the object makes a dominant rotational motion in y-direction starting from initial pose:
π _π‘ ππ
= [10 −4 − 1.4 10 −4 50 50 100] π with minor velocity in other directions: π£ ππ
= [10 −4 0.025 10 −4 10 −4 10 −4 10 −4 ] π .
For artificial data experiments 20 object points are selected and tracked.
Furthermore, the performance of the proposed formulation is compared to that of a quaternion-based approach, which finds instantaneous state estimates using 3D-3D
correspondences [22]. It is clear from Figure 16 that, utilization of a filtering based
approach increases pose estimation accuracy while reducing jitter.
24
(a) Rotation-x (b) Rotation-y (c) Rotation-z
(d) Translation-x (e) Translation-y (f) Translation-z
(g) Rotation-x Error (h) Rotation-y Error (i) Rotation-z Error
(j) Translation-x Error
(k) Translation-y Error (l) Translation-z Error
(m) Rotation-x Variance (n) Rotation-y Variance (o) Rotation-z Variance
(p) Translation-x Variance (q) Translation-y Variance (r) Translation-z Variance
Figure 16 Artificial results with known initial estimates
25
(a) Rotation-x Error (b) Rotation-y Error (c) Rotation-z Error
(d) Translation-x Error (e) Translation-y Error (f) Translation-z Error
Figure 17 Artificial Results with good initial estimate
Since the initial state is known, associated errors slightly increase from 0, and then as the filter converges, the errors decrease. Note that, the errors will probably never reach 0, due to observation and state update noises. Moreover, the variance of estimates also converges towards 0, as the filter is updated. Also there is a strong correlation between state errors and variances.
Figure 17 illustrates the case when the initial state is not known and the filter
is initiated with a good initial estimate:
π _π‘ ππ
= [10 −4 − 1.41 10 −4 75 75 150] π
As expected, the state errors decrease as the filter converges.
A final test is devoted to the case, in which the filter diverges. When the initial state is
π _π‘ ππ
= [10 −4 − 1.4 10 −4 150 150 300] π the filter diverges:
26
(a) Rotation-x Error (b) Rotation-y Error (c) Rotation-z Error
(d) Translation-x Error (e) Translation-y Error (f) Translation-z Error
Figure 18 Artificial results with bad initial estimate
So far, the generation and prediction noise parameters for the EKF are assumed to be equal (3 pixels and 10 mms for 2D and 3D observations respectively).
However, as shown in Sub-section 2.ii, 2D measurements tend to have different qualities, and hence, different (generation) noise parameters. In order to observe the system accuracy under 2D observations with different noise characteristics a set of artificial tests are also performed. 2D observations are divided into 5 subsets and
each subset is assigned a specific generation noise variance as shown in Table 4 .
Then, following cases are compared:
ο· Prediction noise parameters are the same as generation noise parameters,
ο·
Prediction noise parameters are equal for all 2D observations.
Table 4 shows 3D tracking accuracies for the aforementioned cases. It should
be noted that the system has better performance when the observations are treated unequally depending on generation noise parameters. However, in practical scenario it is not possible to detect the generation noise parameters of observations without, for instance, utilization of SSD-plots. Instead, TIM is well suited to the problem of determination observation quality as detailed in Sub-section 2-iii.
27
Table 4 Effect of difference between generation and prediction noise parameters
Generation
Noise Variances
Prediction
Noise Variances
Rot-x
(mrad)
Rot-y
(mrad)
Rot-z
(mrad)
Tr-x
(mm)
Tr-y
(mm)
0.5-1-1.5-2-2.5 0.5-1-1.5-2-2.5
0.5-1-1.5-2-2.5 All 1.5
1-2-3-4-5
1-2-3-4-5
1-2-3-4-5
All 3
2-4-6-8-10
2-4-6-8-10
2-4-6-8-10
All 6
3.7
3.9
4.3
4.5
4.7
5.0
2.2
2.2
2.5
2.5
2.7
2.8
3.3
3.6
4.0
4.1
4.7
4.9
1.62
1.66
1.80
1.77
1.92
1.91
3.13
3.27
3.50
3.74
3.73
3.90
Tr-z
(mm)
1.13
1.16
1.32
1.33
1.49
1.51
In order to examine the performance of the proposed 3D tracker using real-
data, 100-frame length “Face” and “Book” sequences are utilized. Figure 19 shows
typical frames from “Face” and “Book” sequences.
(a) Color frame (b) Depth frame (c) SIM frame
(d) Color frame (e) Depth frame (f) SIM frame
Figure 19 Selected color, depth and SIM frames for “Face” and “Book” sequences.
28
Since ground-truth camera poses are not available, the reprojection and 3D error metrics are utilized:
π ππππππππ‘πππ πΈππππ = β[ π₯ π¦ π π π π
] π‘
− [ π₯ π¦ π π π π
]
π,π‘
β where [ π₯ π¦ π π π π
] is 2D pixel coordinate observation and [ π₯ π¦ π π π π
]
π is 2D pixel coordinate obtained using the associated state estimates for i th object point at time instant t: πΌ π π₯ π π
[ π¦ π π
]
1
π,π‘ π π₯
0 π π₯
= [ 0 π π¦
π π₯
] [π ππ£
0 0 1
[π ππ
[
π π π
π π π
] + π‘ ππ
] + π‘ ππ£
]
π π π
3π· πΈππππ = β[
π π ππ
π π ππ
π π ππ
] π‘
− [π ππ
π π π
[ π π π
π π π
] + π‘ ππ
]β
Mean error values are obtained by averaging associated values for all tracked object points in a frame. With real-data, following algorithms are tested:
ο·
Final 3D tracker proposed in Section 2,
ο·
3D tracker which weights 2D and 3D observations as in Section 2.iv, but does not utilize regular sampling of 3D points (Sub-section 2.i) and SIM to aid feature tracking (Sub-section 2.ii),
ο·
3D tracker proposed in the previous thesis progress report, which does not utilize observation weighting, regular sampling of 3D points and SIM to aid feature tracking (so-called “Base” algorithm),
ο·
Quaternion based 3D tracker [22].
Figure 20 shows mean reprojection and 3D errors for the “Face” sequence.
29
(a) Mean reprojection errors.
(b) Mean 3D errors.
Figure 20 Mean reprojection and 3D errors for “Face” sequence.
30
Figure 21 shows mean reprojection and 3D errors for the “Face” sequence.
(a) Mean reprojection errors.
(b) Mean 3D errors.
Figure 21 Mean reprojection and 3D errors for “Book” sequence.
31
Examining
Figure 21 , one can conclude that weighting of
observations increases 3D tracking performance. However, the final proposed algorithm has superior performance in terms of mean reprojection and 3D errors.
In this progress report, effort within the 6-month period is detailed. To summarize:
ο· A robust 3D tracker utilizing intensity, 3D and SIM data is proposed.
The system weights observations based on their qualities estimated using TIM and 3D observation coordinates.
ο·
The proposed formulation is simulated via Monte Carlo experiments and convergence characteristic is analyzed. Moreover, the effect of weighting observations is examined by simulations.
ο·
The system is tested with real data and associated performance is compared to previously proposed algorithm.
The proposed algorithm enables objects selected by the user to be tracked in
3D space with high accuracy and reduced drift. The system relies on robust features associated using KLT tracking of intensity and SIM data. Furthermore, the proposed formulation enables weighting of observations based on their qualities.
The algorithm tracks 3D pose using the model points selected in the very first frame. However, due to occlusion or sensor noise, these points are eliminated by the algorithm of Sub-section 2.iii, as time goes on. Hence, the number of features decreases as time goes on, which may result in a decrease in tracking quality:
(a) Initial frame (b) Final frame
Figure 22 Initial and final frames of “Face” sequence.
32
Figure 23 Regular lattice for feature selection.
Therefore, a routine, which will update the model selected at t
0
, will be developed.
In the proposed system, the model points to utilize for tracking are selected
using a regular grid, as in Figure 5 . As a future work, these features will be selected
using intensity and SIM data. In order to select points with high cornerness measure, while maximizing their spread, regular lattice shown in
Features having maximum intensity and SIM cornerness measures will be selected within each patch and utilized for 3D tracking.
Furthermore, the performance of the proposed algorithm will be tested using sequences with available ground truth camera poses. For instance, a recent RGB-D
Relying both on intensity and SIM data, the proposed algorithm can operate under varying lighting conditions. Hence, this feature will also be tested in the future.
33
[1] Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse,
“MonoSLAM: Real-Time Single Camera SLAM”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, June 2007.
[2] Kevin Nickels and Seth Hutchinson, “Model-Based Tracking of Complex
Articulated Objects”, IEEE Transactions on Robotics and Automation, Vol.
17, No. 1, February 2001.
[3] Kevin Nickels and Seth Hutchinson, “Weighting observations: The Use of
Kinematic Models in Object Tracking”, IEEE International Conference on
Robotics and Automation, 1998.
[4] G.Taylor, L.Kleeman,“Fusion of Multimodal Visual Cues for Mod- el-based
Object Tracking,” Australasian Conference on Robotics and Automation
(ACRA2003), Brisbane, Australia, 2003.
[5] Michael Krainin, Peter Henry, Xiaofeng Ren, Dieter Fox, “Manipulator and
Object Tracking for In-Hand 3D Object Modeling”, The International
Journal of Robotics Research September 2011 vol. 30 no. 11 1311-1327.
[6] Q. Gan, C.J Harris, “Comparison of Two Measurement Fusion Methods for Kalman-Filter-Based Multisensor Data Fusion”, IEEE Transactions on
Aerospace and Electronic Systems Vol. 37, No. 1 January 2001.
[7] Hauke Strasdat, J. M. M. Montiel and Andrew J. Davison, “Real-time
Monocular SLAM: Why Filter?”, IEEE International Conference on
Robotics and Automation, 2010.
[8] Vincent Lepetit, Pascal Fua, “Monocular Model-Based 3D Tracking of Rigid
Objects: A Survey”, Foundations and Trends in Computer Graphics and
Vision Vol. 1, No 1 (2005) 1–89.
[9] Georg Klein, David Murray, “Parallel Tracking and Mapping for Small AR
Workspaces”, IEEE International Symposium on Mixed and Augmented
Reality, 2007.
[10] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard
Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin
Freeman, Andrew Davison, Andrew Fitzgibbon, “KinectFusion: Real-time
3D Reconstruction and Interaction Using a Moving Depth Camera”’, ACM
Symposium on User Interface Software and Technology, 2011.
34
[11] Frank Steinbrücker, Jürgen Sturm, Daniel Cremers. “Real-Time Visual
Odometry from Dense RGB-D Images”, IEEE International Conference on
Computer Vision Workshops, 2011.
[12] V. Vezhnevets, V. Konouchine, "Grow-Cut - Interactive Multi-Label N-
D Image Segmentation", Graphicon, 2005.
[13] Matlab Central: Grow-cut Image Segmentation by Shawn Lankton.
Retrieved 01.06.2011 from http://www.mathworks.com/matlabcentral/fileexchange/19091growcut-image-segmentation.
[14] G. Mu, M. Liao, R. Yang, D. Ouyang, Z. Xu and X. Guo, “Complete 3D model reconstruction using two types of depth sensors”, ICIS 2010.
[15] J. J. Koenderink, Solid shape: MIT Press, 1990.
[16] Jianbo Shi, Carlo Tomasi, “Good Features to Track”, Computer Vision and Pattern Recognition, 1994.
[17] C Harris, M Stephens, “A combined corner and edge detector”, Alvey
Vision Conference, 1988
[18] R. Liu, Stan Z.Li, X. Yuan, and R. He, “Online Determination of Track
Loss Using Template Inverse Matching”, VS 2008.
[19] Bouquet, JY, “Pyramidal Implementation of the Lucas Kanade Feature
Tracker,” Intel Corporation, Microprocessor Research Labs, http://www.intel.com/research/mrl/research/ opencv/.
[20] Shay Ohayon and Ehud Rivlin, “Robust 3D Head Tracking Using
Camera Pose Estimation”, International Conference on Pattern
Recognition, 2006.
[21] W. H. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery,
Numerical Recipes in C - The Art of Scientific Computing. 2nd ed.,
Cambridge University Press, 1992.
[22] R. Jain, R. Kasturi, B.G. Schunck, “Machine Vision”, McGraw-Hill, 1995.
[23] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, W.
Burgard, D. Cremers, R. Siegwart, “Towards a Benchmark for RGB-D SLAM
Evaluation”, In Proc. of the RGB-D Workshop on Advanced Reasoning with Depth Cameras at Robotics: Science and Systems Conf. (RSS), 2011.
[24] Neslihan Yalcin Bayramoglu, “Range Data Recognition: Segmentation,
Matching, and Similarity Retrieval”, PhD. Dissertation, The Graduate
School of Natural and Applied Sciences of Middle East Technical
University, 2011.
35