Tik_3 - METU | Department Of | Electrical And Electronics

advertisement

MIDDLE EAST TECHNICAL UNIVERSITY

Department of Electrical and Electronics Engineering

PhD Thesis Progress Report - 3 by

Osman Serdar GEDΔ°K

06.01.2012

1

Table of Contents

1.

Introduction ...................................................................................................................... 3 i.

Overview of the Proposed System .................................................................................. 3 ii.

Literature Update ........................................................................................................... 5 iii.

Previous Thesis Progress Report Comments ................................................................. 7 iv.

Summary of the Efforts Since the Last Thesis Progress Report .................................... 7

2.

Proposed 3D Tracking Algorithm .................................................................................. 8 i.

Initial Semi-Automatic Segmentation ............................................................................. 9 ii. Transferring Measurements between Consecutive Frames ........................................ 11 iii. Reliability Check and Correction of Correspondences .................................................. 15

3.

Test Results ................................................................................................................... 24 i.

Artificial Data Tests ....................................................................................................... 24 ii.

Real Data Tests ............................................................................................................. 28

4.

Conclusions and Future Work ..................................................................................... 32

References ............................................................................................................................... 34

2

1. Introduction i.

Overview of the Proposed System

In this PhD. thesis study, our aim is to develop an accurate 3D tracking system,

which utilizes both RGB vision sensor data and 3D depth sensor data. As shown in Figure 1,

the system estimates 3D 6-DOF pose parameters (

[t x to

, t y to

, t z to

]

T

) between sensors and the object of interest:

R to

= R(ρ to

, θ to

, φ to

) , t to

=

Figure 1 Object and camera reference frames.

The main motivations to exploit both types of sensors are as follows:

ο‚·

Commercial systems that can capture high resolution (640x480) and high frame rate (25 fps) color and depth data are emerged,

ο‚· Pure vision sensor based trackers require manual initialization or offline training, which may not be feasible for most robotic or augmented reality applications,

ο‚· Pure depth sensor based methods rely on 3D-3D registration, which may easily trap into local minima.

With the aforementioned motivations, the system summarized in

Figure 2 is

proposed. This system is very similar to the one detailed in previous Thesis Progress

Report, however is more robust and reliable.

3

Color – 3D Data for t

1

Segmented Color and 3D

Data at t

0

Color – 3D Data for t

0

Initial Semi-

Automatic

Segmentation

Initial Model-

Data

Association using KLT

Reliable correspondences between 2D-3D Observations bwn t

0 and t

1 and 𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ

at t

1

Reliability check and correction

(RCC) for the correspondences and pose estimation

Figure 2 Overview of the proposed 3D tracking system.

Color – 3D Data for t

2

KLT 2D-3D Feature

Matching (with

RCC) between t

1

– t

2

𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ

, 𝑣 π‘‘π‘œ

at t

3

Associate data and model, update

𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ and reset EKF, if there is drift

Correspondences between

3D Object Model and t

3

2D-

3D Observations

EKF

𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ

, 𝑣 π‘‘π‘œ

at t

1

Loop with t

2

-> t n

, t

3

-> t n+1 if there is no drift

KLT 2D-3D Feature

Matching (with

RCC) between t

2

– t

3

Correspondences between

Object Model and t

2 instant

2D-3D Observations

Pose Estimation between 3D Object

Model and t

2

instant 3D

Observation

Correspondences

Color – 3D Data for t

3

𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ

, 𝑣 π‘‘π‘œ

at t

2

Update 𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ

,reset EKF and continue

Color – 3D Data for t n

Model-Data

Association

Segmented Color – 3D

Data for t

0

Recalculated Correspondences between 3D Object Model and t n

2D-

3D Observations and 𝑅 π‘‘π‘œ

, 𝑑 π‘‘π‘œ

at t n

4

The system is composed of the following building blocks:

Initial Semi-Automatic Segmentation:

This step takes a user input to specify rough foreground and background regions in order to segment the object to track. Since, two sensors are calibrated, after this step we have a colored point cloud of the object, which will be tracked at following instants.

Transferring Measurements between Consecutive Frames

The proposed method mainly relies on transferring 2D pixel and 3D XYZ measurements to consecutive video frames. Thus, the accuracy of the system is directly related to the accuracy of feature tracking. To this aim, a robust feature tracker, which is based on Kanade -Lukas Tracker (KLT), utilizing both intensity and

Shape Index Map (SIM) data is proposed.

Reliability Check and Correction of Correspondences

The errors in feature tracking are automatically detected and corrected using the Template Inverse Matching (TIM) algorithm, which operates on intensity and

SIM trackers.

State Estimation Using Extended Kalman Filter (EKF)

2D pixel and 3D XYZ measurements, obtained for the current time instant using the proposed KLT tracker, are fed to an Extended Kalman Filter (EKF). The filter uses constant velocity motion model to estimate 6 DOF object motion. A novelweighting scheme weights 2D and 3D measurements based on their qualities, which increases the 3D tracker performance significantly.

Automatic Realignment to Prevent Drift (Optional)

Since cumulative systems are prone to drift due to error accumulation, in the proposed system, drift is detected and corrected automatically.

The system building blocks will be explained in details in Section 2.

ii.

Literature Update

Within the last 6 months period, the literature of 3D tracking is reviewed once more in order to “justify” the proposed algorithm and keep us up-to-date with the new trends in tracking.

5

Constant Velocity Motion Model

In the literature of 3D tracking, constant velocity motion model is used in

many algorithms such as [1]-[5]. In the well-known MonoSLAM approach [1], the 3D

pose of a freely moving camera is estimated (with respect to scene); meanwhile the

3D scene is sparsely reconstructed. The motion model is similar to the formulation here. The state vector is composed of 7 dimensional pose parameters (4 parameters representing rotation in angle-axis form and 3 parameters representing translation) and associated velocities. Accelerations are modeled as noise updates.

In [2] and [3], using the previous measurements, a state estimate for the

current instant is obtained. This estimate is used to obtain a synthetic view of the object. For each feature, a sum of squared distances (SSD) surface is generated by using patches from synthetic view and the current frame. The peak of the surface gives the corresponding feature location in current frame. Moreover, a 2D Gaussian is fitted to the SSD surface. Hence, a variance estimate for each feature is obtained

and passed to the EKF, which utilizes constant velocity motion model. [4] also

proposes a similar approach utilizing Iterated EKF (IEKF).

A recent model-based 3D tracking approach, which utilizes vision and depth

sensors is proposed in [5]. The initial state estimate is utilized to transform the

object initially. Then, the proposed articulated iterative closest point (ICP) is used to match features of transformed colored object point cloud and the colored point cloud at current instant. Finally, measurement update is performed to correct the pose estimate of the articulated ICP. Although the method is quite similar to our approach, the EKF formulation is different.

Combination of Two Sets of Measurements

In the proposed system, 2D pixel and 3D XYZ measurements of each feature are simply concatenated to obtain a 5Nx1 measurement vector for N features. At this point, one may question whether these two sets of measurements can be combined by any appropriate method other than concatenation. However, the

authors of [6] prove that concatenation of measurements is more flexible and

efficient than a weighted combination, if the measurement matrices are not identical. Although, they prove the theorem for linear Kalman case, a 5Nx1 measurement vector still sounds.

Error Accumulation

In the experiments of the proposed system, it is observed that the error increases gradually at each frame, i.e. there is error accumulation. Thus, an automatic realignment module is added. In the previous thesis progress report,

there were some doubts related with such a drift. The authors of [7] state that,

6

inaccuracy can accumulate in filters due to limitations in the representation of probability distributions, such as the linearization assumptions of Gaussian-based filters like the EKF. Moreover, relying on optical flow for 3D tracking also results in

error accumulation [8]. 2D tracking errors may severely affect 3D tracking. However, error accumulation can be handled using loop closures [1].

Non-sequential Methods

In the literature of 3D tracking, there are recent algorithms, which propose batch methods instead of sequential state updates, i.e. filters. For instance, utilizing

the power of multi-core processors, the authors of [9] propose a parallel tracking

and mapping algorithm, which also performs bundle adjustment to refine camera

poses and 3D coordinates of map points. Moreover, in [7] it is stated that if overall

processing budget is small filtering should be applied. Otherwise, performing bundle adjustment on a small set of key-frames results in higher performance.

Furthermore, utilizing parallel processing on GPUs, in [10] an ICP based 3D

tracking algorithm is proposed. The system tracks all pixels in a 640x480 image and the performance is quite satisfactory due to dense tracking. Similarly, the authors of

[11] maximize photo consistency by linearizing the cost function to register all pixels

of consecutive frames.

iii.

Previous Thesis Progress Report Comments

In the previous thesis progress report, following comments were made:

ο‚·

Monte Carlo simulations should be performed in order to analyze system convergence characteristics thoroughly.

ο‚· Real data tests with object/camera motion constrained to one direction should be performed in order to see whether the algorithm makes logical pose estimates or not.

As summarized in the following subsection, a plenty of real and artificial tests are performed in order to meet above comments.

iv.

Summary of the Efforts Since the Last Thesis Progress Report

Efforts within the last 6-month period can be summarized as follows:

7

ο‚·

ο‚·

ο‚·

Monte Carlo simulations are performed in order to analyze system convergence characteristics.

ο‚·

Monte Carlo simulations are performed in order to observe the strength of utilization of EKF compared to an instantaneous state estimation approach.

ο‚· Monte Carlo simulations are performed in order to observe the effect of

ο‚· weighting 2D measurements on 3D tracking quality, when generation and prediction noise parameters are different.

Monte Carlo simulations are performed in order to observe the relation

ο‚·

ο‚·

ο‚· between the variance of 3D points tracked and 3D tracking quality.

Monte Carlo simulations are performed in order to observe the effect of emphasizing the 3D measurements that are far from the center of mass of 3D observations.

Using real data, many methods are compared in terms of their effectiveness to detect the quality of 2D KLT tracking.

A robust KLT tracker, which utilizes intensity and SIM data, is proposed.

The effect of weighting 2D measurements, based on TIM errors, on 3D tracking quality is analyzed using real data.

The effect weighting 3D measurements, based on TIM errors and distances from the center of mass of 3D observations, on 3D tracking quality is analyzed using real data.

2. Proposed 3D Tracking Algorithm

As already mentioned, in the proposed system, the object selected by the

user is tracked using data provided by vision and depth sensors. Considering Figure 1 ,

following variables define the overall system:

[

𝑋 π‘œ 𝑖

π‘Œ π‘œ 𝑖

] : 3D coordinates of the i th object point with respect to object reference

𝑍 π‘œ 𝑖 frame (For instance, obtained by user segmentation at the very first frame or from CAD model)

𝑋 π‘œ 𝑑𝑖

[

π‘Œ π‘œ 𝑑𝑖

𝑍 π‘œ 𝑑𝑖

] : 3D coordinates of the i th object point at time instant t, measured by the depth sensor with respect to depth camera reference frame

[ π‘₯ 𝑦 π‘œ 𝑖 π‘œ 𝑖

] : 2D pixel coordinates measurement of the i th object point at time instant t

8

𝑅 π‘‘π‘œ

= 𝑅(𝜌 π‘‘π‘œ

, πœƒ π‘‘π‘œ

, πœ‘ π‘‘π‘œ

) : Rotation parameters between object and depth camera in x, y and z directions respectively 𝑑 π‘‘π‘œ

= [𝑑 π‘₯ π‘‘π‘œ

, 𝑑 𝑦 π‘‘π‘œ

, 𝑑 𝑧 π‘‘π‘œ

]

𝑇

: Translation parameters between object and depth camera in x, y and z directions respectively 𝑣 π‘‘π‘œ

= [πœŒΜ‡ π‘‘π‘œ

, πœƒΜ‡ π‘‘π‘œ

, πœ‘Μ‡ π‘‘π‘œ

, 𝑑̇ π‘₯ π‘‘π‘œ

, 𝑑̇ 𝑦 π‘‘π‘œ

, 𝑑̇ 𝑧 π‘‘π‘œ

]

𝑇

: Associated velocity parameters between object and depth camera

𝑅 𝑑𝑣

= 𝑅(𝜌 𝑑𝑣

, πœƒ 𝑑𝑣

, πœ‘ 𝑑𝑣

) : Rotation parameters between

RGB and depth camera in x, y and z directions respectively 𝑑 𝑑𝑣

= [𝑑 π‘₯ 𝑑𝑣

, 𝑑 𝑦 𝑑𝑣

, 𝑑 𝑧 𝑑𝑣

]

𝑇

: Translation parameters between RGB and depth camera in x, y and z directions respectively

𝐾 𝑣

: Internal calibration vector ( [𝑓 π‘₯

, 𝑓 𝑦

, 𝑝 π‘₯

, 𝑝 𝑦

]

𝑇

) of RGB camera composed of focal lengths and principal point offsets

With the above parameters defined, the overall system is explained in details in the following sub-sections.

i.

Initial Semi-Automatic Segmentation

Since the utilized Kinect sensor provides RGB and 3D data, it is possible to extract 3D object models in the form colored point clouds. For this purpose, a semi-

automatic object model extraction algorithm, based on grow-cut segmentation [12],

is utilized.

First of all, user is asked to select rough foreground and background regions from the first RGB image of the sequence. Then, the grow-cut implementation of

[13] extracts foreground regions belonging to the object, as shown in Figure 3 . As

RGB and depth cameras are calibrated, colored 3D point cloud model of the object is obtained. For complex objects, 3D object models can be generated using algorithms

such as [14].

Figure 4 shows snapshots from “Book” and “Face” 3D models.

Once the object is segmented and associated PCM is obtained, the next step is the determination of which features to utilize for 3D tracking, since due to the computational requirements whole object points cannot be tracked. Intuitively, there should be a relation between variance of the 3D points selected and pose estimation accuracy. Actually, simulations verify this intuition and results are

summarized in Table 1 . As the spread of points in 3D increases, the pose estimation

errors decrease. Consequently, the features to track are selected using a regular

sampling grid, as shown in Figure 5 .

9

(a) Input image (b) Rough segmentation

(c) Object boundaries (d) Object mask

Figure 3 Segmentation results for the “Book” model.

(a) “Book” model (b) “Face” model

Figure 4 3D object models in the form of colored point clouds.

Table 1 Relation between norm of std vector and 3D tracking errors.

Norm/Error Rot-x

(mrad)

Rot-y

(mrad)

Rot-z

(mrad)

Tr-x

(mm)

Tr-y

(mm)

Tr-z

(mm)

44.92

59.83

68.76

3.8

3.6

3.1

3.6

2.4

2.3

4.1

3.5

3.4

2.1

1.7

1.6

3.0

3.1

2.5

1.9

1.2

1.2

10

Figure 5 Regular sampling of tracked points.

ii. Transferring Measurements between Consecutive Frames

In the proposed system, 3D object pose is determined using EKF. However, this requires 2D and 3D measurements associated with each object point i

( [π‘₯ π‘œ 𝑖

𝑦 π‘œ 𝑖

]

𝑇

, [𝑋 π‘œ 𝑑𝑖

π‘Œ π‘œ 𝑑𝑖

𝑍 π‘œ 𝑑𝑖

]

𝑇 respectively) to be transferred between consecutive time instants. In the algorithm proposed in the previous thesis progress report, 2D measurements are matched between consecutive frames using KLT tracking via intensity data. Thus, since the vision and depth sensors are externally calibrated, the

3D measurement of the object point i is also obtained for the next time instant.

Figure 6 -a illustrates typical KLT tracked features using intensity data.

Due to the fact that relying on optical flow for 3D tracking results in error

accumulation [8], in our way to obtain a high accurate 3D tracker, association of

measurements between consecutive frames should be handled with special care.

Figure 7 illustrates

⁄ surfaces of typical features matched via KLT. A patch of dimensions 10x10 around a 2D measurement at time instant t is moved around the

KLT matched 2D location at time instant t+1 in order to calculate SSD using:

5 5

𝑆𝑆𝐷 𝑖

(π‘₯, 𝑦) = ∑ ∑ (𝐼 𝑑+1

((π‘₯ π‘œ 𝑖

) 𝑑+1 π‘š=−5 𝑛=−5

+ π‘₯ + π‘š, (𝑦 π‘œ 𝑖

) 𝑑+1

+ 𝑦 + 𝑛)

− 𝐼 𝑑

((π‘₯ π‘œ 𝑖

) 𝑑+1

+ π‘š, (𝑦 π‘œ 𝑖

) 𝑑+1

+ 𝑛)) 2 where I stands for intensity and (x, y) represents offset from KLT matched position at

t+1.

11

(a) Intensity KLT tracking.

(b) Depth KLT tracking.

(c) SIM KLT tracking.

Figure 6 KLT tracking using different data.

In Figure 7 , the origin represents the feature location matched using KLT at time instant t+1. In Figure 7 -a, the peak of

1 SSD

graph of the green feature in Figure

5 is at the origin, and hence, we conclude that the KLT tracking is successful.

However, in Figure 7 -b, the KLT tracker failed to locate the exact match of the blue

feature in Figure 5 , since the peak is at

(3, −10) T .

In order to increase the accuracy of the 3D tracker, the errors in KLT tracker should definitely be corrected. At this point, following question arises: “Can one

12

(a) Successfully tracked feature using intensity data.

(b) Unsuccessfully tracked feature with intensity data.

(c) Tracking corrected using SIM data.

Figure 7 𝟏 𝐒𝐒𝐃 plots of different features.

13

increase the quality of KLT tracker by utilizing available 3D information?” An instant

answer to this question may be the exploitation of depth maps as in Figure 6 . For

each feature, two parallel KLT trackers are utilized using intensity and depth data, assuming pixel coordinates of matches should ideally be the same. However, since

details are lost in depth data, the KLT tracker fails as in Figure 6 -b.

On the other hand, Shape Index Map (SIM) proposed in [15], is well-suited for

our approach since the method exaggerates details by using principal curvatures πœ…

1 and πœ…

2

:

𝑆𝐼 =

1

2

− (

1 πœ‹

) tan

−1 ( πœ…

1 πœ…

1

+ πœ…

2

− πœ…

2

)

Principal curvatures ( πœ…

1

and πœ…

2

) stand for the minimum and maximum curvatures of

the point of interest and are estimated as in [24].

Figure 8 shows a typical SIM. Since the formation of SIM is computationally

involved, it is computed over a dynamically selected bounding box as in Figure 6 -c.

Figure 8 A typical shape index map.

Although the feature of Figure 7 -b is erroneously tracked using intensity data,

SIM KLT tracking is correct, as shown Figure 7 -c. Thus, we have a strong idea that KLT

tracking using intensity data can be corrected using KLT tracking using SIM. However, how can we decide which tracking is more accurate than the other without explicit calculation of SSD values? The next sub-section answers this question.

14

Figure 9 KLT accuracies of features.

iii. Reliability Check and Correction of Correspondences

Figure 9 shows the KLT accuracies of typically tracked features using intensity

data. In order to obtain

Figure 9 , SSD plots, as described in sub-section 2-ii, are

obtained and the inverse of the distances of SSD-peaks from the origin are plotted (a small ε is added to avoid divide by zero). Although SSD plots reveal the accuracy of

KLT tracking, due to computational requirements, a method should be developed in order to detect the quality of KLT tracking, without explicit calculation of SSD values.

To this aim, following methods are tried:

ο‚·

Shi-Tomasi Cornerness Measure

ο‚·

Harris Cornerness Measure

ο‚· KLT Error

ο‚· Template Inverse Matching Error

Once an effective way to detect the reliability of a KLT tracked feature is developed, it can be utilized to correct intensity KLT tracking using SIM KLT tracking and vice versa. In the following paragraphs, each method is briefly explained and associated performances are given:

Shi-Tomasi Cornerness Measure

If a feature has high “cornerness” measure with high spatial derivatives, it

will probably be tracked with high accuracy. Hence, in [16], Shi and Tomasi propose a

method to locate features with high cornerness measure. First, using a patch P of size w x h around the interest point i, structure tensor is calculated: β„Ž/2 𝑀/2

𝐴 = ∑ ∑ 𝑀(𝑒, 𝑣) [

𝐼 2 π‘₯ 𝑒=−β„Ž/2 𝑣=−𝑀/2

𝐼

𝐼 π‘₯

𝐼 𝑦 π‘₯

𝐼

𝐼 𝑦

2 𝑦

]

15

Figure 10 Shi-Tomasi cornerness measure. where 𝑀(𝑒, 𝑣) represents the weight of pixel (𝑒, 𝑣) generally obtained by sampling a

2D Gaussian, 𝐼 π‘₯ and I y

stand for spatial derivatives in x and y directions respectively.

If both eigenvalues of the structure tensor are large, then the point i has high

cornerness measure. Thus, in [16], the minimum of the eigenvalues of A is utilized as

the cornerness measure. Figure 10 shows the Shi-Tomasi cornerness measures of the

features whose reliabilities are given in Figure 9 .

Harris Cornerness Measure

Harris corner detector [17] works similarly to the Shi-Tomasi approach,

however does not require explicit calculation of eigenvalues of the structure tensor.

Instead, it exploits the fact that if both eigenvalues are large, their product will much deviate from their sum. Thus, following cornerness measure is developed:

𝐢 𝑖

= πœ†

1 πœ†

2

− πœ…(πœ†

1

+ πœ†

2

) 2 = det(𝐴) − πœ… π‘‘π‘Ÿπ‘Žπ‘π‘’ 2 (𝐴) where πœ… is a constant, generally selected 0.04.

Figure 11 shows Harris cornerness

measures of the features whose reliabilities are given in Figure 9 .

KLT Error

KLT error simply calculates the SSD between the patch around the feature at time instant t and the patch around its match at t+1: β„Ž/2 𝑀/2

𝐾𝐿𝑇 πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ 𝑖

= ∑ ∑ (𝐼 𝑑+1

((π‘₯ π‘œ 𝑖

) 𝑑+1 π‘š=−β„Ž/2 𝑛=−𝑀/2

+ π‘š, (𝑦 π‘œ 𝑖

) 𝑑+1

+ 𝑛)

− 𝐼 𝑑

((π‘₯ π‘œ 𝑖

) 𝑑+1

+ π‘š, (𝑦 π‘œ 𝑖

) 𝑑+1

+ 𝑛)) 2

16

Figure 11 Harris cornerness measure.

KLT error is expected to increase, if a feature is tracked erroneously. Figure 12

illustrates 1/ KLT Error

of the features whose reliabilities are given in Figure 9 .

Figure 12 1/KLT Errors.

Template Inverse Matching

Proposed by the authors of [18], Template Inverse Matching (TIM) can be

utilized to detect the qualities of 2D measurements matched across consecutive frames. TIM simply calculates the Euclidean distance between a 2D measurement

[x o 𝑖

y o i

] t

T

associated with a feature i at time t and the 2D measurement [x o i

y o i

] t

′ T obtained by tracking i’s correspondence at time t+1 backward:

17

Figure 13 1/ 𝐝

𝟎.πŸ“

π“πˆπŒ

values. d

TIM i

= β€–[x o i

y o i

] t

T

− [x o i

y o i

] t

′ T

β€–

Ideally its value being zero, d

TIM i

increases as the quality of feature i decreases.

Instead of utilizing d

TIM

directly, real-data tests detailed in Section 3 reveal that utilization of d 0.5

TIM

increases tracking accuracy.

Figure 13 shows 1/

d 0.5

TIM values for

features of Figure 9 .

A subjective comparison of the aforementioned methods can be done by

observing the correlation between Figure 9

and Figure 10

- Figure 13 . However, a more

reliable way is to calculate sample correlation coefficients between reliabilities and the tested metrics using: 𝜌 π‘₯𝑦

=

2

√∑

∑ 𝑛 𝑖=1 𝑛 𝑖=1

(π‘₯ − π‘₯Μ…)(𝑦 − 𝑦̅)

(π‘₯ − π‘₯Μ…) 2 ∑ 𝑛 𝑖=1

(𝑦 − 𝑦̅) 2 where π‘₯Μ… and 𝑦̅ represent means of vectors π‘₯ and 𝑦 .

Table 2 shows the sample correlation coefficients between reliabilities and

the tested metrics:

18

Table 2 Sample correlation coefficients.

Method 𝐝

𝟎.πŸ“

π“πˆπŒ

KLT Error 𝐝

π“πˆπŒ

Harris Cornerness Measure

Shi-Tomasi Cornerness Measure

Sample Correlation Coefficient

0.4983

0.3161

0.3146

0.1056

0.0606

Examining Table 2 , we can conclude that, among the tested algorithms, TIM

method is more accurate in terms of detecting the accuracy of KLT tracking.

Consequently, the following feature tracker is proposed:

1.

Track features via pyramidal KLT algorithm [19] using intensity

data,

2.

Track features via pyramidal KLT algorithm [19] using SIM data,

3.

Calculate TIM errors for intensity and SIM trackers,

4.

For each feature, comparing the TIM errors of intensity and SIM trackers, assign final correspondence at t+1 based on the tracker with minimum error.

5.

Discard features with TIM error larger than a predefined threshold.

Figure 14 shows final feature correspondences on intensity and SIM data.

(a) Correspondences on intensity data

19

(b) Correspondences on SIM data

Figure 14 Final feature correspondences.

Relying on both intensity and SIM data, the proposed tracker has following specifications:

ο‚·

Can operate under varying lighting conditions,

ο‚· Can track relatively smooth objects,

ο‚·

Can track partially specular objects.

iv. State Estimation Using Extended Kalman Filter (EKF)

Robust 2D and 3D measurements obtained using the proposed feature tracker, are fed to EKF, which estimates 6 DOF 3D motion between camera and the object of interest. Furthermore, a novel measurement-weighting scheme favors

‘good’ measurements and provides high accurate tracking. In the following paragraphs, state update and measurement equations as well as the proposed weighting scheme are highlighted.

State Update Equations

State update equations define transition from previous state π‘₯ 𝑑−1 to current state π‘₯ 𝑑 when the input 𝑒 𝑑

is applied to the system: π‘₯ 𝑑

= 𝑔(π‘₯ 𝑑−1

, 𝑒 𝑑

) + πœ– 𝑑

In the proposed system, constant velocity motion model is applied:

20

𝜌 π‘‘π‘œ

[ πœ‘ 𝑑 πœƒ π‘‘π‘œ π‘‘π‘œ π‘₯ π‘‘π‘œ 𝑑 𝑦 π‘‘π‘œ 𝑑 𝑧 π‘‘π‘œ

] 𝑑

=

[ 𝜌 π‘‘π‘œ 𝑑 π‘₯ π‘‘π‘œ 𝑑 πœƒ π‘‘π‘œ πœ‘ π‘‘π‘œ 𝑦 π‘‘π‘œ 𝑑 𝑧 π‘‘π‘œ

] 𝑑−1

+ πœŒΜ‡ π‘‘π‘œ πœƒΜ‡ π‘‘π‘œ πœ‘Μ‡ π‘‘π‘œ

[ 𝑑̇ π‘₯ π‘‘π‘œ 𝑑̇ 𝑦 π‘‘π‘œ 𝑑̇ 𝑧 π‘‘π‘œ

] 𝑑−1

+ πœ– 𝑖 𝑑 where πœ– 𝑑

is the state update noise for motion parameters between depth camera and object reference frames. On the other hand, the velocity parameters only have noise updates: πœŒΜ‡ πœƒΜ‡ πœ‘Μ‡ π‘‘π‘œ π‘‘π‘œ π‘‘π‘œ

[ 𝑑̇ 𝑑̇ π‘₯ π‘‘π‘œ 𝑦 π‘‘π‘œ 𝑑̇ 𝑧 π‘‘π‘œ

] 𝑑

= πœŒΜ‡ π‘‘π‘œ πœƒΜ‡ π‘‘π‘œ πœ‘Μ‡ π‘‘π‘œ 𝑑̇ 𝑑̇ π‘₯ π‘‘π‘œ 𝑦 π‘‘π‘œ

[ 𝑑̇ 𝑧 π‘‘π‘œ

] 𝑑−1

+ πœ– 𝑖 𝑑

Measurement Equations

Measurement equations relate current states and current measurements as follows: 𝑧 𝑑

= β„Ž(π‘₯ 𝑑

) + πœ€ 𝑑

In the proposed model, measurements are 3D coordinate measurements from depth sensor and 2D pixel coordinates from RGB camera. 3D object coordinates with respect to depth camera reference frame can be related to the states as follows:

X o di

[

Y o di

Z o di

] t

= R do

X o i

[ Y o i

Z o i

] + t do

+ ε i t

Finally, 2D pixel coordinates of object and scene points can be related to the states as follows:

α i x o i

[ y o i

]

1 t

= K [R dv

[R do

X o i

[ Y o i

Z o i

] + t do

] + t dv

] + ε ii t

3D object coordinates with respect to depth camera

3D object coordinates with respect to RGB camera where 𝛼 𝑖

represents scale factor, πœ€ 𝑖 𝑑

and πœ€ 𝑖𝑖 𝑑

are observation noises with covariance matrices 𝑅 𝑖 and 𝑅 𝑖𝑖 .

21

Weighting Observations

Generally, the observation noise covariance matrices are assumed constant and they are specified by the sensor characteristics. Assuming independent measurements, 𝑅 𝑖 and 𝑅 𝑖𝑖 covariance matrices for 2D and 3D measurement noises have following form:

𝑅 𝑖 = π‘‘π‘–π‘Žπ‘”(𝜎 2 𝑝𝑖π‘₯

)

𝑅 𝑖𝑖 = π‘‘π‘–π‘Žπ‘”(𝜎 2

π‘‹π‘Œπ‘

) 𝜎 2 𝑝𝑖π‘₯

represents the variance of noise on 2D measurements possibly caused by finite image resolution, quantization, motion blur, errors in KLT tracking etc. On the other hand, finite resolution, multiple reflections, quantization etc. can be possible reasons for noise on 3D measurements, represented by 𝜎 2

π‘‹π‘Œπ‘

. Thus, in the previous thesis progress report, 𝑅 𝑖 and 𝑅 𝑖𝑖 matrices having above form are utilized.

Thorough analysis in sub-section 2-iii reveal that the 2D tracking reliabilities of features may vary significantly, hence utilization of constant 𝑅 𝑖 matrix may decrease system performance. Based on the reliabilities of features determined by

TIM, following weighting scheme is proposed for the 2D measurements: 𝜎 2 𝑝𝑖π‘₯ 𝑖

= 𝑛 d

TIM i

0.5

∑ 𝑛 𝑖=1 d

TIM i

0.5

𝜎 2 𝑝𝑖π‘₯

Consequently, if a feature has large TIM, it is probably tracked erroneously; it has higher measurement noise and contributes less to state updates.

Moreover, as explained in Section 2-ii, 3D measurement of feature i is obtained from corresponding 2D measurement (using external calibration of two sensors), so the errors in KLT tracking also decrease qualities of 3D measurements.

Hence, using perspective camera model, 3D error corresponding to TIM can be found as: w

1 i

=

Z o di f d

TIM i

0.5

, w

1 i𝑛

= w

1 i

∑ 𝑛 𝑖=1 w

1 i where Z o di is depth measurement of i and f is the depth camera focal length.

In Section 2-i it is shown that there is a strong relation between the 3D variance of selected object points and the pose estimation accuracy, thus regular sampling is utilized in order to select points to track. Hence, one may expect that the points far away from the 3D center of mass should be favored during pose estimation. In order to validate this intuition, first following object points are selected for pose estimation:

22

Figure 15 Selected object points.

Then pose estimations are performed using a 100-frame-length artificial sequence with following properties:

ο‚·

For the first case, the 3D measurements have generation and prediction noises of variance 𝜎 2

π‘‹π‘Œπ‘

= 10,

ο‚·

For the second case, the prediction noises are weighted based on the distances of 3D measurements from 3D center of mass of observations

[𝐢

𝑋

𝐢

π‘Œ

𝐢

𝑍

]

𝑇

:

1/w

2 i

= β€–[𝑋 π‘œ 𝑑𝑖

π‘Œ π‘œ 𝑑𝑖

𝑍 π‘œ 𝑑𝑖

]

𝑇

− [𝐢

𝑋

𝐢

π‘Œ

𝐢

𝑍

] 𝑇 β€– , w

2 i𝑛

= w

2 i

∑ 𝑛 𝑖=1 w

2 i 𝜎 2

π‘‹π‘Œπ‘ 𝑖

= 𝑛 w

2 i𝑛 𝜎 2

π‘‹π‘Œπ‘

Pose estimation accuracies for the two cases are provided via Table 3 .

Table 3 3D tracking accuracies.

Rot-x

(mrad)

Rot-y

(mrad)

Rot-z

(mrad)

Tr-x

(mm)

Tr-y

(mm)

Tr-z

(mm)

Case II

Case I

5.9

6.3

3.7

4.0

5.9

5.5

2.5

2.69

4.47

4.67

2.00

2.20

It is clear that for such an exaggerate point selection as in

Figure 15 , the

weighting of 3D measurements based on distances from 3D center of mass increases tracking performance. Consequently, following weighting scheme is proposed for the

3D measurements: 𝜎 2

π‘‹π‘Œπ‘ 𝑖

= 𝑛

( w

1 i𝑛

+ w

∑ 𝑛 𝑖=1

( w

1 i𝑛

2 i𝑛

+ w

)

2 i𝑛

) 𝜎 2

π‘‹π‘Œπ‘

23

iv. Automatic Realignment to Prevent Drift (Optional)

Since cumulative systems are prone to drift due to error accumulation, in the proposed system, drift is detected and corrected automatically. In the previous thesis progress report, this step is accomplished by automatic segmentation of data and registration of model and segmented data. However, this step should also be modified by probably utilizing wide feature matching via SIM and intensity trackers.

3. Test Results

The proposed 3D tracking algorithm is thoroughly analyzed using experiments with real and artificial data. Using artificial data:

ο‚·

The system convergence characteristics with different initial conditions are examined,

ο‚· The effect of weighting observations on 3D tracking quality is studied,

ο‚· The strength of utilization of EKF compared to an instantaneous state estimation approach using quaternions is examined.

Using real data:

ο‚· The effect of weighting observations on 3D tracking quality is examined,

ο‚·

The effects of regular sampling and assisting KLT tracking using SIM data on 3D tracking quality are studied.

i.

Artificial Data Tests

As shown in Figure 2 , the initial state is estimated using feature tracking and

registering matched features using a PnP and LM based approach [20]-[21]. Thus,

the system performance with known initial state is shown in Figure 16 . It is assumed

that generation and prediction noise variances for different 2D and 3D measurements are the same and 3 pixels and 10 mms respectively. Moreover, the object makes a dominant rotational motion in y-direction starting from initial pose:

𝑅_𝑑 π‘‘π‘œ

= [10 −4 − 1.4 10 −4 50 50 100] 𝑇 with minor velocity in other directions: 𝑣 π‘‘π‘œ

= [10 −4 0.025 10 −4 10 −4 10 −4 10 −4 ] 𝑇 .

For artificial data experiments 20 object points are selected and tracked.

Furthermore, the performance of the proposed formulation is compared to that of a quaternion-based approach, which finds instantaneous state estimates using 3D-3D

correspondences [22]. It is clear from Figure 16 that, utilization of a filtering based

approach increases pose estimation accuracy while reducing jitter.

24

(a) Rotation-x (b) Rotation-y (c) Rotation-z

(d) Translation-x (e) Translation-y (f) Translation-z

(g) Rotation-x Error (h) Rotation-y Error (i) Rotation-z Error

(j) Translation-x Error

(k) Translation-y Error (l) Translation-z Error

(m) Rotation-x Variance (n) Rotation-y Variance (o) Rotation-z Variance

(p) Translation-x Variance (q) Translation-y Variance (r) Translation-z Variance

Figure 16 Artificial results with known initial estimates

25

(a) Rotation-x Error (b) Rotation-y Error (c) Rotation-z Error

(d) Translation-x Error (e) Translation-y Error (f) Translation-z Error

Figure 17 Artificial Results with good initial estimate

Since the initial state is known, associated errors slightly increase from 0, and then as the filter converges, the errors decrease. Note that, the errors will probably never reach 0, due to observation and state update noises. Moreover, the variance of estimates also converges towards 0, as the filter is updated. Also there is a strong correlation between state errors and variances.

Figure 17 illustrates the case when the initial state is not known and the filter

is initiated with a good initial estimate:

𝑅_𝑑 π‘‘π‘œ

= [10 −4 − 1.41 10 −4 75 75 150] 𝑇

As expected, the state errors decrease as the filter converges.

A final test is devoted to the case, in which the filter diverges. When the initial state is

𝑅_𝑑 π‘‘π‘œ

= [10 −4 − 1.4 10 −4 150 150 300] 𝑇 the filter diverges:

26

(a) Rotation-x Error (b) Rotation-y Error (c) Rotation-z Error

(d) Translation-x Error (e) Translation-y Error (f) Translation-z Error

Figure 18 Artificial results with bad initial estimate

So far, the generation and prediction noise parameters for the EKF are assumed to be equal (3 pixels and 10 mms for 2D and 3D observations respectively).

However, as shown in Sub-section 2.ii, 2D measurements tend to have different qualities, and hence, different (generation) noise parameters. In order to observe the system accuracy under 2D observations with different noise characteristics a set of artificial tests are also performed. 2D observations are divided into 5 subsets and

each subset is assigned a specific generation noise variance as shown in Table 4 .

Then, following cases are compared:

ο‚· Prediction noise parameters are the same as generation noise parameters,

ο‚·

Prediction noise parameters are equal for all 2D observations.

Table 4 shows 3D tracking accuracies for the aforementioned cases. It should

be noted that the system has better performance when the observations are treated unequally depending on generation noise parameters. However, in practical scenario it is not possible to detect the generation noise parameters of observations without, for instance, utilization of SSD-plots. Instead, TIM is well suited to the problem of determination observation quality as detailed in Sub-section 2-iii.

27

Table 4 Effect of difference between generation and prediction noise parameters

Generation

Noise Variances

Prediction

Noise Variances

Rot-x

(mrad)

Rot-y

(mrad)

Rot-z

(mrad)

Tr-x

(mm)

Tr-y

(mm)

0.5-1-1.5-2-2.5 0.5-1-1.5-2-2.5

0.5-1-1.5-2-2.5 All 1.5

1-2-3-4-5

1-2-3-4-5

1-2-3-4-5

All 3

2-4-6-8-10

2-4-6-8-10

2-4-6-8-10

All 6

3.7

3.9

4.3

4.5

4.7

5.0

2.2

2.2

2.5

2.5

2.7

2.8

3.3

3.6

4.0

4.1

4.7

4.9

1.62

1.66

1.80

1.77

1.92

1.91

3.13

3.27

3.50

3.74

3.73

3.90

Tr-z

(mm)

1.13

1.16

1.32

1.33

1.49

1.51

ii.

Real Data Tests

In order to examine the performance of the proposed 3D tracker using real-

data, 100-frame length “Face” and “Book” sequences are utilized. Figure 19 shows

typical frames from “Face” and “Book” sequences.

(a) Color frame (b) Depth frame (c) SIM frame

(d) Color frame (e) Depth frame (f) SIM frame

Figure 19 Selected color, depth and SIM frames for “Face” and “Book” sequences.

28

Since ground-truth camera poses are not available, the reprojection and 3D error metrics are utilized:

π‘…π‘’π‘π‘Ÿπ‘œπ‘—π‘’π‘π‘‘π‘–π‘œπ‘› πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ = β€–[ π‘₯ 𝑦 π‘œ 𝑖 π‘œ 𝑖

] 𝑑

− [ π‘₯ 𝑦 π‘œ 𝑖 π‘œ 𝑖

]

𝑃,𝑑

β€– where [ π‘₯ 𝑦 π‘œ 𝑖 π‘œ 𝑖

] is 2D pixel coordinate observation and [ π‘₯ 𝑦 π‘œ 𝑖 π‘œ 𝑖

]

𝑃 is 2D pixel coordinate obtained using the associated state estimates for i th object point at time instant t: 𝛼 𝑖 π‘₯ π‘œ 𝑖

[ 𝑦 π‘œ 𝑖

]

1

𝑃,𝑑 𝑓 π‘₯

0 𝑝 π‘₯

= [ 0 𝑓 𝑦

𝑝 π‘₯

] [𝑅 𝑑𝑣

0 0 1

[𝑅 π‘‘π‘œ

[

𝑋 π‘œ 𝑖

π‘Œ π‘œ 𝑖

] + 𝑑 π‘‘π‘œ

] + 𝑑 𝑑𝑣

]

𝑍 π‘œ 𝑖

3𝐷 πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ = β€–[

𝑋 π‘œ 𝑑𝑖

π‘Œ π‘œ 𝑑𝑖

𝑍 π‘œ 𝑑𝑖

] 𝑑

− [𝑅 π‘‘π‘œ

𝑋 π‘œ 𝑖

[ π‘Œ π‘œ 𝑖

𝑍 π‘œ 𝑖

] + 𝑑 π‘‘π‘œ

]β€–

Mean error values are obtained by averaging associated values for all tracked object points in a frame. With real-data, following algorithms are tested:

ο‚·

Final 3D tracker proposed in Section 2,

ο‚·

3D tracker which weights 2D and 3D observations as in Section 2.iv, but does not utilize regular sampling of 3D points (Sub-section 2.i) and SIM to aid feature tracking (Sub-section 2.ii),

ο‚·

3D tracker proposed in the previous thesis progress report, which does not utilize observation weighting, regular sampling of 3D points and SIM to aid feature tracking (so-called “Base” algorithm),

ο‚·

Quaternion based 3D tracker [22].

Figure 20 shows mean reprojection and 3D errors for the “Face” sequence.

29

(a) Mean reprojection errors.

(b) Mean 3D errors.

Figure 20 Mean reprojection and 3D errors for “Face” sequence.

30

Figure 21 shows mean reprojection and 3D errors for the “Face” sequence.

(a) Mean reprojection errors.

(b) Mean 3D errors.

Figure 21 Mean reprojection and 3D errors for “Book” sequence.

31

Examining

Figure 20 and

Figure 21 , one can conclude that weighting of

observations increases 3D tracking performance. However, the final proposed algorithm has superior performance in terms of mean reprojection and 3D errors.

4. Conclusions and Future Work

In this progress report, effort within the 6-month period is detailed. To summarize:

ο‚· A robust 3D tracker utilizing intensity, 3D and SIM data is proposed.

The system weights observations based on their qualities estimated using TIM and 3D observation coordinates.

ο‚·

The proposed formulation is simulated via Monte Carlo experiments and convergence characteristic is analyzed. Moreover, the effect of weighting observations is examined by simulations.

ο‚·

The system is tested with real data and associated performance is compared to previously proposed algorithm.

The proposed algorithm enables objects selected by the user to be tracked in

3D space with high accuracy and reduced drift. The system relies on robust features associated using KLT tracking of intensity and SIM data. Furthermore, the proposed formulation enables weighting of observations based on their qualities.

The algorithm tracks 3D pose using the model points selected in the very first frame. However, due to occlusion or sensor noise, these points are eliminated by the algorithm of Sub-section 2.iii, as time goes on. Hence, the number of features decreases as time goes on, which may result in a decrease in tracking quality:

(a) Initial frame (b) Final frame

Figure 22 Initial and final frames of “Face” sequence.

32

Figure 23 Regular lattice for feature selection.

Therefore, a routine, which will update the model selected at t

0

, will be developed.

In the proposed system, the model points to utilize for tracking are selected

using a regular grid, as in Figure 5 . As a future work, these features will be selected

using intensity and SIM data. In order to select points with high cornerness measure, while maximizing their spread, regular lattice shown in

Figure 23 will be formed.

Features having maximum intensity and SIM cornerness measures will be selected within each patch and utilized for 3D tracking.

Furthermore, the performance of the proposed algorithm will be tested using sequences with available ground truth camera poses. For instance, a recent RGB-D

dataset is proposed in [23].

Relying both on intensity and SIM data, the proposed algorithm can operate under varying lighting conditions. Hence, this feature will also be tested in the future.

33

References

[1] Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse,

“MonoSLAM: Real-Time Single Camera SLAM”, IEEE Transactions on

Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, June 2007.

[2] Kevin Nickels and Seth Hutchinson, “Model-Based Tracking of Complex

Articulated Objects”, IEEE Transactions on Robotics and Automation, Vol.

17, No. 1, February 2001.

[3] Kevin Nickels and Seth Hutchinson, “Weighting observations: The Use of

Kinematic Models in Object Tracking”, IEEE International Conference on

Robotics and Automation, 1998.

[4] G.Taylor, L.Kleeman,“Fusion of Multimodal Visual Cues for Mod- el-based

Object Tracking,” Australasian Conference on Robotics and Automation

(ACRA2003), Brisbane, Australia, 2003.

[5] Michael Krainin, Peter Henry, Xiaofeng Ren, Dieter Fox, “Manipulator and

Object Tracking for In-Hand 3D Object Modeling”, The International

Journal of Robotics Research September 2011 vol. 30 no. 11 1311-1327.

[6] Q. Gan, C.J Harris, “Comparison of Two Measurement Fusion Methods for Kalman-Filter-Based Multisensor Data Fusion”, IEEE Transactions on

Aerospace and Electronic Systems Vol. 37, No. 1 January 2001.

[7] Hauke Strasdat, J. M. M. Montiel and Andrew J. Davison, “Real-time

Monocular SLAM: Why Filter?”, IEEE International Conference on

Robotics and Automation, 2010.

[8] Vincent Lepetit, Pascal Fua, “Monocular Model-Based 3D Tracking of Rigid

Objects: A Survey”, Foundations and Trends in Computer Graphics and

Vision Vol. 1, No 1 (2005) 1–89.

[9] Georg Klein, David Murray, “Parallel Tracking and Mapping for Small AR

Workspaces”, IEEE International Symposium on Mixed and Augmented

Reality, 2007.

[10] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard

Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin

Freeman, Andrew Davison, Andrew Fitzgibbon, “KinectFusion: Real-time

3D Reconstruction and Interaction Using a Moving Depth Camera”’, ACM

Symposium on User Interface Software and Technology, 2011.

34

[11] Frank Steinbrücker, Jürgen Sturm, Daniel Cremers. “Real-Time Visual

Odometry from Dense RGB-D Images”, IEEE International Conference on

Computer Vision Workshops, 2011.

[12] V. Vezhnevets, V. Konouchine, "Grow-Cut - Interactive Multi-Label N-

D Image Segmentation", Graphicon, 2005.

[13] Matlab Central: Grow-cut Image Segmentation by Shawn Lankton.

Retrieved 01.06.2011 from http://www.mathworks.com/matlabcentral/fileexchange/19091growcut-image-segmentation.

[14] G. Mu, M. Liao, R. Yang, D. Ouyang, Z. Xu and X. Guo, “Complete 3D model reconstruction using two types of depth sensors”, ICIS 2010.

[15] J. J. Koenderink, Solid shape: MIT Press, 1990.

[16] Jianbo Shi, Carlo Tomasi, “Good Features to Track”, Computer Vision and Pattern Recognition, 1994.

[17] C Harris, M Stephens, “A combined corner and edge detector”, Alvey

Vision Conference, 1988

[18] R. Liu, Stan Z.Li, X. Yuan, and R. He, “Online Determination of Track

Loss Using Template Inverse Matching”, VS 2008.

[19] Bouquet, JY, “Pyramidal Implementation of the Lucas Kanade Feature

Tracker,” Intel Corporation, Microprocessor Research Labs, http://www.intel.com/research/mrl/research/ opencv/.

[20] Shay Ohayon and Ehud Rivlin, “Robust 3D Head Tracking Using

Camera Pose Estimation”, International Conference on Pattern

Recognition, 2006.

[21] W. H. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery,

Numerical Recipes in C - The Art of Scientific Computing. 2nd ed.,

Cambridge University Press, 1992.

[22] R. Jain, R. Kasturi, B.G. Schunck, “Machine Vision”, McGraw-Hill, 1995.

[23] J. Sturm, S. Magnenat, N. Engelhard, F. Pomerleau, F. Colas, W.

Burgard, D. Cremers, R. Siegwart, “Towards a Benchmark for RGB-D SLAM

Evaluation”, In Proc. of the RGB-D Workshop on Advanced Reasoning with Depth Cameras at Robotics: Science and Systems Conf. (RSS), 2011.

[24] Neslihan Yalcin Bayramoglu, “Range Data Recognition: Segmentation,

Matching, and Similarity Retrieval”, PhD. Dissertation, The Graduate

School of Natural and Applied Sciences of Middle East Technical

University, 2011.

35

Download