Doc - SJTU Media Lab

advertisement
Video Deblurring for Hand-held Cameras Using
Patch-based Synthesis
Team members:
瞿晖 1120349077
丰子灏 1120349043
李婷 1120349143
1. Introduction
Video stabilization systems [1, 2] have been proposed recently to smooth the camera motion
in a shaky video. Although these approaches can successfully stabilize the video content, it leaves
the blurriness caused by original camera motion untouched. On the other hand, blurry video
frames also hinder video stabilization approaches from achieving high quality results. Most
stabilization systems rely on feature tracking to plan the camera motion. However feature tracking
over blurry frames is often not reliable due to the lack of sharp image features. Restoring sharp
frames from blurry ones caused by camera motion, which we dub video motion deblurring, is thus
critical to generating high quality stabilization results.
A straightforward idea for video motion deblurring is to first identify blurry frames, and then
applies existing single or multiple image deblurring techniques to them. Unfortunately, it is found
that existing image deblurring approaches are incapable of generating satisfactory results on
video.
There are some kinds of deblurring methods:
(1) Video deblurring by interpolation:
Matsushita et al. [3] proposed a practical video deblurring method in their video stabilization
system. They detected sharp frames using the statistics of image gradients, and interpolated sharp
frames to increase the sharpness of blurry frames.
(2) Lucky imaging:
They assumed the camera is static, thus the small amount of misalignment between images
can be removed by a patch based search with simple comparison of pixel values.
(3) Patch-based synthesis:
Recently Barnes et al. [4] extended their Patch-Match algorithm to include searching across
scales and rotations as well. HaCohen et al. [5] proposed a non-rigid dense correspondence
algorithm and applied it to image deblurring, by iteratively performing blind-deconvolution and
image registration.
However, the previous approaches will either degrade the quality of deblurred frames or
throw away most data.
In the paper, the authors present an effective, practical solution for video motion deblurring,
which avoids applying direct kernel estimation and deconvolution to video frames. They first
estimate a parametric, homography-based motion for each frame as an approximation to the real
motion. They then use the approximated motion for defining the luckiness of a pixel to measure its
sharpness. To deblur a video frame, the authors search for luckier pixels in nearby frames and use
them to replace less lucky ones in the current frame. The pixel correspondence across frames is
obtained using the estimated homographs, followed by a local search of the best matching image
patch to compensate for the inaccuracy of the motion model. To compare a sharp patch with a
blurry one, they use forward convolution to blur the sharp patch with the estimated blur function
of the blurry patch. When copying lucky pixels to a blurry frame, they adopt a patch-based texture
synthesis approach [6] to better preserve object structures. Finally, they impose similarity
constraint on the corresponding patches in consecutive frames to maintain temporal coherence of
the deblurred frames.
Our team implements a part of their work, and achieves relative good deblurring results. Next,
we will introduce their work and our implementation in details.
2. Blur model
2.1 Approximate blur model
The author use homographies to approximate motion blurs introduced by hand-held cameras.
As shown in Fig.2.1, they assume that the velocity of the motion is constant between adjacent
frames and the duty cycle of the camera is 2 , and the exposure time interval for frame f i is
[ti   , ti   ] . Let the time interval be T  ti  ti 1 , and the latent image of f i is li . Then
li 1  H i (li ) , where H i is an warping function parameterized by a 3  3 homography matrix
Hi .
Fig.2.1. Illustration of the blur model
The blur model defined in the paper is:

1
[ ( H id1 (li )  H id (li ))  li ]
1  2 d 1
(2-1)
T t
t
T t
t
I  H i11 , H it 
I  Hi
T
T
T
T
(2-2)
fi  bi (li ) 
t
t
where H i 1 and H i are defined as:
H it1 
H i11 is the inverse matrix of H i 1 , and I is the 3  3 identity matrix.
bi is called blur function for frame i . T becomes the sampling rate in ti , ti 1  and is set to
20 in our implementation, which is the same as that in the paper.
 becomes the number of
samples that fall into the duty cycle.
2.2 Blur model estimation
There are two parameters to be estimated in order to obtain the blur function: the
homography H i and the duty cycle
.
2.2.1 Homograph calculation
To estimate the homography H i , the author use KLT approach [7] to track features and then
compute the initial homographies. They also refine the homographies using Lucas-Kanade
registration [8] between frames. In our implementation, we use OpenCV functions to track
features and then use linear optimization to compute the homographies, without the step of
refinement.
2.2.1.1 Feature extraction
The difficulty in image feature extraction is how to ensure the robustness and discrimination.
Robustness refers that the features are not influenced by noise, lighting and geometric
transformations, and discrimination means that different characteristics can be well separated. In
recent years, SIFT [9] and Speeded-Up Robust Features (SURF) [10] are effective in feature
extraction. However, the scale, rotation and illumination of adjacent frames have little change,
therefore we use Harris corner detection operator [11] for simplicity.
The image points can be divided into three categories: flat, edge, corner. The basic idea of
Harris corner detection is that the corners are easily found by observing through a small window,
because the brightness in the window centered in the corner will change a lot if moving the
window in any direction, as shown in Figure 2.2.
(a)flat region
(b) edge
(c) corner
Fig. 2.2 The basic idea of Harris corner detection
2.2.1.2 Feature tracking
After a certain number of feature points are obtained, the corresponding position of these
feature points in the next frame should be figured out in order to estimate the motion between two
frames. We use the pyramid Lucas-Kanade method (LK) [12] to track.
LK algorithm is based on the following three assumptions: (1) Constant brightness (2) Time
is continuous or movement is "little movement" (3) Space consistency.
These assumptions ensure that tracked motion is “small and consistent movement”, when the
LK algorithm can achieve good results. However, large and not consistent motion widely exists in
shaky videos, and LK algorithm cannot obtain good tracking results in this condition. Image
pyramid can decompose big and incoherent motion into small and coherent movement, so LK
algorithm based on image pyramid is effective on tracking larger and faster motion.
In order to make tracking more robust to moving objects, we propose two methods:
(a) The distance between each feature point is set relatively large to ensure that feature points
distribute uniformly on the whole image. Assume that the number of feature points to be selected
is 100, and then the minimum distance between feature points can be set to:
 h w
min _dist  min  , 
 15 15 
(2-3)
where h and w denote the height and width of the image, respectively.
(b) Re-select feature points every K frames to reduce the accumulated tracking error. K is set
to 10 in our implementation.
The result of feature extraction in the first frame of sequence “books” is shown in Fig.2.3.
The red points are extracted features.
Fig. 2.3 The extracted features in the first frame of sequence “books”
2.2.1.3 Outlier rejection
Before we use the tracked feature pairs to compute homograpies, we should perform outlier
rejection to eliminate the feature pairs on foreground objects to improve the accuracy of
homography estimation.
There are local outlier rejection methods and global methods; in our implementation we use
global outlier rejection by RANSAC. The basic idea is using RANSAC to estimate a translational
model for all feature pairs and only retaining those feature pairs that agree with the estimated
model up to a threshold distance (10 pixels in out implementation).
There is a problem when estimating the translational model. The number of feature pairs is
relatively large, but only one feature pair is selected on each loop and not all feature pairs can be
selected due to the limitation number of loops, which may result in wrong outlier rejection and
some noise points or foreground points are considered as inliers, see Fig.2.4. To solve this problem,
we first judge whether the number of inliers is larger than a threshold (one third of the number of
total features), if not, then we perform another outlier rejection in the remaining “outliers”. The
right outlier rejection result is shown in Fig.2.5.
Fig. 2.4 Wrong outlier rejection: the points with green lines are considered as inliers.
Fig. 2.5 Right outlier rejection: the points with green lines are considered as inliers.
2.2.1.4 Homography estimation by linear optimization
The homography matrix between two frames can be computed by at least four feature pairs.
But the number of feature pairs after outlier rejection is much larger, therefore we use linear
optimization to compute homographies.
The homography from frame i to frame j is denoted by H ij , and
 p , p  (k  1, 2,
i
k
j
k
, N ) is the k-th feature pairs between frame i to frame j . Theoretically, we have
pkj  Hij pki
(2-4)
for all k. However, H ij cannot reflect the motion between frames exactly. The error is denoted as:
N
error   pkj  H ij pki
(2-5)
k 1
We treat equation (2-5) as the objective function, and when it reaches to the minimum value, the
corresponding H ij is what we want. The optimization problem can be solved by linear
programming solver after introducing slack variables ek :
ek  pkj  Hij pki  ek
(2-6)
with ek  0 . We use the freely available COIN CLP simplex solver [13] to solve the problem.
In the deblurring approach, we need to estimate homographies not only between adjacent
frames, but also between any two frames in a local temporal window Wi  i  M , i  M  ,
where M is set to 5 frames. In the paper, the authors compute this kind of homographies in the
same way as those between adjacent frames. In our implementation, we obtain them by
multiplication of adjacent frame homographies:
H ij  H i ,i 1 H i 1,i  2 ,
,H j ,i  j
(2-7)
H ij  H i ,i 1 H i 1,i  2 ,
,H j ,i  j
(2-8)
This method can also achieve small errors in homograpies because the homographies between
adjacent frames are accurate since we calculate them by optimization.
2.2.2 Duty cycle  calculation
To compute  , the authors first select a set of frame pairs, where each pair has a large
difference in the luckiness measurement, so that the accuracy of blur functions can be effectively
tested. Then they minimize an energy function of  to find the value of  . We first introduce the
luckiness measurement and then the method to find  .
2.2.2.1 Luckiness measurement
The luckiness of a pixel is introduced to describe the absolute displacement of the pixel
among adjacent frames. For a pixel x in frame f i , its luckiness is defined as:
 H 1 ( x)  x 2  H ( x)  x
i 1
i
 i ( x)  exp  
2

2 l

2




(2-9)
Where H is a function that maps a pixel position to another pixel position according to the
homography H .  l is a constant which is set to 20 pixels. When the motion of x is small,
H i11 and H i are close to I ,thus  i ( x) is close to 1, indicating that the image patch
centered at x is likely to be sharp. Otherwise  i ( x) is small, indicating that the patch is likely
to contain large motion blur. The luckiness  i of a whole frame f i is simply defined as the
average value of all  i ( x) for pixels in f i .
We just compute the luckiness of each frame as the equation (2-9). However, the authors do
not mention how to deal with the first and the last frames. The luckiness of them cannot be
1
obtained according to equation (2-9) as there is not H i 1 for the first frame or H i for the last
1
1
frame. In our implementation, we set H i 1  H i for the first frame and H i  H i 1 for the last
frame. The frame luckiness values of sequence “book” are shown in Fig.2.6. Our result is
Fig.2.6(a) and the result in the paper is Fig.2.6(b). We can see that they are nearly the same.
(a)the luckiness of our implementation
(b)the luckiness in the paper
Fig. 2.6 Frame luckiness values of sequence “book”.
2.2.2.2 Computing optimal
Let
f
k
i
, f jk  , k  1,

, K , be K pairs of frames with j  Wi , where the frame
luckiness difference,  i   j , is larger than a threshold (0.6  as the author set, where
  max  i ). Then the optimal  can be obtained by minimizing the following function:
i


E     b kj H ijk  f i k   f jk
K
k 1
2
(2-10)
  
H ijk  f i k  is aligned sharp frame, and b kj H ijk fi k
is synthetic blurred frame from sharp
k
frame f i . If the blur function b j are accurate, the synthetic blurred frame is close to the real
k
blurred frame, thus the value of equation (2-10) should be small. From Fig.2.1, we can see that

can take only a limited set of integer values from 1 to T 2 , so the authors minimize equation
(2-10) using a brute-force search. For simplicity, we use linear search, which is a simple kind of
brute-force search.
Fig.2.7 shows some result of this process on sequence “books”. Fig.2.7(a) is the sharp frame
6, Fig.2.7(b) is the blurred warped frame of frame 6 to frame 2, Fig.2.7(c) is the real blurring
frame 2 and Fig.2.7(d) is the difference between synthetic blurred frame and real blur frame.
(a)sharp frame 6
(b)blurred warped frame 6
(c)blur frame 2
(d)difference between (b) and (c)
Fig. 2.7 Results of blur function on sequence “book”.
3. Single frame deblurring
Once the blur function bi for frame f i has been decided, we can use sharp patches in
nearby frames to restore the latent frame li .
3.1 Patch deblurring
Let f i , x be an n  n image patch in a frame f i centered at a pixel x (n= 11). f i , x can be
deblurred by computing an weighted average of sharp patches from nearby frames f j in the
temporal window Wi :
li , x 
1
Z

( j , y )i , x
w(i, x, j , y ) f j , y
(3-1)
 
where li , x is a deblurred patch of f i , x , and f j ,y is a patch in the warped frame H ji f j
centered at a pixel y . The weight w(i, x, j , y ) is defined as:
 b f
j, y
i,x
w(i, x, j, y )  exp  
2

2 w

2




(3-2)
where b j , y is a patch centered at y in the blurred warped frame bi ( H ji ( f j )) and  w is a
constant set to 0.1. Z is the normalization factor.
The key issue is how to find N best-matching patches from warped nearby frames
H ji  f j  .
3.1.1 Patch searching
If a patch f j ,y is the best-matching one, then the blurred patch b j ,y of f j ,y should be
very close to the real blur patch f i , x . Therefore, we just need to find the patch which solves
arg min b j , y  fi , x
2
(3-3)
j, y
 
In order to find a matching patch f j ,y in H ji f j , the authors use m  m window
centered at the pixel x . Ideally, if H ji is accurate enough, we can simply set the search range m
to be one. However, in practice due to parallax and object motions, the real motion among frames
is generally more complicated than a single homography. Therefore, they set m  21 , and we use
the same value in our implementation.
In the paper, they only find the best matching patch to restore a latent patch, and equation
(3-1) is reduced to:
li , x  f j , y
(3-4)
For patch deblurring, we found that one patch is not enough to obtain smooth deblurring results.
Therefore we search N  3 best matching patches and use equation (3-1) to restore a latent
patch.
In our implementation, we compute b j , y  fi , x
2
pixel by pixel for each frame in the
temporal window Wi and find 3 best patch candidates in each frame. Then the 3 best-matching
patches are selected from 33 patch candidates. The number of search loops is 111111  1331,
which is acceptable and not time-consuming.
3.1.2 Latent patch restoration
After we obtain 3 best-matching patches for the current blur patch f i , x , we us them to
restore the latent patch li , x according to equation (3-1) and (3-2). We found that if the value of
parameter  w in the weight function is set to 0.1 as the paper, the weight is zero because the
difference b j , y  fi , x
2
is very large (75.43 for example) compared to the value of  w . To
compute the weight, the difference
2
b j , y  fi , x should be smaller than 0.2 if  w  0.1 ,
otherwise the weight will be too small or even be zero. However, the difference could not be so
small, otherwise the average error per pixel would be 0.2 1111  0.00165 , which seems
impossible due to the error of homographies and blur function. Therefore, we doubt that if the
author has written a wrong value of
 w by mistake. In our implementation, we set  w  10
and can achieve good results.
3.1.3 Patch deblurring results
We show some results of our implement on this part. Fig.3.1(a) is a region in the input blurry
frame 4 of sequence “book”, Fig.3.1(b) is the result of using one patch to restore the latent patch,
and Fig.3.1(c) is the result of using three patches. We can see that the result in (c) is a little better
than that in (b), especially in the areas marked by the red rectangles. In fact, the difference will be
more obvious if magnified.
(a)a region in blurry frame 4
(b) restored patch using one patch
(c) restored patch using three patches
Fig.3.1 Illustration of deblurring results of using different number of best patches
3.2 Frame deblurring
Although we can restore the latent frame li using equation (3-1) patch by patch, this
approach may lead to misalignments of object structures in li , as pixels in li are determined
individual without enforcing spatial coherence (see the region marked by green rectangle in
Fig.3.1(c)). To overcome this problem, the authors in the paper adapt a patch-based texture
synthesis approach [6] to merge the effects of overlapping deblurred patches in li .
Let li  x  be the value of li at a pixel x . They determine li  x  as:
li  x  
1
Z
1

Z
Z
x ' x
l
x ' i, x '
 
 x
x ' x  j , y i , x '
w  i, x ', j , y  f j , y  x 
(3-5)
The meaning of symbols can refer to the original paper. Generally, they first compute deblurred
patches for each pixel of the current frame f i , and then a pixel x will be covered by n
2
deblurred patches whose values at x are weighted averaged using equation (3-5). The authors
perform deblurring for a sparse regular grid of pixels instead of for every pixel, as done in [6], to
speed up the whole process.
Since pixel by pixel deblurring process cost too much time about 20 minutes per frame), we
have not integrated this part into our system. Some issues should be discussed in order to
accelerate our implemented process and then we can integrate the part into our system to obtain
better deblurring effects.
3.3 Handling moving objects
The deblurring method can process slightly moving objects due to the local search of
matching patches. As to objects with large motion, the method can keep them untouched. However,
our implementation cannot remain objects with large motion. Fig.3.2 shows the blurry frame 28 in
sequence “bicycle” and our deblurring result. It is obvious that the sportsman is not keeped
untouched in our result. And this region is not successfully deblurred, either.
One possible reason is that we don’t integrate the frame deblurring part, another reason is that
something is wrong in our implementation. We haven’t figure it out.
(a)the blurry frame 28
(b)our deblurred result
Fig.3.2 Illustration of deblurring result on frame with moving object
4. Improved deblurring using luckiness
To further improve the sharpness of deblurred frames, the authors add luckiness in the
process in three aspects:
(1) They use the luckiness values to determine the processing order of frames. Frame with the
highest luckiness value will be deblurred first, and most pixels will remain unchanged after
deblurring. As the luckiness values of frames become lower, more pixels will be updated by
sharper pixels from already processed frames.
(2) They revise the weight function in equation (3-2):
 1
j, y
w '  i, x, j , y   w  i, x, j , y  exp  

2 2

2




(3-5)
where  j , y is an n  n patch centered at a pixel y in a luckiness map  j , and  is a
 
constant. A luckiness map  j is the luckiness values of pixels in the warped frame H ji f j .
(3) They also introduce a luckiness term when searching for best patches and equation (3-3)
becomes:

arg min b j , y  fi , x
j, y
where
2
  1   j,y
2

(3-6)
   2  is a weight to adjust the relative value between the patch match term and the
luckiness term.
The value of  is not mentioned in the paper. We set   1 . The value of
 is set to 0.01
in the paper, which is too small in our system and has no improvement on the results. Therefore,
we choose some values and find that   10 is a suitable value for our system.
The deblurred results before and after using luckiness on frame 4 of sequence “books” are
shown in Fig.4.1.
(a)a region in blurry frame 4
(b)without luckiness
(c) with luckiness
Fig.4.1 Illustration of deblurring results before and after using luckiness
5. Conclusion and Future work
In general, we have implemented the basic work of the paper, including blur function
estimation, patch deblurring and improved deblurring using luckiness. The speed of our system is
about 30 seconds per frame. And we can obtain relative good deblurring results using our code,
see Fig.4.1(a) and (c). Our deblurring results is worse than the paper’s results, as we haven’t
accomplished all their work. Another reason may be that we deal with some issues in different
ways from those in the paper, which may affect the deblurring results.
The difference between our implementation and the work in the paper lies in the following
aspects:
(1) We compute the homographies between adjacent frames by linear programming, while in the
paper it is not mention how to compute them specifically.
(2) We calculate the homographies between non-adjacent frames by multiplying instead of feature
tracking and then refining in the paper.
(3) We solve the problem in equation (3-3) by linear searching, while the authors’ method is not
mentioned.
(4) The values of some parameters in our implementation are different from those in the paper, i.e.
w , N ,  ,  .
The first two items may affect the accuracy of homographies. However, homographies are used in
luckiness measurement and we obtain nearly the same luckiness result as that in the paper
(Fig.2.6), which indicates that the difference of homographies is small. The third item is related to
speed and has little influence on the performance, while the last item has much importance on the
deblurring results. We cannot obtain good results if using values in the paper.
Due to the limited time, we cannot implement the work perfectly and leave some questions.
In the next, we may try to implement the frame deblurring part and then find the reason why the
values of these parameters are different from those in the paper. Besides, we also need to improve
our system to handle fast moving objects. After that, the other parts of this paper may be
considered to be implemented.
Reference
[1] Liu F, Gleicher M, Wang J, et al. Subspace video stabilization. ACM Transactions on Graphics
(TOG), 2011, 30(1): 4.
[2] Grundmann, Matthias, Vivek Kwatra, and Irfan Essa, "Auto-directed video stabilization with
robust L1 optimal camera paths", Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on. 2011.
[3] Matsushita, Yasuyuki, et al. "Full-frame video stabilization with motion inpainting." Pattern
Analysis and Machine Intelligence, IEEE Transactions on 28.7 (2006): 1150-1163.
[4] Barnes, Connelly, et al. "The generalized patchmatch correspondence algorithm", Computer
Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 29-43.
[5] HaCohen, Yoav, et al. "Non-rigid dense correspondence with applications for image
enhancement." ACM Transactions on Graphics (TOG) 30.4 (2011): 70.
[6] Kwatra, Vivek, et al. "Texture optimization for example-based synthesis", ACM Transactions
on Graphics (TOG). Vol. 24. No. 3. ACM, 2005.
[7] Shi, Jianbo, and Carlo Tomasi. "Good features to track", Computer Vision and Pattern
Recognition, 1994.
[8] Baker, Simon, and Iain Matthews. "Lucas-kanade 20 years on: A unifying framework."
International Journal of Computer Vision 56.3 (2004): 221-255.
[9] D. Lowe. “Distinctive image features from scale-invariant keypoints”, International Journal of
Computer Vision, Vol.60(2):91–110, 2004.
[10] H. Bay, T. Tuytelaars, and L. J. Van Gool. “SURF: Speeded Up Robust Features”, In ECCV,
pp. 404-417, 2006.
[11] C. Harris and M. Stephens, “A combined corner and edge detection”, In Proceedings of The
Fourth Alvey Vision Conference, pp. 147–151, 1988.
[12] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to
stereo vision”, Proc. Int. Joint Conf. on Artificial Intelligence, pp. 674 -679, 1981.
[13] COIN CLP simplex solver: http://www.coin-or.org/Clp/userguide/index.html
Download