Video Deblurring for Hand-held Cameras Using Patch-based Synthesis Team members: 瞿晖 1120349077 丰子灏 1120349043 李婷 1120349143 1. Introduction Video stabilization systems [1, 2] have been proposed recently to smooth the camera motion in a shaky video. Although these approaches can successfully stabilize the video content, it leaves the blurriness caused by original camera motion untouched. On the other hand, blurry video frames also hinder video stabilization approaches from achieving high quality results. Most stabilization systems rely on feature tracking to plan the camera motion. However feature tracking over blurry frames is often not reliable due to the lack of sharp image features. Restoring sharp frames from blurry ones caused by camera motion, which we dub video motion deblurring, is thus critical to generating high quality stabilization results. A straightforward idea for video motion deblurring is to first identify blurry frames, and then applies existing single or multiple image deblurring techniques to them. Unfortunately, it is found that existing image deblurring approaches are incapable of generating satisfactory results on video. There are some kinds of deblurring methods: (1) Video deblurring by interpolation: Matsushita et al. [3] proposed a practical video deblurring method in their video stabilization system. They detected sharp frames using the statistics of image gradients, and interpolated sharp frames to increase the sharpness of blurry frames. (2) Lucky imaging: They assumed the camera is static, thus the small amount of misalignment between images can be removed by a patch based search with simple comparison of pixel values. (3) Patch-based synthesis: Recently Barnes et al. [4] extended their Patch-Match algorithm to include searching across scales and rotations as well. HaCohen et al. [5] proposed a non-rigid dense correspondence algorithm and applied it to image deblurring, by iteratively performing blind-deconvolution and image registration. However, the previous approaches will either degrade the quality of deblurred frames or throw away most data. In the paper, the authors present an effective, practical solution for video motion deblurring, which avoids applying direct kernel estimation and deconvolution to video frames. They first estimate a parametric, homography-based motion for each frame as an approximation to the real motion. They then use the approximated motion for defining the luckiness of a pixel to measure its sharpness. To deblur a video frame, the authors search for luckier pixels in nearby frames and use them to replace less lucky ones in the current frame. The pixel correspondence across frames is obtained using the estimated homographs, followed by a local search of the best matching image patch to compensate for the inaccuracy of the motion model. To compare a sharp patch with a blurry one, they use forward convolution to blur the sharp patch with the estimated blur function of the blurry patch. When copying lucky pixels to a blurry frame, they adopt a patch-based texture synthesis approach [6] to better preserve object structures. Finally, they impose similarity constraint on the corresponding patches in consecutive frames to maintain temporal coherence of the deblurred frames. Our team implements a part of their work, and achieves relative good deblurring results. Next, we will introduce their work and our implementation in details. 2. Blur model 2.1 Approximate blur model The author use homographies to approximate motion blurs introduced by hand-held cameras. As shown in Fig.2.1, they assume that the velocity of the motion is constant between adjacent frames and the duty cycle of the camera is 2 , and the exposure time interval for frame f i is [ti , ti ] . Let the time interval be T ti ti 1 , and the latent image of f i is li . Then li 1 H i (li ) , where H i is an warping function parameterized by a 3 3 homography matrix Hi . Fig.2.1. Illustration of the blur model The blur model defined in the paper is: 1 [ ( H id1 (li ) H id (li )) li ] 1 2 d 1 (2-1) T t t T t t I H i11 , H it I Hi T T T T (2-2) fi bi (li ) t t where H i 1 and H i are defined as: H it1 H i11 is the inverse matrix of H i 1 , and I is the 3 3 identity matrix. bi is called blur function for frame i . T becomes the sampling rate in ti , ti 1 and is set to 20 in our implementation, which is the same as that in the paper. becomes the number of samples that fall into the duty cycle. 2.2 Blur model estimation There are two parameters to be estimated in order to obtain the blur function: the homography H i and the duty cycle . 2.2.1 Homograph calculation To estimate the homography H i , the author use KLT approach [7] to track features and then compute the initial homographies. They also refine the homographies using Lucas-Kanade registration [8] between frames. In our implementation, we use OpenCV functions to track features and then use linear optimization to compute the homographies, without the step of refinement. 2.2.1.1 Feature extraction The difficulty in image feature extraction is how to ensure the robustness and discrimination. Robustness refers that the features are not influenced by noise, lighting and geometric transformations, and discrimination means that different characteristics can be well separated. In recent years, SIFT [9] and Speeded-Up Robust Features (SURF) [10] are effective in feature extraction. However, the scale, rotation and illumination of adjacent frames have little change, therefore we use Harris corner detection operator [11] for simplicity. The image points can be divided into three categories: flat, edge, corner. The basic idea of Harris corner detection is that the corners are easily found by observing through a small window, because the brightness in the window centered in the corner will change a lot if moving the window in any direction, as shown in Figure 2.2. (a)flat region (b) edge (c) corner Fig. 2.2 The basic idea of Harris corner detection 2.2.1.2 Feature tracking After a certain number of feature points are obtained, the corresponding position of these feature points in the next frame should be figured out in order to estimate the motion between two frames. We use the pyramid Lucas-Kanade method (LK) [12] to track. LK algorithm is based on the following three assumptions: (1) Constant brightness (2) Time is continuous or movement is "little movement" (3) Space consistency. These assumptions ensure that tracked motion is “small and consistent movement”, when the LK algorithm can achieve good results. However, large and not consistent motion widely exists in shaky videos, and LK algorithm cannot obtain good tracking results in this condition. Image pyramid can decompose big and incoherent motion into small and coherent movement, so LK algorithm based on image pyramid is effective on tracking larger and faster motion. In order to make tracking more robust to moving objects, we propose two methods: (a) The distance between each feature point is set relatively large to ensure that feature points distribute uniformly on the whole image. Assume that the number of feature points to be selected is 100, and then the minimum distance between feature points can be set to: h w min _dist min , 15 15 (2-3) where h and w denote the height and width of the image, respectively. (b) Re-select feature points every K frames to reduce the accumulated tracking error. K is set to 10 in our implementation. The result of feature extraction in the first frame of sequence “books” is shown in Fig.2.3. The red points are extracted features. Fig. 2.3 The extracted features in the first frame of sequence “books” 2.2.1.3 Outlier rejection Before we use the tracked feature pairs to compute homograpies, we should perform outlier rejection to eliminate the feature pairs on foreground objects to improve the accuracy of homography estimation. There are local outlier rejection methods and global methods; in our implementation we use global outlier rejection by RANSAC. The basic idea is using RANSAC to estimate a translational model for all feature pairs and only retaining those feature pairs that agree with the estimated model up to a threshold distance (10 pixels in out implementation). There is a problem when estimating the translational model. The number of feature pairs is relatively large, but only one feature pair is selected on each loop and not all feature pairs can be selected due to the limitation number of loops, which may result in wrong outlier rejection and some noise points or foreground points are considered as inliers, see Fig.2.4. To solve this problem, we first judge whether the number of inliers is larger than a threshold (one third of the number of total features), if not, then we perform another outlier rejection in the remaining “outliers”. The right outlier rejection result is shown in Fig.2.5. Fig. 2.4 Wrong outlier rejection: the points with green lines are considered as inliers. Fig. 2.5 Right outlier rejection: the points with green lines are considered as inliers. 2.2.1.4 Homography estimation by linear optimization The homography matrix between two frames can be computed by at least four feature pairs. But the number of feature pairs after outlier rejection is much larger, therefore we use linear optimization to compute homographies. The homography from frame i to frame j is denoted by H ij , and p , p (k 1, 2, i k j k , N ) is the k-th feature pairs between frame i to frame j . Theoretically, we have pkj Hij pki (2-4) for all k. However, H ij cannot reflect the motion between frames exactly. The error is denoted as: N error pkj H ij pki (2-5) k 1 We treat equation (2-5) as the objective function, and when it reaches to the minimum value, the corresponding H ij is what we want. The optimization problem can be solved by linear programming solver after introducing slack variables ek : ek pkj Hij pki ek (2-6) with ek 0 . We use the freely available COIN CLP simplex solver [13] to solve the problem. In the deblurring approach, we need to estimate homographies not only between adjacent frames, but also between any two frames in a local temporal window Wi i M , i M , where M is set to 5 frames. In the paper, the authors compute this kind of homographies in the same way as those between adjacent frames. In our implementation, we obtain them by multiplication of adjacent frame homographies: H ij H i ,i 1 H i 1,i 2 , ,H j ,i j (2-7) H ij H i ,i 1 H i 1,i 2 , ,H j ,i j (2-8) This method can also achieve small errors in homograpies because the homographies between adjacent frames are accurate since we calculate them by optimization. 2.2.2 Duty cycle calculation To compute , the authors first select a set of frame pairs, where each pair has a large difference in the luckiness measurement, so that the accuracy of blur functions can be effectively tested. Then they minimize an energy function of to find the value of . We first introduce the luckiness measurement and then the method to find . 2.2.2.1 Luckiness measurement The luckiness of a pixel is introduced to describe the absolute displacement of the pixel among adjacent frames. For a pixel x in frame f i , its luckiness is defined as: H 1 ( x) x 2 H ( x) x i 1 i i ( x) exp 2 2 l 2 (2-9) Where H is a function that maps a pixel position to another pixel position according to the homography H . l is a constant which is set to 20 pixels. When the motion of x is small, H i11 and H i are close to I ,thus i ( x) is close to 1, indicating that the image patch centered at x is likely to be sharp. Otherwise i ( x) is small, indicating that the patch is likely to contain large motion blur. The luckiness i of a whole frame f i is simply defined as the average value of all i ( x) for pixels in f i . We just compute the luckiness of each frame as the equation (2-9). However, the authors do not mention how to deal with the first and the last frames. The luckiness of them cannot be 1 obtained according to equation (2-9) as there is not H i 1 for the first frame or H i for the last 1 1 frame. In our implementation, we set H i 1 H i for the first frame and H i H i 1 for the last frame. The frame luckiness values of sequence “book” are shown in Fig.2.6. Our result is Fig.2.6(a) and the result in the paper is Fig.2.6(b). We can see that they are nearly the same. (a)the luckiness of our implementation (b)the luckiness in the paper Fig. 2.6 Frame luckiness values of sequence “book”. 2.2.2.2 Computing optimal Let f k i , f jk , k 1, , K , be K pairs of frames with j Wi , where the frame luckiness difference, i j , is larger than a threshold (0.6 as the author set, where max i ). Then the optimal can be obtained by minimizing the following function: i E b kj H ijk f i k f jk K k 1 2 (2-10) H ijk f i k is aligned sharp frame, and b kj H ijk fi k is synthetic blurred frame from sharp k frame f i . If the blur function b j are accurate, the synthetic blurred frame is close to the real k blurred frame, thus the value of equation (2-10) should be small. From Fig.2.1, we can see that can take only a limited set of integer values from 1 to T 2 , so the authors minimize equation (2-10) using a brute-force search. For simplicity, we use linear search, which is a simple kind of brute-force search. Fig.2.7 shows some result of this process on sequence “books”. Fig.2.7(a) is the sharp frame 6, Fig.2.7(b) is the blurred warped frame of frame 6 to frame 2, Fig.2.7(c) is the real blurring frame 2 and Fig.2.7(d) is the difference between synthetic blurred frame and real blur frame. (a)sharp frame 6 (b)blurred warped frame 6 (c)blur frame 2 (d)difference between (b) and (c) Fig. 2.7 Results of blur function on sequence “book”. 3. Single frame deblurring Once the blur function bi for frame f i has been decided, we can use sharp patches in nearby frames to restore the latent frame li . 3.1 Patch deblurring Let f i , x be an n n image patch in a frame f i centered at a pixel x (n= 11). f i , x can be deblurred by computing an weighted average of sharp patches from nearby frames f j in the temporal window Wi : li , x 1 Z ( j , y )i , x w(i, x, j , y ) f j , y (3-1) where li , x is a deblurred patch of f i , x , and f j ,y is a patch in the warped frame H ji f j centered at a pixel y . The weight w(i, x, j , y ) is defined as: b f j, y i,x w(i, x, j, y ) exp 2 2 w 2 (3-2) where b j , y is a patch centered at y in the blurred warped frame bi ( H ji ( f j )) and w is a constant set to 0.1. Z is the normalization factor. The key issue is how to find N best-matching patches from warped nearby frames H ji f j . 3.1.1 Patch searching If a patch f j ,y is the best-matching one, then the blurred patch b j ,y of f j ,y should be very close to the real blur patch f i , x . Therefore, we just need to find the patch which solves arg min b j , y fi , x 2 (3-3) j, y In order to find a matching patch f j ,y in H ji f j , the authors use m m window centered at the pixel x . Ideally, if H ji is accurate enough, we can simply set the search range m to be one. However, in practice due to parallax and object motions, the real motion among frames is generally more complicated than a single homography. Therefore, they set m 21 , and we use the same value in our implementation. In the paper, they only find the best matching patch to restore a latent patch, and equation (3-1) is reduced to: li , x f j , y (3-4) For patch deblurring, we found that one patch is not enough to obtain smooth deblurring results. Therefore we search N 3 best matching patches and use equation (3-1) to restore a latent patch. In our implementation, we compute b j , y fi , x 2 pixel by pixel for each frame in the temporal window Wi and find 3 best patch candidates in each frame. Then the 3 best-matching patches are selected from 33 patch candidates. The number of search loops is 111111 1331, which is acceptable and not time-consuming. 3.1.2 Latent patch restoration After we obtain 3 best-matching patches for the current blur patch f i , x , we us them to restore the latent patch li , x according to equation (3-1) and (3-2). We found that if the value of parameter w in the weight function is set to 0.1 as the paper, the weight is zero because the difference b j , y fi , x 2 is very large (75.43 for example) compared to the value of w . To compute the weight, the difference 2 b j , y fi , x should be smaller than 0.2 if w 0.1 , otherwise the weight will be too small or even be zero. However, the difference could not be so small, otherwise the average error per pixel would be 0.2 1111 0.00165 , which seems impossible due to the error of homographies and blur function. Therefore, we doubt that if the author has written a wrong value of w by mistake. In our implementation, we set w 10 and can achieve good results. 3.1.3 Patch deblurring results We show some results of our implement on this part. Fig.3.1(a) is a region in the input blurry frame 4 of sequence “book”, Fig.3.1(b) is the result of using one patch to restore the latent patch, and Fig.3.1(c) is the result of using three patches. We can see that the result in (c) is a little better than that in (b), especially in the areas marked by the red rectangles. In fact, the difference will be more obvious if magnified. (a)a region in blurry frame 4 (b) restored patch using one patch (c) restored patch using three patches Fig.3.1 Illustration of deblurring results of using different number of best patches 3.2 Frame deblurring Although we can restore the latent frame li using equation (3-1) patch by patch, this approach may lead to misalignments of object structures in li , as pixels in li are determined individual without enforcing spatial coherence (see the region marked by green rectangle in Fig.3.1(c)). To overcome this problem, the authors in the paper adapt a patch-based texture synthesis approach [6] to merge the effects of overlapping deblurred patches in li . Let li x be the value of li at a pixel x . They determine li x as: li x 1 Z 1 Z Z x ' x l x ' i, x ' x x ' x j , y i , x ' w i, x ', j , y f j , y x (3-5) The meaning of symbols can refer to the original paper. Generally, they first compute deblurred patches for each pixel of the current frame f i , and then a pixel x will be covered by n 2 deblurred patches whose values at x are weighted averaged using equation (3-5). The authors perform deblurring for a sparse regular grid of pixels instead of for every pixel, as done in [6], to speed up the whole process. Since pixel by pixel deblurring process cost too much time about 20 minutes per frame), we have not integrated this part into our system. Some issues should be discussed in order to accelerate our implemented process and then we can integrate the part into our system to obtain better deblurring effects. 3.3 Handling moving objects The deblurring method can process slightly moving objects due to the local search of matching patches. As to objects with large motion, the method can keep them untouched. However, our implementation cannot remain objects with large motion. Fig.3.2 shows the blurry frame 28 in sequence “bicycle” and our deblurring result. It is obvious that the sportsman is not keeped untouched in our result. And this region is not successfully deblurred, either. One possible reason is that we don’t integrate the frame deblurring part, another reason is that something is wrong in our implementation. We haven’t figure it out. (a)the blurry frame 28 (b)our deblurred result Fig.3.2 Illustration of deblurring result on frame with moving object 4. Improved deblurring using luckiness To further improve the sharpness of deblurred frames, the authors add luckiness in the process in three aspects: (1) They use the luckiness values to determine the processing order of frames. Frame with the highest luckiness value will be deblurred first, and most pixels will remain unchanged after deblurring. As the luckiness values of frames become lower, more pixels will be updated by sharper pixels from already processed frames. (2) They revise the weight function in equation (3-2): 1 j, y w ' i, x, j , y w i, x, j , y exp 2 2 2 (3-5) where j , y is an n n patch centered at a pixel y in a luckiness map j , and is a constant. A luckiness map j is the luckiness values of pixels in the warped frame H ji f j . (3) They also introduce a luckiness term when searching for best patches and equation (3-3) becomes: arg min b j , y fi , x j, y where 2 1 j,y 2 (3-6) 2 is a weight to adjust the relative value between the patch match term and the luckiness term. The value of is not mentioned in the paper. We set 1 . The value of is set to 0.01 in the paper, which is too small in our system and has no improvement on the results. Therefore, we choose some values and find that 10 is a suitable value for our system. The deblurred results before and after using luckiness on frame 4 of sequence “books” are shown in Fig.4.1. (a)a region in blurry frame 4 (b)without luckiness (c) with luckiness Fig.4.1 Illustration of deblurring results before and after using luckiness 5. Conclusion and Future work In general, we have implemented the basic work of the paper, including blur function estimation, patch deblurring and improved deblurring using luckiness. The speed of our system is about 30 seconds per frame. And we can obtain relative good deblurring results using our code, see Fig.4.1(a) and (c). Our deblurring results is worse than the paper’s results, as we haven’t accomplished all their work. Another reason may be that we deal with some issues in different ways from those in the paper, which may affect the deblurring results. The difference between our implementation and the work in the paper lies in the following aspects: (1) We compute the homographies between adjacent frames by linear programming, while in the paper it is not mention how to compute them specifically. (2) We calculate the homographies between non-adjacent frames by multiplying instead of feature tracking and then refining in the paper. (3) We solve the problem in equation (3-3) by linear searching, while the authors’ method is not mentioned. (4) The values of some parameters in our implementation are different from those in the paper, i.e. w , N , , . The first two items may affect the accuracy of homographies. However, homographies are used in luckiness measurement and we obtain nearly the same luckiness result as that in the paper (Fig.2.6), which indicates that the difference of homographies is small. The third item is related to speed and has little influence on the performance, while the last item has much importance on the deblurring results. We cannot obtain good results if using values in the paper. Due to the limited time, we cannot implement the work perfectly and leave some questions. In the next, we may try to implement the frame deblurring part and then find the reason why the values of these parameters are different from those in the paper. Besides, we also need to improve our system to handle fast moving objects. After that, the other parts of this paper may be considered to be implemented. Reference [1] Liu F, Gleicher M, Wang J, et al. Subspace video stabilization. ACM Transactions on Graphics (TOG), 2011, 30(1): 4. [2] Grundmann, Matthias, Vivek Kwatra, and Irfan Essa, "Auto-directed video stabilization with robust L1 optimal camera paths", Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. 2011. [3] Matsushita, Yasuyuki, et al. "Full-frame video stabilization with motion inpainting." Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.7 (2006): 1150-1163. [4] Barnes, Connelly, et al. "The generalized patchmatch correspondence algorithm", Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 29-43. [5] HaCohen, Yoav, et al. "Non-rigid dense correspondence with applications for image enhancement." ACM Transactions on Graphics (TOG) 30.4 (2011): 70. [6] Kwatra, Vivek, et al. "Texture optimization for example-based synthesis", ACM Transactions on Graphics (TOG). Vol. 24. No. 3. ACM, 2005. [7] Shi, Jianbo, and Carlo Tomasi. "Good features to track", Computer Vision and Pattern Recognition, 1994. [8] Baker, Simon, and Iain Matthews. "Lucas-kanade 20 years on: A unifying framework." International Journal of Computer Vision 56.3 (2004): 221-255. [9] D. Lowe. “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, Vol.60(2):91–110, 2004. [10] H. Bay, T. Tuytelaars, and L. J. Van Gool. “SURF: Speeded Up Robust Features”, In ECCV, pp. 404-417, 2006. [11] C. Harris and M. Stephens, “A combined corner and edge detection”, In Proceedings of The Fourth Alvey Vision Conference, pp. 147–151, 1988. [12] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision”, Proc. Int. Joint Conf. on Artificial Intelligence, pp. 674 -679, 1981. [13] COIN CLP simplex solver: http://www.coin-or.org/Clp/userguide/index.html