A High-Quality Video Denoising Algorithm based on Reliable Motion Estimation Ce Liu (MSR) William T. Freeman (MSR, MIT) Initial Introduction • To remove real, structured noise introduced by low-end camcorders and digital cameras. • Structured noise: The noise in real cameras can have strong spatial correlations. • Sparsity in Image: For any image patch, there will be a similar ones in other locations of the image. This property is used in NLM method used for denoising. • Sparsity in video: A new frame can be well predicted from previous frames. (Temporally consistent or temporal coherence) Chicken and Egg Problem… • Motion Estimation techniques (e.g. Block matching) are mainly based on temporal coherence across the frames. • Structured noise can mislead the search for similar patches and then breaks the temporal coherence criterion in video de-noising. • Motion should be estimated from the underlying signal after denoising, and denoising relies on the temporal correspondence from motion estimation. • Solution: Use of robust optical flow with spatial regularization to establish reliable temporal correspondence despite noise. A temporally Coherent Video Denoising Framework • For every patch in a video, find a set of supporting patches from this frame and temporal adjacent frames that are similar to this patch. • Find spatially neighboring and temporally corresponding pixels. Algorithms Approximate KNN Optical flow Approximate K-nearest Neighbors (AKNN) for a single frame • • • • • • {I1, I2 , … , IT} – input noisy image. z = (x,y,t) – index to a space time volume. P(z) or P(x,y,t) – a patch at location z. N (q) P (q ) ; a set of approximate K-nearest neighbors for q = (x,y). vi = qi – q ; offset of the found patch from the current patch. Take a priority queue and store these K-nearest neighbors in it such that: K i i 1 D( P (q), P (q i )) D (P (q), P (q j )), 1 i j K Where D(·,·) is the sum of square distance (SSD) over two patches, defined as: D ( P (q ), P (q )) ( I (q u ) I (q u )) i u [ s, s ] [ s, s ] 2 i Three phases of AKNN • Initialization: The K-nearest neighbors are initialized by randomization v i sn i where ni is a 2d standard normal random variable, and controls the radius (=w/3 initially). • Propagation: s • – The idea is to improve approximate K-nearest neighbor set based on the fact that tothe neighboring pixels tend have similar AKNN structures (offsets). – In the scanline order, we attempt to improve AKNN {vi(x,y)} using neighbor {vi(x−1,y) and {vi(x,y,−1)}. In the reverse scanline order, we attempt to improve {vi(x,y)} using neighbor {vi(x+1,y)} and {vi(x,y+1)}. Random Search: After the propagation step, we allow every patch to randomly match other patches in the image for M times using the following mechanism vi = σsαini , i = 1,··· ,M where ni is a standard 2d normal random variable, α = 1 and M = min(log2σs, K). Non-local means with temporal coherence • Temporal coherence is vital for denoising. • Algorithm that produces more temporal coherent results is preferred. • Since, for real sequences, it can be difficult to distinguish high-intensity structured noise from image signal. Therefore, it is important to establish temporal correspondence between adjacent frames and require corresponding pixels to be similar. • For this optical flow algorithm is used. • Since, optical flow is not invertible, we estimate forward flow wf (z) = [vx, vy, 1] from frame It to It+1, and backward flow wb(z)=[vx,vy,−1] from frame It to It−1, in order to establish bidirectional correspondence. Optical flow • Method to estimate the motion between two image frames taken at time t and t+Δt at every pixel position. • Let a pixel I(x,y,t) is moved in next frame to I(x+Δx,y+Δy,t+Δt), then I(x,y,t) = I(x+Δx,y+Δy,t+Δt) • By Taylor Series expansion I( x x, y y, t t ) I( x, y, t ) I x x I y y I t t H .O .T . which gives: Ixu+Iyv+It=0 • Horn-Schunck algorithm considers a global constraint of smoothness for aperture problem and gives an energy functional: E [( I 2 u Iyv It ) ( u x 2 2 2 v )] dxdy Optical flow (cont.) • This functional can be minimized by Euler-Lagrange equations as: L u L v • Which gives – L x ux L x v x L 0 y uy L 0 y v y Ix ( Ix u Iyv It ) u 0 2 Iy ( Ix u Iyv It ) v 0 2 where Δ is laplacian operator, Δu(x,y) = u(x,y) – u(x,y). • The above equations are linear in u and v, which can be iteratively solved as: I (I u I u I ) k u k 1 u k x k x y 2 x v v k 2 y Iy ( Ix v Iy v It ) k k 1 t I I 2 k Ix Iy 2 2 2 Non-local means with temporal coherence (cont.) • So, pixel z corresponds to z + wf (z) in next frame and to z + wb(z) in previous frame across |H| frames. • Modified AKNN’s , which forms the supporting patches for P(z) is Ni = {P(zij)}, j=1,…,K. for ith frame. where, zij = (xij, yij,,i) Non-local means with temporal coherence (cont.) • The non local means estimate for pixel z can be written as: I( z) 1 Z tH D w (( P ( z), P (z ij )) I( z ij ) exp 2 2 t j 1 K |i t| i t H where Z is the normalization factor: tH Z i t H D w (( P (z), P ( z ij )) exp 2 2 t j 1 K |i t| and Dw (·,·) is a weighted SSD function, summed over spatial, but not temporal, offsets: u 2 D w (( P (z), P ( z ij )) (P (z1 u) P ( z 2 u)) exp 2 Z ' u[ s,s ] [ s,s ] 0 2 p 1 2 where σp = s/2 , and Z’ is a normalization constant, γ = 0.9 to control temporal decay and σt is related to the noise level in the video sequence. Non-local means with temporal coherence (cont.) • For a fixed number of iterations, the complexity of this denoising algorithm for a frame is O(NHKlogK), where N is the number of pixels per frame, H is the temporal window size, and K is the number of approximate K-nearest neighbors. • This is a significant reduction compared to O(N2) of the original NLM algorithm, since K << N (typically K = 10 and N = 640×480). Noise estimation for adaptive noise removal • Set σt small when noise level is low to avoid over smoothing, and when the noise level is high, set σt large to smooth out noise. • Proposed noise model based on optical flow: – Ideally, the difference between the warped frame and It should be the difference of independent noise, but due to unreliable motion estimation an outlier in noise estimation is used: f I t ( z ) I t 1 ( z w ( z )) z n z (1 z ) u z here, nz is a pixel-wise Gaussian random variable: E(nz)=0, E(nz2)= σn, and uz ∼ U[−1,1] is a pixel-wise uniform random variable. These two random variables are balanced by weight αz.Let Jt(z) = It(z) − It+1(z + wf (z)). Noise estimation for adaptive noise removal (cont.) – We use an expectation-maximization (EM) algorithm to estimate parameters: 1. Initialize σn = 20. Loop between step 2 and 3 until convergence. 2. (E-Step) Evaluate J ( z) exp t 2 2 n z J ( z ) 1 exp t 2 2 n 2 n 2 1. (M-step) Estimate z J t ( z) z 2 n z z – This estimation is performed for each of R, G and B channels independently. – The relationship between the noise level σn and scaling parameter σt in Eqn. (5) depends on K and H. – Empirically, it has been found that when K = 11 and H = 5, σt = σn generates visually pleasing results. Experimental Results (given by Authors) • Implementation Details: – – – – – Used Patch size – 7X7 K = 11 H=5 4 Iterations of random K-nearest neighbor matching for each frame. The EM algorithm for noise estimation converges in about 10 iterations. • Tennis sequence: – added artificially generated AWGN (σ = 20). – Compared with BM3D which gave PSNR 30.22db in first run and 31.20db in second run. – This algorithm gave PSNR 30.21db. • Real video sequence room captured by a Canon S90. • Denoising quality depends on the quality of motion estimation. Experimental Results (given by Authors) • Evaluation of motion estimation: – Applied block matching and optical flow methods on the video. – Block matching failed due to structured noise, but optical flow gave good smooth results. • For room sequence, two parameters are used σ = 20 and σ = 40. The adaptive noise filtering method outperforms BM3D. Experimental Results (given by Authors) • Temporal coherence as a quality measure: – Used the human-assisted motion annotation tool [27] to annotate the ground-truth motion of the room sequence. Experimental Results (given by Authors) Std. Dev. Input data BM3D (σ = 20) BM3D (σ = 20) Proposed Algorithm RED 4.20 2.22 1.52 1.23 BLUE 3.81 1.78 1.45 1.13 GREEN 9.55 5.77 2.81 2.91 The average standard deviation along motion paths is measured for different algorithms at different RGB channels. Proposed system has overall the least temporal fluctuation.