Scale-Invariant Feature Transform (SIFT) Jinxiang Chai Review Image Processing - Median filtering - Bilateral filtering - Edge detection - Corner detection Review: Corner Detection 1. Compute image gradients 2. Construct the matrix from it and its neighborhood values 3. Determine the 2 eigenvalues λ(i.j)= [λ1, λ2]. 4. If both λ1 and λ2 are big, we have a corner C( i , j ) I x2 I x I y I I I x y 2 y The Orientation Field Corners are detected where both λ1 and λ2 are big Good Image Features • What are we looking for? – Strong features – Invariant to changes (affine and perspective/occlusion) – Solve the problem of correspondence • Locate an object in multiple images (i.e. in video) • Track the path of the object, infer 3D structures, object and camera movement, Scale Invariant Feature Transform (SIFT) • Choosing features that are invariant to image scaling and rotation • Also, partially invariant to changes in illumination and 3D camera viewpoint Invariance • • • • Illumination Scale Rotation Affine Required Readings • Object recognition from local scaleinvariant features [pdf link], ICCV 09 • David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110 Motivation for SIFT • Earlier Methods – Harris corner detector • Sensitive to changes in image scale • Finds locations in image with large gradients in two directions – No method was fully affine invariant • Although the SIFT approach is not fully invariant it allows for considerable affine change • SIFT also allows for changes in 3D viewpoint SIFT Algorithm Overview 1. 2. 3. 4. Scale-space extrema detection Keypoint localization Orientation Assignment Generation of keypoint descriptors. Scale Space • Different scales are appropriate for describing different objects in the image, and we may not know the correct scale/size ahead of time. Scale space (Cont.) • Looking for features (locations) that are stable (invariant) across all possible scale changes – use a continuous function of scale (scale space) • Which scale-space kernel will we use? – The Gaussian Function Scale-Space of Image • L(x, y,k ) G(x, y,k ) * I(x,y) G( x, y, k ) - variable-scale Gaussian I ( x, y ) - input image Scale-Space of Image • L(x, y,k ) G(x, y,k ) * I(x,y) G( x, y, k ) - variable-scale Gaussian I ( x, y ) - input image • To detect stable keypoint locations, find the scale-space extrema in difference-ofGaussian function D( x, y, ) L( x, y, k ) L( x, y, ) D( x, y, ) (G( x, y, k ) G( x, y, )) * I ( x, y) Scale-Space of Image • L(x, y,k ) G(x, y,k ) * I(x,y) G( x, y, k ) - variable-scale Gaussian I ( x, y ) - input image • To detect stable keypoint locations, find the scale-space extrema in difference-ofGaussian function D( x, y, ) L( x, y, k ) L( x, y, ) D( x, y, ) (G( x, y, k ) G( x, y, )) * I ( x, y) Scale-Space of Image • L(x, y,k ) G(x, y,k ) * I(x,y) G( x, y, k ) - variable-scale Gaussian I ( x, y ) - input image • To detect stable keypoint locations, find the scale-space extrema in difference-ofGaussian function D( x, y, ) L( x, y, k ) L( x, y, ) D( x, y, ) (G( x, y, k ) G( x, y, )) * I ( x, y) Look familiar? Scale-Space of Image • L(x, y,k ) G(x, y,k ) * I(x,y) G( x, y, k ) - variable-scale Gaussian I ( x, y ) - input image • To detect stable keypoint locations, find the scale-space extrema in difference-ofGaussian function D( x, y, ) L( x, y, k ) L( x, y, ) D( x, y, ) (G( x, y, k ) G( x, y, )) * I ( x, y) Look familiar? -bandpass filter! Difference of Gaussian 1. A = Convolve image with vertical and horizontal 1D Gaussians, σ=sqrt(2) 2. B = Convolve A with vertical and horizontal 1D Gaussians, σ=sqrt(2) 3. DOG (Difference of Gaussian) = A – B 4. So how to deal with different scales? Difference of Gaussian 1. A = Convolve image with vertical and horizontal 1D Gaussians, σ=sqrt(2) 2. B = Convolve A with vertical and horizontal 1D Gaussians, σ=sqrt(2) 3. DOG (Difference of Gaussian) = A – B 4. Downsample B with bilinear interpolation with pixel spacing of 1.5 (linear combination of 4 adjacent pixels) Difference of Gaussian Pyramid Blur A3-B3 B3 DOG3 A3 Downsample A2-B2 B2 Blur DOG2 A2 Input Image Downsample A1-B1 Blur B1 DOG1 Blur A1 Other issues • Initial smoothing ignores highest spatial frequencies of images Other issues • Initial smoothing ignores highest spatial frequencies of images - expand the input image by a factor of 2, using bilinear interpolation, prior to building the pyramid Other issues • Initial smoothing ignores highest spatial frequencies of images - expand the input image by a factor of 2, using bilinear interpolation, prior to building the pyramid • How to do downsampling with bilinear interpolations? Bilinear Filter Weighted sum of four neighboring pixels x u y v Bilinear Filter y (i,j) Sampling at S(x,y): (i,j+1) u x v (i+1,j) S(x,y) = a*b*S(i,j) (i+1,j+1) + a*(1-b)*S(i+1,j) + (1-a)*b*S(i,j+1) + (1-a)*(1-b)*S(i+1,j+1) Bilinear Filter y (i,j) Sampling at S(x,y): (i,j+1) u x v (i+1,j) S(x,y) = a*b*S(i,j) (i+1,j+1) + a*(1-b)*S(i+1,j) + (1-a)*b*S(i,j+1) + (1-a)*(1-b)*S(i+1,j+1) To optimize the above, do the following Si = S(i,j) + a*(S(i,j+1)-S(i)) Sj = S(i+1,j) + a*(S(i+1,j+1)-S(i+1,j)) S(x,y) = Si+b*(Sj-Si) Bilinear Filter y (i,j) (i,j+1) x (i+1,j) (i+1,j+1) Pyramid Example A3 A2 A1 DOG3 B3 B2 DOG3 B1 DOG1 Feature Detection • Find maxima and minima of scale space • For each point on a DOG level: – Compare to 8 neighbors at same level – If max/min, identify corresponding point at pyramid level below – Determine if the corresponding point is max/min of its 8 neighbors – If so, repeat at pyramid level above • Repeat for each DOG level • Those that remain are key points Identifying Max/Min DOG L+1 DOG L DOG L-1 Refining Key List: Illumination • For all levels, use the “A” smoothed image to compute – Gradient Magnitude • Threshold gradient magnitudes: – Remove all key points with MIJ less than 0.1 times the max gradient value • Motivation: Low contrast is generally less reliable than high for feature points Results: Eliminating Features • Removing features in low-contrast regions Results: Eliminating Features • Removing features in low-contrast regions Assigning Canonical Orientation • For each remaining key point: – Choose surrounding N x N window at DOG level it was detected DOG image Assigning Canonical Orientation • For all levels, use the “A” smoothed image to compute – Gradient Orientation + Gaussian Smoothed Image Gradient Orientation Gradient Magnitude Assigning Canonical Orientation • Gradient magnitude weighted by 2D Gaussian with σ of 3 times that of the current smoothing scale = * Gradient Magnitude 2D Gaussian Weighted Magnitude Assigning Canonical Orientation Weighted Magnitude Gradient Orientation Sum of Weighted Magnitudes • Accumulate in histogram based on orientation • Histogram has 36 bins with 10° increments Gradient Orientation Assigning Canonical Orientation Weighted Magnitude Gradient Orientation Sum of Weighted Magnitudes • Identify peak and assign orientation and sum of magnitude to key point Peak Gradient Orientation Eliminating edges • Difference-of-Gaussian function will be strong along edges – So how can we get rid of these edges? Eliminating edges • Difference-of-Gaussian function will be strong along edges – Similar to Harris corner detector T Dxx Dxy I I x x H H (i , j ) D D yy xy I y I y I x2 I x I y – We are not concerned about actual values of eigenvalue, just the ratio of the two Tr (H) 2 ( ) 2 (r ) 2 (r 1) 2 2 Det(H) r r I I I x y 2 y Eliminating edges • Difference-of-Gaussian function will be strong along edges – So how can we get rid of these edges? Local Image Description • SIFT keys each assigned: – Location – Scale (analogous to level it was detected) – Orientation (assigned in previous canonical orientation steps) • Now: Describe local image region invariant to the above transformations SIFT: Local Image Description • Needs to be invariant to changes in location, scale and rotation SIFT Key Example Local Image Description For each key point: • Identify 8x8 neighborhood (from DOG level it was detected) • Align orientation to xaxis Local Image Description 3. Calculate gradient magnitude and orientation map 4. Weight by Gaussian Local Image Description 5. Calculate histogram of each 4x4 region. 8 bins for gradient orientation. Tally weighted gradient magnitude. Local Image Description 6. This histogram array is the image descriptor. (Example here is vector, length 8*4=32. Best suggestion: 128 vector for 16x16 neighborhood) Applications: Image Matching • Find all key points identified in source and target image – Each key point will have 2d location, scale and orientation, as well as invariant descriptor vector • For each key point in source image, search corresponding SIFT features in target image. • Find the transformation between two images using epipolar geometry constraints or affine transformation. Image matching via SIFT featrues Feature detection Image matching via SIFT featrues • Image matching via nearest neighbor search - if the ratio of closest distance to 2nd closest distance greater than 0.8 then reject as a false match. • Remove outliers using epipolar line constraints. Image matching via SIFT featrues Summary • SIFT features are reasonably invariant to rotation, scaling, and illumination changes. • We can use them for image matching and object recognition among other things. • Efficient on-line matching and recognition can be performed in real time