CSE 185 Introduction to Computer Vision Cameras Cameras • Camera models – Pinhole perspective projection – Weak perspective – Affine Projection • • • • Camera with lenses Sensing Human eye Reading: S Chapter 2 They are formed by the projection of 3D objects Figure from US Navy Manual of Basic Optics and Optical Instruments, prepared by Bureau of Naval Personnel. Reprinted by Dover Publications, Inc., 1969. Images are two-dimensional patterns of brightness values Figure from US Navy Manual of Basic Optics and Optical Instruments, prepared by Bureau of Naval Personnel. Reprinted by Dover Publications, Inc., 1969 Animal eye: a long time ago Photographic camera: Niepce, 1816 Pinhole perspective projection: Brunelleschi, XVth Century. Camera obscura: XVIth Century A is half the size of B C is half the size of B B’ and C’ have the same size Parallel lines: converge on a line formed by the intersection of a plane parallel to π and image plane L in π that is parallel to image plane has no image at all Vanishing point See https://www.youtube.com/watch?v=4H-DYdKYkqk for more explanations Vanishing point Last Supper, Leonardo da Vinci Vanishing line The lines all converge in his right eye, drawing the viewers gaze to this place Urban scene Vanishing points and lines Vanishing Point Vanishing Line • • • • • • o Vanishing Point o The projections of parallel 3D lines intersect at a vanishing point The projection of parallel 3D planes intersect at a vanishing line If a set of parallel 3D lines are also parallel to a particular plane, their vanishing point will lie on the vanishing line of the plane Not all lines that intersect are parallel Vanishing point <-> 3D direction of a line Vanishing line <-> 3D orientation of a surface Perspectives See https://www.youtube.com/watch?v=2HfIU5lp-0I&t=1s for more explanations Vanishing pint: applications • Camera calibration: – Use the properties of vanishing points to find intrinsic and extrinsic camera parameters • 3D reconstruction: – Man-made structures have two main characteristics: several lines are in parallel and a number of edges are orthogonal – Using sets of parallel lines, the orientation of the plane can be estimated using vanishing points • Robot navigation Vanishing points and lines Photo from online Tate collection Note on estimating vanishing points Use multiple lines for better accuracy … but lines will not intersect at exactly the same point in practice One solution: take mean of intersecting pairs … bad idea! Instead, minimize angular differences Vanishing points and lines Vertical vanishing point (at infinity) Vanishing line Vanishing point Slide from Efros, Photo from Criminisi Vanishing point Robot navigation Orthogonal vanishing points • Once sets of mutually orthogonal vanishing points are detected, it is possible to search for 3D rectangular structures in the image Pinhole perspective equation P:(x, y, z) P’: (x’, y’, z’) i axis -> x j axis -> y z axis -> z • C’ :image center • OC’ : optical axis • π’ : image plane is at a positive distance f’ from the pinhole • OP’= λ OP and z’=f’ x ' = x x' y ' f ' y ' = y = = = x y z f ' = z x x ' = f ' z y' = f ' y z NOTE: z is always negative Weak perspective projection frontal-parallel plane π0 defined by z=z0 x' = −mx where y ' = −my f' m=− z0 is the magnification When the scene relief (depth) is small compared its distance from the camera, m can be taken constant → weak perspective projection Orthographic projection x' = x y' = y When the camera is at a (roughly constant) distance from the scene, take m=-1 → orthographic projection Issues with pinhole camera Pinhole too big: many directions are averaged, blurring the image Pinhole too small: diffraction effects blur the image Generally, pinhole cameras are dark, because a very small set of rays from a particular point hits the screen Lenses Snell’s law (aka Descartes’ law) reflection n1 sin a1 = n2 sin a2 n1: incident index n2: refracted index refraction Paraxial (or first-order) optics Snell’s law: Small angles: n1 sin a1 = n2 sin a2 n1a1 = n2a2 Paraxial (or first-order) optics h h a 1 = + 1 + R d1 a2 = − 2 h h − R d2 Small angles: n1a1 = n2a2 n1 n2 n2 − n1 + = d1 d 2 R Thin lens All other rays passing through P are focused on P’ x x' = z ' z y' = z' y z wher e f: focal length 1 1 1 − = z' z f R and f = 2(n − 1) F, F’: focal points Depth of field and field of view • Depth of field (field of focus): objects within certain range of distances are in acceptable focus – Depends on focal length and aperture • Field of view: portion of scene space that are actually projected onto camera sensors – Not only defined by focal length – But also effective sensor area Depth of field f-number: N=f/D f: focal length D: aperture diameter f / 5.6 (large aperture) f / 32 (small aperture) • Changing the aperture size affects depth of field – Increasing f-number (reducing aperture diameter) increases DOF – A smaller aperture increases the range in which the object is approximately in focus Thick lenses • Simple lenses suffer from several aberrations • First order approximation is not sufficient • Use 3rd order Taylor approximation Orthographic (“telecentric”) lenses Navitar telecentric zoom lens Telecentric lens Correcting radial distortion Spherical Aberration • • rays do not intersect at one point circle of least confusion Distortion (optics) pincushion Chromatic Aberration refracted rays of different wavelengths intersect the optical axis at different points barrel Vignetting • Aberrations can be minimized by well-chosen shapes and refraction indexes, separated by appropriate stops • However, light rays from object points off-axis are partially blocked by lens configuration → vignetting → brightness drop in the image periphery Human eye Corena: transparent highly curved refractive component Pupil: opening at center of iris in response to illumination Helmoltz’s Schematic Eye Retina Retina: thin, layered membrane with two types of photoreceptors • rods: very sensitive to light but poor spatial detail • cones: sensitive to spatial details but active at higher light level • generally called receptive field Cones in the fovea Rods and cones in the periphery Photographs (Niepce, “La Table Servie,” 1822) Milestones: Daguerreotypes (1839) Photographic Film (Eastman, 1889) Cinema (Lumière Brothers, 1895) Color Photography (Lumière Brothers, 1908) Television (Baird, Farnsworth, Zworykin, 1920s) CCD Devices (1970) Collection Harlingue-Viollet. . 360 degree field of view… • Basic approach – Take a photo of a parabolic mirror with an orthographic lens – Or buy one a lens from a variety of omnicam manufacturers… • See http://www.cis.upenn.edu/~kostas/omni.html Digital camera • A digital camera replaces film with a sensor array – Each cell in the array is a Charge Coupled Device (CCD) • • • • light-sensitive diode that converts photons to electrons Complementary Metal Oxide on Silicon (CMOS) sensor CMOS is becoming more popular http://electronics.howstuffworks.com/digital-camera.htm Image sensing pipeline A simple camera pipeline Gray-scale image • • Gray scale: 0-255 Usually normalized between 0 and 1 (dividing by 255) and convert it into a vector for processing In a 19 19 face image • Consider a thumbnail 19 19 face image • 256361 possible combination of gray values • 256361= 28361 = 22888 • Total world population (as of 2021) • 7,880,000,000 < 233 • 287 times more than the world population! • Extremely high dimensional space! Color image • Usually represented in three channels CSE 185 Introduction to Computer Vision Light and color Light and color • • • • Human eye Light Color Projection • Reading: Chapters 2,6 Camera aperture f / 5.6 (large aperture) f / 32 (small aperture) The human eye • The human eye is a camera – Iris: colored annulus with radial muscles – Pupil: the hole (aperture) whose size is controlled by the iris – What’s the “film”? photoreceptor cells (rods and cones) in the retina Human eye Retina: thin, layered membrane with two types of photoreceptors • rods: very sensitive to light but poor spatial detail • cones: sensitive to spatial details but active at higher light level • generally called receptive field Human vision system (HVS) Exploiting HVS model • Flicker frequency of film and TV • Interlaced television • Image compression JPEG compression Uncompressed 24 bit RGB bit map: 73,242 pixels require 219,726 bytes (excluding headers) Q=100 Compression ratio: 2.6 Q=50 Compression ratio: 15 Q=10 Compression ratio: 46 Q=1 Compression ratio: 144 Q=25 Compression ratio: 23 JPEG compression Digital camera CCD • • • Low-noise images Consume more power More and higher quality pixels vs. CMOS • • • • More noise (sensor area is smaller) Consume much less power Popular in camera phones Getting better all the time http://electronics.howstuffworks.com/digital-camera.htm Color What colors do humans see? The colors of the visible light spectrum color wavelength interval frequency interval red ~ 700–635 nm ~ 430–480 THz green ~ 560–490 nm ~ 540–610 THz blue ~ 490–450 nm ~ 610–670 THz Color • Plot of all visible colors (Hue and saturation): • Color space: RGB, CIE LUV, CIE XYZ, CIE LAB, HSV, HSL, … • A color image can be represented by 3 image planes Bayer pattern Color filter array • A practical way to record primary colors is to use color filter array • Single-chip image sensor: filter pattern is 50% G, 25% R, 25% B • Since each pixel is filtered to record only one color, various demosaicing algorithms can be used to interpolate a set of complete RGB for each point • Some high-end video cameras have 3 CCD chips Demosaicing original reconstructed Demoasicing Demosaicing • How can we compute an R, G, and B value for every pixel? Color camera Bayer mosaic color filter CCD prism-based color configuration Grayscale image • Mainly dealing with intensity (luminance) • Usually 256 levels (1 byte per pixel): 0 (black) to 255 (white) (often normalized between 0 and 1) • Several ways to convert color to grayscale, e.g., Y=0.2126R+0.7152G+0.0722B Recolor old photos http://twistedsifter.com/2013/08/historic-black-white-photos-colorized/ Projection • See Szeliski 2.1 Projection • Fun application • PhotoFunia Correspondence and alignment • Correspondence: matching points, patches, edges, or regions across images ≈ How do we fit the best alignment? Common transformations original Transformed aspect rotation translation affine perspective Modeling projection • The coordinate system – – – – We will use the pin-hole model as an approximation Put the optical center (Center Of Projection) at the origin Put the image plane (Projection Plane) in front of the COP The camera looks down the negative z axis • we need this if we want right-handed-coordinates Modeling projection • Projection equations – Compute intersection with PP of ray from (x,y,z) to COP – Derived using similar triangles • 𝑥′ 𝑥 = 𝑦′ 𝑦 We get the projection by throwing out the last coordinate: = −𝑑 𝑧 Homogeneous coordinates • Is this a linear transformation? • no—division by z is nonlinear Trick: add one more coordinate: homogeneous image coordinates homogeneous scene coordinates Converting from homogeneous coordinates Perspective projection • Projection is a matrix multiply using homogeneous coordinates: divide by third coordinate This is known as perspective projection • The matrix is the projection matrix Perspective projection • How does scaling the projection matrix change the transformation? Orthographic projection • Special case of perspective projection – Distance from the COP to the PP is infinite Image World – Good approximation for telephoto optics – Also called “parallel projection”: (x, y, z) → (x, y) – What’s the projection matrix? Orthographic projection variants • Scaled orthographic – Also called “weak perspective” – Affine projection • Also called “paraperspective” Camera parameters • See MATLAB camera calibration example Camera parameters A camera is described by several parameters • • • • Translation T of the optical center from the origin of world coords Rotation R of the image plane focal length f, principle point (x’c, y’c), pixel size (sx, sy) yellow parameters are called extrinsics, red are intrinsics Projection equation X Y = ΠX Z 1 The projection matrix models the cumulative effect of all parameters Useful to decompose into a series of operations identity matrix sx * * * * x = sy = * * * * s * * * * • • − fs x Π = 0 0 0 − fs y 0 intrinsics • x'c 1 0 0 0 R y 'c 0 1 0 0 3 x 3 0 1 0 0 1 0 1x 3 projection rotation 03 x1 I 3 x 3 1 01x 3 1 T 3 x1 translation The definitions of these parameters are not completely standardized – especially intrinsics—varies from one book to another CSE 185 Introduction to Computer Vision Image Filtering: Spatial Domain Image filtering • Spatial domain • Frequency domain • Reading: Chapter 3 3D world from 2D image Analysis from local evidence Image filters • Spatial domain – Filter is a mathematical operation of a grid of numbers – Smoothing, sharpening, measuring texture • Frequency domain – Filtering is a way to modify the frequencies of images – Denoising, sampling, image compression • Templates and image pyramids – Filtering is a way to match a template to the image – Detection, coarse-to-fine registration Image filtering • Image filtering: compute function (also known as kernel) of local neighborhood at each position • Important tools – Enhance images • Denoise, resize, increase contrast, etc. – Extract information from images • Texture, edges, distinctive points, etc. – Detect patterns • Template matching Box filter g[ , ] 1 1 1 1 1 1 1 1 1 filter Image filtering output h[.,.] f [.,.] input image g[ , ] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 90 90 0 0 90 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 30 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 30 30 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 30 30 ? h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 30 30 ? 50 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 filter Image filtering g[ , ] output h[.,.] f [.,.] input image 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 30 30 30 20 10 0 0 0 90 90 90 90 90 0 0 0 20 40 60 60 60 40 20 0 0 0 90 90 90 90 90 0 0 0 30 60 90 90 90 60 30 0 0 0 90 90 90 90 90 0 0 0 30 50 80 80 90 60 30 0 0 0 90 0 90 90 90 0 0 0 30 50 80 80 90 60 30 0 0 0 90 90 90 90 90 0 0 0 20 30 50 50 60 40 20 0 0 0 0 0 0 0 0 0 0 10 20 30 30 30 30 20 10 0 0 90 0 0 0 0 0 0 0 10 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l 1 1 1 1 1 1 1 1 1 Example 3*1+0*0+1*(-1)+1*1+5*0+8*(-1)+2*1+7*0+2*(-1)=-5 convert image and filter mask into vectors and apply dot product Convolution of a 6×6 matrix with a 3×3 matrix. As a result we get a 4×4 matrix. See http://datahacker.rs/edge-detection/ Edge Detection Edge detection – an original image (left), a filter (in the middle), a result of a convolution (right) Dot product • Find the angle between • u = 4, 3 and v = 3, 5 • Solution: • When do two vectors have maximal value in dot product? • Consider convolution/filter in a similar way • Consider a filter as a probe to infer response in an image Box filter g[ , ] What does it do? • Replaces each pixel with an average of its neighborhood • Achieve smoothing effect (remove sharp features) 1 1 1 1 1 1 1 1 1 Smoothing with box filter Practice with linear filters 0 0 0 0 1 0 0 0 0 Original ? Practice with linear filters 0 0 0 0 1 0 0 0 0 Original Filtered (no change) Practice with linear filters 0 0 0 0 0 1 0 0 0 Original ? Practice with linear filters 0 0 0 0 0 1 0 0 0 Original Shifted left By 1 pixel Practice with linear filters 0 0 0 0 2 0 0 0 0 Original - 1 1 1 1 1 1 1 1 1 (Note that filter sums to 1) ? Practice with linear filters 0 0 0 0 2 0 0 0 0 Original - 1 1 1 1 1 1 1 1 1 Sharpening filter - Accentuates differences with local average Sharpening Sobel filter 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 -1 2 0 -2 1 0 -1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 Sobel Vertical Edge See https://www.youtube.com/watch?v=am36dePheDc (absolute value) Sobel filter 1 2 1 0 0 0 -1 -2 -1 Sobel Horizontal Edge (absolute value) Synthesize motion blur I = imread('cameraman.tif'); subplot(2,2,1);imshow(I);title('Original Image'); H = fspecial('motion',20,45); MotionBlur = imfilter(I,H,'replicate'); subplot(2,2,2);imshow(MotionBlur);title('Motion Blurred Image'); H2 = fspecial('disk',10); blurred2 = imfilter(I,H2,'replicate'); subplot(2,2,3);imshow(blurred2);title('Blurred Image'); H3 = fspecial('gaussian', [10 10], 2); blurred3 = imfilter(I,H3, 'replicate'); subplot(2,2,4);imshow(blurred3);title('Gaussian Blurred Image'); Sample code Prewitt filter I=imread('cameraman.tif'); Hpr=fspecial('prewitt'); Hso=fspecial('sobel'); Sobel filter fHpr=imfilter(I,Hpr); fHso=imfilter(I,Hso); fHso2=imfilter(I,Hso'); subplot(2,2,1);imshow(I); title('Original Image') subplot(2,2,2); imshow(fHpr); title('Prewitt filter'); subplot(2,2,3); imshow(fHso); title('Sobel filter horizontal'); subplot(2,2,4); imshow(fHso2); title('Sobel filter vertical'); Demosaicing • How can we compute an R, G, and B value for every pixel? Demosaicing • How can we compute an R, G, and B value for every pixel? Filtering vs. convolution • 2d filtering g=filter f=image – h=filter2(g,f); or h=imfilter(f,g); h[ m, n] = g[ k , l ] f [ m + k , n + l ] k ,l • 2d convolution – h=conv2(g,f); h[ m, n] = g[ k , l ] f [ m − k , n − l ] k ,l Key properties of linear filters • Linearity: filter(f1 + f2) = filter(f1) + filter(f2) • Shift invariance: same behavior regardless of pixel location filter(shift(f)) = shift(filter(f)) • Any linear, shift-invariant operator can be represented as a convolution More properties • Commutative: a * b = b * a – Conceptually no difference between filter and signal – But particular filtering implementations might break this equality • Associative: a * (b * c) = (a * b) * c – Often apply several filters one after another: (((a * b1) * b2) * b3) – This is equivalent to applying one filter: a * (b1 * b2 * b3) • Distributes over addition: a * (b + c) = (a * b) + (a * c) • Scalars factor out: ka * b = a * kb = k (a * b) • Identity: unit impulse e = [0, 0, 1, 0, 0], a*e=a Gaussian filter • Weight contributions of neighboring pixels by nearness 0.003 0.013 0.022 0.013 0.003 0.013 0.059 0.097 0.059 0.013 0.022 0.097 0.159 0.097 0.022 0.013 0.059 0.097 0.059 0.013 5 x 5, = 1 0.003 0.013 0.022 0.013 0.003 Smoothing with Gaussian filter Smoothing with box filter Gaussian filters Separability of Gaussian filter filter Separability example image 2D convolution (center location only) The filter factors into a product of 1D filters: Perform convolution along rows: * = Followed by convolution along the remaining column: * = Separability • Why is separability useful in practice? • Filtering an M-by-N image with a P-by-Q kernel requires roughly MNPQ multiples and adds. • For a separable filter, it requires – MNP multiples and adds for the first step – MNQ multiples and adds for the second step – MN(P+Q) multiples and adds • For a 9-by-9 filter, a theoretical speed-up of 4.5 Practical matters How big should the filter be? • Values at edges should be near zero • Rule of thumb for Gaussian: set filter half-width to about 2.4 (or 3) σ Practical matters • What about near the image boundary? – the filter window falls off the extent of the image – need to extrapolate – methods: • clip filter (black) • wrap around • copy edge • reflect across edge Q? Practical matters – methods (MATLAB): • clip filter (black): • wrap around: • copy edge: • reflect across edge: imfilter(f, g, 0) imfilter(f, g, ‘circular’) imfilter(f, g, ‘replicate’) imfilter(f, g, ‘symmetric’) Practical matters • What is the size of the output? • MATLAB: filter2(g, f, shape) g: filter, f:image – shape = ‘full’: output size is sum of sizes of f and g – shape = ‘same’: output size is same as f (default setting) – shape = ‘valid’: output size is difference of sizes of f and g g full g same g f g valid g g f g g g f g g g Median filters • A Median Filter operates over a window by selecting the median intensity in the window. • What advantage does a median filter have over a mean filter? • Is a median filter a kind of convolution? Comparison: salt and pepper noise Sharpening revisited • What does blurring take away? – = detail smoothed (5x5) original Let’s add it back: +α original = detail sharpened Take-home messages • Linear filtering is sum of dot product at each position – Can smooth, sharpen, translate (among many other uses) • Be aware of details for filter size, extrapolation, cropping 1 1 1 1 1 1 1 1 1 Practice questions 1. Write down a 3x3 filter that returns a positive value if the average value of the 4adjacent neighbors is less than the center and a negative value otherwise 2. Write down a filter that will compute the gradient in the x-direction: gradx(y,x) = im(y,x+1)-im(y,x) for each x, y CSE 185 CSE 185 Introduction to Computer Vision Introduction to Computer Vision Image Filtering: Image Filtering: Frequency Domain Frequency Domain Image filtering • Fourier transform and frequency domain – Frequency view of filtering – Hybrid images – Sampling • Reading: Chapters 3 • Some slides from James Hays, David Hoeim, Steve Seitz, Richard Szeliski, … Gaussian filter Gaussian Box filter Hybrid images • Filter one image with low-pass filter and the other with high-pass filter, and then superimpose them • A. Oliva, A. Torralba, P.G. Schyns, “Hybrid Images,” SIGGRAPH 2006 At a distance • What do you see? Close up • What do you see? High/low frequency components • What are stable components? • What describes image details? • Low-frequency components – Stable edges • High-frequency components – Texture details • Decompose image into high and low frequency components Why do we get different, distance-dependent interpretations of hybrid images? ? Why does a lower resolution image still make sense to us? What do we lose? Compression How is it that a 4MP image can be compressed to a few hundred KB without a noticeable change? Thinking in terms of frequency • Convert an image into a frequency domain • Analyze the filter responses at different scale, orientation, and number of occurrence over time • Represent an image with basis filters Jean Baptiste Joseph Fourier had crazy idea (1807): Any univariate function can be rewritten as a weighted sum of sines and cosines of different frequencies. ...the manner in which the author arrives at these equations is not exempt of difficulties and...his analysis to integrate them still leaves something to be desired on the score of generality and even rigour. • Don’t believe it? – Neither did Lagrange, Laplace, Poisson and other big wigs – Not translated into English until 1878! • But it’s (mostly) true! – called Fourier Series – there are some subtle restrictions • Fun with math genealogy Legendre Laplace Lagrange A sum of sines Our building block: Asin(x + ) Add enough of them to get any signal g(x) you want! A: amplitude 𝜔: frequency 𝜙: shift f(target)= f0+f1+f2+…+fn+… Frequency spectra • example : g(t) = sin(2πf t) + (1/3)sin(2π(3f) t) = + Frequency spectra Frequency spectra = = + Frequency spectra = = + Frequency spectra = = + Frequency spectra = = + Frequency spectra = = + Frequency spectra 1 = A sin(2 kt ) k =1 k Example: Music • We think of music in terms of frequencies at different magnitudes Fourier analysis in images Intensity Image Fourier Image Signals can be composed + = More: http://www.cs.unm.edu/~brayer/vision/fourier.html Fourier transform • Fourier transform stores the magnitude and phase at each frequency – Magnitude encodes how much signal there is at a particular frequency – Phase encodes spatial information (indirectly) – For mathematical convenience, this is often notated in terms of real and complex numbers Euler’s formula Amplitude: 𝑒 𝑗𝜙 = cos 𝜙 + 𝑗 sin 𝜙 A = R( ) + I ( ) 2 𝐴𝑒 𝑗𝜙 = 𝐴cos 𝜙 + 𝑗 𝐴 sin 𝜙 2 Phase: I ( ) = tan R( ) −1 Computing Fourier transform 𝑒 𝑗𝜙 = cos 𝜙 + 𝑗 sin 𝜙 A = R( ) 2 + I ( ) 2 = tan −1 Continuous Discrete Fast Fourier Transform (FFT): NlogN k = -N/2..N/2 I ( ) R( ) Phase and magnitude • Fourier transform of a real function is complex – difficult to plot – visualize instead, we can think of the phase and magnitude of the transform • Each frequency response is described by phase and magnitude • Phase: angle of complex transform • Magnitude: amplitude of complex transform • Curious fact – all natural images have about the same magnitude transform – hence, phase seems to matter, but magnitude largely doesn’t • Demonstration – take two pictures, swap the phase transforms, compute the inverse - what does the result look like? Phase and magnitude This is the magnitude transform of the cheetah pic This is the phase transform of the cheetah pic This is the magnitude transform of the zebra pic This is the phase transform of the zebra pic Reconstructio n with zebra phase, cheetah magnitude Reconstruction with cheetah phase, zebra magnitude The convolution theorem • The Fourier transform of the convolution of two functions is the product of their Fourier transforms F[ g h] = F[ g ] F[h] • Convolution in the spatial domain is equivalent to multiplication in the frequency domain! −1 g * h = F [F[ g ] F[h]] • Used in Fast Fourier Transform – Ten computer codes that transformed science, Nature, Jan 20 2021 Properties of Fourier transform • Linearity F 𝑎𝑥 𝑡 + 𝑏𝑦 𝑡 = 𝑎F 𝑥 𝑡 + 𝑏F(𝑦 𝑡 ) • Fourier transform of a real signal is symmetric about the origin • The energy of the signal is the same as the energy of its Fourier transform See Szeliski Book (3.4) 2D FFT • Fourier transform (discrete case) 1 F (u , v) = MN M −1 N −1 f ( x, y )e − j 2 (ux / M + vy / N ) x =0 y =0 for u = 0,1,2,..., M − 1, v = 0,1,2,..., N − 1 • Inverse Fourier transform: M −1 N −1 f ( x, y ) = F (u , v)e j 2 (ux / M + vy / N ) u =0 v =0 for x = 0,1,2,..., M − 1, y = 0,1,2,..., N − 1 • u, v : the transform or frequency variables • x, y : the spatial or image variables Euler’s formula Fourier bases in Matlab, check out: imagesc(log(abs(fftshift(fft2(im))))) 2D FFT Sinusoid with frequency = 1 and its FFT 2D FFT Sinusoid with frequency = 3 and its FFT 2D FFT Sinusoid with frequency = 5 and its FFT 2D FFT Sinusoid with frequency = 10 and its FFT 2D FFT Sinusoid with frequency = 15 and its FFT 2D FFT Sinusoid with varying frequency and their FFT Rotation Sinusoid rotated at 30 degrees and its FFT 2D FFT Sinusoid rotated at 60 degrees and its FFT 2D FFT Image analysis with FFT Image analysis with FFT http://www.cs.unm.edu/~brayer/vision/fourier.html Image analysis with FFT Filtering in spatial domain * = 1 0 -1 2 0 -2 1 0 -1 Filtering in frequency domain FFT FFT = Inverse FFT FFT in Matlab • Filtering with fft im = double(imread('cameraman.tif'))/255; [imh, imw] = size(im); hs = 50; % filter half-size fil = fspecial('gaussian', hs*2+1, 10); fftsize = 1024; % should be order of 2 (for speed) and include im_fft = fft2(im, fftsize, fftsize); % 1) fil_fft = fft2(fil, fftsize, fftsize); % 2) image im_fil_fft = im_fft .* fil_fft; % 3) im_fil = ifft2(im_fil_fft); % 4) im_fil = im_fil(1+hs:size(im,1)+hs, 1+hs:size(im, 2)+hs); % 5) figure, imshow(im); figure, imshow(im_fil); padding fft im with padding fft fil, pad to same size as multiply fft images inverse fft2 remove padding • Displaying with fft figure(1), imagesc(log(abs(fftshift(im_fft)))), axis image, colormap jet Questions Which has more information, the phase or the magnitude? What happens if you take the phase from one image and combine it with the magnitude from another image? Filtering Why does the Gaussian give a nice smooth image, but the square filter give edgy artifacts? Gaussian Box filter Gaussian filter Gaussian Box filter Box Filter Why do we get different, distance-dependent interpretations of hybrid images? ? Salvador Dali invented Hybrid Images? Salvador Dali “Gala Contemplating the Mediterranean Sea, which at 30 meters becomes the portrait of Abraham Lincoln”, 1976 Fourier bases Teases away fast vs. slow changes in the image. This change of basis is the Fourier Transform Fourier bases in Matlab, check out: imagesc(log(abs(fftshift(fft2(im))))) Hybrid image in FFT Hybrid Image Low-passed Image High-passed Image Application: Hybrid images • Combine low-frequency of one image with high-frequency of another one • Sad faces when looked closely • Happy faces when looked a few meters away • A. Oliva, A. Torralba, P.G. Schyns, “Hybrid Images,” SIGGRAPH 2006 Sampling Why does a lower resolution image still make sense to us? What do we lose? Why does a lower resolution image still make sense to us? What do we lose? Subsampling by a factor of 2 Throw away every other row and column to create a 1/2 size image Aliasing problem • 1D example (sinewave): Aliasing problem • 1D example (sinewave): Aliasing problem • Sub-sampling may be dangerous…. • Characteristic errors may appear: – “Wagon wheels rolling the wrong way in movies” – “Checkerboards disintegrate in ray tracing” – “Striped shirts look funny on color television” Aliasing in video Aliasing in graphics Resample the checkerboard by taking one sample at each circle. In the case of the top left board, new representation is reasonable. Top right also yields a reasonable representation. Bottom left is all black (dubious) and bottom right has checks that are too big. Sampling scheme is crucially related to frequency Constructing a pyramid by taking every second pixel leads to layers that badly misrepresent the top layer Nyquist-Shannon sampling theorem • When sampling a signal at discrete intervals, the sampling frequency must be 2 fmax • fmax = max frequency of the input signal • This will allow to reconstruct the original from the sampled version without aliasing v v v good bad Anti-aliasing Solutions: • Sample more often • Get rid of all frequencies that are greater than half the new sampling frequency – Will lose information – But it’s better than aliasing – Apply a smoothing filter Algorithm for downsampling by factor of 2 1. Start with image(h, w) 2. Apply low-pass filter im_blur = imfilter(image, fspecial(‘gaussian’, 7, 1)) 3. Sample every other pixel im_small = im_blur(1:2:end, 1:2:end); Sampling without smoothing. Top row shows the images, sampled at every second pixel to get the next; bottom row shows the magnitude spectrum of these images. substantial aliasing Magnitude of the Fourier transform of each image displayed as a log scale (constant component is at the center). Fourier transform of a resampled image is obtained by scaling the Fourier transform of the original image and then tiling the plane Sampling with smoothing (small σ). Top row shows the images. We get the next image by smoothing the image with a Gaussian with σ=1 pixel, then sampling at every second pixel to get the next; bottom row shows the magnitude spectrum of these images. reducing aliasing Low pass filter suppresses high frequency components with less aliasing Sampling with smoothing (large σ). Top row shows the images. We get the next image by smoothing the image with a Gaussian with σ=2 pixels, then sampling at every second pixel to get the next; bottom row shows the magnitude spectrum of these images. lose details Large σ ➔ less aliasing, but with little detail Gaussian is not an ideal low-pass filter Subsampling without pre-filtering 1/2 1/4 (2x zoom) 1/8 (4x zoom) Subsampling with Gaussian prefiltering Gaussian 1/2 G 1/4 G 1/8 Visual perception • Early processing in humans filters for various orientations and scales of frequency • Perceptual cues in the mid-high frequencies dominate perception • When we see an image from far away, we are effectively subsampling it Early Visual Processing: Multi-scale edge and blob filters Campbell-Robson contrast sensitivity curve Detect slight change in shades of gray before they become indistinguishable Contrast sensitivity is better for mid-range spatial frequencies Things to remember • Sometimes it makes sense to think of images and filtering in the frequency domain – Fourier analysis • Can be faster to filter using FFT for large images (N logN vs. N2 for autocorrelation) • Images are mostly smooth – Basis for compression • Remember to low-pass before sampling Edge orientation Fourier bases Teases away fast vs. slow changes in the image. This change of basis is the Fourier Transform Fourier bases in Matlab, check out: imagesc(log(abs(fftshift(fft2(im))))) Edge orientation Man-made scene Can change spectrum, then reconstruct Low and high pass filtering Sinc filter • What is the spatial representation of the hard cutoff in the frequency domain? Frequency Domain Spatial Domain Review 1. Match the spatial domain image to the Fourier magnitude image 1 2 3 4 5 B A C E D Practice question 1. Match the spatial domain image to the Fourier magnitude image 1 2 3 4 5 B A C E D 1 – D, 2 – B, 3 – A, 4 – E, 5 - C CSE 185 Introduction to Computer Vision Image Filtering: Templates, Image Pyramids, and Filter Banks Image filtering • Template matching • Image Pyramids • Filter banks and texture Template matching • Goal: find in image • Main challenge: What is a good similarity or distance measure between two patches? – – – – Correlation Zero-mean correlation Sum Square Difference Normalized Cross Correlation Matching with filters • Goal: find in image • Method 0: filter the image with eye patch h[m, n] = g[k , l ] f [m + k , n + l ] k ,l f = image g = filter What went wrong? Problem: response is stronger for higher intensity Input Filtered Image Matching with filters • Goal: find in image • Method 1: filter the image with zero-mean eye h[ m, n] = ( f [ k , l ] − f ) ( g[ m + k , n + l ] ) k ,l True detections Problem: response is sensitive to gain/contrast: pixels in filter that are near the mean have little effect (does not require pixel values in image to be near or proportional to values in filter) Input Filtered Image (scaled) False detections Thresholded Image Matching with filters • Goal: find in image • Method 2: SSD h[ m, n] = ( g[ k , l ] − f [ m + k , n + l ] )2 k ,l True detections Problem: SSD sensitive to average intensity Input sqrt(SSD) Thresholded Image Matching with filters What’s the potential downside of SSD? • Goal: find in image SSD sensitive to average intensity • Method 2: SSD h[ m, n] = ( g[ k , l ] − f [ m + k , n + l ] )2 k ,l Input 1- sqrt(SSD) Matching with filters • Goal: find in image • Method 3: Normalized cross-correlation Invariant to mean and scale of intensity mean template h[m, n] = mean image patch å(g[k, l]- g)( f [m - k, n - l]- f m,n ) k,l æ ö ççå(g[k, l]- g)2 å ( f [m - k, n - l]- fm,n )2 ÷÷ è k,l ø k,l 0.5 Dot product of two normalized vectors Matlab: normxcorr2(template, im) Matching with filters • Goal: find in image • Method 3: Normalized cross-correlation True detections Input Normalized X-Correlation Thresholded Image What is the best method to use? A: Depends • SSD: faster, sensitive to overall intensity • Normalized cross-correlation: slower, invariant to local average intensity and contrast • But really, neither of these baselines are representative of modern recognition How to find larger or smaller eyes? A: Image Pyramid Review of sampling Gaussian Filter Image Low-Pass Filtered Image Sample Low-Res Image Gaussian pyramid Source: Forsyth Template matching with image pyramids Input: Image, Template 1. Match template at current scale 2. Downsample image 3. Repeat 1-2 until image is very small 4. Take responses above some threshold, perhaps with non-maxima suppression Coarse-to-fine image registration 1. Compute Gaussian pyramid 2. Align with coarse pyramid 3. Successively align with finer pyramids – Search smaller range Why is this faster? Are we guaranteed to get the same result? 2D edge detection filters Laplacian of Gaussian Gaussian derivative of Gaussian is the Laplacian operator: Laplacian filter unit impulse Gaussian Laplacian of Gaussian Gaussian/Laplacian pyramid Can we reconstruct the original from the Laplacian pyramid? Laplacian pyramid Visual representation Hybrid Image Hybrid Image in Laplacian pyramid High frequency → Low frequency Image representation • Pixels: great for spatial resolution, poor access to frequency • Fourier transform: great for frequency, not for spatial info • Pyramids/filter banks: balance between spatial and frequency information Main uses of image pyramids • Compression • Object detection – Scale search – Features • Detecting stable interest points • Registration – Course-to-fine Application: Representing texture Source: Forsyth Texture and material 50 100 150 200 250 300 50 100 150 200 250 300 350 400 50 100 150 200 250 300 50 100 150 200 250 300 350 400 450 50 100 150 200 250 300 50 100 150 200 250 300 350 400 450 50 100 150 200 250 300 350 50 100 150 200 250 300 350 400 450 500 Texture and orientation 50 100 150 200 250 300 350 50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 550 50 100 150 200 250 300 350 50 100 150 200 250 300 350 400 450 500 Texture and scale What is texture? Regular or stochastic patterns caused by bumps, grooves, and/or markings How can we represent texture? • Compute responses of blobs and edges at various orientations and scales Overcomplete representation Leung-Malik Filter Bank • First and second order derivatives fo Gaussians at 6 orientations and 3 scales • 8 Laplacian of Gaussian and 4 Gaussian filters Code for filter banks: www.robots.ox.ac.uk/~vgg/research/texclass/filters.html Filter banks • Process image with each filter and keep responses (or squared/abs responses) How can we represent texture? • Measure responses of blobs and edges at various orientations and scales • Idea 1: Record simple statistics (e.g., mean, standard deviation) of absolute filter responses Match the texture to the response? Filters A B 1 2 C 3 Mean abs responses Representing texture by mean abs response Filters Mean abs responses Representing texture • Idea 2: take vectors of filter responses at each pixel and cluster them, then take histograms (more on in later weeks) Compression How is it that a 4MP image can be compressed to a few hundred KB without a noticeable change? Lossy image compression (JPEG) 64 basis functions DFT: complex values DCT: real values Block-based Discrete Cosine Transform (DCT) See https://www.mathworks.com/help/images/discrete-cosine-transform.html Slides: Efros Using DCT in JPEG • The first coefficient B(0,0) is the DC component, the average intensity • The top-left coeffs represent low frequencies, the bottom right – high frequencies Image compression using DCT • Quantize – More coarsely for high frequencies (which also tend to have smaller values) – Many quantized high frequency values will be zero • Encode – Can decode with inverse dct Filter responses Quantization table Quantized values JPEG compression summary 1. Convert image to YCrCb 2. Subsample color by factor of 2 – People have bad resolution for color 3. Split into blocks (8x8, typically), subtract 128 4. For each block a. Compute DCT coefficients b. Coarsely quantize • c. Many high frequency components will become zero Encode (e.g., with Huffman coding) http://en.wikipedia.org/wiki/YCbCr http://en.wikipedia.org/wiki/JPEG Lossless compression (PNG) 1.Predict that a pixel’s value based on its upper-left neighborhood 2.Store difference of predicted and actual value 3.Pkzip it (DEFLATE algorithm) Denoising Gaussian Filter Additive Gaussian Noise Reducing Gaussian noise Smoothing with larger standard deviations suppresses noise, but also blurs the image Reducing salt-and-pepper noise by Gaussian smoothing 3x3 5x5 7x7 Alternative idea: Median filtering • A median filter operates over a window by selecting the median intensity in the window • Is median filtering linear? Median filter • What advantage does median filtering have over Gaussian filtering? – Robustness to outliers Median filter Salt-and-pepper noise Median filtered • MATLAB: medfilt2(image, [h w]) Median vs. Gaussian filtering 3x3 Gaussian Median 5x5 7x7 Other non-linear filters • Weighted median (pixels further from center count less) • Clipped mean (average, ignoring few brightest and darkest pixels) • Bilateral filtering (weight by spatial distance and intensity difference) Bilateral filtering Review: Image filtering g[ , ] 1 1 1 1 1 1 1 1 1 h[.,.] f [.,.] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 90 90 0 0 90 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 h[m, n] = f [k , l ] g[m + k , n + l ] k ,l Credit: S. Seitz Image filtering g[ , ] 1 1 1 1 1 1 1 1 1 h[.,.] f [.,.] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 h[m, n] = f [k , l ] g[m + k , n + l ] k ,l Credit: S. Seitz Image filtering g[ , ] 1 1 1 1 1 1 1 1 1 h[.,.] f [.,.] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 90 0 90 90 90 0 0 0 0 0 90 90 90 90 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 20 h[m, n] = f [k , l ] g[m + k , n + l ] k ,l Credit: S. Seitz Filtering in spatial domain * = 1 0 -1 2 0 -2 1 0 -1 Filtering in frequency domain FFT FFT = Inverse FFT Review of image filtering • Filtering in frequency domain – Can be faster than filtering in spatial domain (for large filters) – Can help understand effect of filter – Algorithm: 1. Convert image and filter to fft (fft2 in matlab) 2. Pointwise-multiply ffts 3. Convert result to spatial domain with ifft2 Review of image filtering • Linear filters for basic processing – Edge filter (high-pass) – Gaussian filter (low-pass) [-1 1] Gaussian FFT of Gradient Filter FFT of Gaussian Review of image filtering • Derivative of Gaussian Review of image filtering • Applications of filters – Template matching (SSD or Normxcorr2) • SSD can be done with linear filters, is sensitive to overall intensity – Gaussian pyramid • Coarse-to-fine search, multi-scale detection – Laplacian pyramid • Teases apart different frequency bands while keeping spatial information • Can be used for compositing in graphics – Downsampling • Need to sufficiently low-pass before downsampling EECS 274 Computer Vision Linear Filters and Edges Linear filters and edges • • • • Linear filters Scale space Gaussian pyramid and wavelets Edges • Reading: Chapters 7 and 8 of FP, Chapter 3 of S Linear filters • General process: • Example: smoothing by averaging – Form new image whose pixels are a weighted sum of original pixel values, using the same set of weights at each point. • Properties – Output is a linear function of the input – Output is a shift-invariant function of the input (i.e. shift the input image two pixels to the left, the output is shifted two pixels to the left) 1 Rij = (2k + 1) 2 – form the average of pixels in a neighbourhood • Example: smoothing with a Gaussian – form a weighted average of pixels in a neighbourhood • Example: finding a derivative – form a weighted average of pixels in a neighbourhood u =i + k v = j + k u =i − k 1 Fuv = 2 ( 2 k + 1 ) v= j −k F uv u ,v Convolution • Represent these weights as an image, H • H is usually called the kernel • Operation is called convolution – it’s associative • Notation in textbook: Rij = H i −u , j −v Fu ,v u ,v R = H F g (i, j ) = f (i + k , j + l )h(k , l ) k ,l g (i, j ) = f (i − k , j − l )h(k , l ) k ,l g = f h g = f *h • Notice wierd order of indices – all examples can be put in this form – it’s a result of the derivation expressing any shift-invariant linear operator as a convolution. Example: smoothing by averaging Smoothing with a Gaussian • Smoothing with an average actually doesn’t compare at all well with a defocussed lens – Most obvious difference is that a single point of light viewed in a defocussed lens looks like a fuzzy blob; but the averaging process would give a little square. • A Gaussian gives a good model of a fuzzy blob An isotropic Gaussian • The picture shows a smoothing kernel proportional to x2 + y2 G ( x, y ) = exp − 2 2 2 2 1 (which is a reasonable model of a circularly symmetric fuzzy blob) Smoothing with a Gaussian Differentiation and convolution • Recall f f ( x + , y ) − f ( x, y ) = lim x →0 • Now this is linear and shift invariant, so must be the result of a convolution 0 0 0 H = 1 0 − 1 0 0 0 • We could approximate this as f ( xn +1 , y ) − f ( xn , y ) f x x (which is obviously a convolution; it’s not a very good way to do things, as we shall see) Finite differences Partial derivative in y axis, respond strongly to horizontal edges Partial derivative in x axis, respond strongly to vertical edges Spatial filter • Approximation f 2 1/ 2 2 f f x f = f , | f |= + x y y | f | [( z5 − z8 ) 2 + ( z5 − z6 ) 2 ]1/ 2 | f || ( z5 − z8 ) | + | z5 − z6 ) | | f | | ( z5 − z9 ) | + | z6 − z8 ) | 1/ 2 | f || ( z5 − z9 ) | + | z6 − z8 ) | z1 z2 z3 z4 z5 z6 z7 z8 z9 Roberts operator One of the earliest edge detection algorithm by Lawrence Roberts Sobel operator One of the earliest edge detection algorithm by Irwine Sobel Noise • Simplest noise model – independent stationary additive Gaussian noise – the noise value at each pixel is given by an independent draw from the same normal probability distribution • Issues – this model allows noise values that could be greater than maximum camera output or less than zero – for small standard deviations, this isn’t too much of a problem it’s a fairly good model – independence may not be justified (e.g. damage to lens) – may not be stationary (e.g. thermal gradients in the ccd) sigma=1 sigma=16 Finite differences and noise • Finite difference filters respond strongly to noise – obvious reason: image noise results in pixels that look very different from their neighbours • Generally, the larger the noise the stronger the response • What is to be done? – intuitively, most pixels in images look quite a lot like their neighbours – this is true even at an edge; along the edge they’re similar, across the edge they’re not – suggests that smoothing the image should help, by forcing pixels different to their neighbours (=noise pixels?) to look more like neighbours Finite differences responding to noise σ=0.03 σ=0.09 Increasing noise -> (this is zero mean additive Gaussian noise) Difference operation is strongly influenced by noise (the image is increasingly grainy) The response of a linear filter to noise • Do only stationary independent additive Gaussian noise with zero mean (non-zero mean is easily dealt with) • Mean: – output is a weighted sum of inputs – so we want mean of a weighted sum of zero mean normal random variables – must be zero • Variance: – recall • variance of a sum of random variables is sum of their variances • variance of constant times random variable is constant^2 times variance – then if is noise variance and kernel is K, variance of response is 2 K2 u ,v u ,v Filter responses are correlated • Over scales similar to the scale of the filter • Filtered noise is sometimes useful – looks like some natural textures, can be used to simulate fire, etc. Smoothed noise Smoothing stationary additive Gaussian noise results in signals where pixel values tend to increasingly similar to the value of neighboring pixels (as filter kernel causes correlation) Smoothing reduces noise • Generally expect pixels to “be like” their neighbours – surfaces turn slowly – relatively few reflectance changes • Generally expect noise processes to be independent from pixel to pixel • Implies that smoothing suppresses noise, for appropriate noise models • Scale – the parameter in the symmetric Gaussian – as this parameter goes up, more pixels are involved in the average – and the image gets more blurred – and noise is more effectively suppressed The effects of smoothing Each row shows smoothing with gaussians of different width; each column shows different realisations of an image of gaussian noise. Gradients and edges • Points of sharp change in an image are interesting: – – – – change in reflectance change in object change in illumination noise • Sometimes called edge points • General strategy – determine image gradient – now mark points where gradient magnitude is particularly large wrt neighbours (ideally, curves of such points). In one dimension, the 2nd derivative of a signal is zero when the derivative magnitude is extremal → a good place to look for edge is where the second derivative is zero. Smoothing and differentiation • Issue: noise – smooth before differentiation – two convolutions to smooth, then differentiate? – actually, no - we can use a derivative of Gaussian filter • because differentiation is convolution, and convolution is associative 1 pixel 3 pixels 7 pixels The scale of the smoothing filter affects derivative estimates, and also the semantics of the edges recovered. Image Gradient • Gradient equation: • Represents direction of most rapid change in intensity • Gradient direction: • The edge strength is given by the gradient magnitude Theory of Edge Detection Ideal edge L( x, y ) = x sin − y cos + = 0 B1 : L( x, y ) 0 B2 : L( x, y ) 0 Unit step function: 1 u (t ) = 1 2 0 for t 0 for t = 0 for t 0 u (t ) = (s )ds t − Image intensity (brightness): I ( x, y ) = B1 + (B2 − B1 )u ( x sin − y cos + ) Theory of Edge Detection • Image intensity (brightness): I ( x, y ) = B1 + (B2 − B1 )u ( x sin − y cos + ) • Partial derivatives (gradients): I = + sin (B2 − B1 ) ( x sin − y cos + ) x I = − cos (B2 − B1 ) ( x sin − y cos + ) y • Squared gradient: 2 I I 2 s ( x, y ) = + = (B2 − B1 ) ( x sin − y cos + ) x y Edge Magnitude: s ( x, y ) I I / (normal of the edge) arctan Edge Orientation: y x 2 Rotationally symmetric, non-linear operator Theory of Edge Detection • Image intensity (brightness): I ( x, y ) = B1 + (B2 − B1 )u ( x sin − y cos + ) • Partial derivatives (gradients): I = + sin (B2 − B1 ) ( x sin − y cos + ) x I = − cos (B2 − B1 ) ( x sin − y cos + ) y • Laplacian: 2 2 I I 2 I = 2 + 2 = (B2 − B1 ) ' ( x sin − y cos + ) x y I x Rotationally symmetric, linear operator 2I x 2 zero-crossing Discrete Edge Operators • How can we differentiate a discrete image? Finite difference approximations: I 1 ((I i+1, j +1 − I i, j +1 ) + (I i+1, j − I i, j )) x 2 I 1 ((I i+1, j +1 − I i+1, j ) + (I i, j +1 − I i, j )) y 2 Convolution masks : I 1 x 2 I 1 y 2 I i , j +1 I i +1, j +1 Ii, j I i +1, j Discrete Edge Operators • Second order partial derivatives: I i −1, j +1 I i , j +1 I i +1, j +1 2I 1 (I i−1, j − 2I i, j + I i+1, j ) 2 2 x 2I 1 (I i, j −1 − 2I i, j + I i, j +1 ) 2 2 y I i −1, j I i , j I i +1, j I i −1, j −1 I i , j −1 I i +1, j −1 • Laplacian : 2I 2I I= 2+ 2 x y 2 Convolution masks : 2 I 1 2 0 1 0 1 -4 1 0 1 0 or 1 6 2 1 4 1 4 -20 4 1 4 1 (more accurate) Effects of Noise • Consider a single row or column of the image – Plotting intensity as a function of position gives a signal Where is the edge?? Solution: Smooth First Where is the edge? Look for peaks in Derivative Theorem of Convolution …saves us one operation. Laplacian of Gaussian (LoG) 2 2 (h f ) = 2 2 x x h f Laplacian of Gaussian Laplacian of Gaussian operator Where is the edge? Zero-crossings of bottom graph ! 2D Gaussian Edge Operators Gaussian Derivative of Gaussian (DoG) Laplacian of Gaussian Mexican Hat (Sombrero) • is the Laplacian operator: threshold sigma=4 scale contrast=1 LOG zero crossings sigma=2 contrast=4 We still have unfortunate behavior at corners and trihedral areas σ=1 pixel σ=2 pixel There are three major issues: 1) The gradient magnitude at different scales is different; which should we choose? 2) The gradient magnitude is large along thick trail; how do we identify the significant points? 3) How do we link the relevant points up into curves? We wish to mark points along the curve where the magnitude is biggest. We can do this by looking for a maximum along a slice normal to the curve (non-maximum suppression). These points should form a curve. There are then two algorithmic issues: at which point is the maximum, and where is the next one? Non-maximum Suppression • Check if pixel is local maximum along gradient direction – requires checking interpolated pixels p and r Predicting the next edge point Assume the marked point is an edge point. Then we construct the tangent to the edge curve (which is normal to the gradient at that point) and use this to predict the next points (here either r or s). Edge following: edge points occur along curve like chains fine scale, high threshold σ=1 pixel coarse scale, high threshold σ=4 pixels coarse scale, low threshold σ=4 pixels Canny Edge Operator • Smooth image I with 2D Gaussian: GI • Find local edge normal directions for each pixel (G I ) n= (G I ) • Compute edge magnitudes (G I ) • Locate edges by finding zero-crossings along the edge normal directions (non-maximum suppression) 2 (G I ) =0 2 n Canny edge detector Original image magnitude of gradient Canny edge detector After non-maximum suppression Canny edge detector original Canny with • The choice of – large – small Canny with depends on desired behavior detects large scale edges detects fine features Difference of Gaussians (DoG) • Laplacian of Gaussian can be approximated by the difference between two different Gaussians DoG Edge Detection (a) (b) (b)-(a) Unsharp Masking 100 200 – 300 = 400 500 200 400 +a blurred positive 600 increase details in bright area 800 = Edge Thresholding • Standard Thresholding: • Can only select “strong” edges. • Does not guarantee “continuity”. • Hysteresis based Thresholding (use two thresholds) Example: For “maybe” edges, decide on the edge if neighboring pixel is a strong edge. Remaining issues • Check that maximum value of gradient value is sufficiently large – drop-outs? use hysteresis • use a high threshold to start edge curves and a low threshold to continue them. Notice • Something nasty is happening at corners • Scale affects contrast • Edges aren’t bounding contours Scale space • Framework for multi-scale signal processing • Define image structure in terms of scale with kernel size, σ • Find scale invariant operations • See Scale-theory in computer vision by Tony Lindeberg Orientation representations • The gradient magnitude is affected by illumination changes – but it’s direction isn’t • We can describe image patches by the swing of the gradient orientation • Important types: – constant window • small gradient mags – edge window • few large gradient mags in one direction – flow window • many large gradient mags in one direction – corner window • large gradient mags that swing Representing windows Looking at variations H= (I )(I ) T window G G G G I I I I x x x y = G G G window G x I y I y I y I • Types – constant • small eigenvalues – edge • one medium, one small – flow • one large, one small – corner • two large eigenvalues Plotting in ellipses to understand the matrix (variation of gradient) of 3 × 3 window ( x, y )T H −1 ( x, y ) = Major and minor axes are along the eigenvectors of H, and the extent corresponds to the size of eigenvalues Plotting in ellipses to understand the matrix (variation of gradient) of 5 × 5 window Corners • Harris corner detector • Moravec corner detector • SIFT descriptors Filters are templates • Applying a filter at some point can be seen as taking a dotproduct between the image and some vector • Filtering the image is a set of dot products convolution is equivalent to taking the dot product of the filter with an image patch • Insight Rij = H i −u , j −v Fu ,v u ,v Rij = H −u , − v Fu ,v u ,v – filters look like the effects they are intended to find – filters find effects they look like derivative of Gaussian used as edge detection Normalized correlation • Think of filters of a dot product – now measure the angle – i.e normalised correlation output is filter output, divided by root sum of squares of values over which filter lies – cheap and efficient method for finding patterns • Tricks: – ensure that filter has a zero response to a constant region (helps reduce response to irrelevant background) – subtract image average when computing the normalizing constant (i.e. subtract the image mean in the neighbourhood) – absolute value deals with contrast reversal Positive responses Zero mean image, -1:1 scale Zero mean image, -max:max scale Positive responses Zero mean image, -1:1 scale Zero mean image, -max:max scale Figure from “Computer Vision for Interactive Computer Graphics,” W.Freeman et al, IEEE Computer Graphics and Applications, 1998 copyright 1998, IEEE Anistropic scaling • Symmetric Gaussian smoothing tends ot blur out edges rather aggressively • Prefer an oriented smoothing operator that smoothes – aggressively perpendicular to the gradient – little along the gradient • Also known as edge preserving smoothing • Formulated with diffusion equation Diffusion equation for anistropic filter • PDE that describes fluctuations in a material undergoing diffusion (r , t ) ( = D( , r ) (r , t ) ) t • For istropic filter 2 2 2 = (c( x, y, )) ) = C = 2 + 2 = 2 for istropic cases x y ( x, y,0) = I ( x, y ) as initial condition • For anistropic filter = (c( x, y, )) ) = c( x, y, ) 2 + (c( x, y, )) If c( x, y, ) = 1, just as before If c( x, y, ) = 0, no smoothing Anistropic filtering Edge preserving filter • Bilateral filter – Replace pixel’s value by a weighted average of its neighbors in both space and intensity One iteration Multiple iterations CSE 185 Introduction to Computer Vision Edges Edges • Edges • Scale space • Reading: Chapter 4.2 Edge detection • Goal: Identify sudden changes (discontinuities) in an image – Intuitively, most semantic and shape information from the image can be encoded in the edges – More compact than pixels • Ideal: artist’s line drawing (but artist is also using object-level knowledge) Why do we care about edges? • Extract information, recognize objects • Recover geometry and viewpoint – Measure distance – Compute area • See also single-view 3D Vanishing line Vanishing point [Criminisi et al. ICCV 99] Vertical vanishing point (at infinity) Vanishing point Origin of Edges surface normal discontinuity depth discontinuity surface color discontinuity illumination discontinuity • Edges are caused by a variety of factors Marr’s theory Example Closeup of edges Source: D. Hoiem Closeup of edges Closeup of edges Closeup of edges Characterizing edges • An edge is a place of rapid change in the image intensity function image intensity function (along horizontal scanline) first derivative edges correspond to extrema of derivative 1st and 2nd derivative See https://www.youtube.com/watch?v=uNP6ZwQ3r6A Intensity profile With a little Gaussian noise Gradient Consider pixel value as height Plot intensity value as needle map, i.e., (x, y, z), where x, y are image coordinate and z is the pixel value (color bar indicates the intensity value from low to high) I=imread('lena_gray.png'); [x,y]=size(I); X=1:x; Y=1:y; [xx,yy]=meshgrid(Y,X); i=im2double(I); figure;mesh(xx,yy,i); colorbar figure;imshow(i) First-order derivative filter Second-order derivative filter Effects of noise • Consider a single row or column of the image – Plotting intensity as a function of position gives a signal Where is the edge? Effects of noise • Difference filters respond strongly to noise – Image noise results in pixels that look very different from their neighbors – Generally, the larger the noise the stronger the response • What can we do about it? Solution: smooth first f g f*g d ( f g) dx • To find edges, look for peaks in d ( f g) dx Derivative theorem of convolution Recall associative property: a * (b * c) = (a * b) * c • Differentiation is convolution, and convolution is associative: dxd ( f g ) = f dxd g • This saves us one operation: f d g dx f d g dx Laplacian of Gaussian (LoG) 2 2 (h f ) = 2 2 x x h f Laplacian of Gaussian Laplacian of Gaussian operator Where is the edge? Zero-crossings of bottom graph ! Derivative of Gaussian filter * [1 -1] = Finite differences Partial derivative in y axis, respond strongly to horizontal edges Partial derivative in x axis, respond strongly to vertical edges Differentiation and convolution • Recall f f ( x + , y ) − f ( x, y ) = lim x → 0 • Now this is linear and shift invariant, so must be the result of a convolution 0 0 0 H = 1 0 − 1 0 0 0 • We could approximate this as f ( xn +1 , y ) − f ( xn , y ) f x x (which is obviously a convolution; it’s not a very good way to do things, as we shall see) Discrete edge operators • How can we differentiate a discrete image? Finite difference approximations: I 1 ((I i+1, j +1 − I i, j +1 ) + (I i+1, j − I i, j )) x 2 I 1 ((I i+1, j +1 − I i+1, j ) + (I i, j +1 − I i, j )) y 2 I i , j +1 I i +1, j +1 Ii, j Convolution masks : I 1 x 2 -1 1 -1 1 I 1 y 2 1 1 -1 -1 See https://en.wikipedia.org/wiki/Edge_detection I i +1, j 𝛆 Discrete edge operators • Second order partial derivatives: I i −1, j +1 I i , j +1 I i +1, j +1 2I 1 2 (I i −1, j − 2 I i , j + I i +1, j ) 2 x 2I 1 (I i, j −1 − 2 I i, j + I i, j +1 ) 2 2 y I i −1, j I i −1, j −1 I i , j −1 I i +1, j −1 𝜕𝐼 1 ≈ (𝐼𝑖+1,𝑗 − 𝐼𝑖,𝑗 ) 𝜕𝑥 𝜖 2 𝜕 𝐼 1 1 ≈ ( (𝐼 − 𝐼𝑖,𝑗 ) 𝜕𝑥 2 𝜖 𝜖 𝑖+1,𝑗 • Laplacian : 2 2 I I 2I = 2 + 2 x y I i , j I i +1, j 1 𝜖 − (𝐼𝑖,𝑗 - 𝐼𝑖−1,𝑗 )) Convolution masks : 2I 1 2 0 1 0 1 -4 1 0 1 0 or 1 6 2 1 4 1 4 -20 4 1 4 1 (more accurate) Spatial filter • Approximation of image gradient f 2 1/ 2 2 f f x f = f , | f |= + x y y | f | [( z5 − z8 ) 2 + ( z5 − z6 ) 2 ]1/ 2 | f || ( z5 − z8 ) | + | z5 − z6 ) | | f | | ( z5 − z9 ) | + | z6 − z8 ) | 1/ 2 | f || ( z5 − z9 ) | + | z6 − z8 ) | different ways to compute gradient magnitude Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 filter weight Roberts operator One of the earliest edge detection algorithm by Lawrence Roberts Convolve image with ÑI(x, y) = G(x, y) = Gx2 + Gy2 , to get Gx and Gy q (x, y) = arctan(Gy / Gx ) Sobel operator One of the earliest edge detection algorithm by Irwine Sobel Gx+Gy Gx Gy Image gradient • Gradient equation 𝑓(𝑥, 𝑦) • Represents direction of most rapid change in intensity • Gradient direction: • The edge strength is given by the gradient magnitude 2D Gaussian edge operators Gaussian Derivative of Gaussian (DoG) Laplacian of Gaussian (LoG) Mexican Hat (Sombrero) • is the Laplacian operator: Marr-Hildreth algorithm Difference of Gaussians (DoG) • Laplacian of Gaussian can be approximated by the difference between two different Gaussians Smoothing and localization 1 pixel 3 pixels 7 pixels • Smoothed derivative removes noise, but blurs edge. Also finds edges at different “scales”. threshold sigma=4 scale contrast=1 LOG zero crossings sigma=2 contrast=4 We still have unfortunate behavior at corners and trihedral areas Implementation issues • The gradient magnitude is large along a thick “trail” or “ridge,” so how do we identify the actual edge points? • How do we link the edge points to form curves? Designing an edge detector • Criteria for a good edge detector: – Good detection • the optimal detector should find all real edges, ignoring noise or other artifacts – Good localization • the edges detected must be as close as possible to the true edges • the detector must return one point only for each true edge point • Cues of edge detection – Differences in color, intensity, or texture across the boundary – Continuity and closure – High-level knowledge Canny edge detector • Probably the most widely used edge detector in computer vision • Theoretical model: step-edges corrupted by additive Gaussian noise • John Canny has shown that the first derivative of the Gaussian closely approximates the operator that optimizes the product of signalto-noise ratio and localization J. Canny, A Computational Approach To Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679-714, 1986. http://www.mathworks.com/discovery/edge-detection.html Example original image (Lena) Derivative of Gaussian filter x-direction y-direction Compute gradients X-Derivative of Gaussian Y-Derivative of Gaussian Gradient Magnitude Get orientation at each pixel • Threshold at minimum level • Get orientation theta = atan2(gy, gx) Non-maximum suppression for each orientation At q, we have a maximum if the value is larger than those at both p and at r. Interpolate to get these values. Select the single maximum point across the width of an edge A variant is to use Laplacian zero crossing along the gradient direction. Edge linking Assume the marked point is an edge point. Then we construct the tangent to the edge curve (which is normal to the gradient at that point) and use this to predict the next points (here either r or s). Sidebar: Bilinear Interpolation http://en.wikipedia.org/wiki/Bilinear_interpolation Sidebar: Interpolation options • imx2 = imresize(im, 2, interpolation_type) • ‘nearest’ – Copy value from nearest known – Very fast but creates blocky edges • ‘bilinear’ – Weighted average from four nearest known pixels – Fast and reasonable results • ‘bicubic’ (default) – Non-linear smoothing over larger area (4x4) – Slower, visually appealing, may create negative pixel values Before non-max suppression After non-max suppression Hysteresis thresholding • Threshold at low/high levels to get weak/strong edge pixels • Use connected components, starting from strong edge pixels Hysteresis thresholding • Check that maximum value of gradient value is sufficiently large – drop-outs? use hysteresis • use a high threshold to start edge curves and a low threshold to continue them. Source: S. Seitz Final Canny edges Canny edge detector 1. Filter image with x, y derivatives of Gaussian 2. Find magnitude and orientation of gradient 3. Non-maximum suppression: – Thin multi-pixel wide “ridges” down to single pixel width 4. Thresholding and linking (hysteresis): – Define two thresholds: low and high – Use the high threshold to start edge curves and the low threshold to continue them • MATLAB: edge(image, ‘canny’) Effect of (Gaussian kernel size) original Canny with Canny with The choice of depends on desired behavior • large detects large scale edges • small detects fine features Where do humans see boundaries? image human segmentation gradient magnitude • Berkeley segmentation database: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/ pB boundary detector Martin, Fowlkes, Malik 2004: Learning to Detect Natural Boundaries… http://www.eecs.berkeley.edu/Research/Projects/CS/ vision/grouping/papers/mfm-pami-boundary.pdf Brightness Color Texture Combined Human Pb (0.88) Human (0.95) State of edge detection • Local edge detection works well – But many false positives from illumination and texture edges • Some methods to take into account longer contours, but could probably do better • Poor use of object and high-level information Sketching • Learn from artist’s strokes so that edges are more likely in certain parts of the face. Berger et al. SIGGRAPH 2013 CSE 185 Introduction to Computer Vision Local Invariant Features Local features • Interest points • Descriptors • Reading: Chapter 4 Correspondence across views • Correspondence: matching points, patches, edges, or regions across images ≈ Example: estimating fundamental matrix that corresponds two views Example: structure from motion Applications • Feature points are used for: – – – – – – Image alignment 3D reconstruction Motion tracking Robot navigation Indexing and database retrieval Object recognition Interest points • Note: “interest points” = “keypoints”, also sometimes called “features” • Many applications – tracking: which points are good to track? – recognition: find patches likely to tell us something about object category – 3D reconstruction: find correspondences across different views Interest points original • Suppose you need to click on some point, go away and come back after I deform the image, and click on the same points again. – Which points would you choose? deformed Keypoint matching A1 A2 A3 fA fB d ( f A, fB ) T 1. Find a set of distinctive keypoints 2. Define a region around each keypoint 3. Extract and normalize the region content 4. Compute a local descriptor from the normalized region 5. Match local descriptors Goals for keypoints Detect points that are repeatable and distinctive Key trade-offs A1 A2 A3 Detection of interest points More Repeatable Robust detection Precise localization More Points Robust to occlusion Works with less texture Description of patches More Distinctive Minimize wrong matches More Flexible Robust to expected variations Maximize correct matches Invariant local features • Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging parameters Features Descriptors Natural Language • Stop words: occur frequently but do not contain much information Choosing interest points Where would you tell your friend to meet you? Choosing interest points Where would you tell your friend to meet you? Feature extraction: Corners Why extract features? • Motivation: panorama stitching – We have two images – how do we combine them? Local features: main components 1) Detection: Identify the interest points 2) Description: Extract vector (1) (1) x = [ x , , x ] 1 1 d feature descriptor surrounding each interest point. 3) Matching: Determine correspondence between descriptors in two views x2 = [ x1( 2) ,, xd( 2) ] 444 Characteristics of good features • Repeatability – The same feature can be found in several images despite geometric and photometric transformations • Saliency – Each feature is distinctive • Compactness and efficiency – Fewer features than image pixels • Locality – A feature occupies a relatively small area of the image; robust to clutter and occlusion Interest operator repeatability • We want to detect (at least some of) the same points in both images No chance to find true matches! • Yet we have to be able to run the detection procedure independently per image Descriptor distinctiveness • We want to be able to reliably determine which point goes with which ? • Must provide some invariance to geometric and photometric differences between the two views Local features: main components 1) Detection: Identify the interest points 2) Description: Extract vector feature descriptor surrounding each interest point 3) Matching: Determine correspondence between descriptors in two views Many detectors Hessian & Harris Laplacian, DoG Harris-/Hessian-Laplace Harris-/Hessian-Affine EBR and IBR MSER Salient Regions Others… [Beaudet ‘78], [Harris ‘88] [Lindeberg ‘98], [Lowe 1999] [Mikolajczyk & Schmid ‘01] [Mikolajczyk & Schmid ‘04] [Tuytelaars & Van Gool ‘04] [Matas ‘02] [Kadir & Brady ‘01] Corner detection • We should easily recognize the point by looking through a small window • Shifting a window in any direction should give a large change in intensity “flat” region: no change in all directions “edge”: no change along the edge direction “corner”: significant change in all directions Finding corners • Key property: in the region around a corner, image gradient has two or more dominant directions • Corners are repeatable and distinctive Corner detection: Mathematics Change in appearance of window w(x,y) for the shift [u,v]: E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y I(x, y) E(u, v) E(3,2) w(x, y) Corner detection: Mathematics Change in appearance of window w(x,y) for the shift [u,v]: E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y I(x, y) E(u, v) E(0,0) w(x, y) Corner detection: Mathematics Change in appearance of window w(x,y) for the shift [u,v]: E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y Window function Shifted intensity Window function w(x,y) = Intensity or 1 in window, 0 outside Gaussian Moravec corner detector Test each pixel with change of intensity for the shift [u,v]: E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y Window function Shifted intensity Four shifts: (u,v) = (1,0), (1,1), (0,1), (-1, 1) Look for local maxima in min{E} When does this idea fail? Intensity Moravec corner detector • In a region of uniform intensity, then the nearby patches will look similar. • On an edge, then nearby patches in a direction perpendicular to the edge will look quite different, but nearby patches in a direction parallel to the edge will result in only a small change. • On a feature with variation in all directions, then none of the nearby patches will look similar. “flat” region: no change in all directions “edge”: no change along the edge direction “corner”: significant change in all directions E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y Four shifts: (u,v) = (1,0), (1,1), (0,1), (-1, 1) Look for local maxima in min{E} Problem of Moravec detector • Only a set of shifts at every 45 degree is considered • Noisy response due to a binary window function • Only minimum of E is taken into account Harris corner detector (1988) solves these problems. C. Harris and M. Stephens. "A Combined Corner and Edge Detector.“ Proceedings of the 4th Alvey Vision Conference: pages 147--151. Corner detection: Mathematics Change in appearance of window w(x,y) for the shift [u,v]: E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y We want to find out how this function behaves for small shifts E(u, v) Corner detection: Mathematics Change in appearance of window w(x,y) for the shift [u,v]: E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y We want to find out how this function behaves for small shifts Local quadratic approximation of E(u,v) in the neighborhood of (0,0) is given by the second-order Taylor expansion: Eu (0,0) 1 Euu (0,0) Euv (0,0) u E (u , v) E (0,0) + [u v] + [u v] v E ( 0 , 0 ) E ( 0 , 0 ) E ( 0 , 0 ) 2 vv v uv Taylor series • Taylor series of the function f at a ¥ f (n) (a) f (x) = å (x - a)n n! n=0 f ¢(a) f ¢¢(a) f ¢¢¢(a) 2 = f (a) + (x - a) + (x - a) + (x - a)3 + 1! 2! 3! • Maclaurin series (Taylor series when a=0) Taylor expansion • Second order Taylor expansion for vectors Local quadratic approximation of E(u,v) in the neighborhood of (0,0) is given by the second-order Taylor expansion: Eu (0,0) 1 Euu (0,0) Euv (0,0) u E (u , v) E (0,0) + [u v] + [u v] v E ( 0 , 0 ) E ( 0 , 0 ) E ( 0 , 0 ) 2 vv v uv Corner detection: Mathematics E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y Second-order Taylor expansion of E(u,v) about (0,0): Eu (0,0) 1 Euu (0,0) Euv (0,0) u E (u , v) E (0,0) + [u v] + [u v] v E ( 0 , 0 ) E ( 0 , 0 ) E ( 0 , 0 ) 2 vv v uv Eu (u , v) = 2 w( x, y )I ( x + u , y + v) − I ( x, y )I x ( x + u , y + v) x, y Euu (u , v) = 2 w( x, y )I x ( x + u , y + v) I x ( x + u , y + v) x, y + 2 w( x, y )I ( x + u , y + v) − I ( x, y )I xx ( x + u , y + v) x, y Euv (u , v) = 2 w( x, y )I y ( x + u , y + v) I x ( x + u , y + v) x, y + 2 w( x, y )I ( x + u , y + v) − I ( x, y )I xy ( x + u , y + v) x, y Corner detection: Mathematics E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y Second-order Taylor expansion of E(u,v) about (0,0): Eu (0,0) 1 Euu (0,0) Euv (0,0) u E (u , v) E (0,0) + [u v] + [u v] v E ( 0 , 0 ) E ( 0 , 0 ) E ( 0 , 0 ) 2 vv v uv E (0,0) = 0 Eu (0,0) = 0 Ev (0,0) = 0 Euu (0,0) = 2 w( x, y )I x ( x, y ) I x ( x, y ) x, y E vv (0,0) = 2 w( x, y )I y ( x, y ) I y ( x, y ) x, y Euv (0,0) = 2 w( x, y )I x ( x, y ) I y ( x, y ) x, y Corner detection: Mathematics The quadratic approximation simplifies to u E (u , v) [u v] M v where M is a second moment matrix computed from image derivatives: I x2 M = w( x, y ) x, y I x I y M IxI y 2 I y Simpler Derivation of E(u,v) E (u , v) = w( x, y ) I ( x + u , y + v) − I ( x, y ) 2 x, y f (x + u, y + v) = f (x, y) + uf x (x, y) + vf y (x, y) I(x + u, y + v) = I(x, y) + uI x (x, y) + vI y (x, y) å( I(x + u, y + v) - I(x, y)) » å (I(x, y) +uI (x, y) + vI (x, y) - I(x, y)) = å u I + 2uvI I + v I 2 x 2 2 x x y é 2 Ix ê é ù = åë u v û ê I I ë x y y 2 2 y ù I x I y úé u ù ê ú 2 ú Iy ë v û û 2 Corners as distinctive Interest Points I x I x M = w( x, y ) I x I y IxI y IyIy 2 x 2 matrix of image derivatives (averaged in neighborhood of a point). Notation: I Ix x I Iy y I I IxI y x y Interpreting the second moment matrix The surface E(u,v) is locally approximated by a quadratic form. Let’s try to understand its shape. u E (u , v) [u v] M v I x2 M = w( x, y ) x, y I x I y IxI y 2 I y Interpreting the second moment matrix First, consider the axis-aligned case (gradients are either horizontal or vertical) I x2 M = w( x, y ) x, y I x I y IxI y −1 1 M =R 2 I y 0 0 R 2 Diagonalization of M Here, λ1 and λ2 are eigenvalues of M (and R is a matrix consisting of eigenvectors) If either λ is close to 0, then this is not a corner, so look for locations where both are large. Interpreting the second moment matrix E(u, v) [u v] M u v u Consider a horizontal “slice” of E(u, v): [u v] M = const v This is the equation of an ellipse. Interpreting the second moment matrix u Consider a horizontal “slice” of E(u, v): [u v] M = const v This is the equation of an ellipse. 1 0 M =R R Diagonalization of M: 0 2 The axis lengths of the ellipse are determined by the eigenvalues and the orientation is determined by R −1 direction of the fastest change (max)-1/2 direction of the slowest change (min)-1/2 Visualization of second moment matrices Visualization of second moment matrices Interpreting the eigenvalues Classification of image points using eigenvalues of M: “Edge” 2 >> 1 2 “Corner” 1 and 2 are large, 1 ~ 2 ; E increases in all directions 1 and 2 are small; E is almost constant in all directions “Edge” 1 >> 2 “Flat” region 1 Corner response function R = det( M ) − a trace( M ) 2 = 12 − a (1 + 2 ) 2 α: constant (0.04 to 0.06) “Edge” R<0 “Corner” R>0 recall 1 0 M =R R 0 2 −1 |R| small R is a matrix with two orthonormal bases (eigenvector of unit norm) “Flat” region “Edge” R<0 Harris corner detector 1) Compute M matrix for each image window to get their cornerness scores. 2) Find points whose surrounding window gave large corner response (f>threshold) 3) Take the points of local maxima, i.e., perform non-maximum suppression Harris corner detector [Harris88] • Second moment matrix 1. Image I x2 ( D ) I x I y ( D ) ( I , D ) = g ( I ) derivatives 2 I I ( ) I ( ) x y D y D (optionally, blur first) det M = 12 trace M = 1 + 2 Iy Ix2 Iy2 IxIy g(Ix2) g(Iy2) g(IxIy) 2. Square of derivatives 3. Gaussian filter g(I) recall Ix 4. Cornerness function – both eigenvalues are strong har = det[ ( I , D)] − a [trace( ( I , D)) 2 ] = g ( I x2 ) g ( I y2 ) − [ g ( I x I y )]2 − a [ g ( I x2 ) + g ( I y2 )]2 5. Non-maxima suppression 476 har Harris Detector: Steps Harris Detector: Steps Compute corner response R Harris Detector: Steps Find points with large corner response: R>threshold Harris Detector: Steps Take only the points of local maxima of R Harris Detector: Steps Invariance and covariance • We want corner locations to be invariant to photometric transformations and covariant to geometric transformations – Invariance: image is transformed and corner locations do not change – Covariance: if we have two transformed versions of the same image, features should be detected in corresponding locations Affine intensity change I→aI+b • Only derivatives are used => invariance to intensity shift I → I + b • Intensity scaling: I → a I R R threshold x (image coordinate) x (image coordinate) Partially invariant to affine intensity change Image translation • Derivatives and window function are shift-invariant Corner location is covariant w.r.t. translation Image rotation Second moment ellipse rotates but its shape (i.e. eigenvalues) remains the same Corner location is covariant w.r.t. rotation Scaling Corner All points will be classified as edges Corner location is not covariant to scaling! More local features • • • • • • • • • • SIFT (Scale Invariant Feature Transform) Harris Laplace Harris Affine Hessian detector Hessian Laplace Hessian Affine MSER (Maximally Stable Extremal Regions) SURF (Speeded-Up Robust Feature) BRIEF (Boundary Robust Independent Elementary Features) ORB (Oriented BRIEF) CSE 185 Introduction to Computer Vision Feature Matching Feature matching • Keypoint matching • Hough transform • Reading: Chapter 4 Feature matching • Correspondence: matching points, patches, edges, or regions across images ≈ Keypoint matching A1 A2 A3 fA fB d ( f A, fB ) T 1. Find a set of distinctive keypoints 2. Define a region around each keypoint 3. Extract and normalize the region content 4. Compute a local descriptor from the normalized region 5. Match local descriptors Review: Interest points • Keypoint detection: repeatable and distinctive – Corners, blobs, stable regions – Harris, DoG, MSER – SIFT Which interest detector • What do you want it for? – Precise localization in x-y: Harris – Good localization in scale: Difference of Gaussian – Flexible region shape: MSER • Best choice often application dependent – Harris-/Hessian-Laplace/DoG work well for many natural categories – MSER works well for buildings and printed things • Why choose? – Get more points with more detectors • There have been extensive evaluations/comparisons – [Mikolajczyk et al., IJCV’05, PAMI’05] – All detectors/descriptors shown here work well Local feature descriptors • Most features can be thought of as templates, histograms (counts), or combinations • The ideal descriptor should be – Robust and distinctive – Compact and efficient • Most available descriptors focus on edge/gradient information – Capture texture information – Color rarely used • Scale-invariant feature transform (SIFT) descriptor How to decide which features match? Feature matching • Szeliski 4.1.3 – – – – Simple feature-space methods Evaluation methods Acceleration methods Geometric verification (Chapter 6) Feature matching • Simple criteria: One feature matches to another if those features are nearest neighbors and their distance is below some threshold. • Problems: – Threshold is difficult to set – Non-distinctive features could have lots of close matches, only one of which is correct Matching local features • Threshold based on the ratio of 1st nearest neighbor to 2nd nearest neighbor distance Reject all matches in which the distance ratio > 0.8, which eliminates 90% of false matches while discarding less than 5% correct matches If there is a good match, the second closest matched point should not be close to the best match point (i.e., remove ambiguous matches) SIFT repeatability It shows the stability of detection for keypoint location, orientation, and final matching to a database as a function of affine distortion. The degree of affine distortion is expressed in terms of the equivalent viewpoint rotation in depth for a planar surface. Matching features What do we do about the “bad” matches? Good match: 1. crop a patch at each interest point and compute the distance between two 2. if the distance is small, then it is a good match RAndom SAmple Consensus Select one match, count inliers Blue line: randomly selected line Yellow line: inlier (4) Red line: outlier (2) RAndom SAmple Consensus Select one match, count inliers Blue line: randomly selected line Yellow line: inlier (1) Red line: outlier (5) Least squares fit Find “average” translation vector Blue line: randomly selected line Yellow line: inlier Red line: outlier RANSAC • Random Sample Consensus • Choose a small subset uniformly at random • Fit to that • Anything that is close to result is signal; all others are noise • Refit • Do this many times and choose the best • Issues – How many times? • Often enough that we are likely to have a good line – How big a subset? • Smallest possible – What does close mean? • Depends on the problem – What is a good line? • One where the number of nearby points is so big it is unlikely to be all outliers Descriptor Vector • Orientation = blurred gradient • Similarity Invariant Frame – Scale-space position (x, y, s) + orientation () Image Stitching 506 RANSAC for Homography • SIFT features of two similar images RANSAC for Homography • SIFT features common to both images RANSAC for Homography • Select random subset of features (e.g. 6) • Compute motion estimate • Apply motion estimate to all SIFT features • Compute error: feature pairs not described by motion estimate • Repeat many times (e.g. 500) • Keep estimate with best error RANSAC for Homography Probabilistic model for verification • Potential problem: • Two images don’t match… • … but RANSAC found a motion estimate • Do a quick check to make sure the images do match • MAP for inliers vs. outliers Finding the panoramas Finding the panoramas Finding connected components Finding the panoramas Results Fitting and alignment Fitting: find the parameters of a model that best fit the data Alignment: find the parameters of the transformation that best align matched points Checkerboard • Often used in camera calibration See https://docs.opencv.org/master/d9/dab/tutorial_homography.html https://www.mathworks.com/help/vision/ref/estimatecameraparameters.html Fitting and alignment • Design challenges – Design a suitable goodness of fit measure • Similarity should reflect application goals • Encode robustness to outliers and noise – Design an optimization method • Avoid local optima • Find best parameters quickly Fitting and alignment: Methods • Global optimization / Search for parameters – Least squares fit – Robust least squares – Iterative closest point (ICP) • Hypothesize and test – Generalized Hough transform – RANSAC Least squares line fitting y=mx+b • Data: (x1, y1), …, (xn, yn) • Line equation: yi = m xi + b • Find (m, b) to minimize (xi, yi) E = i =1 ( yi − m xi − b) 2 n dE = 2 A T Ap − 2 A T y = 0 dp A Ap = A T T Matlab: p = A \ y; y p = (A A ) T −1 AT y Least squares (global) optimization Good • Clearly specified objective • Optimization is easy Bad • May not be what you want to optimize • Sensitive to outliers – Bad matches, extra points • Doesn’t allow you to get multiple good fits – Detecting multiple objects, lines, etc. Hypothesize and test 1. Propose parameters – – – Try all possible Each point votes for all consistent parameters Repeatedly sample enough points to solve for parameters 2. Score the given parameters – Number of consistent points, possibly weighted by distance 3. Choose from among the set of parameters – Global or local maximum of scores 4. Possibly refine parameters using inliers Hough transform: Outline 1. Create a grid of parameter values 2. Each point votes for a set of parameters, incrementing those values in grid 3. Find maximum or local maxima in grid Hough transform Given a set of points, find the curve or line that explains the data points best y Duality: Each point has a dual line in the parameter space m x y=mx+b m and b are known variables y and x are unknown variables b Hough space m = -(1/x)b + y/x x and y are known variables m and b are unknown variables Hough transform y m b x y m 3 x 5 3 3 2 2 3 7 11 10 4 3 2 3 2 1 1 0 5 3 2 3 4 1 b Hough transform Issue : [m,b] is unbounded… −∞ ≤ 𝑚 ≤ ∞ → large accumulator Use a polar representation for the parameter space, 0 ≤ 𝜃 < 𝜋 Duality: Each point has a dual curve in the parameter space y x y = (− cos r )x + ( ) sin sin Hough space 𝑟 = 𝑥 cos 𝜃 + 𝑦 sin 𝜃 Hough transform: Experiments features votes Hough transform Hough transform: Experiments Noisy data features Need to adjust grid size or smooth votes Hough transform: Experiments features votes Issue: spurious peaks due to uniform noise 1. Image → Canny 2. Canny → Hough votes 3. Hough votes → Edges Find peaks and post-process Hough transform example Hough transform: Finding lines • Using m,b parameterization • Using r, theta parameterization – Using oriented gradients • Practical considerations – – – – Bin size Smoothing Finding multiple lines Finding line segments CSE 185 Introduction to Computer Vision Fitting and Alignment Fitting and alignment • RANSAC • Transformation • Iterative closest point • Reading: Chapter 4 Correspondence and alignment • Correspondence: matching points, patches, edges, or regions across images ≈ Fitting and alignment: Methods • Global optimization / Search for parameters – Least squares fit – Robust least squares – Iterative closest point (ICP) • Hypothesize and test – Hough transform – RANSAC Hough transform Good • Robust to outliers: each point votes separately • Fairly efficient (much faster than trying all sets of parameters) • Provides multiple good fits Bad • Some sensitivity to noise • Bin size trades off between noise tolerance, precision, and speed/memory – Can be hard to find sweet spot • Not suitable for more than a few parameters – grid size grows exponentially Common applications • Line fitting (also circles, ellipses, etc.) • Object instance recognition (parameters are affine transform) • Object category recognition (parameters are position/scale) RANSAC RANdom SAmple Consensus: Learning technique to estimate parameters of a model by random sampling of observed data Fischler & Bolles in ‘81. RANSAC Algorithm: 1. Sample (randomly) the number of points required to fit the model 2. Solve for model parameters using samples 3. Score by the fraction of inliers within a preset threshold of the model Repeat 1-3 until the best model is found with high confidence RANSAC Line fitting example Algorithm: 1. Sample (randomly) the number of points required to fit the model (#=2) 2. Solve for model parameters using samples 3. Score by the fraction of inliers within a preset threshold of the model Repeat 1-3 until the best model is found with high confidence RANSAC Line fitting example Algorithm: 1. Sample (randomly) the number of points required to fit the model (#=2) 2. Solve for model parameters using samples 3. Score by the fraction of inliers within a preset threshold of the model Repeat 1-3 until the best model is found with high confidence RANSAC Line fitting example NI = 6 Algorithm: 1. Sample (randomly) the number of points required to fit the model (#=2) 2. Solve for model parameters using samples 3. Score by the fraction of inliers within a preset threshold of the model Repeat 1-3 until the best model is found with high confidence RANSAC Algorithm: N I = 14 1. Sample (randomly) the number of points required to fit the model (#=2) 2. Solve for model parameters using samples 3. Score by the fraction of inliers within a preset threshold of the model Repeat 1-3 until the best model is found with high confidence How to choose parameters? • Number of samples N – Choose N so that, with probability p, at least one random sample is free from outliers (e.g. p=0.99) (outlier ratio: e ) • Number of sampled points s – Minimum number needed to fit the model • Distance threshold – Choose so that a good point with noise is likely (e.g., prob=0.95) within threshold – Zero-mean Gaussian noise with std. dev. σ: t2=3.84σ2 ( N = log(1 − p ) / log 1 − (1 − e ) s ) s 2 3 4 5 6 7 8 5% 2 3 3 4 4 4 5 10% 3 4 5 6 7 8 9 proportion of outliers e 20% 5 7 9 12 16 20 26 25% 6 9 13 17 24 33 44 30% 7 11 17 26 37 54 78 40% 50% 11 17 19 35 34 72 57 146 97 293 163 588 272 1177 RANSAC Good • Robust to outliers • Applicable for larger number of objective function parameters than Hough transform • Optimization parameters are easier to choose than Hough transform Bad • Computational time grows quickly with fraction of outliers and number of parameters • Not good for getting multiple fits Common applications • Computing a homography (e.g., image stitching) • Estimating fundamental matrix (relating two views) How do we fit the best alignment? Alignment • Alignment: find parameters of model that maps one set of points to another • Typically want to solve for a global transformation that accounts for *most* true correspondences • Difficulties – Noise (typically 1-3 pixels) – Outliers (often 50%) – Many-to-one matches or multiple objects Parametric (global) warping T p = (x,y) p’ = (x’,y’) Transformation T is a coordinate-changing machine: p’ = T(p) What does it mean that T is global? – Is the same for any point p – can be described by just a few numbers (parameters) For linear transformations, we can represent T as a matrix p’ = Tp x' x y ' = T y Common transformations original Transformed aspect rotation translation affine perspective Scaling • Scaling a coordinate means multiplying each of its components by a scalar • Uniform scaling means this scalar is the same for all components: 2 Scaling • Non-uniform scaling: different scalars per component: X 2, Y 0.5 Scaling • Scaling operation: x ' = ax y ' = by • Or, in matrix form: x ' a 0 x y ' = 0 b y scaling matrix S 2D rotation (x’, y’) (x, y) x’ = x cos() - y sin() y’ = x sin() + y cos() 2D rotation (x’, y’) (x, y) Polar coordinates… x = r cos () y = r sin () x’ = r cos ( + ) y’ = r sin ( + ) Trig Identity… x’ = r cos() cos() – r sin() sin() y’ = r sin() cos() + r cos() sin() Substitute… x’ = x cos() - y sin() y’ = x sin() + y cos() 2D rotation This is easy to capture in matrix form: x ' cos( ) − sin ( ) x y ' = sin ( ) cos( ) y R Even though sin() and cos() are nonlinear functions of , – x’ is a linear combination of x and y – y’ is a linear combination of x and y What is the inverse transformation? – Rotation by – – For rotation matrices R −1 = R T Basic 2D transformations x' s x y ' = 0 0 x s y y x' 1 y ' = a y Scale a x x 1 y Shear x' cos − sin x y ' = sin cos y Rotate x x 1 0 t x y = 0 1 t y y 1 Translate x a b y = d e Affine x c y f 1 Affine is any combination of translation, scale, rotation, shear Affine transformation Affine transformations are combinations of x • Linear transformations, and • Translations a y = d Properties of affine transformations: • • • • Lines map to lines Parallel lines remain parallel Ratios are preserved Closed under composition b e x c y f 1 or x' a y ' = d 1 0 b e 0 c x f y 1 1 Projective transformations Projective transformations are combos of • • Affine transformations, and Projective warps Properties of projective transformations: • • • • • • Lines map to lines Parallel lines do not necessarily remain parallel Ratios are preserved Closed under composition Models change of basis Projective matrix is defined up to a scale (8 DOF) x' a y ' = d w' g b e h c x f y i w 2D image transformations Example: solving for translation A1 A2 B1 A3 B2 B3 Given matched points in {A} and {B}, estimate the translation of the object xiB xiA t x B = A + t yi yi y Example: solving for translation A1 A2 A3 (tx, ty) B1 B2 Least squares solution 1. 2. 3. Write down objective function Derived solution a) Compute derivative b) Compute solution Computational solution a) Write in form Ax=b b) Solve using pseudo-inverse or eigenvalue decomposition xiB xiA t x B = A + t yi yi y B3 1 0 1 0 0 x1B B 1 t x y1 = ty 0 x nB y nB 1 − x1A − y1A − x nA − y nA Example: solving for translation A1 A5 A2 A3 (tx, ty) A4 B4 B1 B2 B5 B3 Problem: outliers RANSAC solution 1. 2. 3. 4. Sample a set of matching points (1 pair) Solve for transformation parameters Score parameters with number of inliers Repeat steps 1-3 N times xiB xiA t x B = A + t yi yi y Example: solving for translation B4 B5 B6 A1 A2 (tx, ty) A3 A4 A5 A6 B1 B2 B3 Problem: outliers, multiple objects, and/or many-to-one matches Hough transform solution 1. Initialize a grid of parameter values 2. Each matched pair casts a vote for consistent values 3. Find the parameters with the most votes 4. Solve using least squares with inliers xiB xiA t x B = A + t yi yi y Example: solving for translation (tx, ty) Problem: no initial guesses for correspondence xiB xiA t x B = A + t yi yi y When no prior matched pairs exist • Hough transform and RANSAC not applicable • Important applications Medical imaging: match brain scans or contours Robotics: match point clouds Iterative Closest Point (ICP) Goal: estimate transform between two dense sets of points 1. Initialize transformation (e.g., compute difference in means and scale) 2. Assign each point in {Set 1} to its nearest neighbor in {Set 2} 3. Estimate transformation parameters – e.g., least squares or robust least squares 4. Transform the points in {Set 1} using estimated parameters 5. Repeat steps 2-4 until change is very small https://www.youtube.com/watch?v=m64E47uvPYc https://www.youtube.com/watch?v=uzOCS_gdZuM Example: aligning boundaries p q Example: solving for translation (tx, ty) Problem: no initial guesses for correspondence ICP solution xiB xiA t x 1. Find nearest neighbors for each point B = A + t 2. Compute transform using matches yi yi y 3. Move points using transform 4. Repeat steps 1-3 until convergence Algorithm summary • Least Squares Fit – – – • Robust Least Squares – – • robust to noise and outliers can fit multiple models only works for a few parameters (1-4 typically) RANSAC – – • improves robustness to noise requires iterative optimization Hough transform – – – • closed form solution robust to noise not robust to outliers robust to noise and outliers works with a moderate number of parameters (e.g, 1-8) Iterative Closest Point (ICP) – For local alignment only: does not require initial correspondences Object instance recognition A1 1. Match keypoints to object model 2. Solve for affine transformation parameters 3. Score by inliers and choose solutions with score above threshold A2 A3 Matched keypoints Affine Parameters # Inliers Choose hypothesis with max score above threshold Keypoint matching A1 A2 A3 fA fB d ( f A, fB ) T 1. Find a set of distinctive keypoints 2. Define a region around each keypoint 3. Extract and normalize the region content 4. Compute a local descriptor from the normalized region 5. Match local descriptors Finding the objects Input Image 1. 2. 3. 4. 5. Stored Image Match interest points from input image to database image Matched points vote for rough position/orientation/scale of object Find position/orientation/scales that have at least three votes Compute affine registration and matches using iterative least squares with outlier check Report object if there are at least T matched points Object recognition using SIFT descriptors 1. 2. Match interest points from input image to database image Get location/scale/orientation using Hough voting – In training, each point has known position/scale/orientation wrt whole object – Matched points vote for the position, scale, and orientation of the entire object – Bins for x, y, scale, orientation • • 3. Wide bins (0.25 object length in position, 2x scale, 30 degrees orientation) Vote for two closest bin centers in each direction (16 votes total) Geometric verification – For each bin with at least 3 keypoints – Iterate between least squares fit and checking for inliers and outliers 4. Report object if > T inliers (T is typically 3, can be computed to match some probabilistic threshold) Examples of recognized objects CSE 185 Introduction to Computer Vision Stereo Stereo • • • • Multi-view analysis Stereopsis Finding correspondce Depth • Reading: Chapter 12 and 11 Multiple views • Taken at the same time or sequential in time • stereo vision • structure from motion • optical flow Why multiple views? • Structure and depth are inherently ambiguous from single views Why multiple views? • Structure and depth are inherently ambiguous from single views P1 P2 P1’=P2’ Optical center Shape from X • What cues help us to perceive 3d shape and depth? • Many factors – – – – – – – – Shading Motion Occlusion Focus Texture Shadow Specularity … Shading For static objects and known light source, estimate surface normals under different lighting conditions Focus/defocus Images from same point of view, different camera parameters far focused near focused 3d shape / depth estimates estimated depth map Texture estimate surface orientation Perspective effects perspective, relative size, occlusion, texture gradients contribute to 3D appearance of the scene Motion Motion parallax: as the viewpoint moves side to side, the objects at distance appear to move slower than the ones close to the camera Motion Parallax Occlusion and focus Estimating scene shape • Shape from X: Shading, texture, focus, motion… • Stereo: – shape from motion between two views – infer 3d shape of scene from two (multiple) images from different viewpoints Main idea: scene point image plane optical center Outline • Human stereopsis • Stereograms • Epipolar geometry and the epipolar constraint – Case example with parallel optical axes – General case with calibrated cameras Human eye Rough analogy with human visual system: Pupil/Iris – control amount of light passing through lens Retina - contains sensor cells, where image is formed Fovea – highest concentration of cones Human stereopsis: disparity Human eyes fixate on point in space – rotate so that corresponding images form in centers of fovea. See https://en.wikipedia.org/wiki/Binocular_disparity Human stereopsis: disparity Disparity occurs when eyes fixate on one object; others appear at different visual angles Human stereopsis: disparity Disparity: d = r-l = D-F. Random dot stereograms • Bela Julesz 1960: Do we identify local brightness patterns before fusion (monocular process) or after (binocular)? • To test: pair of synthetic images obtained by randomly spraying black dots on white objects Random dot stereograms Random dot stereograms 1. Create an image of suitable size. Fill it with random dots. Duplicate the image. 2. Select a region in one image. See also this video 3. Shift this region horizontally by a small amount. The stereogram is complete. (focus on a point behind the image by a small amount until the two images "snap" together) Random dot stereograms • When viewed monocularly, they appear random; when viewed stereoscopically, see 3d structure. • Conclusion: human binocular fusion not directly associated with the physical retinas; must involve the central nervous system • Imaginary “cyclopean retina” that combines the left and right image stimuli as a single unit • High level scene understanding not required for stereo Stereo photograph and viewer Take two pictures of the same subject from two slightly different viewpoints and display so that each eye sees only one of the images. Invented by Sir Charles Wheatstone, 1838 Image from fisher-price.com 3D effect http://www.johnsonshawmuseum.org https://en.wikipedia.org/wiki/Anaglyph_3D 3D effect http://www.johnsonshawmuseum.org https://en.wikipedia.org/wiki/Anaglyph_3D Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923 Stereo images present stereo images on the screen by simply putting the right and left images in an animated gif http://www.well.com/~jimg/stereo/stereo_list.html Autostereograms Exploit disparity as depth cue using single image. (Single image random dot stereogram, Single image stereogram) Tips: first move your nose to the screen and look through the screen, then move your head away slowly and you will see 3D See https://en.wikipedia.org/wiki/Autostereogram Autosteoreogram Designed to create the visual illusion of a 3D scene Using smooth gradients Estimating depth with stereo • Stereo: shape from motion between two views • Need to consider: – Info on camera pose (“calibration”) – Image point correspondences scene point image plane optical center Stereo vision Two cameras, simultaneous views Single moving camera and static scene Camera parameters Camera frame 2 Extrinsic parameters: Camera frame 1 → Camera frame 2 Camera frame 1 Intrinsic parameters: Image coordinates relative to camera → Pixel coordinates • Extrinsic params: rotation matrix and translation vector • Intrinsic params: focal length, pixel sizes (mm), image center point, radial distortion parameters We’ll assume for now that these parameters are given and fixed. Outline • Human stereopsis • Stereograms • Epipolar geometry and the epipolar constraint – Case example with parallel optical axes – General case with calibrated cameras Geometry for a stereo system • First, assuming parallel optical axes, known camera parameters (i.e., calibrated cameras): P World point point in 3D world Depth of p image point (left) image point image point (right) image point (left) depth of P (right) Focal focal length length optical center (left) optical center optical center (right) (right) optical center (left) baseline baseline Geometry for a stereo system • Assume parallel optical axes, known camera parameters (i.e., calibrated cameras). What is expression for Z? P Similar triangles (pl, P, pr) and (Ol, P, Or): T + xl − xr T = Z− f Z disparity 𝑇 − 𝑥𝑙 − 𝑥𝑟 = 𝑇 + 𝑥𝑙 − 𝑥𝑟 T Z= f xr − xl Depth from disparity image I(x,y) Disparity map D(x,y) image I´(x´,y´) (x´,y´)=(x+D(x,y), y) So if we could find the corresponding points in two images, we could estimate relative depth… CSE 185 Introduction to Computer Vision Stereo 2 Stereo • Epipolar geometry • Correspondence • Structure from motion • Reading: Chapter 11 Depth from disparity X (X – X’) / f = baseline / z z x f f C X – X’ = (baseline*f) / z x’ baseline C’ z = (baseline*f) / (X – X’) d=X-X’ (disparity) z is inversely proportional to d Outline • Human stereopsis • Stereograms • Epipolar geometry and the epipolar constraint – Case example with parallel optical axes – General case with calibrated cameras General case with calibrated camera • The two cameras need not have parallel optical axes. vs. Stereo correspondence constraint • Given p in left image, where can corresponding point p’ be? Stereo correspondence constraints Epipolar constraint • Geometry of two views constrains where the corresponding pixel for some image point in the first view must occur in the second view • It must be on the line carved out by a plane connecting the world point and optical centers Epipolar geometry • Epipolar Plane Epipole Baseline Epipolar Line Epipole http://www.ai.sri.com/~luong/research/Meta3DViewer/EpipolarGeo.html Epipolar geometry: terms • • • • • • Baseline: line joining the camera centers 𝑂𝑂′ Epipole: point of intersection of baseline with image plane 𝑒 Epipolar plane: plane containing baseline and world point Epipolar line: intersection of epipolar plane with the image plane 𝑒𝑝 and 𝑒′𝑝′ All epipolar lines intersect at the epipole An epipolar plane intersects the left and right image planes in epipolar lines Why is the epipolar constraint useful? Epipolar constraint This is useful because it reduces the correspondence problem to a 1D search along an epipolar line Example What do the epipolar lines look like? 1. Ol Or 2. Ol Or Example: Converging camera Example: Parallel camera Where are the epipoles? Example: Forward motion What would the epipolar lines look like if the camera moves directly forward? Example: Forward motion e’ e Epipole has same coordinates in both images. Points move along lines radiating from e: “Focus of expansion” Recall • Cross product • Dot product in matrix notation M=[R | t] Recall • A line 𝑙: 𝑎𝑥 + 𝑏𝑦 + 𝑐 = 0 • In vector form 𝑥 𝑎 𝑙 = 𝑏 , a point 𝑝 = 𝑦 , 𝑝𝑇 𝑙 = 0 𝑐 1 Epipolar Constraint: Calibrated Case R is 3 x 3 matrix t is 3 x 1 vector 𝑝𝑙 = 𝑅𝑝𝑟 + 𝑡 See p347-348 of Szeliski (R, t) 𝑝 ∙ 𝑡 × 𝑅 𝑝′ + 𝑡) 𝑝 = (𝑢, 𝑣, 1)𝑇 = 0 𝑤𝑖𝑡ℎ ൞ 𝑝′ = (𝑢′ , 𝑣 ′ , 1)𝑇 𝑝′ in left camera: 𝑅𝑝′ + 𝑡 𝑝 ∙ [𝑡 × 𝑅𝑝′) = 0, 𝑝𝑇 [𝑡 × 𝑅𝑝′) =0 Essential Matrix 3 ×3 skew-symmetric matrix: rank=2 (Longuet-Higgins, 1981) 𝑡 × 𝑅𝑝′ = 𝑡 × 𝑅𝑝′ = 𝜀𝑝′ 𝑝𝑇 𝜀𝑝′ = 0 with 𝜀 = [𝑡]× 𝑅 𝑝𝑇 𝑙 = 0, epiolor line 𝑙 is formed by 𝜀𝑝′ Epipolar Constraint: Calibrated Case 𝑅11 𝑅12 𝑅13 𝑡 = [𝑡𝑥 , 𝑡𝑦 , 𝑡𝑧 ] , R = 𝑅21 𝑅22 𝑅23 𝑅31 𝑅32 𝑅33 0 −𝑡𝑧 𝑡𝑦 𝑅11 𝑅12 𝑅13 0 −𝑡𝑥 𝑅21 𝑅22 𝑅23 𝐸 = 𝑡𝑧 𝑅31 𝑅32 𝑅33 −𝑡𝑦 𝑡𝑥 0 𝑇 • Note that Ep’ can be interpreted as the coordinate vector representing the epipolar line associated with the point p’ in the first image • A line l can be defined by its equation au+bv+c=0 where (u,v) denote the coordinate of a point on the line, (a, b) is the unit normal to the line, and –c is the offset • Can normalize that with a2+b2=1 to have a unique answer https://www.youtube.com/watch?v=6kpBqfgSPRc http://www.rasmus.is/uk/t/F/Su58k05.htm https://sites.math.washington.edu/~king/coursedir/m445w04/notes/vector/normals-planes.html Epipolar Constraint: Uncalibrated Case pˆ , pˆ are normalized image coordinate Fundamental Matrix (Faugeras and Luong, 1992) Fundamental matrix • Let p be a point in left image, p’ in right image l • Epipolar relation l’ p p’ – p maps to epipolar line l’ – p’ maps to epipolar line l • Epipolar mapping described by a 3x3 matrix F • It follows that What is the physical meaning? Fundamental matrix • This matrix F is called – Essential Matrix • when image intrinsic parameters are known – Fundamental Matrix • more generally (uncalibrated case) • Can solve F from point correspondences – Each (p, p’) pair gives one linear equation in entries of F – F has 9 entries, but really only 7 or 8 degrees of freedom. – With 8 points it is simple to solve for F, but it is also possible with 7 points Stereo image rectification Stereo image rectification • Reproject image planes onto a common plane parallel to the line between camera centers • Pixel motion is horizontal after this transformation • Two homographies (3x3 transform), one for each input image reprojection Rectification example Correspondence problem • Epipolar geometry constrains our search, but we still have a difficult correspondence problem Basic stereo matching algorithm • If necessary, rectify the two stereo images to transform epipolar lines into scanlines • For each pixel x in the first image – Find corresponding epipolar scanline in the right image – Examine all pixels on the scanline and pick the best match x’ – Compute disparity x-x’ and set depth(x) = fB/(x-x’) Correspondence search Left Right scanline Matching cost • Slide a window along the right scanline and compare contents of that window with the reference window in the left image • Matching cost: SSD or normalized correlation Correspondence search Left Right scanline SSD Correspondence search Left Right scanline Norm. corr Effect of window size • Smaller window + More detail – More noise • Larger window + Smoother disparity maps – Less detail W=3 W = 20 Failure cases Textureless surfaces Occlusions, repetition Non-Lambertian surfaces, specularities Results with window search Data Window-based matching Ground truth Beyond window-based matching • So far, matches are independent for each point • What constraints or priors can we add? Stereo constraints/priors • Uniqueness – For any point in one image, there should be at most one matching point in the other image Stereo constraints/priors • Uniqueness – For any point in one image, there should be at most one matching point in the other image • Ordering – Corresponding points should be in the same order in both views Stereo constraints/priors • Uniqueness – For any point in one image, there should be at most one matching point in the other image • Ordering – Corresponding points should be in the same order in both views Ordering constraint doesn’t hold Priors and constraints • Uniqueness – For any point in one image, there should be at most one matching point in the other image • Ordering – Corresponding points should be in the same order in both views • Smoothness – We expect disparity values to change slowly (for the most part) Scanline stereo • Try to coherently match pixels on the entire scanline • Different scanlines are still optimized independently Left image Right image “Shortest paths” for scan-line stereo Left image I Right image Sleft Right occlusion Left occlusion q t s p S right Can be implemented with dynamic programming Ccorr I Coherent stereo on 2D grid • Scanline stereo generates streaking artifacts • Can’t use dynamic programming to find spatially coherent disparities/ correspondences on a 2D grid Stereo matching as energy minimization I2 I1 W1(i) D W2(i+D(i)) E ( D ) = (W1 (i ) − W2 (i + D (i )) ) + 2 i D(i) (D(i) − D( j ) ) neighbors i , j data term smoothness term • Random field interpretation • Energy functions of this form can be minimized using graph cuts Graph cut Before Graph cuts Many of these constraints can be encoded in an energy function and solved using graph cuts Ground truth Y. Boykov, O. Veksler, and R. Zabih, Fast Approximate Energy Minimization via Graph Cuts, PAMI 2001 For the latest and greatest: http://www.middlebury.edu/stereo/ Active stereo with structured light • Project “structured” light patterns onto the object – Simplifies the correspondence problem – Allows us to use only one camera camera projector L. Zhang, B. Curless, and S. M. Seitz. Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming. 3DPVT 2002 Kinect: Structured infrared light http://bbzippo.wordpress.com/2010/11/28/kinect-in-infrared/ Summary: Epipolar constraint X X X x x’ x’ x’ Potential matches for x have to lie on the corresponding line l’. Potential matches for x’ have to lie on the corresponding line l. Summary • Epipolar geometry – Epipoles are intersection of baseline with image planes – Matching point in second image is on a line passing through its epipole – Fundamental matrix maps from a point in one image to a line (its epipolar line) in the other – Can solve for F given corresponding points (e.g., interest points) • Stereo depth estimation – Estimate disparity by finding corresponding points along scanlines – Depth is inverse to disparity Structure from motion • Given a set of corresponding points in two or more images, compute the camera parameters and the 3D point coordinates ? Camera 1 R1,t1 ? Camera 2 R2,t2 Camera 3 ? ? R ,t 3 3 Structure from motion ambiguity • If we scale the entire scene by some factor k and, at the same time, scale the camera matrices by the factor of 1/k, the projections of the scene points in the image remain exactly the same: 1 x = PX = P (k X) k It is impossible to recover the absolute scale of the scene! Structure from motion ambiguity • If we scale the entire scene by some factor k and, at the same time, scale the camera matrices by the factor of 1/k, the projections of the scene points in the image remain exactly the same ( x = PX = PQ -1 )(QX ) • More generally: if we transform the scene using a transformation Q and apply the inverse transformation to the camera matrices, then the images do not change Projective structure from motion • Given: m images of n fixed 3D points • xij = Pi Xj , i = 1,… , m, j = 1, … , n • Problem: estimate m projection matrices Pi and n 3D points Xj from the mn corresponding points xij Xj x1j x3j P1 x2j P3 P2 Projective structure from motion • Given: m images of n fixed 3D points • xij = Pi Xj , i = 1,… , m, j = 1, … , n • Problem: estimate m projection matrices Pi and n 3D points Xj from the mn corresponding points xij • With no calibration info, cameras and points can only be recovered up to a 4x4 projective transformation Q: • X → QX, P → PQ-1 • We can solve for structure and motion when • 2mn >= 11m +3n – 15 • For two cameras, at least 7 points are needed Projective ambiguity A Qp = T v ( x = PX = PQ -1 P )(Q X ) P t v Projective ambiguity Bundle adjustment • Non-linear method for refining structure and motion • Minimizing reprojection error 2 E (P, X) = D (x ij , Pi X j ) m n i =1 j =1 Xj P1Xj x3j x1j P1 P2Xj x2j P3Xj P3 P2 Photosynth Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo tourism: Exploring photo collections in 3D," SIGGRAPH 2006 https://www.youtube.com/watch?v=p16frKJLVi0 http://photosynth.net/ CSE 185 Introduction to Computer Vision Feature Tracking and Optical Flow Motion estimation • Feature-tracking – Extract visual features (corners, textured areas) and track them over multiple frames • Optical flow – Recover image motion at each pixel from spatio-temporal image brightness variations (optical flow) Two problems, one registration method • Reading: Chapter 4.1.4 and 8 Feature tracking • Many problems, e.g., structure from motion, require matching points • If motion is small, tracking is an easy way to measure pixel movements Feature tracking • Challenges – Figure out which features can be tracked – Efficiently track across frames – Some points may change appearance over time (e.g., due to rotation, moving into shadows, etc.) – Drift: small errors can accumulate as appearance model is updated – Points may appear or disappear; need to be able to add/delete tracked points Feature tracking I(x, y, t) I(x, y, t+1) • Given two subsequent frames, estimate the point translation • Key assumptions of Lucas-Kanade Tracker • Brightness constancy: projection of the same point looks the same in every frame • Small motion: points do not move very far • Spatial coherence: points move like their neighbors Brightness constancy I(x,y,t) I(x,y,t+1) • Brightness Constancy Equation: I ( x, y , t ) = I ( x + u, y + v, t + 1) Take Taylor expansion of I(x+u, y+v, t+1) at (x, y, t) to linearize the right side: Image derivative along x Difference over frames I ( x + u, y + v, t + 1) I ( x, y , t ) + I x u + I y v + I t I ( x + u, y + v, t + 1) − I ( x, y , t ) = + I x u + I y v + I t Hence, I x u + I y v + I t 0 → I u v + I t = 0 T Detailed derivaiton H.O.T. 𝜕𝐼 ∆𝑥 𝜕𝐼 ∆𝑦 + 𝜕𝑥 ∆𝑡 𝜕𝑦 ∆𝑡 + 𝜕𝐼 ∆𝑡 =0 𝜕𝑡 ∆𝑡 𝐼𝑥 𝑢 + 𝐼𝑦 𝑣 + 𝐼𝑡 = 0 ∇𝐼 ∙ [𝑢 𝑣]𝑇 +𝐼𝑡 = 0 𝐼𝑥 ∇𝐼 = 𝐼 𝑦 How does this make sense? I u v + I t = 0 T • What do the static image gradients have to do with motion estimation? • Relate both spatial and temporal information – Spatial information: image gradient – Temporal information: temporal gradient Computing gradients in X-Y-T y t At 𝐼𝑖,𝑗,𝑡 : 𝐼𝑖+1,𝑗,𝑡 − 𝐼𝑖,𝑗,𝑡 At 𝐼𝑖,𝑗,𝑡+1 : 𝐼𝑖+1,𝑗,𝑡+1 − 𝐼𝑖,𝑗,𝑡+1 At 𝐼𝑖,𝑗+1,𝑡 : 𝐼𝑖+1,𝑗+1,𝑡 − 𝐼𝑖,𝑗+1,𝑡 At 𝐼𝑖,𝑗+1,𝑡+1 : 𝐼𝑖+1,𝑗+1,𝑡+1 − 𝐼𝑖,𝑗+1,𝑡+1 j+1 t+1 j t i Ix = i+1 1 [( I i +1, j ,t + I i +1, j ,t +1 + I i +1, j +1,t + I i +1, j +1,t +1 ) − 4 x ( I i , j ,t + I i , j ,t +1 + I i , j +1,t + I i , j +1,t +1 )] likewise for Iy and It x Brightness constancy Can we use this equation to recover image motion (u,v) at each pixel? I u v + I t = 0 T • How many equations and unknowns per pixel? • One equation (this is a scalar equation!), two unknowns (u,v) The component of the motion perpendicular to the gradient (i.e., parallel to the edge) cannot be measured I u ' v ' = 0 T If (u, v) satisfies the equation, so does (u+u’, v+v’ ) since I (u + u ' ) (v + v' ) + I t = 0 T Multiple solutions (explanations) for the same observed motion gradient (u,v) (u+u’,v+v’) (u’,v’) edge The aperture problem Actual motion The aperture problem Perceived motion The aperture problem http://en.wikipedia.org/wiki/Motion_perception#The_aperture_problem The grating appears to be moving down and to the right, perpendicular to the orientation of the bars. But it could be moving in many other directions, such as only down, or only to the right. It is not possible to determine unless the ends of the bars become visible in the aperture The barber pole illusion http://en.wikipedia.org/wiki/Barberpole_illusion This visual illusion occurs when a diagonally striped pole is rotated around its vertical axis (horizontally), it appears as though the stripes are moving in the direction of its vertical axis (downwards in the case of the animation) rather than around it. The barber pole turns in place on its vertical axis, but the stripes appear to move upwards rather than turning with the pole Addressing the ambiguity issue B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 674–679, 1981. I u v + I t = 0 T • How to get more equations for a pixel? • Spatial coherence constraint • Assume the pixel’s neighbors have the same (u,v) – If we use a 5x5 window, that gives us 25 equations per pixel Solving the ambiguity • Least squares problem: Matching patches • Overconstrained linear system Least squares solution for d given by The summations are over all pixels in the K x K window Conditions for solvability Optimal (u, v) satisfies Lucas-Kanade equation When is this solvable? I.e., what are good points to track? • ATA should be invertible • ATA should not be too small due to noise – eigenvalues 1 and 2 of ATA should not be too small • ATA should be well-conditioned – 1/ 2 should not be too large ( 1 = larger eigenvalue) Does this remind you of anything? Criteria for Harris corner detector Pseudo inverse 𝐴𝑑 = 𝑏, 𝑑 = 𝐴−1 𝑏 When A is a well-conditioned square matrix 𝐴𝑑 = 𝑏 For flow estimation, A is not a square matrix 𝐴𝑇 𝐴 𝑑 = 𝐴𝑇 𝑏 Multiply AT on both sides, ATA is a square matrix 𝐴+ = 𝐴𝑇 𝐴 −1 𝐴𝑇 𝐴+ 𝐴 = 𝐴𝑇 𝐴 −1 𝐴𝑇 𝐴 = 𝐼 𝑑 = 𝐴𝑇 𝐴 −1 See Moore-Penrose inverse 𝐴𝑇 𝑏 pseudo inverse, left inverse Need to make sure ATA is well-conditioned M = ATA is the second moment matrix ! (Harris corner detector…) • Eigenvectors and eigenvalues of ATA relate to edge direction and magnitude • The eigenvector associated with the larger eigenvalue points in the direction of fastest intensity change • The other eigenvector is orthogonal to it Low-texture region – gradients have small magnitude – small 1, small 2 Edge – gradients very large or very small – large 1, small 2 High-texture region – gradients are different, large magnitudes – large 1, large 2 The aperture problem resolved Actual motion The aperture problem resolved Perceived motion Shi-Tomasi feature tracker • Find good features using eigenvalues of second-moment matrix (e.g., Harris detector or threshold on the smallest eigenvalue) – • Track from frame to frame with Lucas-Kanade – • Key idea: “good” features to track are the ones whose motion can be estimated reliably This amounts to assuming a translation model for frame-toframe feature movement Check consistency of tracks by affine registration to the first observed instance of the feature – – Affine model is more accurate for larger displacements Comparing to the first frame helps to minimize drift J. Shi and C. Tomasi. Good Features to Track. CVPR 1994. Tracking example J. Shi and C. Tomasi. Good Features to Track. CVPR 1994. Summary of KLT tracking • Find a good point to track (Harris corner) • Use intensity second moment matrix and difference across frames to find displacement • Iterate and use coarse-to-fine search to deal with larger movements • When creating long tracks, check appearance of registered patch against appearance of initial patch to find points that have drifted Implementation issues • Window size – Small window more sensitive to noise and may miss larger motions (without pyramid) – Large window more likely to cross an occlusion boundary (and it’s slower) – 15x15 to 31x31 seems typical • Weighting the window – Common to apply weights so that center matters more (e.g., with Gaussian) Optical flow Vector field function of the spatio-temporal image brightness variations Motion and perceptual organization • Gestalt factor that leads to grouping • Sometimes, motion is the only cue Motion and perceptual organization • Even “impoverished” motion data can evoke a strong percept G. Johansson, “Visual Perception of Biological Motion and a Model For Its Analysis", Perception and Psychophysics 14, 201-211, 1973. Motion and perceptual organization • Even “impoverished” motion data can evoke a strong percept G. Johansson, “Visual Perception of Biological Motion and a Model For Its Analysis", Perception and Psychophysics 14, 201-211, 1973. Uses of motion • • • • • Estimating 3D structure Segmenting objects based on motion cues Learning and tracking dynamical models Recognizing events and activities Improving video quality (motion stabilization) Motion field • The motion field is the projection of the 3D scene motion into the image See Nayar’s lecture on optical flow and motion field What would the motion field of a non-rotating ball moving towards the camera look like? Optical flow • Definition: optical flow is the apparent motion of brightness patterns in the image • Ideally, optical flow would be the same as the motion field • Have to be careful: apparent motion can be caused by lighting changes without any actual motion – Think of a uniform rotating sphere under fixed lighting vs. a stationary sphere under moving illumination Lucas-Kanade optical flow • Same as Lucas-Kanade feature tracking, but for each pixel – As we saw, works better for textured pixels • Operations can be done one frame at a time, rather than pixel by pixel – Efficient Iterative refinement • Iterative Lukas-Kanade Algorithm 1. Estimate displacement at each pixel by solving Lucas-Kanade equations 2. Warp I(t) towards I(t+1) using the estimated flow field - Basically, just interpolation 3. Repeat until convergence 711 Coarse-to-fine flow estimation run iterative L-K warp & upsample run iterative L-K . . . image J1 Gaussian pyramid of image 1 (t) image I2 image Gaussian pyramid of image 2 (t+1) Coarse-to-fine flow estimation u=1.25 pixels u=2.5 pixels u=5 pixels image H 1 Gaussian pyramid of image 1 u=10 pixels image I2 image Gaussian pyramid of image 2 Larger motion: Iterative refinement Original (x,y) position 1. Initialize (x’,y’) = (x,y) 2. Compute (u,v) by 2nd moment matrix for feature patch in first image It = I(x’, y’, t+1) - I(x, y, t) displacement 1. Shift window by (u, v): x’=x’+u; y’=y’+v; 2. Recalculate It 3. Repeat steps 2-4 until small change • Use interpolation for subpixel values Multi-scale Lucas Kanade Algorithm Example Multi-resolution registration Optical flow results Optical flow results Errors in Lucas-Kanade • The motion is large – Possible Fix: keypoint matching • A point does not move like its neighbors – Possible Fix: region-based matching • Brightness constancy does not hold – Possible Fix: gradient constancy Genesis of Lucas Kanade Algorithm More applications • • • • • Frame interpolation Stabilization Surveillance Interactive games See Nayar’s lecture video Recent optical flow methods Start with something similar to Lucas-Kanade + gradient constancy + energy minimization with smoothing term + region matching + keypoint matching (long-range) Region-based +Pixel-based +Keypoint-based Large displacement optical flow, Brox et al., CVPR 2009 Stereo vs. Optical Flow • Similar dense matching procedures • Why don’t we typically use epipolar constraints for optical flow? B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 674–679, 1981. Summary • Major contributions from Lucas, Tomasi, Kanade – Tracking feature points – Optical flow • Key ideas – By assuming brightness constancy, truncated Taylor expansion leads to simple and fast patch matching across frames – Coarse-to-fine registration CSE 185 Introduction to Computer Vision Pattern Recognition Computer vision: related topics Pattern recognition • One of the leading vision conferences: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) • Pattern recognition and machine learning • Goal: Making predictions or decisions from data Machine learning applications Image categorization Training Training Labels Training Images Image Features Classifier Training Trained Classifier Image categorization Training Training Labels Training Images Image Features Classifier Training Trained Classifier Testing Prediction Image Features Trained Classifier Outdoor Test Image Pattern recognition pipeline • Apply a prediction function to a feature representation of the image to get the desired output: f( ) = “apple” f( ) = “tomato” f( ) = “cow” Machine learning framework y = f(x) output prediction function Image feature • Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set • Testing: apply f to a never seen test example x and output the predicted value y = f(x) Example: Scene categorization • Is this a kitchen? Image features Training Training Labels Training Images Image Features Classifier Training Trained Classifier Image representations • Coverage – Ensure that all relevant info is captured • Concision – Minimize number of features without sacrificing coverage • Directness – Ideal features are independently useful for prediction Image representations • Templates – Intensity, gradients, etc. • Histograms – Color, texture, SIFT descriptors, etc. • Features – PCA, local features, corners, etc. Classifiers Training Training Labels Training Images Image Features Classifier Training Trained Classifier Learning a classifier Given some set of features with corresponding labels, learn a function to predict the labels from the features x x x x x o o o o x2 x1 x x o x Many classifiers to choose from • • • • • • • • • • • Support vector machine Neural networks Naïve Bayes Bayesian network Which is the best one? Logistic regression Randomized Forests Boosted decision trees Adaboost classifier K-nearest neighbor Restricted Boltzmann machine Etc. One way to think about it… • Training labels dictate that two examples are the same or different, in some sense • Features and distance measures define visual similarity • Classifiers try to learn weights or parameters for features and distance measures so that visual similarity predicts label similarity Machine learning Topics Dimensionality reduction • Principal component analysis (PCA) is the most important technique to know. It takes advantage of correlations in data dimensions to produce the best possible lower dimensional representation, according to reconstruction error. • PCA should be used for dimensionality reduction, not for discovering patterns or making predictions. Don't try to assign semantic meaning to the bases. • Independent component analysis (ICA), Locally liner embedding (LLE), Isometric mapping (Isomap), … Clustering example: Segmentation Goal: Break up the image into meaningful or perceptually similar regions Segmentation for feature support 50x50 Patch 50x50 Patch Segmentation for efficiency [Felzenszwalb and Huttenlocher 2004] [Hoiem et al. 2005, Mori 2005] [Shi and Malik 2001] Segmentation as a result Types of segmentations Oversegmentation Undersegmentation Multiple Segmentations Segmentation approaches • Bottom-up: group tokens with similar features • Top-down: group tokens that likely belong to the same object [Levin and Weiss 2006] Clustering • Clustering: group together similar points and represent them with a single token • Key Challenges: – What makes two points/images/patches similar? – How do we compute an overall grouping from pairwise similarities? Slide: Derek Hoiem Why do we cluster? • Summarizing data – Look at large amounts of data – Patch-based compression or denoising – Represent a large continuous vector with the cluster number • Counting – Histograms of texture, color, SIFT vectors • Segmentation – Separate the image into different regions • Prediction – Images in the same cluster may have the same labels How do we cluster? • K-means – Iteratively re-assign points to the nearest cluster center • Agglomerative clustering – Start with each point as its own cluster and iteratively merge the closest clusters • Mean-shift clustering – Estimate modes of pdf • Spectral clustering – Split the nodes in a graph based on assigned links with similarity weights Clustering for Summarization Goal: cluster to minimize variance in data given clusters – Preserve information Cluster center Data c * , δ * = argmin N1 ij (c i − x j ) c ,δ N K j i 2 Whether xj is assigned to ci K-means algorithm 1. Randomly select K centers 2. Assign each point to nearest center 3. Compute new center (mean) for each cluster Illustration: http://en.wikipedia.org/wiki/K-means_clustering K-means algorithm 1. Randomly select K centers 2. Assign each point to nearest center Back to 2 3. Compute new center (mean) for each cluster Illustration: http://en.wikipedia.org/wiki/K-means_clustering K-means 1. Initialize cluster centers: c0 ; t=0 2. Assign each point to the closest center δ = argmin t δ (c N 1 N K ij j t −1 i −x j) 2 i 3. Update cluster centers as the mean of the points c t = argmin N1 ijt (c i − x j ) c N K j i 2 1. Repeat 2-3 until no points are re-assigned (t=t+1) Slide: Derek Hoiem Issues with K-means converges to a local minimum different results from multiple runs More examples • https://youtu.be/_aWzGGNrcic?t=262 • https://www.youtube.com/watch?v=BVFG7fd 1H30 K-means: design choices • Initialization – Randomly select K points as initial cluster center – Or greedily choose K points to minimize residual • Distance measures – Traditionally Euclidean, could be others • Optimization – Will converge to a local minimum – May want to perform multiple restarts K-means using intensity or color Image Clusters on intensity Clusters on color Number of clusters? • Minimum Description Length (MDL) principal for model comparison • Minimize Schwarz Criterion – also called Bayes Information Criteria (BIC) How to evaluate clusters? • Generative – How well are points reconstructed from the clusters? • Discriminative – How well do the clusters correspond to labels? • Purity – Note: unsupervised clustering does not aim to be discriminative Slide: Derek Hoiem Number of clusters? • Validation set – Try different numbers of clusters and look at performance • When building dictionaries (discussed later), more clusters typically work better Slide: Derek Hoiem K-Means pros and cons • • • Pros • Finds cluster centers that minimize conditional variance (good representation of data) • Simple and fast* • Easy to implement Cons • Need to choose K • Sensitive to outliers • Prone to local minima • All clusters have the same parameters (e.g., distance measure is nonadaptive) • *Can be slow: each iteration is O(KNd) for N d-dimensional points Usage • Rarely used for pixel segmentation Building Visual Dictionaries 1. Sample patches from a database – E.g., 128 dimensional SIFT vectors 2. Cluster the patches – Cluster centers are the dictionary 3. Assign a codeword (number) to each new patch, according to the nearest cluster Examples of learned codewords Most likely codewords for 4 learned “topics” EM with multinomial (problem 3) to get topics http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05b.pdf CSE 185 Introduction to Computer Vision Pattern Recognition 2 Unsupervised learning • Supervised learning – Predict target value (“y”) given features (“x”) • Unsupervised learning – Understand patterns of data (just “x”) – Useful for many reasons • Data mining (“explain”) • Missing data values (“impute”) • Representation (feature generation or selection) • One example: clustering Clustering and data compression • Clustering is related to vector quantization – Dictionary of vectors (the cluster centers) – Each original value represented using a dictionary index – Each center “claims” a nearby region (Voronoi region) Agglomerative clustering • Another simple clustering algorithm Initially, every datum is a cluster • Define a distance between clusters (return to this) • Initialize: every example is a cluster • Iterate: – Compute distances between all clusters (store for efficiency) – Merge two closest clusters • Save both clustering and sequence of cluster operations • “Dendrogram” Iteration 1 height (above two samples) corresponds to distance Iteration 2 Iteration 3 • Builds up a sequence of clusters (“hierarchical”) • Algorithm complexity O(N2) (Why?) In Matlab: “linkage” function (stats toolbox) Dendrogram Cluster Distances produces minimal spanning tree. avoids elongated clusters. Example: microarray expression • Measure gene expression • Various experimental conditions – Cancer, normal – Time – Subjects • Explore similarities – What genes change together? – What conditions are similar? • Cluster on both genes and conditions Agglomerative Clustering Good • Simple to implement, widespread application • Clusters have adaptive shapes • Provides a hierarchy of clusters Bad • May have imbalanced clusters • Still have to choose number of clusters or threshold • Need to use a good metric to get a meaningful hierarchy Sidebar: Gaussian modes • Parametric distribution and modes Statistical estimation • Parametric distribution: – based on some statical function form and estimate the parameters – e.g., Gaussian, mixture of Gaussian • Non-parametric distribution: – no assumption of the statistical function form – based on kernels – e.g., k-nearest neighbor, kernel density estimation, mean-shift Mean shift segmentation D. Comaniciu and P. Meer, Mean Shift: A Robust Approach toward Feature Space Analysis, PAMI 2002. • Versatile technique for clustering-based segmentation Mean shift algorithm • Try to find modes of this non-parametric density Trajectories → of mean shift procedure 2D (first 2 components) dataset of 110,400 points in the LUV space Mean shift → procedure (7 clusters) Kernel density estimation Kernel density estimation function Gaussian kernel Mean shift Region of interest Center of mass Mean Shift vector Mean shift Region of interest Center of mass Mean Shift vector Mean shift Region of interest Center of mass Mean Shift vector Mean shift Region of interest Center of mass Mean Shift vector Mean shift Region of interest Center of mass Mean Shift vector Mean shift Region of interest Center of mass Mean Shift vector Mean shift Region of interest Center of mass Computing the mean shift Simple Mean Shift procedure: • Compute mean shift vector • Translate the Kernel window by m(x) n x - xi 2 xi g h i =1 m ( x) = − x 2 n x - xi g h i =1 g(x): typically a Gaussian kernel Attraction basin • Attraction basin: the region for which all trajectories lead to the same mode • Cluster: all data points in the attraction basin of a mode Attraction basin Mean shift clustering • The mean shift algorithm seeks modes of the given set of points 1. Choose kernel and bandwidth 2. For each point: a) b) c) d) Center a window on that point Compute the mean of the data in the search window Center the search window at the new mean location Repeat (b,c) until convergence 3. Assign points that lead to nearby modes to the same cluster Segmentation by mean shift • • • • • Compute features for each pixel (color, gradients, texture, etc) Set kernel size for features Kf and position Ks Initialize windows at individual pixel locations Perform mean shift for each window until convergence Merge windows that are within width of Kf and Ks Mean shift segmentation http://www.caip.rutgers.edu/~comanici/MSPAMI/msPamiResults.html Mean shift pros and cons • Pros – Good general-practice segmentation – Flexible in number and shape of regions – Robust to outliers • Cons – Have to choose kernel size in advance – Not suitable for high-dimensional features • When to use it – Oversegmentatoin – Multiple segmentations – Tracking, clustering, filtering applications Spectral clustering • Group points based on links in a graph A B Cuts in a graph A B Normalized Cut • a cut penalizes large segments • fix by normalizing for size of segments • volume(A) = sum of costs of all edges that touch A Normalized cuts Which algorithm to use? • Quantization/Summarization: K-means – Aims to preserve variance of original data – Can easily assign new point to a cluster Quantization for computing histograms Summary of 20,000 photos of Rome using “greedy k-means” http://grail.cs.washington.edu/projects/canonview/ Which algorithm to use? • Image segmentation: agglomerative clustering – More flexible with distance measures (e.g., can be based on boundary prediction) – Adapts better to specific data – Hierarchy can be useful http://www.cs.berkeley.edu/~arbelaez/UCM.html Things to remember • K-means useful for summarization, building dictionaries of patches, general clustering • Agglomerative clustering useful for segmentation, general clustering • Spectral clustering useful for determining relevance, summarization, segmentation Clustering Key algorithm • K-means CSE 185 Introduction to Computer Vision Face Recognition The space of all face images • When viewed as vectors of pixel values, face images are extremely high-dimensional – 100x100 image = 10,000 dimensions • However, relatively few 10,000-dimensional vectors correspond to valid face images • We want to effectively model the subspace of face images In one thumbnail face image • Consider a thumbnail 19 19 face pattern • 256361 possible combination of gray values – 256361= 28361 = 22888 • Total world population (as of 2004) – 7,800,000,000 233 • 287 times more than the world population! • Extremely high dimensional space! The space of all face images • We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images Recall • For one random variable 𝑥, where the mean is 𝑥,ҧ the variance is var 𝑥 = 𝐸[ 𝑥 − 𝑥ҧ 2 ] • Standard deviation 𝜎 = var(𝑥) Covariance and correlation • Two random variables x and y Subject 1 2 3 4 5 6 7 8 9 10 11 12 Mean Height (X) 69 61 68 66 66 63 72 62 62 67 66 63 65.42 Weight (Y) 108 130 135 135 120 115 150 105 115 145 132 120 125.83 Sum(X) = 785 Sum(Y) = 1510 2 Sum (X ) = 51473 Sum(Y2) = 192238 Sum (XY) = 99064 cov xy = cov xy = ( x − x )( y − y) N −1 99064 − r= r= (785)(1510) 12 = 23.74 11 cov xy s : standard deviation x s xs y 25.89 = .55 (3.32)(14.24) Different correlation coefficients Y Y Y X X r = -1 r = -.6 Y r=0 Y Y r = +1 X X X r = +.3 X r=0 Correlation Linear relationships Curvilinear relationships Y Y X Y X Y X X Correlation Strong relationships Weak relationships Y Y X Y X Y X X Correlation No relationship Y X 0.0 |r| < 0.3 weak correlation 0.3 |r| < 0.7 moderate correlation 0.7 |r| < 1.0 strong correlation |r| = 1.0 perfect correlation Correlation Covariance of random vectors • Covariance is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction • Two random vectors X and Y (column vectors) ത ത 𝑇 C𝑜𝑣 𝑋, 𝑌 = 𝐸 (𝑋 − 𝑋)(𝑌 − 𝑌) 𝑋ത = 𝐸 𝑋 , 𝑌ത = 𝐸 𝑌 See notes on random vectors and covariace for details Covariance matrix • Variance-Covariance Matrix: Variance and covariance are displayed together in a variance-covariance matrix. The variances appear along the diagonal and covariances appear in the off-diagonal elements ത ത 𝑇 • X = [X1, X2, …, Xp]T C𝑜𝑣 𝑋 = 𝐸 (𝑋 − 𝑋)(𝑋 − 𝑋) 𝐶𝑜𝑣(𝑋1 , 𝑋1 ) 𝐶𝑜𝑣(𝑋1 , 𝑋2 ) 𝐶𝑜𝑣(𝑋2 , 𝑋1 ) 𝐶𝑜𝑣(𝑋2 , 𝑋2 ) 𝐶𝑜𝑣 𝑋 = ⋮ ⋮ 𝐶𝑜𝑣(𝑋𝑝 , 𝑋1 ) 𝐶𝑜𝑣(𝑋𝑝 , 𝑋2 ) … 𝐶𝑜𝑣(𝑋1 , 𝑋𝑝 ) … 𝐶𝑜𝑣(𝑋2 , 𝑋𝑝 ) ⋱ ⋮ … 𝐶𝑜𝑣(𝑋𝑝 , 𝑋𝑝 ) 𝑉𝑎𝑟(𝑋1 ) 𝐶𝑜𝑣(𝑋1 , 𝑋2 ) 𝐶𝑜𝑣(𝑋2 , 𝑋1 ) 𝑉𝑎𝑟(𝑋2 ) = ⋮ ⋮ 𝐶𝑜𝑣(𝑋𝑝 , 𝑋1 ) 𝐶𝑜𝑣(𝑋𝑝 , 𝑋2 ) 𝐶𝑖𝑗 = 𝐶𝑜𝑣(𝑋)𝑖𝑗 = 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗 ) … 𝐶𝑜𝑣(𝑋1 , 𝑋𝑝 ) … 𝐶𝑜𝑣(𝑋2 , 𝑋𝑝 ) ⋱ ⋮ … 𝑉𝑎𝑟(𝑋𝑝 ) Variance and covariance C For any single pixel See video for details C C For any two pixels Dimensionality reduction No dominant variation along any direction Dominant variation along one direction Dominant variation along one direction Distribution of data points Points with centered in a small region → low energy Points with scattered everywhere → high energy Use covariance matrix to represent how points scattered Find out the direction that can capture most variation/energy Principal components 1st principal component 1st and 2nd principal comments are of equal importance (each one capture similar amount of variation/energy) 2nd principal component No dominant variation along any direction Principal components 1st principal component 1st principal component is along y axis (capture maximum variation/energy) 2nd principal component Dominant variation along one direction Principal components 1st principal component 1st principal component is along (capture maximum variation/energy) 2nd principal component Dominant variation along one direction Projection and basis vectors 𝑖Ԧ = 𝑒𝑥 = [1,0,0]𝑇 , 𝑗Ԧ = 𝑒𝑦 = [0,1,0]𝑇 , 𝑘 = 𝑒𝑧 = [0,0,1]𝑇 𝑎𝑥 = 𝑎Ԧ ∙ 𝑒𝑥 , 𝑎𝑦 = 𝑎Ԧ ∙ 𝑒𝑦 , 𝑎𝑧 = 𝑎Ԧ ∙ 𝑒𝑧 𝑎Ԧ = 𝑎𝑥 𝑒𝑥 + 𝑎𝑦 𝑒𝑦 + 𝑎𝑧 𝑒𝑧 Orthonormal basis and projection Projection onto new basis vectors Eigenvector and eigenvalue Ax = λx A: Square matrix λ: Eigenvector or characteristic vector X: Eigenvalue or characteristic value 2 − 12 A= 1 − 5 2 1 0 A = 0 2 0 0 0 2 I − A = −2 −1 12 = ( − 2)( + 5) + 12 +5 = 2 + 3 + 2 = ( + 1)( + 2) I − A = −2 −1 0 0 −2 0 0 0 = ( − 2)3 = 0 −2 Principal component analysis (PCA) • The direction that captures the maximum covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix • Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues Linear subspaces Convert two-dimensional x into v1, v2 coordinates What does the v2 coordinate measure? - distance to line - use it for classification — near 0 for orange pts What does the v1 coordinate measure? - position along line - use it to specify which orange point it is • Classification can be expensive: – Big search prob (e.g., nearest neighbors) or store large PDF’s • Suppose the data points are arranged as above – Idea—fit a line, classifier measures distance to line Dimensionality reduction Dimensionality reduction • We can represent the orange points with only their v1 coordinates (since v2 coordinates are all essentially 0) • This makes it much cheaper to store and compare points • A bigger deal for higher dimensional problems Linear subspace Consider the variation along direction v among all of the orange points: What unit vector v minimizes var? What unit vector v maximizes var? Solution: v1 is eigenvector of A with largest eigenvalue v2 is eigenvector of A with smallest eigenvalue 𝑥∙𝑦 = 𝑥∙𝑦 2 2 = 𝑥𝑇𝑦 2 2 = 𝑥𝑇𝑦 𝑇 𝑥 𝑇 𝑦 = 𝑦 𝑇 𝑥𝑥 𝑇 𝑦 Principal component analysis • Suppose each data point is p-dimensional – Same procedure applies: – The eigenvectors of A define a new coordinate system • eigenvector with largest eigenvalue captures the most variation among training vectors x • eigenvector with smallest eigenvalue has least variation – We can compress the data using the top few eigenvectors • corresponds to choosing a “linear subspace” – represent points on a line, plane, or “hyper-plane” • these eigenvectors are known as the principal components Principal component analysis • Given: N data points x1, … ,xN in Rp • We want to find a new set of features that are linear combinations of original ones: u(xi) = uT(xi – µ) (µ: mean of data points) • Choose unit vector u in Rp that captures the most data variance, var(u(xi)) Forsyth & Ponce, Sec. 22.3.1, 22.3.2 Another derivation of PCA • Direction that maximizes the variance of the projected data: var(uT(xi – µ)) ) N Maximize subject to ||u||=1 Projection of data point N 1/N Covariance matrix of data = 𝐮T 𝐶 𝐮 The direction that maximizes the variance is the eigenvector associated with the largest eigenvalue of 𝐶 Principal component analysis • For 𝐱𝑖 ∈ 𝑅𝑝 , 𝑖 = 1, … , 𝑁 • Eigenvectors of covariance 𝐶 are the principal components 𝐶 ∈ 𝑅𝑝×𝑝 𝐶x = 𝜆𝐱 or 𝐶𝐱𝑖 = 𝜆𝑖 𝐱 𝑖 , 𝑖 = 1, … , 𝑝 • The first principal component is the eigenvector with largest eigenvalue • The second principal component is the eigenvector with second largest eigenvalue The space of faces = + • An image is a point in a high dimensional space – An N x M image is a point in RNM – We can define vectors in this space as we did in the 2D case Eigenfaces: Key idea • Assume that most face images lie on a low-dimensional subspace determined by the first k (k<p) directions of maximum variance • Use PCA to determine the vectors or “eigenfaces” u1,…uk that span that subspace • Represent all face images in the dataset as linear combinations of eigenfaces M. Turk and A. Pentland, Face Recognition using Eigenfaces, CVPR 1991 Eigenfaces example • Training images x1,…,xN Eigenfaces example Top eigenvectors: u1,…uk Mean: μ See video for illustration Eigenfaces example Principal component (eigenvector) uk μ + 3σkuk μ – 3σkuk See video for illustration Eigenfaces example • Face x in “face space” coordinates: = Eigenfaces example • Face x in “face space” coordinates: = • Reconstruction: = ^x = + µ + w1u1+w2u2+w3u3+w4u4+ … Reconstruction Each image is of 92 x 112 pixels, i.e., 10,304-dimension P=4 P = 200 P = 400 After computing eigenfaces using 400 face images from ORL face database 844 Recognition with eigenfaces • Process labeled training images: 1. Find mean µ and covariance matrix C 2. Find k principal components (eigenvectors of Σ) u1,…uk 3. Project each training image xi onto subspace spanned by principal components: (wi1,…,wik) = (u1T(xi – µ), … , ukT(xi – µ)) • Given novel image x: 1. Project onto subspace: (w1,…,wk) = (u1T(x – µ), … , ukT(x – µ)) 2. Optional: check reconstruction error x – x to determine whether image is really a face 3. Classify as closest training face in k-dimensional subspace Eigenfaces • PCA extracts the eigenvectors of C – Gives a set of vectors v1, v2, v3, ... – Each vector is a direction in face space • what do these look like? Projecting onto the eigenfaces • The eigenfaces v1, ..., vK span the space of faces – A face is converted to eigenface coordinates by Recognition with eigenfaces • Algorithm 1. Process the image database (set of images with labels) • • Run PCA—compute eigenfaces Calculate the K coefficients for each image 2. Given a new image (to be recognized) x, calculate K coefficients 3. Detect if x is a face 4. If it is a face, who is it? • Find closest labeled face in database – nearest-neighbor in K-dimensional space Choosing the dimension K eigenvalues i= K NM • How many eigenfaces to use? • Look at the decay of the eigenvalues – the eigenvalue tells you the amount of variance “in the direction” of that eigenface – ignore eigenfaces with low variance Limitations • Global appearance method: not robust to misalignment, background variation Limitations • PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix C) The shape of this dataset is not well described by its principal components Limitations • The direction of maximum variance is not always good for classification CSE 185 Introduction to Computer Vision Deep Learning Deep learning • Introduction • Convolutional Neural Networks • Other Neural Networks Brief history • • • • • • • • • • • • 1942: McCulloch and Pitts neuron 1959: Adaline by Widrow and Hoff 1960: Mark I Perceptron by Rosenblatt 1969: Perceptrons by Minsky and Papert 1973: First AI winter 1986: Backpropagation by Rumelhart, Hinton, and Williams 1990s-2000s: Second AI winter 1989: CNN by LeCun et al. 2012: AlexNet by Krizhevsky, Sutskever, and Hinton 2014: Generative Adversarial Networks by Goodfellow, … and Bengio 2018: Turing award to Hinton, Bengio and LeCun See note 1, note 2 Other neural networks • • • • • Recurrent neural networks (RNNs) Long shot-term memory (LSTM) Graph neural networks (GNNs) Generative adversarial networks (GANs) Transformers Introduction • Introduction to deep learning: slides, video • Introduction to convolutional neural networks for visual recognition: slides Some vision tasks • • • • • • • • Object detection: R-CNN, Yolo Object segmentation: mask R-CNN Object tracking: Scene parsing: deeplab Depth estimation: DPT Optical flow: PWCNet Frame interpolation: DAIN Seeing through obstruction: Recent results StyleGAN GPT-3 DALL-E Online resources • MIT: Introduction to deep learning • Stanford: Convolutional networks for visual recognition, video • Coursera: deep learning