Uploaded by klediator54

CSE 185 Computer Visions

advertisement
CSE 185
Introduction to Computer Vision
Cameras
Cameras
• Camera models
– Pinhole perspective projection
– Weak perspective
– Affine Projection
•
•
•
•
Camera with lenses
Sensing
Human eye
Reading: S Chapter 2
They are formed by
the projection of 3D
objects
Figure from US Navy Manual of Basic Optics and Optical Instruments, prepared by Bureau of
Naval Personnel. Reprinted by Dover Publications, Inc., 1969.
Images are two-dimensional patterns of brightness values
Figure from US Navy
Manual of Basic Optics
and Optical Instruments,
prepared by Bureau of
Naval Personnel. Reprinted
by Dover Publications,
Inc., 1969
Animal eye: a long time ago
Photographic camera:
Niepce, 1816
Pinhole perspective projection: Brunelleschi, XVth Century.
Camera obscura: XVIth Century
A is half the size of B
C is half the size of B
B’ and C’ have the same size
Parallel lines: converge on
a line formed by the
intersection of a plane
parallel to π and image plane
L in π that is parallel to
image plane has no image at
all
Vanishing point
See https://www.youtube.com/watch?v=4H-DYdKYkqk for more explanations
Vanishing point
Last Supper, Leonardo da Vinci
Vanishing line
The lines all converge in his right eye, drawing the viewers gaze to this place
Urban scene
Vanishing points and lines
Vanishing Point
Vanishing Line
•
•
•
•
•
•
o
Vanishing Point
o
The projections of parallel 3D lines intersect at a vanishing point
The projection of parallel 3D planes intersect at a vanishing line
If a set of parallel 3D lines are also parallel to a particular plane, their vanishing point will
lie on the vanishing line of the plane
Not all lines that intersect are parallel
Vanishing point <-> 3D direction of a line
Vanishing line <-> 3D orientation of a surface
Perspectives
See https://www.youtube.com/watch?v=2HfIU5lp-0I&t=1s for more explanations
Vanishing pint: applications
• Camera calibration:
– Use the properties of vanishing points to find intrinsic and
extrinsic camera parameters
• 3D reconstruction:
– Man-made structures have two main characteristics:
several lines are in parallel and a number of edges are
orthogonal
– Using sets of parallel lines, the orientation of the plane can
be estimated using vanishing points
• Robot navigation
Vanishing points and lines
Photo from online Tate collection
Note on estimating vanishing points
Use multiple lines for better accuracy
… but lines will not intersect at exactly the same point in practice
One solution: take mean of intersecting pairs
… bad idea!
Instead, minimize angular differences
Vanishing points and lines
Vertical vanishing
point
(at infinity)
Vanishing
line
Vanishing
point
Slide from Efros, Photo from Criminisi
Vanishing
point
Robot navigation
Orthogonal vanishing points
• Once sets of mutually orthogonal vanishing points are detected, it is
possible to search for 3D rectangular structures in the image
Pinhole perspective equation
P:(x, y, z)
P’: (x’, y’, z’)
i axis -> x
j axis -> y
z axis -> z
• C’ :image center
• OC’ : optical axis
• π’ : image plane is at a
positive distance f’ from
the pinhole
• OP’= λ OP and z’=f’
 x ' = x
x' y ' f '

y
'
=

y


=
= =

x
y
z
 f ' = z

x

x
'
=
f
'


z

 y' = f ' y

z

NOTE: z is
always
negative
Weak perspective projection
frontal-parallel plane
π0 defined by z=z0
 x' = −mx
where

 y ' = −my
f'
m=−
z0
is the magnification
When the scene relief (depth) is small compared its distance from the
camera, m can be taken constant → weak perspective projection
Orthographic projection
 x' = x

 y' = y
When the camera is at a (roughly constant)
distance from the scene, take m=-1
→ orthographic projection
Issues with pinhole camera
Pinhole too big:
many directions are
averaged, blurring the image
Pinhole too small:
diffraction effects blur the
image
Generally, pinhole
cameras are dark, because
a very small set of rays
from a particular point
hits the screen
Lenses
Snell’s law (aka Descartes’
law)
reflection
n1 sin a1 = n2 sin a2
n1: incident index
n2: refracted index
refraction
Paraxial (or first-order) optics
Snell’s law:
Small angles:
n1 sin a1 = n2 sin a2
n1a1 = n2a2
Paraxial (or first-order) optics
h h
a 1 =  + 1  +
R d1
a2 =  − 2 
h h
−
R d2
Small angles:
n1a1 = n2a2
n1 n2 n2 − n1
+
=
d1 d 2
R
Thin lens
All other rays
passing through P
are focused on P’
x

 x' = z ' z

 y' = z' y

z
wher e
f: focal length
1 1 1
− =
z' z f
R
and f =
2(n − 1)
F, F’: focal points
Depth of field and field of view
• Depth of field (field of focus): objects within
certain range of distances are in acceptable
focus
– Depends on focal length and aperture
• Field of view: portion of scene space that are
actually projected onto camera sensors
– Not only defined by focal length
– But also effective sensor area
Depth of field
f-number:
N=f/D
f: focal length
D: aperture
diameter
f / 5.6 (large aperture)
f / 32 (small aperture)
• Changing the aperture size affects depth of field
– Increasing f-number (reducing aperture diameter) increases DOF
– A smaller aperture increases the range in which the object is
approximately in focus
Thick lenses
• Simple lenses suffer from several aberrations
• First order approximation is not sufficient
• Use 3rd order Taylor approximation
Orthographic (“telecentric”) lenses
Navitar telecentric zoom lens
Telecentric lens
Correcting radial distortion
Spherical
Aberration
•
•
rays do not intersect at
one point
circle of least confusion
Distortion
(optics)
pincushion
Chromatic
Aberration
refracted rays of different
wavelengths intersect the
optical axis at different
points
barrel
Vignetting
• Aberrations can be minimized by well-chosen shapes and refraction indexes,
separated by appropriate stops
• However, light rays from object points off-axis are partially blocked by lens
configuration → vignetting → brightness drop in the image periphery
Human eye
Corena: transparent highly
curved refractive component
Pupil: opening at center of iris in
response to illumination
Helmoltz’s
Schematic
Eye
Retina
Retina: thin, layered membrane with
two types of photoreceptors
• rods: very sensitive to light but
poor spatial detail
• cones: sensitive to spatial details
but active at higher light level
• generally called receptive field
Cones in the fovea
Rods and cones in the periphery
Photographs (Niepce, “La
Table Servie,” 1822)
Milestones:
Daguerreotypes (1839)
Photographic Film (Eastman,
1889)
Cinema (Lumière Brothers,
1895)
Color Photography (Lumière
Brothers, 1908)
Television (Baird, Farnsworth,
Zworykin, 1920s)
CCD Devices (1970)
Collection Harlingue-Viollet. .
360 degree field of view…
• Basic approach
– Take a photo of a parabolic mirror with an orthographic lens
– Or buy one a lens from a variety of omnicam manufacturers…
• See http://www.cis.upenn.edu/~kostas/omni.html
Digital camera
• A digital camera replaces film with a sensor array
– Each cell in the array is a Charge Coupled Device (CCD)
•
•
•
•
light-sensitive diode that converts photons to electrons
Complementary Metal Oxide on Silicon (CMOS) sensor
CMOS is becoming more popular
http://electronics.howstuffworks.com/digital-camera.htm
Image sensing pipeline
A simple camera pipeline
Gray-scale image
•
•
Gray scale: 0-255
Usually normalized
between 0 and 1
(dividing by 255) and
convert it into a vector
for processing
In a 19  19 face image
• Consider a thumbnail 19  19 face image
• 256361 possible combination of gray values
• 256361= 28361 = 22888
• Total world population (as of 2021)
• 7,880,000,000 < 233
• 287 times more than the world population!
• Extremely high dimensional space!
Color image
• Usually represented in three channels
CSE 185
Introduction to Computer Vision
Light and color
Light and color
•
•
•
•
Human eye
Light
Color
Projection
• Reading: Chapters 2,6
Camera aperture
f / 5.6 (large aperture)
f / 32 (small aperture)
The human eye
• The human eye is a camera
– Iris: colored annulus with radial muscles
– Pupil: the hole (aperture) whose size is controlled by the iris
– What’s the “film”?
photoreceptor cells (rods and cones) in the retina
Human eye
Retina: thin, layered membrane with two types of photoreceptors
• rods: very sensitive to light but poor spatial detail
• cones: sensitive to spatial details but active at higher light level
• generally called receptive field
Human vision system (HVS)
Exploiting HVS model
• Flicker frequency of film and TV
• Interlaced television
• Image compression
JPEG compression
Uncompressed 24 bit RGB bit map: 73,242 pixels require 219,726 bytes (excluding headers)
Q=100 Compression ratio: 2.6
Q=50 Compression ratio: 15
Q=10 Compression ratio: 46
Q=1 Compression ratio: 144
Q=25 Compression ratio: 23
JPEG compression
Digital camera
CCD
•
•
•
Low-noise images
Consume more power
More and higher quality
pixels
vs.
CMOS
•
•
•
•
More noise (sensor area is
smaller)
Consume much less power
Popular in camera phones
Getting better all the time
http://electronics.howstuffworks.com/digital-camera.htm
Color
What colors do humans see?
The colors of the visible light spectrum
color
wavelength interval frequency interval
red
~ 700–635 nm
~ 430–480 THz
green
~ 560–490 nm
~ 540–610 THz
blue
~ 490–450 nm
~ 610–670 THz
Color
• Plot of all visible colors
(Hue and saturation):
• Color space: RGB, CIE
LUV, CIE XYZ, CIE LAB,
HSV, HSL, …
• A color image can be
represented by 3 image
planes
Bayer pattern
Color filter array
• A practical way to record primary colors is to use color filter array
• Single-chip image sensor: filter pattern is 50% G, 25% R, 25% B
• Since each pixel is filtered to record only one color, various
demosaicing algorithms can be used to interpolate a set of
complete RGB for each point
• Some high-end video cameras have 3 CCD chips
Demosaicing
original reconstructed
Demoasicing
Demosaicing
• How can we compute an R, G, and B value for every pixel?
Color camera
Bayer mosaic color filter
CCD prism-based color configuration
Grayscale image
• Mainly dealing with intensity (luminance)
• Usually 256 levels (1 byte per pixel): 0 (black)
to 255 (white) (often normalized between 0
and 1)
• Several ways to convert color to grayscale,
e.g., Y=0.2126R+0.7152G+0.0722B
Recolor old photos
http://twistedsifter.com/2013/08/historic-black-white-photos-colorized/
Projection
• See Szeliski 2.1
Projection
• Fun application
• PhotoFunia
Correspondence and alignment
• Correspondence: matching points, patches,
edges, or regions across images
≈
How do we fit the best alignment?
Common transformations
original
Transformed
aspect
rotation
translation
affine
perspective
Modeling projection
• The coordinate system
–
–
–
–
We will use the pin-hole model as an approximation
Put the optical center (Center Of Projection) at the origin
Put the image plane (Projection Plane) in front of the COP
The camera looks down the negative z axis
• we need this if we want right-handed-coordinates
Modeling projection
• Projection equations
– Compute intersection with PP of ray from (x,y,z) to COP
– Derived using similar triangles
•
𝑥′
𝑥
=
𝑦′
𝑦
We get the projection by throwing out the last coordinate:
=
−𝑑
𝑧
Homogeneous coordinates
• Is this a linear transformation?
• no—division by z is nonlinear
Trick: add one more coordinate:
homogeneous image
coordinates
homogeneous scene
coordinates
Converting from homogeneous coordinates
Perspective projection
• Projection is a matrix multiply using homogeneous coordinates:
divide by third coordinate
This is known as perspective projection
• The matrix is the projection matrix
Perspective projection
• How does scaling the projection matrix change the transformation?
Orthographic projection
• Special case of perspective projection
– Distance from the COP to the PP is infinite
Image
World
– Good approximation for telephoto optics
– Also called “parallel projection”: (x, y, z) → (x, y)
– What’s the projection matrix?
Orthographic projection variants
• Scaled orthographic
– Also called “weak perspective”
– Affine projection
• Also called “paraperspective”
Camera parameters
• See MATLAB camera calibration example
Camera parameters
A camera is described by several parameters
•
•
•
•
Translation T of the optical center from the origin of world coords
Rotation R of the image plane
focal length f, principle point (x’c, y’c), pixel size (sx, sy)
yellow parameters are called extrinsics, red are intrinsics
Projection equation
X 
Y 
  = ΠX
Z 
 
1
The projection matrix models the cumulative effect of all parameters
Useful to decompose into a series of operations
identity matrix
 sx  * * * *
x = sy  = * * * *
 s  * * * *
•
•
− fs x
Π =  0
 0
0
− fs y
0
intrinsics
•
x'c  1 0 0 0
R
y 'c  0 1 0 0  3 x 3
0
1  0 0 1 0  1x 3
projection
rotation
03 x1  I 3 x 3

1   01x 3


1 
T
3 x1
translation
The definitions of these parameters are not completely standardized
– especially intrinsics—varies from one book to another
CSE 185
Introduction to Computer Vision
Image Filtering:
Spatial Domain
Image filtering
• Spatial domain
• Frequency domain
• Reading: Chapter 3
3D world from 2D image
Analysis from local evidence
Image filters
• Spatial domain
– Filter is a mathematical operation of a grid of
numbers
– Smoothing, sharpening, measuring texture
• Frequency domain
– Filtering is a way to modify the frequencies of images
– Denoising, sampling, image compression
• Templates and image pyramids
– Filtering is a way to match a template to the image
– Detection, coarse-to-fine registration
Image filtering
• Image filtering: compute function (also known as
kernel) of local neighborhood at each position
• Important tools
– Enhance images
• Denoise, resize, increase contrast, etc.
– Extract information from images
• Texture, edges, distinctive points, etc.
– Detect patterns
• Template matching
Box filter
g[ , ]
1
1
1
1
1
1
1
1
1
filter
Image filtering
output
h[.,.]
f [.,.]
input image
g[ , ]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
90
90
0
0
90
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
30
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
30
30
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
30
30
?
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
30
30
?
50
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
filter
Image filtering
g[ , ]
output
h[.,.]
f [.,.]
input image
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
30
30
30
20
10
0
0
0
90
90
90
90
90
0
0
0
20
40
60
60
60
40
20
0
0
0
90
90
90
90
90
0
0
0
30
60
90
90
90
60
30
0
0
0
90
90
90
90
90
0
0
0
30
50
80
80
90
60
30
0
0
0
90
0
90
90
90
0
0
0
30
50
80
80
90
60
30
0
0
0
90
90
90
90
90
0
0
0
20
30
50
50
60
40
20
0
0
0
0
0
0
0
0
0
0
10
20
30
30
30
30
20
10
0
0
90
0
0
0
0
0
0
0
10
10
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
1
1
1
1
1
1
1
1
1
Example
3*1+0*0+1*(-1)+1*1+5*0+8*(-1)+2*1+7*0+2*(-1)=-5
convert image and filter mask into vectors and apply dot product
Convolution of a 6×6 matrix with a 3×3 matrix.
As a result we get a 4×4 matrix.
See http://datahacker.rs/edge-detection/
Edge Detection
Edge detection – an original image (left), a filter (in the
middle), a result of a convolution (right)
Dot product
• Find the angle between
• u = 4, 3 and v = 3, 5
• Solution:
• When do two vectors have maximal value in
dot product?
• Consider convolution/filter in a similar way
• Consider a filter as a probe to infer response
in an image
Box filter
g[ , ]
What does it do?
• Replaces each pixel with
an average of its
neighborhood
• Achieve smoothing effect
(remove sharp features)
1
1
1
1
1
1
1
1
1
Smoothing with box filter
Practice with linear filters
0 0 0
0 1 0
0 0 0
Original
?
Practice with linear filters
0 0 0
0 1 0
0 0 0
Original
Filtered
(no change)
Practice with linear filters
0 0 0
0 0 1
0 0 0
Original
?
Practice with linear filters
0 0 0
0 0 1
0 0 0
Original
Shifted left
By 1 pixel
Practice with linear filters
0 0 0
0 2 0
0 0 0
Original
-
1 1 1
1 1 1
1 1 1
(Note that filter sums to 1)
?
Practice with linear filters
0 0 0
0 2 0
0 0 0
Original
-
1 1 1
1 1 1
1 1 1
Sharpening filter
- Accentuates differences
with local average
Sharpening
Sobel filter
1 1 1
1 1 1
1 1 1
1 1 0
1 1 0
1 1 0
1
0
-1
2
0
-2
1
0
-1
1 0 0
1 0 0
1 0 0
0 0 1
0 0 0
0 0 1
1 1 1
0 0 0
0 0 0
Sobel
Vertical Edge
See https://www.youtube.com/watch?v=am36dePheDc
(absolute value)
Sobel filter
1
2
1
0
0
0
-1 -2 -1
Sobel
Horizontal Edge
(absolute value)
Synthesize motion blur
I = imread('cameraman.tif');
subplot(2,2,1);imshow(I);title('Original Image');
H = fspecial('motion',20,45);
MotionBlur = imfilter(I,H,'replicate');
subplot(2,2,2);imshow(MotionBlur);title('Motion
Blurred Image');
H2 = fspecial('disk',10);
blurred2 = imfilter(I,H2,'replicate');
subplot(2,2,3);imshow(blurred2);title('Blurred
Image');
H3 = fspecial('gaussian', [10 10], 2);
blurred3 = imfilter(I,H3, 'replicate');
subplot(2,2,4);imshow(blurred3);title('Gaussian
Blurred Image');
Sample code
Prewitt filter
I=imread('cameraman.tif');
Hpr=fspecial('prewitt');
Hso=fspecial('sobel');
Sobel filter
fHpr=imfilter(I,Hpr);
fHso=imfilter(I,Hso);
fHso2=imfilter(I,Hso');
subplot(2,2,1);imshow(I); title('Original Image')
subplot(2,2,2); imshow(fHpr); title('Prewitt
filter');
subplot(2,2,3); imshow(fHso); title('Sobel filter horizontal');
subplot(2,2,4); imshow(fHso2); title('Sobel filter vertical');
Demosaicing
• How can we compute an R, G, and B value for every pixel?
Demosaicing
• How can we compute an R, G, and B value for every pixel?
Filtering vs. convolution
• 2d filtering
g=filter
f=image
– h=filter2(g,f); or
h=imfilter(f,g);
h[ m, n] =  g[ k , l ] f [ m + k , n + l ]
k ,l
• 2d convolution
– h=conv2(g,f);
h[ m, n] =  g[ k , l ] f [ m − k , n − l ]
k ,l
Key properties of linear filters
• Linearity:
filter(f1 + f2) = filter(f1) + filter(f2)
• Shift invariance: same behavior regardless of
pixel location
filter(shift(f)) = shift(filter(f))
• Any linear, shift-invariant operator can be
represented as a convolution
More properties
• Commutative: a * b = b * a
– Conceptually no difference between filter and signal
– But particular filtering implementations might break this equality
• Associative: a * (b * c) = (a * b) * c
– Often apply several filters one after another: (((a * b1) * b2) * b3)
– This is equivalent to applying one filter: a * (b1 * b2 * b3)
• Distributes over addition: a * (b + c) = (a * b) + (a * c)
• Scalars factor out: ka * b = a * kb = k (a * b)
• Identity: unit impulse e = [0, 0, 1, 0, 0],
a*e=a
Gaussian filter
• Weight contributions of neighboring pixels by nearness
0.003
0.013
0.022
0.013
0.003
0.013
0.059
0.097
0.059
0.013
0.022
0.097
0.159
0.097
0.022
0.013
0.059
0.097
0.059
0.013
5 x 5,  = 1
0.003
0.013
0.022
0.013
0.003
Smoothing with Gaussian filter
Smoothing with box filter
Gaussian filters
Separability of Gaussian filter
filter
Separability example
image
2D convolution
(center location only)
The filter factors
into a product of 1D
filters:
Perform convolution
along rows:
*
=
Followed by convolution
along the remaining column:
*
=
Separability
• Why is separability useful in practice?
• Filtering an M-by-N image with a P-by-Q kernel
requires roughly MNPQ multiples and adds.
• For a separable filter, it requires
– MNP multiples and adds for the first step
– MNQ multiples and adds for the second step
– MN(P+Q) multiples and adds
• For a 9-by-9 filter, a theoretical speed-up of 4.5
Practical matters
How big should the filter be?
• Values at edges should be near zero
• Rule of thumb for Gaussian: set filter half-width to
about 2.4 (or 3) σ
Practical matters
• What about near the image boundary?
– the filter window falls off the extent of the image
– need to extrapolate
– methods:
• clip filter (black)
• wrap around
• copy edge
• reflect across edge
Q?
Practical matters
– methods (MATLAB):
• clip filter (black):
• wrap around:
• copy edge:
• reflect across edge:
imfilter(f, g, 0)
imfilter(f, g, ‘circular’)
imfilter(f, g, ‘replicate’)
imfilter(f, g, ‘symmetric’)
Practical matters
• What is the size of the output?
• MATLAB: filter2(g, f, shape) g: filter, f:image
– shape = ‘full’: output size is sum of sizes of f and g
– shape = ‘same’: output size is same as f (default setting)
– shape = ‘valid’: output size is difference of sizes of f and g
g
full
g
same
g
f
g
valid
g
g
f
g
g
g
f
g
g
g
Median filters
• A Median Filter operates over a window by
selecting the median intensity in the window.
• What advantage does a median filter have over
a mean filter?
• Is a median filter a kind of convolution?
Comparison: salt and pepper noise
Sharpening revisited
• What does blurring take away?
–
=
detail
smoothed (5x5)
original
Let’s add it back:
+α
original
=
detail
sharpened
Take-home messages
• Linear filtering is sum of dot
product at each position
– Can smooth, sharpen, translate
(among many other uses)
• Be aware of details for filter size,
extrapolation, cropping
1
1
1
1
1
1
1
1
1
Practice questions
1. Write down a 3x3 filter that returns a
positive value if the average value of the 4adjacent neighbors is less than the center
and a negative value otherwise
2. Write down a filter that will compute the
gradient in the x-direction:
gradx(y,x) = im(y,x+1)-im(y,x) for each x, y
CSE
185
CSE 185
Introduction
to
Computer
Vision
Introduction to Computer Vision
Image Filtering:
Image Filtering:
Frequency Domain
Frequency Domain
Image filtering
• Fourier transform and frequency domain
– Frequency view of filtering
– Hybrid images
– Sampling
• Reading: Chapters 3
• Some slides from James Hays, David Hoeim,
Steve Seitz, Richard Szeliski, …
Gaussian filter
Gaussian
Box filter
Hybrid images
• Filter one image with low-pass filter and the other with high-pass
filter, and then superimpose them
• A. Oliva, A. Torralba, P.G. Schyns, “Hybrid Images,” SIGGRAPH 2006
At a distance
• What do you see?
Close up
• What do you see?
High/low frequency components
• What are stable components?
• What describes image details?
• Low-frequency components
– Stable edges
• High-frequency components
– Texture details
• Decompose image into high and low
frequency components
Why do we get different, distance-dependent
interpretations of hybrid images?
?
Why does a lower resolution image still make
sense to us? What do we lose?
Compression
How is it that a 4MP image can be compressed
to a few hundred KB without a noticeable
change?
Thinking in terms of frequency
• Convert an image into a frequency domain
• Analyze the filter responses at different scale,
orientation, and number of occurrence over
time
• Represent an image with basis filters
Jean Baptiste Joseph Fourier
had crazy idea (1807):
Any univariate function can be
rewritten as a weighted sum of
sines and cosines of different
frequencies.
...the manner in which the author arrives at
these equations is not exempt of difficulties
and...his analysis to integrate them still leaves
something to be desired on the score of
generality and even rigour.
• Don’t believe it?
– Neither did Lagrange,
Laplace, Poisson and other
big wigs
– Not translated into English
until 1878!
• But it’s (mostly) true!
– called Fourier Series
– there are some subtle
restrictions
• Fun with math genealogy
Legendre
Laplace
Lagrange
A sum of sines
Our building block:
Asin(x +  )
Add enough of them to get
any signal g(x) you want!
A: amplitude
𝜔: frequency
𝜙: shift
f(target)=
f0+f1+f2+…+fn+…
Frequency spectra
• example : g(t) = sin(2πf t) + (1/3)sin(2π(3f) t)
=
+
Frequency spectra
Frequency spectra
=
=
+
Frequency spectra
=
=
+
Frequency spectra
=
=
+
Frequency spectra
=
=
+
Frequency spectra
=
=
+
Frequency spectra

1
= A sin(2 kt )
k =1 k
Example: Music
• We think of music in terms of frequencies at
different magnitudes
Fourier analysis in images
Intensity Image
Fourier Image
Signals can be composed
+
=
More: http://www.cs.unm.edu/~brayer/vision/fourier.html
Fourier transform
• Fourier transform stores the magnitude and phase at each
frequency
– Magnitude encodes how much signal there is at a particular frequency
– Phase encodes spatial information (indirectly)
– For mathematical convenience, this is often notated in terms of real
and complex numbers
Euler’s formula
Amplitude:
𝑒 𝑗𝜙 = cos 𝜙 + 𝑗 sin 𝜙
A =  R( ) + I ( )
2
𝐴𝑒 𝑗𝜙 = 𝐴cos 𝜙 + 𝑗 𝐴 sin 𝜙
2
Phase:
I ( )
 = tan
R( )
−1
Computing Fourier transform
𝑒 𝑗𝜙 = cos 𝜙 + 𝑗 sin 𝜙
A =  R( ) 2 + I ( ) 2
 = tan −1
Continuous
Discrete
Fast Fourier Transform (FFT): NlogN
k = -N/2..N/2
I ( )
R( )
Phase and magnitude
• Fourier transform of a real
function is complex
– difficult to plot
– visualize instead, we can
think of the phase and
magnitude of the transform
• Each frequency response is
described by phase and
magnitude
• Phase: angle of complex
transform
• Magnitude: amplitude of
complex transform
• Curious fact
– all natural images have about
the same magnitude
transform
– hence, phase seems to
matter, but magnitude largely
doesn’t
• Demonstration
– take two pictures, swap the
phase transforms, compute
the inverse - what does the
result look like?
Phase and magnitude
This is the
magnitude
transform
of the
cheetah
pic
This is the
phase
transform
of the
cheetah
pic
This is the
magnitude
transform
of the
zebra pic
This is
the
phase
transform
of the
zebra pic
Reconstructio
n with zebra
phase,
cheetah
magnitude
Reconstruction
with cheetah
phase, zebra
magnitude
The convolution theorem
• The Fourier transform of the convolution of two
functions is the product of their Fourier transforms
F[ g  h] = F[ g ] F[h]
• Convolution in the spatial domain is equivalent to
multiplication in the frequency domain!
−1
g * h = F [F[ g ] F[h]]
• Used in Fast Fourier Transform
– Ten computer codes that transformed science, Nature, Jan 20 2021
Properties of Fourier transform
• Linearity
F 𝑎𝑥 𝑡 + 𝑏𝑦 𝑡 = 𝑎F 𝑥 𝑡 + 𝑏F(𝑦 𝑡 )
• Fourier transform of a real signal is symmetric
about the origin
• The energy of the signal is the same as the
energy of its Fourier transform
See Szeliski Book (3.4)
2D FFT
• Fourier transform (discrete case)
1
F (u , v) =
MN
M −1 N −1

f ( x, y )e − j 2 (ux / M + vy / N )
x =0 y =0
for u = 0,1,2,..., M − 1, v = 0,1,2,..., N − 1
• Inverse Fourier transform:
M −1 N −1
f ( x, y ) =  F (u , v)e j 2 (ux / M + vy / N )
u =0 v =0
for x = 0,1,2,..., M − 1, y = 0,1,2,..., N − 1
• u, v : the transform or frequency variables
• x, y : the spatial or image variables
Euler’s formula
Fourier bases
in Matlab, check out: imagesc(log(abs(fftshift(fft2(im)))))
2D FFT
Sinusoid with frequency = 1 and its FFT
2D FFT
Sinusoid with frequency = 3 and its FFT
2D FFT
Sinusoid with frequency = 5 and its FFT
2D FFT
Sinusoid with frequency = 10 and its FFT
2D FFT
Sinusoid with frequency = 15 and its FFT
2D FFT
Sinusoid with varying frequency and their FFT
Rotation
Sinusoid rotated at 30 degrees and its FFT
2D FFT
Sinusoid rotated at 60 degrees and its FFT
2D FFT
Image analysis with FFT
Image analysis with FFT
http://www.cs.unm.edu/~brayer/vision/fourier.html
Image analysis with FFT
Filtering in spatial domain
*
=
1
0
-1
2
0
-2
1
0
-1
Filtering in frequency domain
FFT
FFT
=
Inverse FFT
FFT in Matlab
• Filtering with fft
im = double(imread('cameraman.tif'))/255;
[imh, imw] = size(im);
hs = 50; % filter half-size
fil = fspecial('gaussian', hs*2+1, 10);
fftsize = 1024; % should be order of 2 (for speed) and include
im_fft = fft2(im, fftsize, fftsize);
% 1)
fil_fft = fft2(fil, fftsize, fftsize);
% 2)
image
im_fil_fft = im_fft .* fil_fft;
% 3)
im_fil = ifft2(im_fil_fft);
% 4)
im_fil = im_fil(1+hs:size(im,1)+hs, 1+hs:size(im, 2)+hs); % 5)
figure, imshow(im);
figure, imshow(im_fil);
padding
fft im with padding
fft fil, pad to same size as
multiply fft images
inverse fft2
remove padding
• Displaying with fft
figure(1), imagesc(log(abs(fftshift(im_fft)))), axis image, colormap jet
Questions
Which has more information, the phase or the
magnitude?
What happens if you take the phase from one
image and combine it with the magnitude
from another image?
Filtering
Why does the Gaussian give a nice smooth image, but the square filter give
edgy artifacts?
Gaussian
Box filter
Gaussian filter
Gaussian
Box filter
Box Filter
Why do we get different, distance-dependent
interpretations of hybrid images?
?
Salvador Dali invented Hybrid Images?
Salvador Dali
“Gala Contemplating the Mediterranean Sea,
which at 30 meters becomes the portrait
of Abraham Lincoln”, 1976
Fourier bases
Teases away fast vs. slow changes in the image.
This change of basis is the Fourier Transform
Fourier bases
in Matlab, check out: imagesc(log(abs(fftshift(fft2(im)))))
Hybrid image in FFT
Hybrid Image
Low-passed Image
High-passed Image
Application: Hybrid images
• Combine low-frequency of one image with high-frequency of
another one
• Sad faces when looked closely
• Happy faces when looked a few meters away
• A. Oliva, A. Torralba, P.G. Schyns, “Hybrid Images,” SIGGRAPH
2006
Sampling
Why does a lower resolution image still make
sense to us? What do we lose?
Why does a lower resolution image still make
sense to us? What do we lose?
Subsampling by a factor of 2
Throw away every other row and
column to create a 1/2 size image
Aliasing problem
• 1D example (sinewave):
Aliasing problem
• 1D example (sinewave):
Aliasing problem
• Sub-sampling may be dangerous….
• Characteristic errors may appear:
– “Wagon wheels rolling the wrong way in
movies”
– “Checkerboards disintegrate in ray tracing”
– “Striped shirts look funny on color television”
Aliasing in video
Aliasing in graphics
Resample the
checkerboard by taking
one sample at each
circle. In the case of the
top left board, new
representation is
reasonable.
Top right also yields a
reasonable
representation.
Bottom left is all black
(dubious) and bottom
right has checks that are
too big.
Sampling scheme is
crucially related to
frequency
Constructing a pyramid by
taking every second pixel
leads to layers that badly
misrepresent the top layer
Nyquist-Shannon sampling theorem
• When sampling a signal at discrete intervals, the sampling
frequency must be  2  fmax
• fmax = max frequency of the input signal
• This will allow to reconstruct the original from the
sampled version without aliasing
v
v
v
good
bad
Anti-aliasing
Solutions:
• Sample more often
• Get rid of all frequencies that are greater
than half the new sampling frequency
– Will lose information
– But it’s better than aliasing
– Apply a smoothing filter
Algorithm for downsampling by
factor of 2
1. Start with image(h, w)
2. Apply low-pass filter
im_blur = imfilter(image, fspecial(‘gaussian’, 7, 1))
3. Sample every other pixel
im_small = im_blur(1:2:end, 1:2:end);
Sampling without smoothing. Top row shows the images, sampled at every second pixel
to get the next; bottom row shows the magnitude spectrum of these images.
substantial aliasing
Magnitude of the Fourier transform of each image displayed as a log scale (constant component is at the center).
Fourier transform of a resampled image is obtained by scaling the Fourier transform
of the original image and then tiling the plane
Sampling with smoothing (small σ). Top row shows the images. We get the next image by
smoothing the image with a Gaussian with σ=1 pixel, then sampling at every second pixel to
get the next; bottom row shows the magnitude spectrum of these images.
reducing aliasing
Low pass filter suppresses high frequency components with less aliasing
Sampling with smoothing (large σ). Top row shows the images. We get the next image by
smoothing the image with a Gaussian with σ=2 pixels, then sampling at every second pixel to
get the next; bottom row shows the magnitude spectrum of these images.
lose details
Large σ ➔ less aliasing, but with little detail
Gaussian is not an ideal low-pass filter
Subsampling without pre-filtering
1/2
1/4
(2x zoom)
1/8
(4x zoom)
Subsampling with Gaussian prefiltering
Gaussian 1/2
G 1/4
G 1/8
Visual perception
• Early processing in humans filters for various orientations and scales of
frequency
• Perceptual cues in the mid-high frequencies dominate perception
• When we see an image from far away, we are effectively subsampling it
Early Visual Processing: Multi-scale edge and blob filters
Campbell-Robson contrast
sensitivity curve
Detect slight change in shades of gray before they become indistinguishable
Contrast sensitivity is better for mid-range spatial frequencies
Things to remember
• Sometimes it makes sense to think of
images and filtering in the frequency
domain
– Fourier analysis
• Can be faster to filter using FFT for
large images (N logN vs. N2 for autocorrelation)
• Images are mostly smooth
– Basis for compression
• Remember to low-pass before sampling
Edge orientation
Fourier bases
Teases away fast vs. slow changes in the image.
This change of basis is the Fourier Transform
Fourier bases
in Matlab, check out: imagesc(log(abs(fftshift(fft2(im)))))
Edge orientation
Man-made scene
Can change spectrum, then
reconstruct
Low and high pass filtering
Sinc filter
• What is the spatial representation of the hard
cutoff in the frequency domain?
Frequency Domain
Spatial Domain
Review
1.
Match the spatial domain image to the Fourier magnitude image
1
2
3
4
5
B
A
C
E
D
Practice question
1.
Match the spatial domain image to the Fourier magnitude image
1
2
3
4
5
B
A
C
E
D
1 – D, 2 – B, 3 – A, 4 – E, 5 - C
CSE 185
Introduction to Computer Vision
Image Filtering:
Templates, Image Pyramids, and Filter Banks
Image filtering
• Template matching
• Image Pyramids
• Filter banks and texture
Template matching
• Goal: find
in image
• Main challenge: What is a
good similarity or distance
measure between two
patches?
–
–
–
–
Correlation
Zero-mean correlation
Sum Square Difference
Normalized Cross
Correlation
Matching with filters
• Goal: find
in image
• Method 0: filter the image with eye patch
h[m, n] =  g[k , l ] f [m + k , n + l ]
k ,l
f = image
g = filter
What went wrong?
Problem: response is stronger
for higher intensity
Input
Filtered Image
Matching with filters
• Goal: find
in image
• Method 1: filter the image with zero-mean eye
h[ m, n] =  ( f [ k , l ] − f ) ( g[ m + k , n + l ] )
k ,l
True detections
Problem: response is sensitive to gain/contrast: pixels in filter
that are near the mean have little effect (does not require pixel
values in image to be near or proportional to values in filter)
Input
Filtered Image (scaled)
False
detections
Thresholded Image
Matching with filters
• Goal: find
in image
• Method 2: SSD
h[ m, n] =  ( g[ k , l ] − f [ m + k , n + l ] )2
k ,l
True detections
Problem: SSD sensitive to average intensity
Input
sqrt(SSD)
Thresholded Image
Matching with filters
What’s the potential
downside of SSD?
• Goal: find
in image
SSD sensitive to average intensity
• Method 2: SSD
h[ m, n] =  ( g[ k , l ] − f [ m + k , n + l ] )2
k ,l
Input
1- sqrt(SSD)
Matching with filters
• Goal: find
in image
• Method 3: Normalized cross-correlation
Invariant to mean and scale of intensity
mean template
h[m, n] =
mean image patch
å(g[k, l]- g)( f [m - k, n - l]- f
m,n
)
k,l
æ
ö
ççå(g[k, l]- g)2 å ( f [m - k, n - l]- fm,n )2 ÷÷
è k,l
ø
k,l
0.5
Dot product of two normalized vectors
Matlab: normxcorr2(template, im)
Matching with filters
• Goal: find
in image
• Method 3: Normalized cross-correlation
True detections
Input
Normalized X-Correlation
Thresholded Image
What is the best method to use?
A: Depends
• SSD: faster, sensitive to overall intensity
• Normalized cross-correlation: slower,
invariant to local average intensity and
contrast
• But really, neither of these baselines are
representative of modern recognition
How to find larger or smaller eyes?
A: Image Pyramid
Review of sampling
Gaussian
Filter
Image
Low-Pass
Filtered Image
Sample
Low-Res
Image
Gaussian pyramid
Source: Forsyth
Template matching with image
pyramids
Input: Image, Template
1. Match template at current scale
2. Downsample image
3. Repeat 1-2 until image is very small
4. Take responses above some threshold, perhaps
with non-maxima suppression
Coarse-to-fine image registration
1. Compute Gaussian pyramid
2. Align with coarse pyramid
3. Successively align with finer
pyramids
–
Search smaller range
Why is this faster?
Are we guaranteed to get the same
result?
2D edge detection filters
Laplacian of Gaussian
Gaussian
derivative of Gaussian
is the Laplacian operator:
Laplacian filter
unit impulse
Gaussian
Laplacian of Gaussian
Gaussian/Laplacian pyramid
Can we reconstruct the original from
the Laplacian pyramid?
Laplacian pyramid
Visual representation
Hybrid Image
Hybrid Image in Laplacian pyramid
High frequency → Low frequency
Image representation
• Pixels: great for spatial resolution, poor access to
frequency
• Fourier transform: great for frequency, not for spatial
info
• Pyramids/filter banks: balance between spatial and
frequency information
Main uses of image pyramids
• Compression
• Object detection
– Scale search
– Features
• Detecting stable interest points
• Registration
– Course-to-fine
Application: Representing texture
Source: Forsyth
Texture and material
50
100
150
200
250
300
50
100
150
200
250
300
350
400
50
100
150
200
250
300
50
100
150
200
250
300
350
400
450
50
100
150
200
250
300
50
100
150
200
250
300
350
400
450
50
100
150
200
250
300
350
50
100
150
200
250
300
350
400
450
500
Texture and orientation
50
100
150
200
250
300
350
50
100
150
200
250
300
350
400
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
550
50
100
150
200
250
300
350
50
100
150
200
250
300
350
400
450
500
Texture and scale
What is texture?
Regular or stochastic patterns caused by
bumps, grooves, and/or markings
How can we represent texture?
• Compute responses of blobs and edges at
various orientations and scales
Overcomplete representation
Leung-Malik Filter Bank
• First and second order derivatives fo Gaussians at 6 orientations and 3 scales
• 8 Laplacian of Gaussian and 4 Gaussian filters
Code for filter banks: www.robots.ox.ac.uk/~vgg/research/texclass/filters.html
Filter banks
• Process image with each filter and keep
responses (or squared/abs responses)
How can we represent texture?
• Measure responses of blobs and edges at
various orientations and scales
• Idea 1: Record simple statistics (e.g., mean,
standard deviation) of absolute filter
responses
Match the texture to the response?
Filters
A
B
1
2
C
3
Mean abs responses
Representing texture by mean abs
response
Filters
Mean abs responses
Representing texture
• Idea 2: take vectors of filter responses at each pixel and
cluster them, then take histograms (more on in later weeks)
Compression
How is it that a 4MP image can be compressed
to a few hundred KB without a noticeable
change?
Lossy image compression (JPEG)
64 basis functions
DFT: complex values
DCT: real values
Block-based Discrete Cosine Transform (DCT)
See https://www.mathworks.com/help/images/discrete-cosine-transform.html
Slides: Efros
Using DCT in JPEG
• The first coefficient B(0,0) is the DC
component, the average intensity
• The top-left coeffs represent low frequencies,
the bottom right – high frequencies
Image compression using DCT
• Quantize
– More coarsely for high frequencies (which also tend to have smaller values)
– Many quantized high frequency values will be zero
• Encode
– Can decode with inverse dct
Filter responses
Quantization table
Quantized values
JPEG compression summary
1. Convert image to YCrCb
2. Subsample color by factor of 2
–
People have bad resolution for color
3. Split into blocks (8x8, typically), subtract 128
4. For each block
a. Compute DCT coefficients
b. Coarsely quantize
•
c.
Many high frequency components will become zero
Encode (e.g., with Huffman coding)
http://en.wikipedia.org/wiki/YCbCr
http://en.wikipedia.org/wiki/JPEG
Lossless compression (PNG)
1.Predict that a pixel’s value based on
its upper-left neighborhood
2.Store difference of predicted and
actual value
3.Pkzip it (DEFLATE algorithm)
Denoising
Gaussian
Filter
Additive Gaussian Noise
Reducing Gaussian noise
Smoothing with larger standard deviations suppresses noise, but also blurs the
image
Reducing salt-and-pepper noise by
Gaussian smoothing
3x3
5x5
7x7
Alternative idea: Median filtering
• A median filter operates over a window by
selecting the median intensity in the window
• Is median filtering linear?
Median filter
• What advantage does median filtering have over
Gaussian filtering?
– Robustness to outliers
Median filter
Salt-and-pepper noise
Median filtered
• MATLAB: medfilt2(image, [h w])
Median vs. Gaussian filtering
3x3
Gaussian
Median
5x5
7x7
Other non-linear filters
• Weighted median (pixels further from center count less)
• Clipped mean (average, ignoring few brightest and darkest
pixels)
• Bilateral filtering (weight by spatial distance and intensity
difference)
Bilateral filtering
Review: Image filtering
g[ , ]
1
1
1
1
1
1
1
1
1
h[.,.]
f [.,.]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
90
90
0
0
90
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
h[m, n] =  f [k , l ] g[m + k , n + l ]
k ,l
Credit: S. Seitz
Image filtering
g[ , ]
1
1
1
1
1
1
1
1
1
h[.,.]
f [.,.]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
h[m, n] =  f [k , l ] g[m + k , n + l ]
k ,l
Credit: S. Seitz
Image filtering
g[ , ]
1
1
1
1
1
1
1
1
1
h[.,.]
f [.,.]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
90
0
90
90
90
0
0
0
0
0
90
90
90
90
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
90
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
20
h[m, n] =  f [k , l ] g[m + k , n + l ]
k ,l
Credit: S. Seitz
Filtering in spatial domain
*
=
1
0
-1
2
0
-2
1
0
-1
Filtering in frequency domain
FFT
FFT
=
Inverse FFT
Review of image filtering
• Filtering in frequency domain
– Can be faster than filtering in spatial domain (for
large filters)
– Can help understand effect of filter
– Algorithm:
1. Convert image and filter to fft (fft2 in matlab)
2. Pointwise-multiply ffts
3. Convert result to spatial domain with ifft2
Review of image filtering
• Linear filters for basic processing
– Edge filter (high-pass)
– Gaussian filter (low-pass)
[-1 1]
Gaussian
FFT of Gradient Filter
FFT of Gaussian
Review of image filtering
• Derivative of Gaussian
Review of image filtering
• Applications of filters
– Template matching (SSD or Normxcorr2)
• SSD can be done with linear filters, is sensitive to overall
intensity
– Gaussian pyramid
• Coarse-to-fine search, multi-scale detection
– Laplacian pyramid
• Teases apart different frequency bands while keeping spatial
information
• Can be used for compositing in graphics
– Downsampling
• Need to sufficiently low-pass before downsampling
EECS 274 Computer Vision
Linear Filters and Edges
Linear filters and edges
•
•
•
•
Linear filters
Scale space
Gaussian pyramid and wavelets
Edges
• Reading: Chapters 7 and 8 of FP, Chapter 3 of
S
Linear filters
• General process:
• Example: smoothing by averaging
– Form new image whose pixels
are a weighted sum of original
pixel values, using the same set
of weights at each point.
• Properties
– Output is a linear function of the
input
– Output is a shift-invariant
function of the input (i.e. shift
the input image two pixels to the
left, the output is shifted two
pixels to the left)
1
Rij =
(2k + 1) 2
– form the average of pixels in a
neighbourhood
• Example: smoothing with a
Gaussian
– form a weighted average of
pixels in a neighbourhood
• Example: finding a derivative
– form a weighted average of
pixels in a neighbourhood
u =i + k v = j + k

u =i − k
1
Fuv =

2
(
2
k
+
1
)
v= j −k
F
uv
u ,v
Convolution
• Represent these weights as an
image, H
• H is usually called the kernel
• Operation is called convolution
– it’s associative
• Notation in textbook:
Rij =  H i −u , j −v Fu ,v
u ,v
R = H F
g (i, j ) =  f (i + k , j + l )h(k , l )
k ,l
g (i, j ) =  f (i − k , j − l )h(k , l )
k ,l
g = f h
g = f *h
• Notice wierd order of indices
– all examples can be put in this
form
– it’s a result of the derivation
expressing any shift-invariant
linear operator as a
convolution.
Example: smoothing by averaging
Smoothing with a Gaussian
• Smoothing with an average
actually doesn’t compare at all
well with a defocussed lens
– Most obvious difference is that a
single point of light viewed in a
defocussed lens looks like a fuzzy
blob; but the averaging process
would give a little square.
• A Gaussian gives a good model of
a fuzzy blob
An isotropic Gaussian
• The picture shows a smoothing
kernel proportional to
  x2 + y2  
 
G ( x, y ) =
exp − 
2
2

2
  2  
1
(which is a reasonable model of a
circularly symmetric fuzzy blob)
Smoothing with a Gaussian
Differentiation and convolution
• Recall
f
f ( x +  , y ) − f ( x, y )
= lim
x  →0

• Now this is linear and shift
invariant, so must be the result of
a convolution
0 0 0 


H =  1 0 − 1
0 0 0 


• We could approximate this as
f ( xn +1 , y ) − f ( xn , y )
f

x
x
(which is obviously a convolution; it’s
not a very good way to do things,
as we shall see)
Finite differences
Partial derivative in y axis,
respond strongly to
horizontal edges
Partial derivative in x axis,
respond strongly to
vertical edges
Spatial filter
• Approximation
 f 
2 1/ 2
2
 
 f   f  

x
f =  f , | f |=   +   
 
 x   y  
 y 
 
| f | [( z5 − z8 ) 2 + ( z5 − z6 ) 2 ]1/ 2
| f || ( z5 − z8 ) | + | z5 − z6 ) |
| f | | ( z5 − z9 ) | + | z6 − z8 ) |
1/ 2
| f || ( z5 − z9 ) | + | z6 − z8 ) |
z1
z2
z3
z4
z5
z6
z7
z8
z9
Roberts operator
One of the earliest edge detection algorithm by Lawrence Roberts
Sobel operator
One of the earliest edge detection algorithm by Irwine Sobel
Noise
• Simplest noise model
– independent stationary additive
Gaussian noise
– the noise value at each pixel is
given by an independent draw
from the same normal
probability distribution
• Issues
– this model allows noise values
that could be greater than
maximum camera output or less
than zero
– for small standard deviations,
this isn’t too much of a problem it’s a fairly good model
– independence may not be
justified (e.g. damage to lens)
– may not be stationary (e.g.
thermal gradients in the ccd)
sigma=1
sigma=16
Finite differences and noise
• Finite difference filters respond
strongly to noise
– obvious reason: image noise
results in pixels that look very
different from their neighbours
• Generally, the larger the noise
the stronger the response
• What is to be done?
– intuitively, most pixels in images
look quite a lot like their
neighbours
– this is true even at an edge;
along the edge they’re similar,
across the edge they’re not
– suggests that smoothing the
image should help, by forcing
pixels different to their
neighbours (=noise pixels?) to
look more like neighbours
Finite differences responding to
noise
σ=0.03
σ=0.09
Increasing noise -> (this is zero mean additive Gaussian noise)
Difference operation is strongly influenced by noise (the image is
increasingly grainy)
The response of a linear filter to
noise
• Do only stationary independent
additive Gaussian noise with zero
mean (non-zero mean is easily
dealt with)
• Mean:
– output is a weighted sum of
inputs
– so we want mean of a weighted
sum of zero mean normal
random variables
– must be zero
• Variance:
– recall
• variance of a sum of random
variables is sum of their
variances
• variance of constant times
random variable is
constant^2 times variance
– then if  is noise variance and
kernel is K, variance of response
is
 2 K2
u ,v
u ,v
Filter responses are correlated
• Over scales similar to the scale of the filter
• Filtered noise is sometimes useful
– looks like some natural textures, can be used to
simulate fire, etc.
Smoothed noise
Smoothing stationary additive Gaussian noise results in signals
where pixel values tend to increasingly similar to the value of
neighboring pixels (as filter kernel causes correlation)
Smoothing reduces noise
• Generally expect pixels to “be
like” their neighbours
– surfaces turn slowly
– relatively few reflectance
changes
• Generally expect noise processes
to be independent from pixel to
pixel
• Implies that smoothing
suppresses noise, for appropriate
noise models
• Scale
– the parameter in the symmetric
Gaussian
– as this parameter goes up, more
pixels are involved in the average
– and the image gets more blurred
– and noise is more effectively
suppressed
The effects of smoothing
Each row shows smoothing
with gaussians of different
width; each column shows
different realisations of
an image of gaussian noise.
Gradients and edges
• Points of sharp change in an
image are interesting:
–
–
–
–
change in reflectance
change in object
change in illumination
noise
• Sometimes called edge points
• General strategy
– determine image gradient
– now mark points where gradient
magnitude is particularly large
wrt neighbours (ideally, curves of
such points).
In one dimension, the 2nd derivative of a signal is zero when the
derivative magnitude is extremal → a good place to look for edge
is where the second derivative is zero.
Smoothing and differentiation
• Issue: noise
– smooth before differentiation
– two convolutions to smooth, then differentiate?
– actually, no - we can use a derivative of Gaussian filter
• because differentiation is convolution, and convolution is
associative
1 pixel
3 pixels
7 pixels
The scale of the smoothing filter affects derivative estimates, and also
the semantics of the edges recovered.
Image Gradient
• Gradient equation:
• Represents direction of most rapid change in intensity
• Gradient direction:
• The edge strength is given by the gradient magnitude
Theory of Edge Detection
Ideal edge
L( x, y ) = x sin  − y cos  +  = 0
B1 : L( x, y )  0
B2 : L( x, y )  0
Unit step function:
1

u (t ) =  1
2
0

for t  0
for t = 0
for t  0
u (t ) =   (s )ds
t
−
Image intensity (brightness):
I ( x, y ) = B1 + (B2 − B1 )u ( x sin  − y cos  +  )
Theory of Edge Detection
• Image intensity (brightness):
I ( x, y ) = B1 + (B2 − B1 )u ( x sin  − y cos  +  )
• Partial derivatives (gradients):
I
= + sin  (B2 − B1 ) ( x sin  − y cos  +  )
x
I
= − cos  (B2 − B1 ) ( x sin  − y cos  +  )
y
• Squared gradient:
2
 I   I 
2
s ( x, y ) =   +   = (B2 − B1 ) ( x sin  − y cos  +  )
 x   y 
Edge Magnitude:
s ( x, y )
 I I 
 /  (normal of the edge)
arctan
Edge Orientation:
 y x 
2
Rotationally symmetric, non-linear operator
Theory of Edge Detection
• Image intensity (brightness):
I ( x, y ) = B1 + (B2 − B1 )u ( x sin  − y cos  +  )
• Partial derivatives (gradients):
I
= + sin  (B2 − B1 ) ( x sin  − y cos  +  )
x
I
= − cos  (B2 − B1 ) ( x sin  − y cos  +  )
y
• Laplacian:
2
2

I

I
 2 I = 2 + 2 = (B2 − B1 ) ' ( x sin  − y cos  +  )
x
y
I
x
Rotationally symmetric, linear operator
2I
x 2
zero-crossing
Discrete Edge Operators
• How can we differentiate a discrete image?
Finite difference approximations:
I
1
((I i+1, j +1 − I i, j +1 ) + (I i+1, j − I i, j ))

x 2
I
1
((I i+1, j +1 − I i+1, j ) + (I i, j +1 − I i, j ))

y 2
Convolution masks :
I
1

x 2
I
1

y 2
I i , j +1 I i +1, j +1
Ii, j
I i +1, j
Discrete Edge Operators
• Second order partial derivatives:
I i −1, j +1 I i , j +1 I i +1, j +1
2I
1
(I i−1, j − 2I i, j + I i+1, j )

2
2
x

2I
1
(I i, j −1 − 2I i, j + I i, j +1 )

2
2
y

I i −1, j
I i , j I i +1, j
I i −1, j −1 I i , j −1 I i +1, j −1
• Laplacian :
2I 2I
 I= 2+ 2
x
y
2
Convolution masks :
2 I 
1

2
0
1
0
1
-4
1
0
1
0
or
1
6 2
1
4
1
4
-20
4
1
4
1
(more accurate)
Effects of Noise
• Consider a single row or column of the image
– Plotting intensity as a function of position gives a signal
Where is the edge??
Solution: Smooth First
Where is the edge?
Look for peaks in
Derivative Theorem of Convolution
…saves us one operation.
Laplacian of Gaussian (LoG)
 2
2
(h  f ) =  2
2
x
 x

h   f

Laplacian of Gaussian
Laplacian of Gaussian
operator
Where is the edge?
Zero-crossings of bottom graph !
2D Gaussian Edge Operators
Gaussian
Derivative of Gaussian (DoG)
Laplacian of Gaussian
Mexican Hat (Sombrero)
•
is the Laplacian operator:
threshold
sigma=4
scale
contrast=1
LOG zero crossings
sigma=2
contrast=4
We still have unfortunate behavior
at corners and trihedral areas
σ=1 pixel
σ=2 pixel
There are three major issues:
1) The gradient magnitude at different scales is different; which should
we choose?
2) The gradient magnitude is large along thick trail; how
do we identify the significant points?
3) How do we link the relevant points up into curves?
We wish to mark points along the curve where the magnitude is biggest.
We can do this by looking for a maximum along a slice normal to the curve
(non-maximum suppression). These points should form a curve. There are
then two algorithmic issues: at which point is the maximum, and where is the
next one?
Non-maximum Suppression
• Check if pixel is local maximum along gradient direction
– requires checking interpolated pixels p and r
Predicting
the next
edge point
Assume the marked point is an edge point. Then we construct
the tangent to the edge curve (which is normal to the gradient
at that point) and use this to predict the next points (here either
r or s).
Edge following: edge points occur along curve like chains
fine scale, high threshold σ=1 pixel
coarse scale, high threshold σ=4 pixels
coarse scale, low threshold σ=4 pixels
Canny Edge Operator
• Smooth image I with 2D Gaussian:
GI
• Find local edge normal directions for each pixel
(G  I )
n=
(G  I )
• Compute edge magnitudes
(G  I )
• Locate edges by finding zero-crossings along the edge normal directions
(non-maximum suppression)
 2 (G  I )
=0
2
n
Canny edge detector
Original image
magnitude of gradient
Canny edge detector
After non-maximum suppression
Canny edge detector
original
Canny with
• The choice of
– large
– small
Canny with
depends on desired behavior
detects large scale edges
detects fine features
Difference of Gaussians (DoG)
• Laplacian of Gaussian can be approximated by the
difference between two different Gaussians
DoG Edge Detection
(a)
(b)
(b)-(a)
Unsharp Masking
100
200
–
300
=
400
500
200
400
+a
blurred positive
600
increase details in bright area
800
=
Edge Thresholding
• Standard Thresholding:
• Can only select “strong” edges.
• Does not guarantee “continuity”.
• Hysteresis based Thresholding (use two thresholds)
Example: For “maybe” edges, decide on the edge if neighboring pixel is a strong edge.
Remaining issues
• Check that maximum value of gradient value
is sufficiently large
– drop-outs? use hysteresis
• use a high threshold to start edge curves and a low
threshold to continue them.
Notice
• Something nasty is happening at corners
• Scale affects contrast
• Edges aren’t bounding contours
Scale space
• Framework for multi-scale signal processing
• Define image structure in terms of scale with kernel
size, σ
• Find scale invariant operations
• See Scale-theory in computer vision by Tony
Lindeberg
Orientation representations
• The gradient magnitude is
affected by illumination changes
– but it’s direction isn’t
• We can describe image patches
by the swing of the gradient
orientation
• Important types:
– constant window
• small gradient mags
– edge window
• few large gradient mags in
one direction
– flow window
• many large gradient mags in
one direction
– corner window
• large gradient mags that
swing
Representing windows
Looking at variations
H=
 (I )(I )
T
window
  G

 G
  G
 G

 I 
I 
 I 
 I  
  x
 x
  x
 y

=  
  G
 G

 G
window  G


  x  I  y  I   y  I  y  I  

 



•
Types
– constant
• small eigenvalues
– edge
• one medium, one small
– flow
• one large, one small
– corner
• two large eigenvalues
Plotting in ellipses to understand the matrix
(variation of gradient) of 3 × 3 window
( x, y )T H −1 ( x, y ) = 
Major and minor axes are along
the eigenvectors of H, and the extent
corresponds to the size of eigenvalues
Plotting in ellipses to understand the matrix
(variation of gradient) of 5 × 5 window
Corners
• Harris corner detector
• Moravec corner detector
• SIFT descriptors
Filters are templates
• Applying a filter at some point
can be seen as taking a dotproduct between the image and
some vector
• Filtering the image is a set of dot
products
convolution is equivalent to
taking the dot product of the filter
with an image patch
• Insight
Rij =  H i −u , j −v Fu ,v
u ,v
Rij =  H −u , − v Fu ,v
u ,v
– filters look like the effects they
are intended to find
– filters find effects they look like
derivative of Gaussian used as edge
detection
Normalized correlation
• Think of filters of a dot product
– now measure the angle
– i.e normalised correlation output
is filter output, divided by root
sum of squares of values over
which filter lies
– cheap and efficient method for
finding patterns
• Tricks:
– ensure that filter has a zero
response to a constant region
(helps reduce response to
irrelevant background)
– subtract image average when
computing the normalizing
constant (i.e. subtract the image
mean in the neighbourhood)
– absolute value deals with
contrast reversal
Positive responses
Zero mean image, -1:1 scale
Zero mean image, -max:max scale
Positive responses
Zero mean image, -1:1 scale
Zero mean image, -max:max scale
Figure from “Computer Vision for Interactive Computer Graphics,” W.Freeman et al, IEEE Computer Graphics and Applications,
1998 copyright 1998, IEEE
Anistropic scaling
• Symmetric Gaussian smoothing tends ot blur out
edges rather aggressively
• Prefer an oriented smoothing operator that
smoothes
– aggressively perpendicular to the gradient
– little along the gradient
• Also known as edge preserving smoothing
• Formulated with diffusion equation
Diffusion equation for anistropic filter
• PDE that describes fluctuations in a material
undergoing diffusion



 (r , t )
(
=   D( , r ) (r , t ) )
t
• For istropic filter

 2  2
2
=   (c( x, y,  )) ) = C  = 2 + 2 =  2 for istropic cases

x
y
 ( x, y,0) = I ( x, y ) as initial condition
• For anistropic filter

=   (c( x, y,  )) ) = c( x, y,  ) 2 + (c( x, y,  ))  

If c( x, y,  ) = 1, just as before
If c( x, y,  ) = 0, no smoothing
Anistropic filtering
Edge preserving filter
• Bilateral filter
– Replace pixel’s value by a weighted average of its
neighbors in both space and intensity
One iteration
Multiple iterations
CSE 185
Introduction to Computer Vision
Edges
Edges
• Edges
• Scale space
• Reading: Chapter 4.2
Edge detection
• Goal: Identify sudden
changes (discontinuities)
in an image
– Intuitively, most semantic
and shape information from
the image can be encoded
in the edges
– More compact than pixels
• Ideal: artist’s line drawing
(but artist is also using
object-level knowledge)
Why do we care about edges?
• Extract information,
recognize objects
• Recover geometry and
viewpoint
– Measure distance
– Compute area
• See also single-view 3D
Vanishing
line
Vanishing
point
[Criminisi et al. ICCV 99]
Vertical vanishing
point
(at infinity)
Vanishing
point
Origin of Edges
surface normal discontinuity
depth discontinuity
surface color discontinuity
illumination discontinuity
• Edges are caused by a variety of factors
Marr’s theory
Example
Closeup of edges
Source: D. Hoiem
Closeup of edges
Closeup of edges
Closeup of edges
Characterizing edges
• An edge is a place of rapid change in the image
intensity function
image
intensity function
(along horizontal scanline)
first derivative
edges correspond to
extrema of derivative
1st and 2nd derivative
See https://www.youtube.com/watch?v=uNP6ZwQ3r6A
Intensity profile
With a little Gaussian noise
Gradient
Consider pixel value as height
Plot intensity value as needle map,
i.e., (x, y, z), where x, y are image coordinate
and z is the pixel value (color bar indicates the
intensity value from low to high)
I=imread('lena_gray.png');
[x,y]=size(I);
X=1:x;
Y=1:y;
[xx,yy]=meshgrid(Y,X);
i=im2double(I);
figure;mesh(xx,yy,i);
colorbar
figure;imshow(i)
First-order derivative filter
Second-order derivative filter
Effects of noise
• Consider a single row or column of the image
– Plotting intensity as a function of position gives a signal
Where is the edge?
Effects of noise
• Difference filters respond strongly to noise
– Image noise results in pixels that look very
different from their neighbors
– Generally, the larger the noise the stronger the
response
• What can we do about it?
Solution: smooth first
f
g
f*g
d
( f  g)
dx
• To find edges, look for peaks in
d
( f  g)
dx
Derivative theorem of convolution
Recall associative property: a * (b * c) = (a * b) * c
• Differentiation is convolution, and convolution is
associative: dxd ( f  g ) = f  dxd g
• This saves us one operation:
f
d
g
dx
f
d
g
dx
Laplacian of Gaussian (LoG)
 2
2
(h  f ) =  2
2
x
 x

h   f

Laplacian of Gaussian
Laplacian of Gaussian
operator
Where is the edge?
Zero-crossings of bottom graph !
Derivative of Gaussian filter
* [1 -1] =
Finite differences
Partial derivative in y axis,
respond strongly to
horizontal edges
Partial derivative in x axis,
respond strongly to
vertical edges
Differentiation and convolution
• Recall
f
f ( x +  , y ) − f ( x, y )
= lim
x  → 0

• Now this is linear and shift
invariant, so must be the result of
a convolution
0 0 0 


H =  1 0 − 1
0 0 0 


• We could approximate this as
f ( xn +1 , y ) − f ( xn , y )
f

x
x
(which is obviously a convolution; it’s
not a very good way to do things,
as we shall see)
Discrete edge operators
• How can we differentiate a discrete image?
Finite difference approximations:
I
1
((I i+1, j +1 − I i, j +1 ) + (I i+1, j − I i, j ))

x 2
I
1
((I i+1, j +1 − I i+1, j ) + (I i, j +1 − I i, j ))

y 2
I i , j +1 I i +1, j +1
Ii, j
Convolution masks :
I
1

x 2
-1
1
-1
1
I
1

y 2
1
1
-1
-1
See https://en.wikipedia.org/wiki/Edge_detection
I i +1, j
𝛆
Discrete edge operators
• Second order partial derivatives:
I i −1, j +1 I i , j +1 I i +1, j +1
2I 1
 2 (I i −1, j − 2 I i , j + I i +1, j )
2
x

2I
1
(I i, j −1 − 2 I i, j + I i, j +1 )

2
2
y

I i −1, j
I i −1, j −1 I i , j −1 I i +1, j −1
𝜕𝐼
1
≈ (𝐼𝑖+1,𝑗 − 𝐼𝑖,𝑗 )
𝜕𝑥
𝜖
2
𝜕 𝐼
1 1
≈
( (𝐼
− 𝐼𝑖,𝑗 )
𝜕𝑥 2
𝜖 𝜖 𝑖+1,𝑗
• Laplacian :
2
2

I

I
2I = 2 + 2
x y
I i , j I i +1, j
1
𝜖
− (𝐼𝑖,𝑗 - 𝐼𝑖−1,𝑗 ))
Convolution masks :
2I 
1

2
0
1
0
1
-4
1
0
1
0
or
1
6 2
1
4
1
4
-20
4
1
4
1
(more accurate)
Spatial filter
• Approximation of image gradient
 f 
2 1/ 2
2
 
 f   f  

x
f =  f , | f |=   +   
 
 x   y  
 y 
 
| f | [( z5 − z8 ) 2 + ( z5 − z6 ) 2 ]1/ 2
| f || ( z5 − z8 ) | + | z5 − z6 ) |
| f | | ( z5 − z9 ) | + | z6 − z8 ) |
1/ 2
| f || ( z5 − z9 ) | + | z6 − z8 ) |
different ways to compute gradient magnitude
Z1
Z2
Z3
Z4
Z5
Z6
Z7
Z8
Z9
filter weight
Roberts operator
One of the earliest edge detection algorithm by Lawrence Roberts
Convolve image with
ÑI(x, y) = G(x, y) = Gx2 + Gy2 ,
to get Gx and Gy
q (x, y) = arctan(Gy / Gx )
Sobel operator
One of the earliest edge detection algorithm by Irwine Sobel
Gx+Gy
Gx
Gy
Image gradient
• Gradient equation 𝑓(𝑥, 𝑦)
• Represents direction of most rapid change in intensity
• Gradient direction:
• The edge strength is given by the gradient magnitude
2D Gaussian edge operators
Gaussian
Derivative of Gaussian (DoG)
Laplacian of Gaussian (LoG)
Mexican Hat (Sombrero)
•
is the Laplacian operator:
Marr-Hildreth algorithm
Difference of Gaussians (DoG)
• Laplacian of Gaussian can be approximated by the
difference between two different Gaussians
Smoothing and localization
1 pixel
3 pixels
7 pixels
• Smoothed derivative removes noise, but blurs
edge. Also finds edges at different “scales”.
threshold
sigma=4
scale
contrast=1
LOG zero crossings
sigma=2
contrast=4
We still have unfortunate behavior
at corners and trihedral areas
Implementation issues
• The gradient magnitude is large along a thick “trail”
or “ridge,” so how do we identify the actual edge
points?
• How do we link the edge points to form curves?
Designing an edge detector
• Criteria for a good edge detector:
– Good detection
• the optimal detector should find all real edges, ignoring
noise or other artifacts
– Good localization
• the edges detected must be as close as possible to the
true edges
• the detector must return one point only for each true
edge point
• Cues of edge detection
– Differences in color, intensity, or texture across the
boundary
– Continuity and closure
– High-level knowledge
Canny edge detector
• Probably the most widely used edge detector
in computer vision
• Theoretical model: step-edges corrupted by
additive Gaussian noise
• John Canny has shown that the first derivative
of the Gaussian closely approximates the
operator that optimizes the product of signalto-noise ratio and localization
J. Canny, A Computational Approach To Edge Detection, IEEE
Trans. Pattern Analysis and Machine Intelligence, 8:679-714, 1986.
http://www.mathworks.com/discovery/edge-detection.html
Example
original image (Lena)
Derivative of Gaussian filter
x-direction
y-direction
Compute gradients
X-Derivative of Gaussian
Y-Derivative of Gaussian
Gradient Magnitude
Get orientation at each pixel
• Threshold at minimum level
• Get orientation
theta = atan2(gy, gx)
Non-maximum suppression for each
orientation
At q, we have a maximum if the value
is larger than those at both p and at r.
Interpolate to get these values.
Select the single maximum point
across the width of an edge
A variant is to use Laplacian
zero crossing along the gradient
direction.
Edge linking
Assume the marked point is
an edge point. Then we
construct the tangent to the
edge curve (which is normal
to the gradient at that point)
and use this to predict the
next points (here either r or s).
Sidebar: Bilinear Interpolation
http://en.wikipedia.org/wiki/Bilinear_interpolation
Sidebar: Interpolation options
• imx2 = imresize(im, 2, interpolation_type)
• ‘nearest’
– Copy value from nearest known
– Very fast but creates blocky edges
• ‘bilinear’
– Weighted average from four nearest known pixels
– Fast and reasonable results
• ‘bicubic’ (default)
– Non-linear smoothing over larger area (4x4)
– Slower, visually appealing, may create negative
pixel values
Before non-max suppression
After non-max suppression
Hysteresis thresholding
• Threshold at low/high levels to get weak/strong edge pixels
• Use connected components, starting from strong edge pixels
Hysteresis thresholding
• Check that maximum value of gradient value
is sufficiently large
– drop-outs? use hysteresis
• use a high threshold to start edge curves and a low
threshold to continue them.
Source: S. Seitz
Final Canny edges
Canny edge detector
1. Filter image with x, y derivatives of Gaussian
2. Find magnitude and orientation of gradient
3. Non-maximum suppression:
– Thin multi-pixel wide “ridges” down to single pixel width
4. Thresholding and linking (hysteresis):
– Define two thresholds: low and high
– Use the high threshold to start edge curves and the low
threshold to continue them
•
MATLAB: edge(image, ‘canny’)
Effect of  (Gaussian kernel size)
original
Canny with
Canny with
The choice of  depends on desired behavior
• large  detects large scale edges
• small  detects fine features
Where do humans see boundaries?
image
human segmentation
gradient magnitude
• Berkeley segmentation database:
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/
pB boundary detector
Martin, Fowlkes, Malik 2004: Learning to Detect
Natural Boundaries…
http://www.eecs.berkeley.edu/Research/Projects/CS/
vision/grouping/papers/mfm-pami-boundary.pdf
Brightness
Color
Texture
Combined
Human
Pb (0.88)
Human (0.95)
State of edge detection
• Local edge detection works well
– But many false positives from illumination and
texture edges
• Some methods to take into account longer
contours, but could probably do better
• Poor use of object and high-level information
Sketching
• Learn from artist’s strokes so that edges are
more likely in certain parts of the face.
Berger et al. SIGGRAPH 2013
CSE 185
Introduction to Computer Vision
Local Invariant Features
Local features
• Interest points
• Descriptors
• Reading: Chapter 4
Correspondence across views
• Correspondence: matching points, patches,
edges, or regions across images
≈
Example: estimating fundamental
matrix that corresponds two views
Example: structure from motion
Applications
• Feature points are used for:
–
–
–
–
–
–
Image alignment
3D reconstruction
Motion tracking
Robot navigation
Indexing and database retrieval
Object recognition
Interest points
• Note: “interest points” = “keypoints”, also
sometimes called “features”
• Many applications
– tracking: which points are good to track?
– recognition: find patches likely to tell us
something about object category
– 3D reconstruction: find correspondences
across different views
Interest points
original
• Suppose you need to
click on some point,
go away and come
back after I deform the
image, and click on the
same points again.
– Which points would
you choose?
deformed
Keypoint matching
A1
A2
A3
fA
fB
d ( f A, fB )  T
1. Find a set of
distinctive keypoints
2. Define a region
around each
keypoint
3. Extract and
normalize the
region content
4. Compute a local
descriptor from the
normalized region
5. Match local
descriptors
Goals for keypoints
Detect points that are repeatable and distinctive
Key trade-offs
A1
A2
A3
Detection of interest points
More Repeatable
Robust detection
Precise localization
More Points
Robust to occlusion
Works with less texture
Description of patches
More Distinctive
Minimize wrong matches
More Flexible
Robust to expected variations
Maximize correct matches
Invariant local features
• Image content is transformed into local feature coordinates that are
invariant to translation, rotation, scale, and other imaging parameters
Features Descriptors
Natural Language
• Stop words: occur frequently but do not
contain much information
Choosing interest points
Where would you tell your friend to meet you?
Choosing interest points
Where would you tell your friend to meet you?
Feature extraction: Corners
Why extract features?
• Motivation: panorama stitching
– We have two images – how do we combine them?
Local features: main components
1) Detection: Identify the
interest points
2) Description: Extract vector
(1)
(1)
x
=
[
x
,

,
x
]
1
1
d
feature descriptor surrounding
each interest point.
3) Matching: Determine
correspondence between
descriptors in two views
x2 = [ x1( 2) ,, xd( 2) ]
444
Characteristics of good features
• Repeatability
– The same feature can be found in several images despite geometric and photometric
transformations
• Saliency
– Each feature is distinctive
• Compactness and efficiency
– Fewer features than image pixels
• Locality
– A feature occupies a relatively small area of the image; robust to clutter and
occlusion
Interest operator repeatability
• We want to detect (at least some of) the
same points in both images
No chance to find true matches!
• Yet we have to be able to run the detection
procedure independently per image
Descriptor distinctiveness
• We want to be able to reliably determine
which point goes with which
?
• Must provide some invariance to geometric
and photometric differences between the two
views
Local features: main components
1) Detection: Identify the interest points
2) Description: Extract vector feature descriptor surrounding
each interest point
3) Matching: Determine correspondence between
descriptors in two views
Many detectors
Hessian & Harris
Laplacian, DoG
Harris-/Hessian-Laplace
Harris-/Hessian-Affine
EBR and IBR
MSER
Salient Regions
Others…
[Beaudet ‘78], [Harris ‘88]
[Lindeberg ‘98], [Lowe 1999]
[Mikolajczyk & Schmid ‘01]
[Mikolajczyk & Schmid ‘04]
[Tuytelaars & Van Gool ‘04]
[Matas ‘02]
[Kadir & Brady ‘01]
Corner detection
• We should easily recognize the point by looking
through a small window
• Shifting a window in any direction should give a large
change in intensity
“flat” region:
no change in all
directions
“edge”:
no change along
the edge direction
“corner”:
significant change
in all directions
Finding corners
• Key property: in the region around a corner,
image gradient has two or more dominant
directions
• Corners are repeatable and distinctive
Corner detection: Mathematics
Change in appearance of window w(x,y)
for the shift [u,v]:
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
I(x, y)
E(u, v)
E(3,2)
w(x, y)
Corner detection: Mathematics
Change in appearance of window w(x,y)
for the shift [u,v]:
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
I(x, y)
E(u, v)
E(0,0)
w(x, y)
Corner detection: Mathematics
Change in appearance of window w(x,y)
for the shift [u,v]:
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
Window
function
Shifted
intensity
Window function w(x,y) =
Intensity
or
1 in window, 0 outside
Gaussian
Moravec corner detector
Test each pixel with change of intensity for the shift [u,v]:
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
Window
function
Shifted
intensity
Four shifts: (u,v) = (1,0), (1,1), (0,1), (-1, 1)
Look for local maxima in min{E}
When does this idea fail?
Intensity
Moravec corner detector
• In a region of uniform intensity,
then the nearby patches will look
similar.
• On an edge, then nearby patches in
a direction perpendicular to the
edge will look quite different, but
nearby patches in a direction
parallel to the edge will result in
only a small change.
• On a feature with variation in all
directions, then none of the nearby
patches will look similar.
“flat” region:
no change in all
directions
“edge”:
no change along the
edge direction
“corner”:
significant change in
all directions
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
Four shifts: (u,v) = (1,0), (1,1), (0,1), (-1, 1)
Look for local maxima in min{E}
Problem of Moravec detector
• Only a set of shifts at every 45 degree is
considered
• Noisy response due to a binary window
function
• Only minimum of E is taken into account
Harris corner detector (1988) solves these
problems.
C. Harris and M. Stephens. "A Combined Corner and Edge Detector.“
Proceedings of the 4th Alvey Vision Conference: pages 147--151.
Corner detection: Mathematics
Change in appearance of window w(x,y)
for the shift [u,v]:
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
We want to find out how this function behaves for
small shifts
E(u, v)
Corner detection: Mathematics
Change in appearance of window w(x,y)
for the shift [u,v]:
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
We want to find out how this function behaves for
small shifts
Local quadratic approximation of E(u,v) in the
neighborhood of (0,0) is given by the second-order
Taylor expansion:
 Eu (0,0) 1
 Euu (0,0) Euv (0,0) u 
E (u , v)  E (0,0) + [u v]
+ [u v]


v 
E
(
0
,
0
)
E
(
0
,
0
)
E
(
0
,
0
)
2
vv
 v

 uv
 
Taylor series
• Taylor series of the function f at a
¥
f (n) (a)
f (x) = å
(x - a)n
n!
n=0
f ¢(a)
f ¢¢(a)
f ¢¢¢(a)
2
= f (a) +
(x - a) +
(x - a) +
(x - a)3 +
1!
2!
3!
• Maclaurin series (Taylor series when a=0)
Taylor expansion
• Second order Taylor expansion for vectors
Local quadratic approximation of E(u,v) in the
neighborhood of (0,0) is given by the second-order
Taylor expansion:
 Eu (0,0) 1
 Euu (0,0) Euv (0,0) u 
E (u , v)  E (0,0) + [u v]
+ [u v]

 v 
E
(
0
,
0
)
E
(
0
,
0
)
E
(
0
,
0
)
2
vv
 v

 uv
 
Corner detection: Mathematics
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
Second-order Taylor expansion of E(u,v) about (0,0):
 Eu (0,0) 1
 Euu (0,0) Euv (0,0) u 
E (u , v)  E (0,0) + [u v]
+ [u v]

 v 
E
(
0
,
0
)
E
(
0
,
0
)
E
(
0
,
0
)
2
vv
 v

 uv
 
Eu (u , v) =  2 w( x, y )I ( x + u , y + v) − I ( x, y )I x ( x + u , y + v)
x, y
Euu (u , v) =  2 w( x, y )I x ( x + u , y + v) I x ( x + u , y + v)
x, y
+  2 w( x, y )I ( x + u , y + v) − I ( x, y )I xx ( x + u , y + v)
x, y
Euv (u , v) =  2 w( x, y )I y ( x + u , y + v) I x ( x + u , y + v)
x, y
+  2 w( x, y )I ( x + u , y + v) − I ( x, y )I xy ( x + u , y + v)
x, y
Corner detection: Mathematics
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
Second-order Taylor expansion of E(u,v) about (0,0):
 Eu (0,0) 1
 Euu (0,0) Euv (0,0) u 
E (u , v)  E (0,0) + [u v]
+ [u v]

 v 
E
(
0
,
0
)
E
(
0
,
0
)
E
(
0
,
0
)
2
vv
 v

 uv
 
E (0,0) = 0
Eu (0,0) = 0
Ev (0,0) = 0
Euu (0,0) =  2 w( x, y )I x ( x, y ) I x ( x, y )
x, y
E vv (0,0) =  2 w( x, y )I y ( x, y ) I y ( x, y )
x, y
Euv (0,0) =  2 w( x, y )I x ( x, y ) I y ( x, y )
x, y
Corner detection: Mathematics
The quadratic approximation simplifies to
u 
E (u , v)  [u v] M  
v 
where M is a second moment matrix computed from image
derivatives:
 I x2
M =  w( x, y ) 
x, y
 I x I y
M
IxI y 
2 
I y 
Simpler Derivation of E(u,v)
E (u , v) =  w( x, y )  I ( x + u , y + v) − I ( x, y ) 
2
x, y
f (x + u, y + v) = f (x, y) + uf x (x, y) + vf y (x, y)
I(x + u, y + v) = I(x, y) + uI x (x, y) + vI y (x, y)
å( I(x + u, y + v) - I(x, y))
» å (I(x, y) +uI (x, y) + vI (x, y) - I(x, y))
= å u I + 2uvI I + v I
2
x
2 2
x
x y
é 2
Ix
ê
é
ù
= åë u v û
ê I I
ë x y
y
2 2
y
ù
I x I y úé u ù
ê
ú
2
ú
Iy ë v û
û
2
Corners as distinctive Interest
Points
I x I x
M =  w( x, y ) 
I x I y
IxI y 

IyIy 
2 x 2 matrix of image derivatives (averaged in
neighborhood of a point).
Notation:
I
Ix 
x
I
Iy 
y
I I
IxI y 
x y
Interpreting the second moment
matrix
The surface E(u,v) is locally approximated by a
quadratic form. Let’s try to understand its shape.
u 
E (u , v)  [u v] M  
v 
 I x2
M =  w( x, y ) 
x, y
 I x I y
IxI y 
2 
I y 
Interpreting the second moment
matrix
First, consider the axis-aligned case
(gradients are either horizontal or vertical)
 I x2
M =  w( x, y ) 
x, y
 I x I y

IxI y 
−1  1
M =R 
2 
I y 
0
0
R

2 
Diagonalization of M
Here, λ1 and λ2 are eigenvalues of M (and R is a
matrix consisting of eigenvectors)
If either λ is close to 0, then this is not a corner, so look for
locations where both are large.
Interpreting the second moment
matrix E(u, v)  [u v] M u
v 
u 
Consider a horizontal “slice” of E(u, v): [u v] M   = const
v 
This is the equation of an ellipse.
Interpreting the second moment
matrix
u 
Consider a horizontal “slice” of E(u, v):
[u v] M   = const
v 
This is the equation of an ellipse.
1 0 
M =R 
R
Diagonalization of M:

 0 2 
The axis lengths of the ellipse are determined by the
eigenvalues and the orientation is determined by R
−1
direction of the
fastest change
(max)-1/2
direction of the
slowest
change
(min)-1/2
Visualization of second moment
matrices
Visualization of second moment
matrices
Interpreting the eigenvalues
Classification of image points using eigenvalues
of M:
“Edge”
2 >> 1
2
“Corner”
1 and 2 are large,
1 ~ 2 ;
E increases in all
directions
1 and 2 are small;
E is almost constant in
all directions
“Edge”
1 >> 2
“Flat”
region
1
Corner response function
R = det( M ) − a trace( M ) 2 = 12 − a (1 + 2 ) 2
α: constant (0.04 to 0.06)
“Edge”
R<0
“Corner”
R>0
recall
1 0 
M =R 
R

 0 2 
−1
|R| small
R is a matrix with two orthonormal bases
(eigenvector of unit norm)
“Flat”
region
“Edge”
R<0
Harris corner detector
1) Compute M matrix for each image window
to get their cornerness scores.
2) Find points whose surrounding window
gave large corner response (f>threshold)
3) Take the points of local maxima, i.e.,
perform non-maximum suppression
Harris corner detector [Harris88]
• Second moment matrix
1. Image
 I x2 ( D ) I x I y ( D )
 ( I ,  D ) = g ( I )  
derivatives

2
I
I
(

)
I
(

)
 x y D
y
D 
 (optionally, blur first)
det M = 12
trace M = 1 + 2
Iy
Ix2
Iy2
IxIy
g(Ix2)
g(Iy2)
g(IxIy)
2. Square of
derivatives
3. Gaussian
filter g(I)
recall
Ix
4. Cornerness function – both eigenvalues are strong
har = det[ ( I , D)] − a [trace(  ( I , D)) 2 ] =
g ( I x2 ) g ( I y2 ) − [ g ( I x I y )]2 − a [ g ( I x2 ) + g ( I y2 )]2
5. Non-maxima suppression
476
har
Harris Detector: Steps
Harris Detector: Steps
Compute corner response R
Harris Detector: Steps
Find points with large corner response: R>threshold
Harris Detector: Steps
Take only the points of local maxima of R
Harris Detector: Steps
Invariance and covariance
•
We want corner locations to be invariant to photometric transformations and covariant
to geometric transformations
– Invariance: image is transformed and corner locations do not change
– Covariance: if we have two transformed versions of the same image,
features should be detected in corresponding locations
Affine intensity change
I→aI+b
• Only derivatives are used =>
invariance to intensity shift I → I + b
• Intensity scaling: I → a I
R
R
threshold
x (image coordinate)
x (image coordinate)
Partially invariant to affine intensity change
Image translation
• Derivatives and window function are shift-invariant
Corner location is covariant w.r.t. translation
Image rotation
Second moment ellipse rotates but its shape
(i.e. eigenvalues) remains the same
Corner location is covariant w.r.t. rotation
Scaling
Corner
All points will
be classified
as edges
Corner location is not covariant to scaling!
More local features
•
•
•
•
•
•
•
•
•
•
SIFT (Scale Invariant Feature Transform)
Harris Laplace
Harris Affine
Hessian detector
Hessian Laplace
Hessian Affine
MSER (Maximally Stable Extremal Regions)
SURF (Speeded-Up Robust Feature)
BRIEF (Boundary Robust Independent Elementary Features)
ORB (Oriented BRIEF)
CSE 185
Introduction to Computer Vision
Feature Matching
Feature matching
• Keypoint matching
• Hough transform
• Reading: Chapter 4
Feature matching
• Correspondence: matching points, patches,
edges, or regions across images
≈
Keypoint matching
A1
A2
A3
fA
fB
d ( f A, fB )  T
1. Find a set of
distinctive keypoints
2. Define a region
around each
keypoint
3. Extract and
normalize the
region content
4. Compute a local
descriptor from the
normalized region
5. Match local
descriptors
Review: Interest points
• Keypoint detection: repeatable
and distinctive
– Corners, blobs, stable regions
– Harris, DoG, MSER
– SIFT
Which interest detector
• What do you want it for?
– Precise localization in x-y: Harris
– Good localization in scale: Difference of Gaussian
– Flexible region shape: MSER
• Best choice often application dependent
– Harris-/Hessian-Laplace/DoG work well for many natural categories
– MSER works well for buildings and printed things
• Why choose?
– Get more points with more detectors
• There have been extensive evaluations/comparisons
– [Mikolajczyk et al., IJCV’05, PAMI’05]
– All detectors/descriptors shown here work well
Local feature descriptors
• Most features can be thought of
as templates, histograms
(counts), or combinations
• The ideal descriptor should be
– Robust and distinctive
– Compact and efficient
• Most available descriptors focus
on edge/gradient information
– Capture texture information
– Color rarely used
• Scale-invariant feature transform
(SIFT) descriptor
How to decide which features match?
Feature matching
• Szeliski 4.1.3
–
–
–
–
Simple feature-space methods
Evaluation methods
Acceleration methods
Geometric verification (Chapter 6)
Feature matching
• Simple criteria: One feature matches to
another if those features are nearest
neighbors and their distance is below some
threshold.
• Problems:
– Threshold is difficult to set
– Non-distinctive features could have lots of close
matches, only one of which is correct
Matching local features
• Threshold based on the ratio of 1st nearest neighbor
to 2nd nearest neighbor distance
Reject all matches in which
the distance ratio > 0.8, which
eliminates 90% of false matches
while discarding less than 5%
correct matches
If there is a good match, the second
closest matched point should not be
close to the best match point
(i.e., remove ambiguous matches)
SIFT repeatability
It shows the stability of detection for keypoint location, orientation, and final matching to a
database as a function of affine distortion. The degree of affine distortion is expressed
in terms of the equivalent viewpoint rotation in depth for a planar surface.
Matching features
What do we do about the “bad” matches?
Good match: 1. crop a patch at each interest point and compute the distance
between two 2. if the distance is small, then it is a good match
RAndom SAmple Consensus
Select one match, count inliers
Blue line: randomly selected line
Yellow line: inlier (4)
Red line: outlier (2)
RAndom SAmple Consensus
Select one match, count inliers
Blue line: randomly selected line
Yellow line: inlier (1)
Red line: outlier (5)
Least squares fit
Find “average” translation vector
Blue line: randomly selected line
Yellow line: inlier
Red line: outlier
RANSAC
• Random Sample Consensus
• Choose a small subset uniformly
at random
• Fit to that
• Anything that is close to result is
signal; all others are noise
• Refit
• Do this many times and choose
the best
• Issues
– How many times?
• Often enough that we are
likely to have a good line
– How big a subset?
• Smallest possible
– What does close mean?
• Depends on the problem
– What is a good line?
• One where the number of
nearby points is so big it is
unlikely to be all outliers
Descriptor Vector
• Orientation = blurred gradient
• Similarity Invariant Frame
– Scale-space position (x, y, s) + orientation ()
Image Stitching
506
RANSAC for Homography
• SIFT features of two similar images
RANSAC for Homography
• SIFT features common to both images
RANSAC for Homography
• Select random subset of features (e.g. 6)
• Compute motion estimate
• Apply motion estimate to all SIFT features
• Compute error: feature pairs not
described by motion estimate
• Repeat many times (e.g. 500)
• Keep estimate with best error
RANSAC for Homography
Probabilistic model for verification
• Potential problem:
• Two images don’t match…
• … but RANSAC found a motion estimate
• Do a quick check to make sure the images do
match
• MAP for inliers vs. outliers
Finding the panoramas
Finding the panoramas
Finding connected
components
Finding the panoramas
Results
Fitting and alignment
Fitting: find the parameters of a model that
best fit the data
Alignment: find the parameters of the
transformation that best align matched points
Checkerboard
• Often used in camera calibration
See https://docs.opencv.org/master/d9/dab/tutorial_homography.html
https://www.mathworks.com/help/vision/ref/estimatecameraparameters.html
Fitting and alignment
• Design challenges
– Design a suitable goodness of fit measure
• Similarity should reflect application goals
• Encode robustness to outliers and noise
– Design an optimization method
• Avoid local optima
• Find best parameters quickly
Fitting and alignment: Methods
• Global optimization / Search for parameters
– Least squares fit
– Robust least squares
– Iterative closest point (ICP)
• Hypothesize and test
– Generalized Hough transform
– RANSAC
Least squares line fitting
y=mx+b
• Data: (x1, y1), …, (xn, yn)
• Line equation: yi = m xi + b
• Find (m, b) to minimize
(xi, yi)
E = i =1 ( yi − m xi − b) 2
n
dE
= 2 A T Ap − 2 A T y = 0
dp
A Ap = A
T
T
Matlab: p = A \ y;
y  p = (A A )
T
−1
AT y
Least squares (global) optimization
Good
• Clearly specified objective
• Optimization is easy
Bad
• May not be what you want to optimize
• Sensitive to outliers
– Bad matches, extra points
• Doesn’t allow you to get multiple good fits
– Detecting multiple objects, lines, etc.
Hypothesize and test
1. Propose parameters
–
–
–
Try all possible
Each point votes for all consistent parameters
Repeatedly sample enough points to solve for parameters
2. Score the given parameters
–
Number of consistent points, possibly weighted by distance
3. Choose from among the set of parameters
–
Global or local maximum of scores
4. Possibly refine parameters using inliers
Hough transform: Outline
1. Create a grid of parameter values
2. Each point votes for a set of parameters,
incrementing those values in grid
3. Find maximum or local maxima in grid
Hough transform
Given a set of points, find the curve or line that explains
the data points best
y
Duality: Each
point has a dual
line in the
parameter space
m
x
y=mx+b
m and b are known variables
y and x are unknown variables
b
Hough space
m = -(1/x)b + y/x
x and y are known variables
m and b are unknown variables
Hough transform
y
m
b
x
y
m
3
x
5
3
3
2
2
3 7
11 10
4
3
2 3
2 1
1
0
5
3
2
3
4
1
b
Hough transform
Issue : [m,b] is unbounded… −∞ ≤ 𝑚 ≤ ∞ → large accumulator
Use a polar representation for the parameter space, 0 ≤ 𝜃 < 𝜋
Duality: Each
point has a dual
curve in the
parameter space
y
x
y = (−
cos 
r
)x + (
)
sin 
sin 
Hough space
𝑟 = 𝑥 cos 𝜃 + 𝑦 sin 𝜃
Hough transform: Experiments
features
votes
Hough transform
Hough transform: Experiments
Noisy data
features
Need to adjust grid size or smooth
votes
Hough transform: Experiments
features
votes
Issue: spurious peaks due to uniform noise
1. Image → Canny
2. Canny → Hough votes
3. Hough votes → Edges
Find peaks and post-process
Hough transform example
Hough transform: Finding lines
• Using m,b parameterization
• Using r, theta parameterization
– Using oriented gradients
• Practical considerations
–
–
–
–
Bin size
Smoothing
Finding multiple lines
Finding line segments
CSE 185
Introduction to Computer Vision
Fitting and Alignment
Fitting and alignment
• RANSAC
• Transformation
• Iterative closest point
• Reading: Chapter 4
Correspondence and alignment
• Correspondence: matching points, patches,
edges, or regions across images
≈
Fitting and alignment: Methods
• Global optimization / Search for parameters
– Least squares fit
– Robust least squares
– Iterative closest point (ICP)
• Hypothesize and test
– Hough transform
– RANSAC
Hough transform
Good
• Robust to outliers: each point votes separately
• Fairly efficient (much faster than trying all sets of parameters)
• Provides multiple good fits
Bad
• Some sensitivity to noise
• Bin size trades off between noise tolerance, precision, and
speed/memory
– Can be hard to find sweet spot
• Not suitable for more than a few parameters
– grid size grows exponentially
Common applications
• Line fitting (also circles, ellipses, etc.)
• Object instance recognition (parameters are affine transform)
• Object category recognition (parameters are position/scale)
RANSAC
RANdom SAmple Consensus:
Learning technique to estimate
parameters of a model by random
sampling of observed data
Fischler & Bolles in ‘81.

RANSAC
Algorithm:
1. Sample (randomly) the number of points required to fit the model
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
RANSAC
Line fitting example
Algorithm:
1. Sample (randomly) the number of points required to fit the model (#=2)
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
RANSAC
Line fitting example
Algorithm:
1. Sample (randomly) the number of points required to fit the model (#=2)
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
RANSAC
Line fitting example
NI = 6

Algorithm:
1. Sample (randomly) the number of points required to fit the model (#=2)
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
RANSAC

Algorithm:
N I = 14
1. Sample (randomly) the number of points required to fit the model (#=2)
2. Solve for model parameters using samples
3. Score by the fraction of inliers within a preset threshold of the model
Repeat 1-3 until the best model is found with high confidence
How to choose parameters?
• Number of samples N
– Choose N so that, with probability p, at least one random sample is free
from outliers (e.g. p=0.99) (outlier ratio: e )
• Number of sampled points s
– Minimum number needed to fit the model
• Distance threshold 
– Choose  so that a good point with noise is likely (e.g., prob=0.95) within threshold
–
Zero-mean Gaussian noise with std. dev. σ: t2=3.84σ2
(
N = log(1 − p ) / log 1 − (1 − e )
s
)
s
2
3
4
5
6
7
8
5%
2
3
3
4
4
4
5
10%
3
4
5
6
7
8
9
proportion of outliers e
20%
5
7
9
12
16
20
26
25%
6
9
13
17
24
33
44
30%
7
11
17
26
37
54
78
40% 50%
11
17
19
35
34
72
57
146
97
293
163 588
272 1177
RANSAC
Good
• Robust to outliers
• Applicable for larger number of objective function parameters
than Hough transform
• Optimization parameters are easier to choose than Hough
transform
Bad
• Computational time grows quickly with fraction of outliers
and number of parameters
• Not good for getting multiple fits
Common applications
• Computing a homography (e.g., image stitching)
• Estimating fundamental matrix (relating two views)
How do we fit the best alignment?
Alignment
• Alignment: find parameters of model that
maps one set of points to another
• Typically want to solve for a global
transformation that accounts for *most* true
correspondences
• Difficulties
– Noise (typically 1-3 pixels)
– Outliers (often 50%)
– Many-to-one matches or multiple objects
Parametric (global) warping
T
p = (x,y)
p’ = (x’,y’)
Transformation T is a coordinate-changing machine:
p’ = T(p)
What does it mean that T is global?
– Is the same for any point p
– can be described by just a few numbers (parameters)
For linear transformations, we can represent T as a matrix
p’ = Tp
 x' 
 x
 y ' = T  y 
 
 
Common transformations
original
Transformed
aspect
rotation
translation
affine
perspective
Scaling
• Scaling a coordinate means multiplying each of its components by a
scalar
• Uniform scaling means this scalar is the same for all components:
2
Scaling
• Non-uniform scaling: different scalars per component:
X  2,
Y  0.5
Scaling
• Scaling operation:
x ' = ax
y ' = by
• Or, in matrix form:
 x '   a 0  x 
 y ' =  0 b   y 
  
 
scaling matrix S
2D rotation
(x’, y’)
(x, y)

x’ = x cos() - y sin()
y’ = x sin() + y cos()
2D rotation
(x’, y’)
(x, y)


Polar coordinates…
x = r cos ()
y = r sin ()
x’ = r cos ( + )
y’ = r sin ( + )
Trig Identity…
x’ = r cos() cos() – r sin() sin()
y’ = r sin() cos() + r cos() sin()
Substitute…
x’ = x cos() - y sin()
y’ = x sin() + y cos()
2D rotation
This is easy to capture in matrix form:
 x '  cos( ) − sin ( )  x 
 y ' =  sin ( ) cos( )   y 
  
 
R
Even though sin() and cos() are nonlinear functions of ,
– x’ is a linear combination of x and y
– y’ is a linear combination of x and y
What is the inverse transformation?
– Rotation by –
– For rotation matrices
R −1 = R T
Basic 2D transformations
 x'   s x
 y ' =  0
  
0  x
s y   y 
 x'   1
 y ' = a
   y
Scale
a x  x
1   y 
Shear
 x'  cos  − sin   x 
 y ' =  sin  cos    y 
  
 
Rotate
 x
 x  1 0 t x   
 y  = 0 1 t   y 
y
  
 1 
Translate
 x   a b
 y  = d e
  
Affine
 x
c  
y


f 
 1 
Affine is any combination of translation, scale,
rotation, shear
Affine transformation
Affine transformations are combinations of  x 
• Linear transformations, and
• Translations
a
 y  = d
  
Properties of affine transformations:
•
•
•
•
Lines map to lines
Parallel lines remain parallel
Ratios are preserved
Closed under composition
b
e
 x
c  
y


f 
 1 
or
 x'   a
 y ' =  d
  
 1   0
b
e
0
c  x
f  y
 
1   1 
Projective transformations
Projective transformations are combos of
•
•
Affine transformations, and
Projective warps
Properties of projective transformations:
•
•
•
•
•
•
Lines map to lines
Parallel lines do not necessarily remain
parallel
Ratios are preserved
Closed under composition
Models change of basis
Projective matrix is defined up to a scale
(8 DOF)
 x'   a
 y ' = d
 w'  g
  
b
e
h
c  x 
f  y 
i   w
2D image transformations
Example: solving for translation
A1
A2
B1
A3
B2
B3
Given matched points in {A} and {B}, estimate the translation of the object
 xiB   xiA  t x 
 B  =  A  + t 
 yi   yi   y 
Example: solving for translation
A1
A2
A3
(tx, ty)
B1
B2
Least squares solution
1.
2.
3.
Write down objective function
Derived solution
a)
Compute derivative
b)
Compute solution
Computational solution
a)
Write in form Ax=b
b) Solve using pseudo-inverse or eigenvalue
decomposition
 xiB   xiA  t x 
 B  =  A  + t 
 yi   yi   y 
B3
1
0



1
0
0
 x1B
 B
1
 t x   y1
   = 

 ty
0    x nB
 y nB
1
− x1A 

− y1A 
 

− x nA 
− y nA 
Example: solving for translation
A1
A5
A2
A3
(tx, ty)
A4
B4
B1
B2
B5
B3
Problem: outliers
RANSAC solution
1.
2.
3.
4.
Sample a set of matching points (1 pair)
Solve for transformation parameters
Score parameters with number of inliers
Repeat steps 1-3 N times
 xiB   xiA  t x 
 B  =  A  + t 
 yi   yi   y 
Example: solving for translation
B4
B5 B6
A1
A2
(tx, ty)
A3
A4
A5 A6
B1
B2
B3
Problem: outliers, multiple objects, and/or many-to-one matches
Hough transform solution
1. Initialize a grid of parameter values
2. Each matched pair casts a vote for consistent
values
3. Find the parameters with the most votes
4. Solve using least squares with inliers
 xiB   xiA  t x 
 B  =  A  + t 
 yi   yi   y 
Example: solving for translation
(tx, ty)
Problem: no initial guesses for correspondence
 xiB   xiA  t x 
 B  =  A  + t 
 yi   yi   y 
When no prior matched pairs exist
• Hough transform and RANSAC not applicable
• Important applications
Medical imaging: match
brain scans or contours
Robotics: match point clouds
Iterative Closest Point (ICP)
Goal: estimate transform between two dense
sets of points
1. Initialize transformation (e.g., compute difference in means
and scale)
2. Assign each point in {Set 1} to its nearest neighbor in {Set 2}
3. Estimate transformation parameters
–
e.g., least squares or robust least squares
4. Transform the points in {Set 1} using estimated parameters
5. Repeat steps 2-4 until change is very small
https://www.youtube.com/watch?v=m64E47uvPYc
https://www.youtube.com/watch?v=uzOCS_gdZuM
Example: aligning boundaries
p
q
Example: solving for translation
(tx, ty)
Problem: no initial guesses for correspondence
ICP solution
 xiB   xiA  t x 
1. Find nearest neighbors for each point
 B  =  A  + t 
2. Compute transform using matches
 yi   yi   y 
3. Move points using transform
4. Repeat steps 1-3 until convergence
Algorithm summary
•
Least Squares Fit
–
–
–
•
Robust Least Squares
–
–
•
robust to noise and outliers
can fit multiple models
only works for a few parameters (1-4 typically)
RANSAC
–
–
•
improves robustness to noise
requires iterative optimization
Hough transform
–
–
–
•
closed form solution
robust to noise
not robust to outliers
robust to noise and outliers
works with a moderate number of parameters (e.g, 1-8)
Iterative Closest Point (ICP)
–
For local alignment only: does not require initial correspondences
Object instance recognition
A1
1. Match keypoints to
object model
2. Solve for affine
transformation
parameters
3. Score by inliers and
choose solutions with
score above threshold
A2
A3
Matched
keypoints
Affine
Parameters
# Inliers
Choose hypothesis with max
score above threshold
Keypoint matching
A1
A2
A3
fA
fB
d ( f A, fB )  T
1. Find a set of
distinctive keypoints
2. Define a region
around each
keypoint
3. Extract and
normalize the
region content
4. Compute a local
descriptor from the
normalized region
5. Match local
descriptors
Finding the objects
Input
Image
1.
2.
3.
4.
5.
Stored
Image
Match interest points from input image to database image
Matched points vote for rough position/orientation/scale of object
Find position/orientation/scales that have at least three votes
Compute affine registration and matches using iterative least squares
with outlier check
Report object if there are at least T matched points
Object recognition using SIFT descriptors
1.
2.
Match interest points from input image to database image
Get location/scale/orientation using Hough voting
–
In training, each point has known position/scale/orientation wrt whole
object
–
Matched points vote for the position, scale, and orientation of the entire
object
–
Bins for x, y, scale, orientation
•
•
3.
Wide bins (0.25 object length in position, 2x scale, 30 degrees orientation)
Vote for two closest bin centers in each direction (16 votes total)
Geometric verification
–
For each bin with at least 3 keypoints
–
Iterate between least squares fit and checking for inliers and outliers
4. Report object if > T inliers (T is typically 3, can be computed to match some
probabilistic threshold)
Examples of recognized objects
CSE 185
Introduction to Computer Vision
Stereo
Stereo
•
•
•
•
Multi-view analysis
Stereopsis
Finding correspondce
Depth
• Reading: Chapter 12 and 11
Multiple views
• Taken at the same time
or sequential in time
• stereo vision
• structure from motion
• optical flow
Why multiple views?
• Structure and depth are inherently ambiguous from single
views
Why multiple views?
• Structure and depth are inherently ambiguous from single
views
P1
P2
P1’=P2’
Optical center
Shape from X
• What cues help us to perceive 3d shape and depth?
• Many factors
–
–
–
–
–
–
–
–
Shading
Motion
Occlusion
Focus
Texture
Shadow
Specularity
…
Shading
For static objects and known light source, estimate surface normals
under different lighting conditions
Focus/defocus
Images from
same point of
view, different
camera
parameters
far focused
near focused
3d shape / depth
estimates
estimated depth map
Texture
estimate surface
orientation
Perspective effects
perspective, relative size, occlusion, texture gradients contribute to
3D appearance of the scene
Motion
Motion parallax: as the viewpoint moves side to side, the
objects at distance appear to move slower than the ones
close to the camera
Motion Parallax
Occlusion and focus
Estimating scene shape
• Shape from X: Shading, texture, focus, motion…
• Stereo:
– shape from motion between two views
– infer 3d shape of scene from two (multiple) images
from different viewpoints
Main idea:
scene point
image plane
optical center
Outline
• Human stereopsis
• Stereograms
• Epipolar geometry and the epipolar constraint
– Case example with parallel optical axes
– General case with calibrated cameras
Human eye
Rough analogy with human visual system:
Pupil/Iris – control
amount of light
passing through lens
Retina - contains
sensor cells, where
image is formed
Fovea – highest
concentration of
cones
Human stereopsis: disparity
Human eyes fixate on point in space – rotate so that corresponding images form in
centers of fovea.
See https://en.wikipedia.org/wiki/Binocular_disparity
Human stereopsis: disparity
Disparity occurs when
eyes fixate on one object;
others appear at different
visual angles
Human stereopsis: disparity
Disparity:
d = r-l = D-F.
Random dot stereograms
• Bela Julesz 1960: Do we identify local brightness
patterns before fusion (monocular process) or after
(binocular)?
• To test: pair of synthetic images obtained by
randomly spraying black dots on white objects
Random dot stereograms
Random dot stereograms
1. Create an image of suitable size. Fill it with random dots. Duplicate the image.
2. Select a region in one image.
See also this video
3. Shift this region horizontally by a small amount. The stereogram is complete.
(focus on a point behind the image by a small amount until the two images "snap" together)
Random dot stereograms
• When viewed monocularly, they appear random;
when viewed stereoscopically, see 3d structure.
• Conclusion: human binocular fusion not directly
associated with the physical retinas; must involve
the central nervous system
• Imaginary “cyclopean retina” that combines the left
and right image stimuli as a single unit
• High level scene understanding not required for
stereo
Stereo photograph and viewer
Take two pictures of the same subject from two slightly
different viewpoints and display so that each eye sees
only one of the images.
Invented by Sir Charles Wheatstone, 1838
Image from fisher-price.com
3D effect
http://www.johnsonshawmuseum.org
https://en.wikipedia.org/wiki/Anaglyph_3D
3D effect
http://www.johnsonshawmuseum.org
https://en.wikipedia.org/wiki/Anaglyph_3D
Public Library, Stereoscopic Looking Room, Chicago, by Phillips, 1923
Stereo images
present stereo images on the screen by simply putting the right and left images in an animated gif
http://www.well.com/~jimg/stereo/stereo_list.html
Autostereograms
Exploit disparity as depth
cue using single image.
(Single image random dot
stereogram, Single image
stereogram)
Tips: first move your nose
to the screen and look
through the screen, then
move your head away
slowly and you will see 3D
See https://en.wikipedia.org/wiki/Autostereogram
Autosteoreogram
Designed to create the
visual illusion of a 3D
scene
Using smooth
gradients
Estimating depth with stereo
• Stereo: shape from motion between two views
• Need to consider:
– Info on camera pose (“calibration”)
– Image point correspondences
scene point
image plane
optical
center
Stereo vision
Two cameras, simultaneous
views
Single moving camera and
static scene
Camera parameters
Camera
frame 2
Extrinsic parameters:
Camera frame 1 → Camera frame 2
Camera
frame 1
Intrinsic parameters:
Image coordinates relative to
camera → Pixel coordinates
• Extrinsic params: rotation matrix and translation vector
• Intrinsic params: focal length, pixel sizes (mm), image center
point, radial distortion parameters
We’ll assume for now that these parameters are given and fixed.
Outline
• Human stereopsis
• Stereograms
• Epipolar geometry and the epipolar constraint
– Case example with parallel optical axes
– General case with calibrated cameras
Geometry for a stereo system
• First, assuming parallel optical axes, known
camera parameters (i.e., calibrated cameras):
P
World
point
point in 3D world
Depth of p
image point
(left)
image point
image point
(right)
image point
(left)
depth of P
(right)
Focal
focal
length
length
optical
center
(left)
optical
center
optical
center
(right)
(right)
optical center
(left)
baseline
baseline
Geometry for a stereo system
• Assume parallel optical axes, known camera parameters (i.e.,
calibrated cameras). What is expression for Z?
P
Similar triangles (pl, P, pr) and
(Ol, P, Or):
T + xl − xr T
=
Z− f
Z
disparity
𝑇 − 𝑥𝑙 − 𝑥𝑟 = 𝑇 + 𝑥𝑙 − 𝑥𝑟
T
Z= f
xr − xl
Depth from disparity
image I(x,y)
Disparity map D(x,y)
image I´(x´,y´)
(x´,y´)=(x+D(x,y), y)
So if we could find the corresponding points in two images,
we could estimate relative depth…
CSE 185
Introduction to Computer Vision
Stereo 2
Stereo
• Epipolar geometry
• Correspondence
• Structure from motion
• Reading: Chapter 11
Depth from disparity
X
(X – X’) / f = baseline / z
z
x
f
f
C
X – X’ = (baseline*f) / z
x’
baseline
C’
z = (baseline*f) / (X – X’)
d=X-X’ (disparity)
z is inversely proportional to d
Outline
• Human stereopsis
• Stereograms
• Epipolar geometry and the epipolar constraint
– Case example with parallel optical axes
– General case with calibrated cameras
General case with calibrated camera
• The two cameras need not have parallel optical axes.
vs.
Stereo correspondence constraint
• Given p in left image, where can corresponding point
p’ be?
Stereo correspondence constraints
Epipolar constraint
• Geometry of two views constrains where the corresponding pixel for
some image point in the first view must occur in the second view
• It must be on the line carved out by a plane connecting the world point
and optical centers
Epipolar geometry
• Epipolar Plane
Epipole
Baseline
Epipolar Line
Epipole
http://www.ai.sri.com/~luong/research/Meta3DViewer/EpipolarGeo.html
Epipolar geometry: terms
•
•
•
•
•
•
Baseline: line joining the camera centers 𝑂𝑂′
Epipole: point of intersection of baseline with
image plane 𝑒
Epipolar plane: plane containing baseline and
world point
Epipolar line: intersection of epipolar plane
with the image plane 𝑒𝑝 and 𝑒′𝑝′
All epipolar lines intersect at the epipole
An epipolar plane intersects the left and right
image planes in epipolar lines
Why is the epipolar constraint useful?
Epipolar constraint
This is useful because it reduces the correspondence
problem to a 1D search along an epipolar line
Example
What do the epipolar lines look
like?
1.
Ol
Or
2.
Ol
Or
Example: Converging camera
Example: Parallel camera
Where are the
epipoles?
Example: Forward motion
What would the epipolar lines look like if the
camera moves directly forward?
Example: Forward motion
e’
e
Epipole has same coordinates in both images.
Points move along lines radiating from e: “Focus
of expansion”
Recall
• Cross product
• Dot product in matrix notation
M=[R | t]
Recall
• A line 𝑙: 𝑎𝑥 + 𝑏𝑦 + 𝑐 = 0
• In vector form
𝑥
𝑎
𝑙 = 𝑏 , a point 𝑝 = 𝑦 , 𝑝𝑇 𝑙 = 0
𝑐
1
Epipolar Constraint: Calibrated Case
R is 3 x 3 matrix
t is 3 x 1 vector
𝑝𝑙 = 𝑅𝑝𝑟 + 𝑡
See p347-348
of Szeliski
(R, t)
𝑝 ∙ 𝑡 × 𝑅 𝑝′ + 𝑡)
𝑝 = (𝑢, 𝑣, 1)𝑇
= 0 𝑤𝑖𝑡ℎ ൞
𝑝′ = (𝑢′ , 𝑣 ′ , 1)𝑇
𝑝′ in left camera: 𝑅𝑝′ + 𝑡
𝑝 ∙ [𝑡 × 𝑅𝑝′) = 0, 𝑝𝑇 [𝑡 × 𝑅𝑝′) =0
Essential Matrix
3 ×3 skew-symmetric
matrix: rank=2
(Longuet-Higgins, 1981)
𝑡 × 𝑅𝑝′ = 𝑡 × 𝑅𝑝′ = 𝜀𝑝′
𝑝𝑇 𝜀𝑝′ = 0 with 𝜀 = [𝑡]× 𝑅
𝑝𝑇 𝑙 = 0, epiolor line 𝑙 is formed by 𝜀𝑝′
Epipolar Constraint: Calibrated Case
𝑅11 𝑅12 𝑅13
𝑡 = [𝑡𝑥 , 𝑡𝑦 , 𝑡𝑧 ] , R = 𝑅21 𝑅22 𝑅23
𝑅31 𝑅32 𝑅33
0
−𝑡𝑧 𝑡𝑦 𝑅11 𝑅12 𝑅13
0 −𝑡𝑥 𝑅21 𝑅22 𝑅23
𝐸 = 𝑡𝑧
𝑅31 𝑅32 𝑅33
−𝑡𝑦 𝑡𝑥
0
𝑇
• Note that Ep’ can be interpreted as the coordinate
vector representing the epipolar line associated with
the point p’ in the first image
• A line l can be defined by its equation au+bv+c=0
where (u,v) denote the coordinate of a point on the
line, (a, b) is the unit normal to the line, and –c is the
offset
• Can normalize that with a2+b2=1 to have a unique
answer
https://www.youtube.com/watch?v=6kpBqfgSPRc
http://www.rasmus.is/uk/t/F/Su58k05.htm
https://sites.math.washington.edu/~king/coursedir/m445w04/notes/vector/normals-planes.html
Epipolar Constraint: Uncalibrated Case
pˆ , pˆ are normalized image coordinate
Fundamental Matrix
(Faugeras and Luong, 1992)
Fundamental matrix
• Let p be a point in left image, p’ in right image
l
• Epipolar relation
l’
p
p’
– p maps to epipolar line l’
– p’ maps to epipolar line l
• Epipolar mapping described by a 3x3 matrix F
• It follows that
What is the physical meaning?
Fundamental matrix
• This matrix F is called
– Essential Matrix
• when image intrinsic parameters are known
– Fundamental Matrix
• more generally (uncalibrated case)
• Can solve F from point correspondences
– Each (p, p’) pair gives one linear equation in entries of F
– F has 9 entries, but really only 7 or 8 degrees of freedom.
– With 8 points it is simple to solve for F, but it is also possible with 7
points
Stereo image rectification
Stereo image rectification
• Reproject image planes onto
a common plane parallel to
the line between camera
centers
• Pixel motion is horizontal
after this transformation
• Two homographies (3x3
transform), one for each
input image reprojection
Rectification example
Correspondence problem
• Epipolar geometry constrains our search, but
we still have a difficult correspondence
problem
Basic stereo matching algorithm
• If necessary, rectify the two stereo images to transform epipolar
lines into scanlines
• For each pixel x in the first image
– Find corresponding epipolar scanline in the right image
– Examine all pixels on the scanline and pick the best match x’
– Compute disparity x-x’ and set depth(x) = fB/(x-x’)
Correspondence search
Left
Right
scanline
Matching cost
• Slide a window along the right scanline and
compare contents of that window with the
reference window in the left image
• Matching cost: SSD or normalized correlation
Correspondence search
Left
Right
scanline
SSD
Correspondence search
Left
Right
scanline
Norm. corr
Effect of window size
• Smaller window
+ More detail
– More noise
• Larger window
+ Smoother disparity maps
– Less detail
W=3
W = 20
Failure cases
Textureless surfaces
Occlusions, repetition
Non-Lambertian surfaces, specularities
Results with window
search
Data
Window-based matching
Ground truth
Beyond window-based matching
• So far, matches are independent for each
point
• What constraints or priors can we add?
Stereo constraints/priors
• Uniqueness
– For any point in one image, there should be at most one matching point
in the other image
Stereo constraints/priors
• Uniqueness
– For any point in one image, there should be at most one matching point in
the other image
• Ordering
– Corresponding points should be in the same order in both views
Stereo constraints/priors
• Uniqueness
– For any point in one image, there should be at most one matching point in the
other image
• Ordering
– Corresponding points should be in the same order in both views
Ordering constraint doesn’t hold
Priors and constraints
• Uniqueness
– For any point in one image, there should be at most one
matching point in the other image
• Ordering
– Corresponding points should be in the same order in both
views
• Smoothness
– We expect disparity values to change slowly (for the most
part)
Scanline stereo
• Try to coherently match pixels on the entire scanline
• Different scanlines are still optimized independently
Left image
Right image
“Shortest paths” for scan-line
stereo
Left image
I
Right image
Sleft
Right
occlusion
Left
occlusion
q
t
s
p
S right
Can be implemented with dynamic programming
Ccorr
I
Coherent stereo on 2D grid
• Scanline stereo generates streaking artifacts
• Can’t use dynamic programming to find spatially coherent
disparities/ correspondences on a 2D grid
Stereo matching as energy
minimization
I2
I1
W1(i)
D
W2(i+D(i))
E ( D ) =  (W1 (i ) − W2 (i + D (i )) ) + 
2
i
D(i)
  (D(i) − D( j ) )
neighbors i , j
data term
smoothness term
• Random field interpretation
• Energy functions of this form can be minimized using graph cuts
Graph cut
Before
Graph cuts
Many of these constraints can be encoded in
an energy function and solved using graph cuts
Ground truth
Y. Boykov, O. Veksler, and R. Zabih, Fast Approximate Energy Minimization via Graph Cuts, PAMI 2001
For the latest and greatest: http://www.middlebury.edu/stereo/
Active stereo with structured light
• Project “structured” light patterns onto the object
– Simplifies the correspondence problem
– Allows us to use only one camera
camera
projector
L. Zhang, B. Curless, and S. M. Seitz. Rapid Shape Acquisition Using Color Structured Light and Multi-pass
Dynamic Programming. 3DPVT 2002
Kinect: Structured infrared light
http://bbzippo.wordpress.com/2010/11/28/kinect-in-infrared/
Summary: Epipolar constraint
X
X
X
x
x’
x’
x’
Potential matches for x have to lie on the corresponding line l’.
Potential matches for x’ have to lie on the corresponding line l.
Summary
• Epipolar geometry
– Epipoles are intersection of baseline with image planes
– Matching point in second image is on a line passing
through its epipole
– Fundamental matrix maps from a point in one image to a
line (its epipolar line) in the other
– Can solve for F given corresponding points (e.g., interest
points)
• Stereo depth estimation
– Estimate disparity by finding corresponding points along
scanlines
– Depth is inverse to disparity
Structure from motion
• Given a set of corresponding points in two or more images,
compute the camera parameters and the 3D point coordinates
?
Camera 1
R1,t1
?
Camera 2
R2,t2
Camera 3
?
? R ,t
3 3
Structure from motion ambiguity
• If we scale the entire scene by some factor k
and, at the same time, scale the camera
matrices by the factor of 1/k, the projections
of the scene points in the image remain
exactly the same:
1 
x = PX =  P (k X)
k 
It is impossible to recover the absolute scale of the scene!
Structure from motion ambiguity
• If we scale the entire scene by some factor k and, at
the same time, scale the camera matrices by the
factor of 1/k, the projections of the scene points in
the image remain exactly the same
(
x = PX = PQ
-1
)(QX )
• More generally: if we transform the scene using a
transformation Q and apply the inverse
transformation to the camera matrices, then the
images do not change
Projective structure from motion
• Given: m images of n fixed 3D points
• xij = Pi Xj , i = 1,… , m, j = 1, … , n
• Problem: estimate m projection matrices Pi and n 3D points
Xj from the mn corresponding points xij
Xj
x1j
x3j
P1
x2j
P3
P2
Projective structure from motion
• Given: m images of n fixed 3D points
• xij = Pi Xj ,
i = 1,… , m, j = 1, … , n
• Problem: estimate m projection matrices Pi and n
3D points Xj from the mn corresponding points xij
• With no calibration info, cameras and points can
only be recovered up to a 4x4 projective
transformation Q:
• X → QX, P → PQ-1
• We can solve for structure and motion when
• 2mn >= 11m +3n – 15
• For two cameras, at least 7 points are needed
Projective ambiguity
A
Qp =  T
v
(
x = PX = PQ
-1
P
)(Q X )
P
t
v 
Projective ambiguity
Bundle adjustment
• Non-linear method for refining structure and motion
• Minimizing reprojection error
2
E (P, X) =  D (x ij , Pi X j )
m
n
i =1 j =1
Xj
P1Xj
x3j
x1j
P1
P2Xj
x2j
P3Xj
P3
P2
Photosynth
Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo tourism: Exploring
photo collections in 3D," SIGGRAPH 2006
https://www.youtube.com/watch?v=p16frKJLVi0
http://photosynth.net/
CSE 185
Introduction to Computer Vision
Feature Tracking and Optical Flow
Motion estimation
• Feature-tracking
– Extract visual features (corners, textured areas) and track them over
multiple frames
• Optical flow
– Recover image motion at each pixel from spatio-temporal image
brightness variations (optical flow)
Two problems, one registration method
• Reading: Chapter 4.1.4 and 8
Feature tracking
• Many problems, e.g., structure from motion,
require matching points
• If motion is small, tracking is an easy way to
measure pixel movements
Feature tracking
• Challenges
– Figure out which features can be tracked
– Efficiently track across frames
– Some points may change appearance over time
(e.g., due to rotation, moving into shadows, etc.)
– Drift: small errors can accumulate as appearance
model is updated
– Points may appear or disappear; need to be able
to add/delete tracked points
Feature tracking
I(x, y, t)
I(x, y, t+1)
• Given two subsequent frames, estimate the point
translation
• Key assumptions of Lucas-Kanade Tracker
• Brightness constancy: projection of the same point looks the
same in every frame
• Small motion: points do not move very far
• Spatial coherence: points move like their neighbors
Brightness constancy
I(x,y,t)
I(x,y,t+1)
• Brightness Constancy Equation:
I ( x, y , t ) = I ( x + u, y + v, t + 1)
Take Taylor expansion of I(x+u, y+v, t+1) at (x, y, t) to linearize the right side:
Image derivative along x
Difference over frames
I ( x + u, y + v, t + 1)  I ( x, y , t ) + I x  u + I y  v + I t
I ( x + u, y + v, t + 1) − I ( x, y , t ) = + I x  u + I y  v + I t
Hence, I x  u + I y  v + I t  0
→ I  u v + I t = 0
T
Detailed derivaiton
H.O.T.
𝜕𝐼 ∆𝑥 𝜕𝐼 ∆𝑦
+
𝜕𝑥 ∆𝑡 𝜕𝑦 ∆𝑡
+
𝜕𝐼 ∆𝑡
=0
𝜕𝑡 ∆𝑡
𝐼𝑥 𝑢 + 𝐼𝑦 𝑣 + 𝐼𝑡 = 0
∇𝐼 ∙ [𝑢 𝑣]𝑇 +𝐼𝑡 = 0
𝐼𝑥
∇𝐼 = 𝐼
𝑦
How does this make sense?
I  u v  + I t = 0
T
• What do the static image gradients have to do
with motion estimation?
• Relate both spatial and temporal information
– Spatial information: image gradient
– Temporal information: temporal gradient
Computing gradients in X-Y-T
y
t
At 𝐼𝑖,𝑗,𝑡 : 𝐼𝑖+1,𝑗,𝑡 − 𝐼𝑖,𝑗,𝑡
At 𝐼𝑖,𝑗,𝑡+1 : 𝐼𝑖+1,𝑗,𝑡+1 − 𝐼𝑖,𝑗,𝑡+1
At 𝐼𝑖,𝑗+1,𝑡 : 𝐼𝑖+1,𝑗+1,𝑡 − 𝐼𝑖,𝑗+1,𝑡
At 𝐼𝑖,𝑗+1,𝑡+1 : 𝐼𝑖+1,𝑗+1,𝑡+1 − 𝐼𝑖,𝑗+1,𝑡+1
j+1
t+1
j
t
i
Ix =
i+1
1
[( I i +1, j ,t + I i +1, j ,t +1 + I i +1, j +1,t + I i +1, j +1,t +1 ) −
4 x
( I i , j ,t + I i , j ,t +1 + I i , j +1,t + I i , j +1,t +1 )]
likewise for Iy and It
x
Brightness constancy
Can we use this equation to recover image motion (u,v) at
each pixel?
I  u v + I t = 0
T
• How many equations and unknowns per pixel?
• One equation (this is a scalar equation!), two unknowns (u,v)
The component of the motion perpendicular to the
gradient (i.e., parallel to the edge) cannot be measured
I  u ' v ' = 0
T
If (u, v) satisfies the equation,
so does (u+u’, v+v’ ) since
I  (u + u ' ) (v + v' ) + I t = 0
T
Multiple solutions (explanations) for the same observed motion
gradient
(u,v)
(u+u’,v+v’)
(u’,v’)
edge
The aperture problem
Actual motion
The aperture problem
Perceived motion
The aperture problem
http://en.wikipedia.org/wiki/Motion_perception#The_aperture_problem
The grating appears to be moving down and to the right, perpendicular to the
orientation of the bars. But it could be moving in many other directions, such as
only down, or only to the right. It is not possible to determine unless the ends of
the bars become visible in the aperture
The barber pole illusion
http://en.wikipedia.org/wiki/Barberpole_illusion
This visual illusion occurs when a diagonally striped pole is
rotated around its vertical axis (horizontally), it appears as though
the stripes are moving in the direction of its vertical axis
(downwards in the case of the animation) rather than around it.
The barber pole turns in
place on its vertical axis,
but the stripes appear to
move upwards rather than
turning with the pole
Addressing the ambiguity issue
B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In
Proceedings of the International Joint Conference on Artificial Intelligence, pp. 674–679, 1981.
I  u v + I t = 0
T
• How to get more equations for a pixel?
• Spatial coherence constraint
• Assume the pixel’s neighbors have the same (u,v)
– If we use a 5x5 window, that gives us 25 equations per pixel
Solving the ambiguity
• Least squares problem:
Matching patches
• Overconstrained linear system
Least squares solution for d given by
The summations are over all pixels in the K x K window
Conditions for solvability
Optimal (u, v) satisfies Lucas-Kanade equation
When is this solvable? I.e., what are good points to
track?
• ATA should be invertible
• ATA should not be too small due to noise
– eigenvalues 1 and  2 of ATA should not be too small
• ATA should be well-conditioned
–  1/  2 should not be too large ( 1 = larger eigenvalue)
Does this remind you of anything?
Criteria for Harris corner detector
Pseudo inverse
𝐴𝑑 = 𝑏,
𝑑 = 𝐴−1 𝑏
When A is a well-conditioned square matrix
𝐴𝑑 = 𝑏
For flow estimation, A is not a square matrix
𝐴𝑇 𝐴 𝑑 = 𝐴𝑇 𝑏
Multiply AT on both sides, ATA is a square matrix
𝐴+ = 𝐴𝑇 𝐴 −1 𝐴𝑇
𝐴+ 𝐴 = 𝐴𝑇 𝐴 −1 𝐴𝑇 𝐴 = 𝐼
𝑑 = 𝐴𝑇 𝐴
−1
See Moore-Penrose inverse
𝐴𝑇 𝑏
pseudo inverse, left inverse
Need to make sure ATA is well-conditioned
M = ATA is the second moment matrix !
(Harris corner detector…)
• Eigenvectors and eigenvalues of ATA relate to
edge direction and magnitude
• The eigenvector associated with the larger eigenvalue points
in the direction of fastest intensity change
• The other eigenvector is orthogonal to it
Low-texture region
– gradients have small magnitude
– small 1, small 2
Edge
– gradients very large or very small
– large 1, small 2
High-texture region
– gradients are different, large magnitudes
– large 1, large 2
The aperture problem resolved
Actual motion
The aperture problem resolved
Perceived motion
Shi-Tomasi feature tracker
•
Find good features using eigenvalues of second-moment
matrix (e.g., Harris detector or threshold on the smallest
eigenvalue)
–
•
Track from frame to frame with Lucas-Kanade
–
•
Key idea: “good” features to track are the ones whose
motion can be estimated reliably
This amounts to assuming a translation model for frame-toframe feature movement
Check consistency of tracks by affine registration to the
first observed instance of the feature
–
–
Affine model is more accurate for larger displacements
Comparing to the first frame helps to minimize drift
J. Shi and C. Tomasi. Good Features to Track. CVPR 1994.
Tracking example
J. Shi and C. Tomasi. Good Features to Track. CVPR 1994.
Summary of KLT tracking
• Find a good point to track (Harris corner)
• Use intensity second moment matrix and
difference across frames to find displacement
• Iterate and use coarse-to-fine search to deal with
larger movements
• When creating long tracks, check appearance of
registered patch against appearance of initial
patch to find points that have drifted
Implementation issues
• Window size
– Small window more sensitive to noise and may miss larger
motions (without pyramid)
– Large window more likely to cross an occlusion boundary
(and it’s slower)
– 15x15 to 31x31 seems typical
• Weighting the window
– Common to apply weights so that center matters more
(e.g., with Gaussian)
Optical flow
Vector field function of the
spatio-temporal image
brightness variations
Motion and perceptual
organization
• Gestalt factor that leads to grouping
• Sometimes, motion is the only cue
Motion and perceptual
organization
• Even “impoverished” motion data can evoke a
strong percept
G. Johansson, “Visual Perception of Biological Motion and a Model For Its
Analysis", Perception and Psychophysics 14, 201-211, 1973.
Motion and perceptual organization
• Even “impoverished” motion data can evoke a
strong percept
G. Johansson, “Visual Perception of Biological Motion and a Model For Its
Analysis", Perception and Psychophysics 14, 201-211, 1973.
Uses of motion
•
•
•
•
•
Estimating 3D structure
Segmenting objects based on motion cues
Learning and tracking dynamical models
Recognizing events and activities
Improving video quality (motion
stabilization)
Motion field
• The motion field is the projection of the 3D
scene motion into the image
See Nayar’s lecture on optical flow and motion field
What would the motion field of a non-rotating ball moving towards the camera look like?
Optical flow
• Definition: optical flow is the apparent motion
of brightness patterns in the image
• Ideally, optical flow would be the same as the
motion field
• Have to be careful: apparent motion can be
caused by lighting changes without any actual
motion
– Think of a uniform rotating sphere under fixed
lighting vs. a stationary sphere under moving
illumination
Lucas-Kanade optical flow
• Same as Lucas-Kanade feature tracking, but
for each pixel
– As we saw, works better for textured pixels
• Operations can be done one frame at a time,
rather than pixel by pixel
– Efficient
Iterative refinement
• Iterative Lukas-Kanade Algorithm
1. Estimate displacement at each pixel by solving
Lucas-Kanade equations
2. Warp I(t) towards I(t+1) using the estimated flow field
- Basically, just interpolation
3. Repeat until convergence
711
Coarse-to-fine flow estimation
run iterative L-K
warp & upsample
run iterative L-K
.
.
.
image J1
Gaussian pyramid of image 1 (t)
image I2
image
Gaussian pyramid of image 2 (t+1)
Coarse-to-fine flow estimation
u=1.25 pixels
u=2.5 pixels
u=5 pixels
image H
1
Gaussian pyramid of image 1
u=10 pixels
image I2
image
Gaussian pyramid of image 2
Larger motion: Iterative refinement
Original (x,y) position
1. Initialize (x’,y’) = (x,y)
2. Compute (u,v) by
2nd moment matrix for feature
patch in first image
It = I(x’, y’, t+1) - I(x, y, t)
displacement
1. Shift window by (u, v): x’=x’+u; y’=y’+v;
2. Recalculate It
3. Repeat steps 2-4 until small change
•
Use interpolation for subpixel values
Multi-scale Lucas Kanade Algorithm
Example
Multi-resolution registration
Optical flow results
Optical flow results
Errors in Lucas-Kanade
• The motion is large
– Possible Fix: keypoint matching
• A point does not move like its neighbors
– Possible Fix: region-based matching
• Brightness constancy does not hold
– Possible Fix: gradient constancy
Genesis of Lucas Kanade Algorithm
More applications
•
•
•
•
•
Frame interpolation
Stabilization
Surveillance
Interactive games
See Nayar’s lecture video
Recent optical flow methods
Start with something similar to Lucas-Kanade
+ gradient constancy
+ energy minimization with smoothing term
+ region matching
+ keypoint matching (long-range)
Region-based +Pixel-based +Keypoint-based
Large displacement optical flow, Brox et al., CVPR 2009
Stereo vs. Optical Flow
• Similar dense matching procedures
• Why don’t we typically use epipolar
constraints for optical flow?
B. Lucas and T. Kanade. An iterative image registration technique with an application to
stereo vision. In Proceedings of the International Joint Conference on Artificial
Intelligence, pp. 674–679, 1981.
Summary
• Major contributions from Lucas, Tomasi,
Kanade
– Tracking feature points
– Optical flow
• Key ideas
– By assuming brightness constancy, truncated
Taylor expansion leads to simple and fast patch
matching across frames
– Coarse-to-fine registration
CSE 185
Introduction to Computer Vision
Pattern Recognition
Computer vision: related topics
Pattern recognition
• One of the leading vision conferences: IEEE
Conference on Computer Vision and Pattern
Recognition (CVPR)
• Pattern recognition and machine learning
• Goal: Making predictions or decisions from
data
Machine learning applications
Image categorization
Training
Training
Labels
Training
Images
Image Features
Classifier
Training
Trained
Classifier
Image categorization
Training
Training
Labels
Training
Images
Image Features
Classifier
Training
Trained
Classifier
Testing
Prediction
Image Features
Trained Classifier
Outdoor
Test Image
Pattern recognition pipeline
• Apply a prediction function to a feature representation of the
image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Machine learning framework
y = f(x)
output
prediction
function
Image
feature
• Training: given a training set of labeled examples {(x1,y1), …,
(xN,yN)}, estimate the prediction function f by minimizing the
prediction error on the training set
• Testing: apply f to a never seen test example x and output the
predicted value y = f(x)
Example: Scene categorization
• Is this a kitchen?
Image features
Training
Training
Labels
Training
Images
Image Features
Classifier
Training
Trained
Classifier
Image representations
• Coverage
– Ensure that all relevant info is captured
• Concision
– Minimize number of features without sacrificing
coverage
• Directness
– Ideal features are independently useful for prediction
Image representations
• Templates
– Intensity, gradients, etc.
• Histograms
– Color, texture, SIFT descriptors, etc.
• Features
– PCA, local features, corners, etc.
Classifiers
Training
Training
Labels
Training
Images
Image Features
Classifier
Training
Trained
Classifier
Learning a classifier
Given some set of features with
corresponding labels, learn a function to
predict the labels from the features
x
x
x
x
x
o
o
o
o
x2
x1
x
x
o
x
Many classifiers to choose from
•
•
•
•
•
•
•
•
•
•
•
Support vector machine
Neural networks
Naïve Bayes
Bayesian network
Which is the best one?
Logistic regression
Randomized Forests
Boosted decision trees
Adaboost classifier
K-nearest neighbor
Restricted Boltzmann machine
Etc.
One way to think about it…
• Training labels dictate that two examples are
the same or different, in some sense
• Features and distance measures define visual
similarity
• Classifiers try to learn weights or parameters
for features and distance measures so that
visual similarity predicts label similarity
Machine learning
Topics
Dimensionality reduction
•
Principal component analysis (PCA) is the most
important technique to know. It takes advantage
of correlations in data dimensions to produce the
best possible lower dimensional representation,
according to reconstruction error.
•
PCA should be used for dimensionality reduction,
not for discovering patterns or making
predictions. Don't try to assign semantic meaning
to the bases.
•
Independent component analysis (ICA), Locally
liner embedding (LLE), Isometric mapping
(Isomap), …
Clustering example: Segmentation
Goal: Break up the image into meaningful or
perceptually similar regions
Segmentation for feature support
50x50 Patch
50x50 Patch
Segmentation for efficiency
[Felzenszwalb and Huttenlocher 2004]
[Hoiem et al. 2005, Mori 2005]
[Shi and Malik 2001]
Segmentation as a result
Types of segmentations
Oversegmentation
Undersegmentation
Multiple Segmentations
Segmentation approaches
• Bottom-up: group tokens with similar features
• Top-down: group tokens that likely belong to the
same object
[Levin and Weiss 2006]
Clustering
• Clustering: group together similar points and
represent them with a single token
• Key Challenges:
– What makes two points/images/patches similar?
– How do we compute an overall grouping from
pairwise similarities?
Slide: Derek Hoiem
Why do we cluster?
• Summarizing data
– Look at large amounts of data
– Patch-based compression or denoising
– Represent a large continuous vector with the cluster number
• Counting
– Histograms of texture, color, SIFT vectors
• Segmentation
– Separate the image into different regions
• Prediction
– Images in the same cluster may have the same labels
How do we cluster?
• K-means
– Iteratively re-assign points to the nearest cluster
center
• Agglomerative clustering
– Start with each point as its own cluster and iteratively
merge the closest clusters
• Mean-shift clustering
– Estimate modes of pdf
• Spectral clustering
– Split the nodes in a graph based on assigned links with
similarity weights
Clustering for Summarization
Goal: cluster to minimize variance in data
given clusters
– Preserve information
Cluster center
Data
c * , δ * = argmin N1   ij (c i − x j )
c ,δ
N
K
j
i
2
Whether xj is assigned to ci
K-means algorithm
1. Randomly
select K centers
2. Assign each
point to nearest
center
3. Compute new
center (mean)
for each cluster
Illustration: http://en.wikipedia.org/wiki/K-means_clustering
K-means algorithm
1. Randomly
select K centers
2. Assign each
point to nearest
center
Back to 2
3. Compute new
center (mean)
for each cluster
Illustration: http://en.wikipedia.org/wiki/K-means_clustering
K-means
1. Initialize cluster centers: c0 ; t=0
2. Assign each point to the closest center
δ = argmin
t
δ
 (c
N
1
N
K
ij
j
t −1
i
−x j)
2
i
3. Update cluster centers as the mean of the points
c t = argmin N1   ijt (c i − x j )
c
N
K
j
i
2
1. Repeat 2-3 until no points are re-assigned (t=t+1)
Slide: Derek Hoiem
Issues with K-means
converges to a local minimum
different results from multiple runs
More examples
• https://youtu.be/_aWzGGNrcic?t=262
• https://www.youtube.com/watch?v=BVFG7fd
1H30
K-means: design choices
• Initialization
– Randomly select K points as initial cluster center
– Or greedily choose K points to minimize residual
• Distance measures
– Traditionally Euclidean, could be others
• Optimization
– Will converge to a local minimum
– May want to perform multiple restarts
K-means using intensity or color
Image
Clusters on intensity
Clusters on color
Number of clusters?
• Minimum Description Length (MDL) principal for
model comparison
• Minimize Schwarz Criterion
– also called Bayes Information Criteria (BIC)
How to evaluate clusters?
• Generative
– How well are points reconstructed from the
clusters?
• Discriminative
– How well do the clusters correspond to labels?
• Purity
– Note: unsupervised clustering does not aim to be
discriminative
Slide: Derek Hoiem
Number of clusters?
• Validation set
– Try different numbers of clusters and look at
performance
• When building dictionaries (discussed later), more
clusters typically work better
Slide: Derek Hoiem
K-Means pros and cons
•
•
•
Pros
• Finds cluster centers that minimize
conditional variance (good
representation of data)
• Simple and fast*
• Easy to implement
Cons
• Need to choose K
• Sensitive to outliers
• Prone to local minima
• All clusters have the same parameters
(e.g., distance measure is nonadaptive)
• *Can be slow: each iteration is O(KNd)
for N d-dimensional points
Usage
• Rarely used for pixel segmentation
Building Visual Dictionaries
1. Sample patches from a
database
–
E.g., 128 dimensional
SIFT vectors
2. Cluster the patches
–
Cluster centers are the
dictionary
3. Assign a codeword
(number) to each new
patch, according to the
nearest cluster
Examples of learned codewords
Most likely codewords for 4 learned “topics”
EM with multinomial (problem 3) to get topics
http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05b.pdf
CSE 185
Introduction to Computer Vision
Pattern Recognition 2
Unsupervised learning
• Supervised learning
– Predict target value (“y”) given features (“x”)
• Unsupervised learning
– Understand patterns of data (just “x”)
– Useful for many reasons
• Data mining (“explain”)
• Missing data values (“impute”)
• Representation (feature generation or selection)
• One example: clustering
Clustering and data compression
• Clustering is related to vector quantization
– Dictionary of vectors (the cluster centers)
– Each original value represented using a dictionary index
– Each center “claims” a nearby region (Voronoi region)
Agglomerative clustering
• Another simple clustering algorithm
Initially, every datum is a cluster
• Define a distance between clusters
(return to this)
• Initialize: every example is a cluster
• Iterate:
– Compute distances between all clusters
(store for efficiency)
– Merge two closest clusters
• Save both clustering and sequence of
cluster operations
• “Dendrogram”
Iteration 1
height (above two samples) corresponds to distance
Iteration 2
Iteration 3
• Builds up a sequence of
clusters (“hierarchical”)
• Algorithm complexity O(N2)
(Why?)
In Matlab: “linkage” function (stats toolbox)
Dendrogram
Cluster Distances
produces minimal spanning tree.
avoids elongated clusters.
Example: microarray expression
• Measure gene expression
• Various experimental conditions
– Cancer, normal
– Time
– Subjects
• Explore similarities
– What genes change together?
– What conditions are similar?
• Cluster on both genes and
conditions
Agglomerative Clustering
Good
• Simple to implement, widespread application
• Clusters have adaptive shapes
• Provides a hierarchy of clusters
Bad
• May have imbalanced clusters
• Still have to choose number of clusters or
threshold
• Need to use a good metric to get a meaningful
hierarchy
Sidebar: Gaussian modes
• Parametric distribution and modes
Statistical estimation
• Parametric distribution:
– based on some statical function form and
estimate the parameters
– e.g., Gaussian, mixture of Gaussian
• Non-parametric distribution:
– no assumption of the statistical function form
– based on kernels
– e.g., k-nearest neighbor, kernel density
estimation, mean-shift
Mean shift segmentation
D. Comaniciu and P. Meer, Mean Shift: A Robust Approach toward Feature Space Analysis, PAMI 2002.
• Versatile technique for clustering-based
segmentation
Mean shift algorithm
• Try to find modes of this non-parametric
density
Trajectories →
of mean shift
procedure
 2D (first 2
components) dataset
of 110,400 points
in the LUV space
Mean shift →
procedure (7
clusters)
Kernel density estimation
Kernel density estimation function
Gaussian kernel
Mean shift
Region of
interest
Center of
mass
Mean Shift
vector
Mean shift
Region of
interest
Center of
mass
Mean Shift
vector
Mean shift
Region of
interest
Center of
mass
Mean Shift
vector
Mean shift
Region of
interest
Center of
mass
Mean Shift
vector
Mean shift
Region of
interest
Center of
mass
Mean Shift
vector
Mean shift
Region of
interest
Center of
mass
Mean Shift
vector
Mean shift
Region of
interest
Center of
mass
Computing the mean shift
Simple Mean Shift procedure:
• Compute mean shift vector
• Translate the Kernel window by m(x)
 n

 x - xi 2 
  xi g 


 h 
 i =1



m ( x) = 
− x
2
 n  x - xi 

  g  h 

i =1




g(x): typically a Gaussian kernel
Attraction basin
• Attraction basin: the region for which all
trajectories lead to the same mode
• Cluster: all data points in the attraction basin
of a mode
Attraction basin
Mean shift clustering
•
The mean shift algorithm seeks modes of the given
set of points
1. Choose kernel and bandwidth
2. For each point:
a)
b)
c)
d)
Center a window on that point
Compute the mean of the data in the search window
Center the search window at the new mean location
Repeat (b,c) until convergence
3. Assign points that lead to nearby modes to the same
cluster
Segmentation by mean shift
•
•
•
•
•
Compute features for each pixel (color, gradients, texture, etc)
Set kernel size for features Kf and position Ks
Initialize windows at individual pixel locations
Perform mean shift for each window until convergence
Merge windows that are within width of Kf and Ks
Mean shift segmentation
http://www.caip.rutgers.edu/~comanici/MSPAMI/msPamiResults.html
Mean shift pros and cons
• Pros
– Good general-practice segmentation
– Flexible in number and shape of regions
– Robust to outliers
• Cons
– Have to choose kernel size in advance
– Not suitable for high-dimensional features
• When to use it
– Oversegmentatoin
– Multiple segmentations
– Tracking, clustering, filtering applications
Spectral clustering
• Group points based on links in a graph
A
B
Cuts in a graph
A
B
Normalized Cut
• a cut penalizes large segments
• fix by normalizing for size of segments
• volume(A) = sum of costs of all edges that touch A
Normalized cuts
Which algorithm to use?
• Quantization/Summarization: K-means
– Aims to preserve variance of original data
– Can easily assign new point to a cluster
Quantization for
computing histograms
Summary of 20,000 photos of Rome using
“greedy k-means”
http://grail.cs.washington.edu/projects/canonview/
Which algorithm to use?
• Image segmentation: agglomerative clustering
– More flexible with distance measures (e.g., can be
based on boundary prediction)
– Adapts better to specific data
– Hierarchy can be useful
http://www.cs.berkeley.edu/~arbelaez/UCM.html
Things to remember
• K-means useful for summarization,
building dictionaries of patches,
general clustering
• Agglomerative clustering useful for
segmentation, general clustering
• Spectral clustering useful for
determining relevance,
summarization, segmentation
Clustering
Key algorithm
• K-means
CSE 185
Introduction to Computer Vision
Face Recognition
The space of all face images
• When viewed as vectors of pixel values, face images are
extremely high-dimensional
– 100x100 image = 10,000 dimensions
• However, relatively few 10,000-dimensional vectors
correspond to valid face images
• We want to effectively model the subspace of face images
In one thumbnail face image
• Consider a thumbnail 19  19 face pattern
• 256361 possible combination of gray values
– 256361= 28361 = 22888
• Total world population (as of 2004)
– 7,800,000,000  233
• 287 times more than the world population!
• Extremely high dimensional space!
The space of all face images
• We want to construct a low-dimensional linear subspace
that best explains the variation in the set of face images
Recall
• For one random variable 𝑥, where the mean is
𝑥,ҧ the variance is var 𝑥 = 𝐸[ 𝑥 − 𝑥ҧ 2 ]
• Standard deviation 𝜎 =
var(𝑥)
Covariance and correlation
• Two random variables x and y
Subject
1
2
3
4
5
6
7
8
9
10
11
12
Mean
Height (X)
69
61
68
66
66
63
72
62
62
67
66
63
65.42
Weight (Y)
108
130
135
135
120
115
150
105
115
145
132
120
125.83
Sum(X) = 785
Sum(Y) = 1510
2
Sum (X ) = 51473 Sum(Y2) = 192238
Sum (XY) = 99064
cov xy =
cov xy =
( x − x )( y − y)
N −1
99064 −
r=
r=
(785)(1510)
12
= 23.74
11
cov xy s : standard deviation
x
s xs y
25.89
= .55
(3.32)(14.24)
Different correlation coefficients
Y
Y
Y
X
X
r = -1
r = -.6
Y
r=0
Y
Y
r = +1
X
X
X
r = +.3
X
r=0
Correlation
Linear relationships
Curvilinear relationships
Y
Y
X
Y
X
Y
X
X
Correlation
Strong relationships
Weak relationships
Y
Y
X
Y
X
Y
X
X
Correlation
No relationship
Y
X
0.0  |r| < 0.3
 weak correlation
0.3  |r| < 0.7
 moderate correlation
0.7  |r| < 1.0
 strong correlation
|r| = 1.0
 perfect correlation
Correlation
Covariance of random vectors
• Covariance is a measure of the extent to
which corresponding elements from two sets
of ordered data move in the same direction
• Two random vectors X and Y (column vectors)
ത
ത 𝑇
C𝑜𝑣 𝑋, 𝑌 = 𝐸 (𝑋 − 𝑋)(𝑌
− 𝑌)
𝑋ത = 𝐸 𝑋 , 𝑌ത = 𝐸 𝑌
See notes on random vectors and covariace for details
Covariance matrix
• Variance-Covariance Matrix: Variance and covariance are displayed
together in a variance-covariance matrix. The variances appear along the
diagonal and covariances appear in the off-diagonal elements
ത
ത 𝑇
• X = [X1, X2, …, Xp]T C𝑜𝑣 𝑋 = 𝐸 (𝑋 − 𝑋)(𝑋
− 𝑋)
𝐶𝑜𝑣(𝑋1 , 𝑋1 ) 𝐶𝑜𝑣(𝑋1 , 𝑋2 )
𝐶𝑜𝑣(𝑋2 , 𝑋1 ) 𝐶𝑜𝑣(𝑋2 , 𝑋2 )
𝐶𝑜𝑣 𝑋 =
⋮
⋮
𝐶𝑜𝑣(𝑋𝑝 , 𝑋1 ) 𝐶𝑜𝑣(𝑋𝑝 , 𝑋2 )
… 𝐶𝑜𝑣(𝑋1 , 𝑋𝑝 )
… 𝐶𝑜𝑣(𝑋2 , 𝑋𝑝 )
⋱
⋮
… 𝐶𝑜𝑣(𝑋𝑝 , 𝑋𝑝 )
𝑉𝑎𝑟(𝑋1 )
𝐶𝑜𝑣(𝑋1 , 𝑋2 )
𝐶𝑜𝑣(𝑋2 , 𝑋1 )
𝑉𝑎𝑟(𝑋2 )
=
⋮
⋮
𝐶𝑜𝑣(𝑋𝑝 , 𝑋1 ) 𝐶𝑜𝑣(𝑋𝑝 , 𝑋2 )
𝐶𝑖𝑗 = 𝐶𝑜𝑣(𝑋)𝑖𝑗 = 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗 )
… 𝐶𝑜𝑣(𝑋1 , 𝑋𝑝 )
… 𝐶𝑜𝑣(𝑋2 , 𝑋𝑝 )
⋱
⋮
…
𝑉𝑎𝑟(𝑋𝑝 )
Variance and covariance
C
For any single pixel
See video for details
C
C
For any two pixels
Dimensionality reduction
No dominant variation along any direction
Dominant variation along one direction
Dominant variation along one direction
Distribution of data points
Points with centered in a small region → low energy
Points with scattered everywhere → high energy
Use covariance matrix to represent how points scattered
Find out the direction that can capture most variation/energy
Principal components
1st principal component
1st and 2nd principal
comments are of equal
importance (each one
capture similar amount
of variation/energy)
2nd principal component
No dominant variation along any direction
Principal components
1st principal component
1st principal component
is along y axis (capture
maximum
variation/energy)
2nd principal component
Dominant variation along one direction
Principal components
1st principal component
1st principal component
is along
(capture maximum
variation/energy)
2nd principal component
Dominant variation along one direction
Projection and basis vectors
𝑖Ԧ = 𝑒𝑥 = [1,0,0]𝑇 , 𝑗Ԧ = 𝑒𝑦 = [0,1,0]𝑇 , 𝑘 = 𝑒𝑧 = [0,0,1]𝑇
𝑎𝑥 = 𝑎Ԧ ∙ 𝑒𝑥 , 𝑎𝑦 = 𝑎Ԧ ∙ 𝑒𝑦 , 𝑎𝑧 = 𝑎Ԧ ∙ 𝑒𝑧
𝑎Ԧ = 𝑎𝑥 𝑒𝑥 + 𝑎𝑦 𝑒𝑦 + 𝑎𝑧 𝑒𝑧
Orthonormal basis and projection
Projection onto new basis vectors
Eigenvector and eigenvalue
Ax = λx
A: Square matrix
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value
2 − 12
A=

1
−
5


2 1 0
A = 0 2 0
0 0 2
I − A =
 −2
−1
12
= ( − 2)( + 5) + 12
 +5
= 2 + 3 + 2 = ( + 1)( + 2)
I − A =
 −2
−1
0
0
 −2
0
0
0 = ( − 2)3 = 0
 −2
Principal component analysis (PCA)
• The direction that captures the maximum
covariance of the data is the eigenvector
corresponding to the largest eigenvalue of the
data covariance matrix
• Furthermore, the top k orthogonal directions
that capture the most variance of the data are
the k eigenvectors corresponding to the k
largest eigenvalues
Linear subspaces
Convert two-dimensional x into v1, v2 coordinates
What does the v2 coordinate measure?
- distance to line
- use it for classification — near 0 for orange pts
What does the v1 coordinate measure?
- position along line
- use it to specify which orange point it is
• Classification can be expensive:
– Big search prob (e.g., nearest neighbors) or store large PDF’s
• Suppose the data points are arranged as above
– Idea—fit a line, classifier measures distance to line
Dimensionality reduction
Dimensionality reduction
• We can represent the orange points with only their v1 coordinates
(since v2 coordinates are all essentially 0)
• This makes it much cheaper to store and compare points
• A bigger deal for higher dimensional problems
Linear subspace
Consider the variation along direction v
among all of the orange points:
What unit vector v minimizes var?
What unit vector v maximizes var?
Solution: v1 is eigenvector of A with largest eigenvalue
v2 is eigenvector of A with smallest eigenvalue
𝑥∙𝑦 = 𝑥∙𝑦
2
2
= 𝑥𝑇𝑦
2
2
= 𝑥𝑇𝑦
𝑇
𝑥 𝑇 𝑦 = 𝑦 𝑇 𝑥𝑥 𝑇 𝑦
Principal component analysis
• Suppose each data point is p-dimensional
– Same procedure applies:
– The eigenvectors of A define a new coordinate system
• eigenvector with largest eigenvalue captures the most variation among
training vectors x
• eigenvector with smallest eigenvalue has least variation
– We can compress the data using the top few eigenvectors
• corresponds to choosing a “linear subspace”
– represent points on a line, plane, or “hyper-plane”
• these eigenvectors are known as the principal components
Principal component analysis
• Given: N data points x1, … ,xN in Rp
• We want to find a new set of features that are
linear combinations of original ones:
u(xi) = uT(xi – µ)
(µ: mean of data points)
• Choose unit vector u in Rp that captures the
most data variance, var(u(xi))
Forsyth & Ponce, Sec. 22.3.1, 22.3.2
Another derivation of PCA
• Direction that maximizes the variance of the projected data: var(uT(xi – µ))
)
N
Maximize
subject to ||u||=1
Projection of data point
N
1/N
Covariance matrix of data
=
𝐮T 𝐶 𝐮
The direction that maximizes the variance is the eigenvector associated
with the largest eigenvalue of 𝐶
Principal component analysis
• For 𝐱𝑖 ∈ 𝑅𝑝 , 𝑖 = 1, … , 𝑁
• Eigenvectors of covariance 𝐶 are the principal
components 𝐶 ∈ 𝑅𝑝×𝑝
𝐶x = 𝜆𝐱 or 𝐶𝐱𝑖 = 𝜆𝑖 𝐱 𝑖 , 𝑖 = 1, … , 𝑝
• The first principal component is the
eigenvector with largest eigenvalue
• The second principal component is the
eigenvector with second largest eigenvalue
The space of faces
=
+
• An image is a point in a high dimensional space
– An N x M image is a point in RNM
– We can define vectors in this space as we did in the 2D case
Eigenfaces: Key idea
• Assume that most face images lie on
a low-dimensional subspace determined by the first
k (k<p) directions of maximum variance
• Use PCA to determine the vectors or “eigenfaces”
u1,…uk that span that subspace
• Represent all face images in the dataset as linear
combinations of eigenfaces
M. Turk and A. Pentland, Face Recognition using Eigenfaces, CVPR 1991
Eigenfaces example
• Training
images
x1,…,xN
Eigenfaces example
Top eigenvectors: u1,…uk
Mean: μ
See video for illustration
Eigenfaces example
Principal component (eigenvector) uk
μ + 3σkuk
μ – 3σkuk
See video for illustration
Eigenfaces example
• Face x in “face space” coordinates:
=
Eigenfaces example
• Face x in “face space” coordinates:
=
• Reconstruction:
=
^x
=
+
µ
+
w1u1+w2u2+w3u3+w4u4+ …
Reconstruction
Each image is of 92 x 112 pixels, i.e., 10,304-dimension
P=4
P = 200
P = 400
After computing eigenfaces using 400 face images from ORL
face database
844
Recognition with eigenfaces
• Process labeled training images:
1. Find mean µ and covariance matrix C
2. Find k principal components (eigenvectors of Σ) u1,…uk
3. Project each training image xi onto subspace spanned by principal components:
(wi1,…,wik) = (u1T(xi – µ), … , ukT(xi – µ))
• Given novel image x:
1. Project onto subspace:
(w1,…,wk) = (u1T(x – µ), … , ukT(x – µ))
2. Optional: check reconstruction error x – x to determine whether image is really a face
3. Classify as closest training face in k-dimensional subspace
Eigenfaces
• PCA extracts the eigenvectors of C
– Gives a set of vectors v1, v2, v3, ...
– Each vector is a direction in face space
• what do these look like?
Projecting onto the eigenfaces
• The eigenfaces v1, ..., vK span the space of faces
– A face is converted to eigenface coordinates by
Recognition with eigenfaces
•
Algorithm
1. Process the image database (set of images with labels)
•
•
Run PCA—compute eigenfaces
Calculate the K coefficients for each image
2. Given a new image (to be recognized) x, calculate K coefficients
3. Detect if x is a face
4. If it is a face, who is it?
•
Find closest labeled face in database
– nearest-neighbor in K-dimensional space
Choosing the dimension K
eigenvalues
i=
K
NM
• How many eigenfaces to use?
• Look at the decay of the eigenvalues
– the eigenvalue tells you the amount of variance “in the direction”
of that eigenface
– ignore eigenfaces with low variance
Limitations
• Global appearance method: not robust to
misalignment, background variation
Limitations
• PCA assumes that the data has a Gaussian distribution (mean
µ, covariance matrix C)
The shape of this dataset is not well described by its principal components
Limitations
• The direction of maximum variance is not
always good for classification
CSE 185
Introduction to Computer Vision
Deep Learning
Deep learning
• Introduction
• Convolutional Neural Networks
• Other Neural Networks
Brief history
•
•
•
•
•
•
•
•
•
•
•
•
1942: McCulloch and Pitts neuron
1959: Adaline by Widrow and Hoff
1960: Mark I Perceptron by Rosenblatt
1969: Perceptrons by Minsky and Papert
1973: First AI winter
1986: Backpropagation by Rumelhart, Hinton, and Williams
1990s-2000s: Second AI winter
1989: CNN by LeCun et al.
2012: AlexNet by Krizhevsky, Sutskever, and Hinton
2014: Generative Adversarial Networks by Goodfellow, … and Bengio
2018: Turing award to Hinton, Bengio and LeCun
See note 1, note 2
Other neural networks
•
•
•
•
•
Recurrent neural networks (RNNs)
Long shot-term memory (LSTM)
Graph neural networks (GNNs)
Generative adversarial networks (GANs)
Transformers
Introduction
• Introduction to deep learning: slides, video
• Introduction to convolutional neural networks
for visual recognition: slides
Some vision tasks
•
•
•
•
•
•
•
•
Object detection: R-CNN, Yolo
Object segmentation: mask R-CNN
Object tracking:
Scene parsing: deeplab
Depth estimation: DPT
Optical flow: PWCNet
Frame interpolation: DAIN
Seeing through obstruction:
Recent results
StyleGAN
GPT-3
DALL-E
Online resources
• MIT: Introduction to deep learning
• Stanford: Convolutional networks for visual
recognition, video
• Coursera: deep learning
Download